Processors

DSP Libraries - Open CV

OpenCV is an open source computer vision library. It is available free for both academic and commercial use under a BSD (Berkeley Software Distribution) license.

Texas Instruments fully supports OpenCV 3.1 on all Arm® Cortex®-A based processors. This includes processors from the AM335x, AM437x and AM57xx Sitara™ processor families as well as 66AK2x multicore DSP + Arm® processor families. For processors with multiple Arm® Cortex®-A15 cores, multicore Arm® is supported for all OpenCV functions.

OpenCV has a modular structure with each module containing several shared or static libraries. A sample of these modules and associated functions are shown below.

Image processing

  • Image filtering
  • Image transformation
  • ColorMaps
  • Structural analysis
  • Feature detection

Video analysis

  • Motion analysis
  • Object tracking
  • Dense optical flow
  • Kalman filter

Object detection

  • Cascade classifier
  • Detection region of interest

DSP acceleration of OpenCV

OpenCV-3.x has an architecture concept known as Transparent API (T-API) that allows users to automatically call accelerated code through OpenCL. When accelerated code exists, it becomes a branch in a regular OpenCV call. For example, calling cv::resize() would automatically call an accelerated branch when it’s possible and makes sense from a performance point of view.

On TI’s heterogeneous SoCs, T-API enables offloading and/or accelerating OpenCV routines to TI’s C66x DSP cores. The ocl module within OpenCV contains OpenCL implementations of ~200 OpenCV algorithms that are targeted for accelerated offload.

From a user’s point of view, leveraging DSP cores for acceleration is transparent. The OpenCV call does not change.

OpenCV performance

The following table shows example performance of OpenCV routines as measured on multicore Arm® Cortex®-A15 cores and C66x DSPs in two different TI SoCs.

Note: The performance measured on the DSPs is for the OpenCL implementation included in OpenCV’s ocl module. This code has been compiled but not optimized for the DSP. The current implementation does not take advantage of efficient data movement, software loop pipelines or DSP intrinsics. Optimized OpenCV modules will be available in future processor SDK releases from TI that will have improved performance. Usage of these optimized modules will be transparent to the end user. The OpenCV calls will not change.

1 Time represents total ms for processing across all ARM Cortex-A15 cores on the device running at the designated speed.

2 Time represents total ms for processing across all C66x DSP cores on the device running at the designated speed. Time includes OpenCL overhead for cached OpenCL kernels.

3 This code has been compiled but not optimized for the DSP. The current implementation does not take advantage of efficient data movement, software loop pipelines or DSP intrinsics. Optimization likely to show improvement in these numbers.

Hardware Platform Used AM572x EVM 66AK2H EVM
Devices with core configuration benchmarked AM5726, AM5728 66AK2H12, 66AK2H14
Cores Benchmarked 2x ARM A15 2x C66x DSPs Speedup measured when moving code from ARM to DSP 4x ARM A15 8x C66x DSPs Speedup measured when moving code from ARM to DSP
@1.5GHz @750MHz @1.2Ghz @1.2GHz
OpenCV functions Test Image Details Time in ms1 Time in ms2,3 Time in ms1 Time in ms2,3
Size Channels Datatype
cv::add 640x480 4 uint 8 7.7 20.7 0.4x 5.9 3.6 1.6x
cv::dilate 1920x1080 1 uint 8 164.3 89.8 1.8x 201.9 18.7 10.8x
cv::erode 1920x1080 1 uint 8 164.1 89.8 1.8x 204.3 18.6 11.0x
cv::filter2d 1920x1080 4 float 32 601.5 249.4 2.4x 721.2 55.5 13.0x
cv::GaussianBlur 1920x1080 4 uint 8 707.4 366 1.9x 883.4 70.5 12.5x
cv::CascadeClassifier cv/shared/lena.png 118 412.2 0.3x 105 44.5 2.3x
cv::calcOpticalFlowFarneback N/A 5535.2 7227.5 0.8x 6852.7 1191.6 5.8x

The ‘speed up’ column in the above table shows relative performance improvements that can be gained by offloading from Arm® cores to DSP cores for the device. Where the speedup is >1, the routine runs faster on the DSP cores in the specified SoC configuration. If the speed up is < 1, the routine runs faster on the Arm® cores in the SoC configuration. However, utilization of the DSP may still be desired for purposes of offloading processing from the Arms.

The following charts show the relative performance of these OpenCV routines on the Arm® cores vs the DSP cores for the specified devices. The performance of the Cortex®-A15 cores has been normalized to 1. The C66x cores performance are shown relative to the Cortex®-A15. This comparison takes number of cores and processor speed into account.

Download OpenCV

OpenCV is included as part of TI’s free Processor SDK that can be downloaded for the following devices using the links below:

Sitara™ AM335x

Cortex®-A8 core

Sitara™ AM437x

Cortex®-A9 core

Sitara™ AM57x

Single/dual Cortex®-A15 and C66x DSPs

C6000 DSP + Arm® 66AK2Ex

Highest performance dual/quad Cortex®-A15

C6000 DSP + Arm® 66AK2Hx

High performance multicore C66x DSPs + Arm® Cortex®-A15s

C6000 DSP + Arm® 66AK2Gx

Single Cortex®-A15 and single C66x DSP