DSP Libraries - Open CV
Texas Instruments fully supports OpenCV 3.1 on all ARM® Cortex®-A based processors. This includes processors from the AM335x, AM437x and AM57xx Sitara processor families as well as 66AK2x multicore DSP + ARM processor families. For processors with multiple ARM Cortex-A15 cores, multicore ARM is supported for all OpenCV functions.
OpenCV has a modular structure with each module containing several shared or static libraries. A sample of these modules and associated functions are shown below.
- Image filtering
- Image transformation
- Structural analysis
- Feature detection
- Motion analysis
- Object tracking
- Dense optical flow
- Kalman filter
- Cascade classifier
- Detection region of interest
DSP acceleration of OpenCV
OpenCV-3.x has an architecture concept known as Transparent API (T-API) that allows users to automatically call accelerated code through OpenCL. When accelerated code exists, it becomes a branch in a regular OpenCV call. For example, calling cv::resize() would automatically call an accelerated branch when it’s possible and makes sense from a performance point of view.
On TI’s heterogeneous SoCs, T-API enables offloading and/or accelerating OpenCV routines to TI’s C66x DSP cores. The ocl module within OpenCV contains OpenCL implementations of ~200 OpenCV algorithms that are targeted for accelerated offload.
From a user’s point of view, leveraging DSP cores for acceleration is transparent. The OpenCV call does not change.
The following table shows example performance of OpenCV routines as measured on multicore ARM Cortex-A15 cores and C66x DSPs in two different TI SoCs.
Note: The performance measured on the DSPs is for the OpenCL implementation included in OpenCV’s ocl module. This code has been compiled but not optimized for the DSP. The current implementation does not take advantage of efficient data movement, software loop pipelines or DSP intrinsics. Optimized OpenCV modules will be available in future processor SDK releases from TI that will have improved performance. Usage of these optimized modules will be transparent to the end user. The OpenCV calls will not change.
1 Time represents total ms for processing across all ARM Cortex-A15 cores on the device running at the designated speed.
2 Time represents total ms for processing across all C66x DSP cores on the device running at the designated speed. Time includes OpenCL overhead for cached OpenCL kernels.
3 This code has been compiled but not optimized for the DSP. The current implementation does not take advantage of efficient data movement, software loop pipelines or DSP intrinsics. Optimization likely to show improvement in these numbers.
|Hardware Platform Used||AM572x EVM||66AK2H EVM|
|Devices with core configuration benchmarked||AM5726, AM5728||66AK2H12, 66AK2H14|
|Cores Benchmarked||2x ARM A15||2x C66x DSPs||Speedup measured when moving code from ARM to DSP||4x ARM A15||8x C66x DSPs||Speedup measured when moving code from ARM to DSP|
|OpenCV functions||Test Image Details||Time in ms1||Time in ms2,3||Time in ms1||Time in ms2,3|
The ‘speed up’ column in the above table shows relative performance improvements that can be gained by offloading from ARM cores to DSP cores for the device. Where the speedup is >1, the routine runs faster on the DSP cores in the specified SoC configuration. If the speed up is < 1, the routine runs faster on the ARM cores in the SoC configuration. However, utilization of the DSP may still be desired for purposes of offloading processing from the ARMs.
The following charts show the relative performance of these OpenCV routines on the ARM cores vs the DSP cores for the specified devices. The performance of the Cortex-A15 cores has been normalized to 1. The C66x cores performance are shown relative to the Cortex-A15. This comparison takes number of cores and processor speed into account.
OpenCV is included as part of TI’s free Processor SDK that can be downloaded for the following devices using the links below: