OpenCL™ (Open Computing Language)
OpenCL is a framework for expressing programs where parallel computation is dispatched across heterogeneous devices. It is an open, royalty-free standard maintained by the technology consortium Khronos Group.
On a heterogeneous device, OpenCL views one of the programmable cores as a host and the other cores as OpenCL devices. For example, on a Sitara™ AM572x SoC, the host is the Arm® Cortex®-A15 cluster running SMP/Linux or TI-RTOS and the OpenCL device is the C6xx DSP cluster.
The OpenCL runtime consists of two components:
- An API for the host program to create and submit kernels for execution
- A cross-platform language for expressing kernels – OpenCL C – which is based on C99 C with some additions and restrictions
OpenCL supports both data parallel and task parallel programming paradigms.
- Data parallel execution parallelizes the execution across compute units on a device.
- Task parallel execution enables asynchronous dispatch of tasks to each compute unit.
For more detailed information on TI’s OpenCL implementation and supported devices, see TI’s OpenCL User’s Guide.
Benefits of using OpenCL
Using a standard approach to programming heterogeneous SoCs simplifies programming; it allows the programmer to use standard, well-documented APIs to handle the mechanics of dispatching code and data to the DSPs and focus on optimizing the dispatched code. Other benefits include:
- Seamless migration of applications between TI SoCs (e.g., take an OpenCL application written for a 66AK2H SoC with eight C66x DSP cores and run it on an AM572x SoC with two C66x DSP cores with only a recompile).
- TI extensions to OpenCL enable programmers to leverage optimized TI-provided accelerated DSP libraries such as DSPLIB, MATHLIB, and IMGLIB
- Use the DSPs to offload computation within open source libraries such as Linear algebra and OpenCV.
Key features of TI’s OpenCL implementation
- OpenCL host is the Arm Cortex-A15 cluster running Linux
- One OpenCL device with the set of C66x DSP cores available on the device
- Compute unit is a single C66x DSP
- TI-specific extensions to OpenCL improve performance of code offloaded to the C66x DSP
- OpenCL implementation conformant to v1.1 (full profile) on AM57x and 66AK2H devices
Notes: TI’s OpenCL implementation does not provide support for images; image support is optional in the OpenCL v1.1 specification. Support for the double precision floating-point data type is enabled as an OpenCL extension and is not included in the conformance testing.
TI’s OpenCL implementation includes many examples illustrating various aspects of the implementation, including TI specific extensions to OpenCL and optimizing OpenCL kernels for the C66x DSP. A detailed description of the examples can be found in the examples section of the OpenCL User’s Guide.
Measuring OpenCL overheads for Arm® – DSP communication
The null example reports the time overhead that OpenCL requires to submit and dispatch a kernel. A null (empty) kernel is created and dispatched so that the OpenCL profiling times queried from the OpenCL events reflects only the OpenCL overhead necessary to submit and execute the kernel on the device. This overhead is for the round-trip for a single kernel dispatch. In practice, when multiple kernels are being enqueued, the dispatch overhead is overlapped with kernel execution and can approach zero.
OpenCL is available on the following devices and is included as part of TI’s free Processor SDK that can be downloaded using the links below:
Note: OpenCL is conformant to v1.1 (full profile) on AM57x and 66AK2H devices. Complete set of conformance tests have not been run on 66AK2Ex and 66AK2Gx devices.