# TI OpenVX<sup>™</sup> (TIOVX): Framework optimizations



















#### **Parallelism**

- Enabled by graph model
- Independent nodes can potentially be executed in parallel (using different execution units/cores)





# **Graph Pipelining**

- Pipelining multiple invocations of a graph across targets:
  - As images are fed to the system, the graph execution of these input images can be staged across targets in a pipeline fashion.
  - Enables full performance entitlement on SoC's with multiple cores



Example of 4 nodes pipelined across 4 target cores



#### **Additional optimizations**

- Block/tile processing
  - When output depends only on subset of input (not entire input), the data to be processed can be broken into tiles, and fed into the graph.
  - Each node processes 1 tile of data at a time and the graph executes N times for 1 image (N is number of tiles in the image)
  - Removes intermediate round-trips to memory
  - Requires a custom DMA framework to be implemented
- Abstraction of cache operations/address translation
  - TI provides internal memory API's for optimized cache maintenance and address translation





#### For more information

- Jacinto 7 Processor SDK Automotive download: <u>http://www.ti.com/tool/PROCESSOR-SDK-JACINTO-DRA8X-TDA4X</u>
- Processor SDK Linux Automotive (PSDKLA) user guide: <u>\${PSDKLA\_INSTALL\_PATH}/docs/linux/index.html</u>
- Processor SDK RTOS Automotive (PSDKRA) user guide: <u>\${PSDKRA\_INSTALL\_PATH}/index.html</u>
- For additional questions, refer to the E2E community forums: <u>https://e2e.ti.com/support/processors/f/791</u>
- "Addressing System-Level Optimization with OpenVX Graphs" <u>https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6910050</u>





#### ©2020 Texas Instruments Incorporated. All rights reserved.

The material is provided strictly "as-is" for informational purposes only and without any warranty. Use of this material is subject to TI's **Terms of Use**, viewable at TI.com





#### **Block/tile processing**

- When output depends only on subset of input (not entire input), the data to be processed can be broken into tiles, and fed into the graph.
- Each node processes 1 tile of data at a time and the graph executes *N* times for 1 image (*N* is number of tiles in the image)
- Removes intermediate round-trips to memory



#### **Abstraction of cache operations/address translation**

• TI provides internal memory API's for optimized cache maintenance and address translation





#### Graph model enables ... kernel fusion

- Aggregate function replacement :
  - Identifying a sub-set of nodes which can be replaced by a single equivalent node:



14