SPRUIV4C May   2020  – December 2023

 

  1.   1
  2.   Read This First
    1.     About This Manual
    2.     Related Documentation
    3.     Trademarks
  3. 2Introduction
    1. 2.1 C7000 Digital Signal Processor CPU Architecture Overview
    2. 2.2 C7000 Split Datapath and Functional Units
  4. 3C7000 C/C++ Compiler Options
    1. 3.1 Overview
    2. 3.2 Selecting Compiler Options for Performance
    3. 3.3 Understanding Compiler Optimization
      1. 3.3.1 Software Pipelining
      2. 3.3.2 Vectorization and Vector Predication
      3. 3.3.3 Automatic Use of Streaming Engine and Streaming Address Generator
      4. 3.3.4 Loop Collapsing and Loop Coalescing
      5. 3.3.5 Automatic Inlining
      6. 3.3.6 If Conversion
  5. 4Basic Code Optimization
    1. 4.1  Signed Types for Iteration Counters and Limits
    2. 4.2  Floating-Point Division
    3. 4.3  Loop-Carried Dependencies and the Restrict Keyword
      1. 4.3.1 Loop-Carried Dependencies
      2. 4.3.2 The Restrict Keyword
      3. 4.3.3 Run-Time Alias Disambiguation
    4. 4.4  Function Calls and Inlining
    5. 4.5  MUST_ITERATE and PROB_ITERATE Pragmas and Attributes
    6. 4.6  If Statements and Nested If Statements
    7. 4.7  Intrinsics
    8. 4.8  Vector Types
    9. 4.9  C++ Features to Use and Avoid
    10. 4.10 Streaming Engine
    11. 4.11 Streaming Address Generator
    12. 4.12 Optimized Libraries
    13. 4.13 Memory Optimizations
  6. 5Understanding the Assembly Comment Blocks
    1. 5.1 Software Pipelining Processing Stages
    2. 5.2 Software Pipeline Information Comment Block
      1. 5.2.1 Loop and Iteration Count Information
      2. 5.2.2 Dependency and Resource Bounds
      3. 5.2.3 Initiation Interval (ii) and Iterations
      4. 5.2.4 Constant Extensions
      5. 5.2.5 Resources Used and Register Tables
      6. 5.2.6 Stage Collapsing
      7. 5.2.7 Memory Bank Conflicts
      8. 5.2.8 Loop Duration Formula
    3. 5.3 Single Scheduled Iteration Comment Block
    4. 5.4 Identifying Pipeline Failures and Performance Issues
      1. 5.4.1 Issues that Prevent a Loop from Being Software Pipelined
      2. 5.4.2 Software Pipeline Failure Messages
      3. 5.4.3 Performance Issues
  7. 6Revision History

C7000 Split Datapath and Functional Units

The following block diagram shows the datapath split on the C7100 DSP CPU. There is an A-side datapath and a B-side datapath. The diagram shows the functional units and multiple, heterogeneous register files. The A-side datapath is responsible for scalar computation, loading and storing scalars and vectors to and from memory, and control-flow (branches, calls). The B-side datapath handles vector math operations, permutations of data, and vector predication operations.

GUID-20200513-SS0I-GVDH-406R-GWV8NKTVNBTS-low.png Figure 2-1 C7000 Datapath Block Diagram

To simplify the image above, some data movement capabilities and data paths are not shown in this figure.

  • In general, a functional unit can write to any register file on the same datapath.
  • Most functional units can obtain data from one or both of the streaming engines.
  • There is one 64-bit cross path per datapath (A/B). Each cross path allows one read per cycle from the opposite side global register file.

C7100 and C7120 devices have a 512-bit vector width. C7504 and C7524 devices have a 256-bit vector width. Registers have 64 bits per register ("scalar") or a "vector-width" number of bits per register. Thus, C7100 and C7120 devices have 512-bit vector registers, while C7504 and C7524 devices have 256-bit vector registers.

On a given datapath, there are several different kinds of register files. On a given datapath, each functional unit can write to the global register file on that datapath and most of the “local” register files on that datapath. However, only some functional units can read from a “local” register file.

  • D1 and D2 units: These reside on the A-side datapath and can load from and store to memory. Two 64-bit loads can execute in parallel. Two 64-bit stores can execute in parallel. A 64-bit and vector-width load can execute in parallel with a 64-bit or vector-width store. It is not possible for two vector-width stores to execute in parallel or two vector-width loads to execute in parallel.
  • L1, S1, M1, and N1 units: These are general-purpose functional units, handling a varied mix of scalar and small vector computation. The M1 and N1 functional units perform various multiplication instructions.
  • L2, S2, M2, and N2 units: These are also general-purpose functional units, and can operate on full-width vector data. The M2 and N2 functional units perform various multiplication instructions.
  • B unit: This unit handles indirect branches and calls.
  • C unit: This unit performs permutations and shuffles of data.
  • P unit: This unit computes predicates used to mask off vector lanes so particular lanes are not computed or are not stored to memory.

In addition to the D1 and D2 units providing CPU access to the memory hierarchy, the C7100 DSP has two “streaming engines” that facilitate a fast path to obtain data from memory. A streaming engine is a hardware feature that allows you (or the compiler) to specify a pattern of memory addresses to obtain from memory. The streaming engine will do its best to pre-fetch that data from the memory hierarchy into a scratchpad memory close to the CPU, to minimize CPU stalls due to cold cache misses.