KeyStone I training: C66x CorePac overview - achieving high performance
CorePac: Achieving High Performance discusses how high performance can be achieved within each DSP core. Topics include CorePac architecture, Single Instruction Multiple Data (SIMD), memory access, and software pipelining.
Resources
TI KeyStone family of multi-core processors achieve very high floating-point and fixed-point performances. This presentation, C66x CorePac, Achieving High Performance, shows how each individual core achieves its performance.
The agenda for this presentation is the following, we will go through the CorePac architecture first. Then we'll cover the Single Instruction Multiple Data concept, followed by the memory access description, and we will finish by describing the pipeline concept of the CorePac.
The first issue is CorePac architecture. The drawing shows the internal structure of the CorePac. On the middle of the left side, we see the DSP Core itself. Two sides, A and B, each side has its own register file, register A file, register B file. And each side has four functional units, M, L, S, and D. We'll talk more about those functional units in the next slide.
Each of the register files has it's own 64-bit port [INAUDIBLE], Level 1 data memory. The data memory is 32 kbyte, and it can be configured either as cache or as RAM or can be partitioned between cache and RAM.
Level 1 program memory is, again, 32 kilobyte memory that can be configured as cache or RAM or partitioned between cache and RAM. And it has 256-bit bus to the instruction fetch.
Also, on that part of the CorePac is Level 2 memory. The size of Level 2 memory depends on the family member. For 6678, the L2 memory is 512 kilobyte. And again, this memory can be partitioned between cache and RAM.
This slide shows us the two sides of the core, the A side and the B side with the 32 registers on each side, 32-bit each register, and the four functional units on each side. The function units are the M functional unit, which is responsible to all the multiplication operation, the A functional unit that does the ALU operation, D functional unit moves data in and out of the core to the memory, and then S functional unit does the control. As we will see in the next few slides, each of these functional units does much more than only multiplication or ALU or data or control.
The typical DSP processing operation is MAC operation, multiple accumulation. And this is implemented by the M and the L unit. The D unit, as we said, provides the data in and out the functional unit from and to L1 memory. And the S unit provides the control, loop, branch, call, et cetera, et cetera.
In each cycle, the DSP core can dispatch up to eight parallel instructions that will be executed in each cycle. All instructions are conditional, which enables efficient pipelining. We'll talked about it later. It's worth mentioning to you that the optimized C compiler generates efficient target code.
While each of the sides, A and B, are independent, they are two crossed paths inside the core. One moves a register B into the A functional unit, and one moves register A into the B functional unit.
Common instruction in functional unit involves two inputs and generates one output. So on the A side, the two inputs, one of them can come from the B side, while at least one or two should come from the A side. The result is always in the A side.
On the B side functional unit, again, out of two inputs, one may come from the A side. The other input and the output is in the B side. Each of the crossed paths is 64-bit width, and it can move either L register or a pair of two registers from side A to the functional unit of side B or from side to B to the functional unit of side A each cycle.
The next few slides present a partial list of the instructions that can be executed by the D unit, the M unit, the L and the S. The purpose of this slide is to illustrate the variety of operations that each of the functional units can perform.
For example, the D unit, in addition of all the loading and stowing operation,-- as we see here LDB, LDDW, LDH, ASTH, et cetera-- it has a full range of other operations, like additional operations ADD, logical operations, A and D. I'm not going to go and cover those instructions. A user guide covers each of these instructions in detail. I just want to mention this is not a complete list of all of the instructions.
Just as we showed before with the D functional unit, we can see a partial list of L functional unit instructions. We say, again, that it has a multiplicity of operations, not only addition and subtraction and logical operation, but we have packing instruction and unpacking instruction and negate, compare, shift. We have a great variety of instruction.
Moving on to the partial list of M instructions, again, we see here that the M functional unit does more than just multiplication. It does dot product, rotation, and of course, all the multiplications.
Last, but not least, are the S functional unit and partial list of the S instructions. Again, we see here that in addition to control instructions-- that is branch, call, return-- the S functional unit supports a great deal of logical and arithmetic and packing instructions. And of course, we can follow each of the instructions in this table in the user's guide.
The next part of the presentation is single instruction, multiple data, or what we like to call SIMD. Here, we give some examples that show how the same hardware that is used, the same functional unit that is used for one instruction can be used to implement SIMD.
For example, there is instruction called a ADDDP. This instruction adds two double-precision floating-point values. Double-precision floating-point is 64-bit. So this instruction takes in two 64-bit values and generates a result of 64-bit value.
The same hardware can be used by the instruction DADD2, which is 4-way SIMD addition packed, signed, 16-bit. This instruction takes two pairs of 4 16-bit numbers packed into 64-bit registers and performs 4 additions. Again, there are 4 results rounded to 4 packed 16-bit values.
So what do we have here? In one cycle, we can do either a double-precision floating-point addition, or we can do 4 16-bit integer addition operations.
The next example is FMPYDP, which performs fast double-precision floating-point multiplication. So the instruction takes two 64-bit double-precision floating-point values and multiplies them. And the result is, again, a 64-bit value. The same M functional unit that performs the double-precision multiplication can perform 4-way SIMD 32-bit multiplication.
The instruction QMPY32 performs 4-way SIMD multiplication of two sets. Each has 4 32-bit signed values packed into quad registers and the result is four results. Again, each [INAUDIBLE] is 32-bit signed value packed again into quad registers.
The next example is CMATMPY. This is a complex matrix operation. Indeed, many applications use complex arithmetic. And this instruction, CMATMPY, multiplies 2 by 1 complex vector by 2 complex metrics, and the result is 1 by 2 signed complex vector.
All the values are 16-bit, but they are complex. So we have 16-bit real/16-bit imaginary. The input vector, in that case, is 64-bit. The input matrix is 128-bit. It has four values, each of them 32-bit. And in one cycle, each one of the M units performs operations.
So let's calculate how many multiplications are involved in this operation. We know that when we multiply the vector of 2 by 1 by a metrics of 2 by 2, there are 4 multiplications. In that case, since the vectors and the matrices are complex, there are four complex multiplications. And we know that each complex multiplication is equal to 4 real multiplications. So all together in one cycle, this instruction performs 16 real value multiplications.
Since we have two M units then each of the code can perform 32 real multiplications in one cycle. So in a second, in one second, we can perform 32 operations multiplied by 1.25 gig, which is this clock rate of KeyStone and which give us the astonishing number of 40 gig multiplications in a second.
But the KeyStone, and remember, 6678 has eight calls. So all together, eight calls can give us up to 320 gig multiplications in a second. Enormous number, so the question is can we actually achieve this type of performance? Can we feed the functional unit's data fast enough?
So there are two challenges in feeding the functional unit with enough data. The first is how to provide enough data for memory to the call. Well, we know that M1 memory, which is the fastest memory-- zero wait state, two [INAUDIBLE], each [INAUDIBLE] 64-bit-- can push lots of data into the functional unit. And the code is a multiple mechanism that provides feeding the L1 memory, which is fast, but relatively small, with data from the slower and bigger L2 or external memory as well.
And the other challenge is how to get the data in and out the functional units themselves. We'll talk about how the pipelining enables the execution of instruction every cycle. And this is a place to mention that the compiler is optimized to generate code that has efficient instruction scheduling to maximize the functional units.
Understanding the two challenges, first let's take a look at the memory access. This slide shows the buses that connect the functional unit to the L1 memory. Let's look at the pink bus. The pink bus is program bus. And the program address is 32-bit, but the program data is 256-bit.
The CorePac can fetch 256-bit every cycle up to 256. 256-bit are eight instructions of 32-bit each.
Then we have the A side and the B side. Each of those buses has 32-bit address and depends on the type of call, either 32 or 64-bit data bus. A full KeyStone family, the CorePac, has 64-bit A bus and 64-bit B bus.
So in each cycle, 128-bit can be read from L1 memory to the functional unit, read or write. Next slide, we discuss the pipeline concept.
It is well known that to execute instruction in the CPU, there are three main stages, fetch of the instruction for memory, decode the instruction, and execute the instruction. And the functional unit, or any CPU, actually, should do all the three, one after another.
The first line in the drawing shows what happened if there is no pipeline. A functional unit first fetches instruction. The next time slot it decodes the instruction, and then it executes it. And then it fetches the next instruction, decodes the next instruction, executes the next instruction, and so on and so forth.
But if the hardware architecture enables pipelining, it means that in the first time [INAUDIBLE], the functional unit fetched the first instruction. In the next time instance, while the functional unit decoded the first instruction, it fetched the next instruction at the same time in parallel. And then at the third time instance, the third clock, while it executed the first instruction, the functional unit decodes the second instruction and fetches the third instruction.
When the pipeline is full, the functional unit executes one instruction, decodes the next instruction, and fetches the third instruction. So in reality, even though each instruction goes through three time instances, the functional unit executes instruction every cycle.
Now, let's look a little bit deeper into the C66x pipeline. The fetch instruction itself has four phases. In the first phase, the program generates fetch address. It generates what address to fetch the next instruction.
In the next phase, this address is sent to the program memory. And then, depends on the speed of the program memory, we wait for the data to be ready, the instruction to be ready. And then in the last phase, the fetch reads the opcode and can start decoding.
The next slide shows the pipeline with the four phases of the fetch. So we have the program address generator, and then we have the set of the address. We have the wait for the memory and then read the memory. Four phases.
And we can see that even though the pipeline is deeper, instead of three phases, there are six phases in the pipeline. Still a new result is generated every cycle. There is execution every cycle.
This was about program fetch. Let's talk a little bit about the decode process. Is it only one cycle?
The decode is, again, two phases. In the DP phase, a dispatcher will dispatch each instruction to the functional unit. Remember, there are eight functional units and up to eight instructions in each cycle. So in the DP phase, each instruction is fetched and send to the right functional unit. And then in the next phase, in the DC phase, the instruction decoded at the functional unit.
So when we add the two phases of the decode, we actually increase the pipeline depth to seven phases. But again, when the pipeline is full, every cycle another instruction is executed. So the next question is how many cycles does it take to execute an instruction?
All the C66x instructions require only one cycle to execute, but some of the results are delayed. So the following table tells you how much there is a delay. Most instructions have a single cycle, which means that they have zero delay.
Integer multiplication and fast floating-point multiplication is one cycle delay, which means that when you start execution, the result will be there not in the next cycle, but one cycle later. But you can start-- in the pipeline fashion-- you can start execution every cycle.
The legacy floating-point multiplication has two cycles delay, and we keep it to ensure backward compatibility with code that was already developed. Load instruction has four cycles delay, and we know why. Because we need to put the address and then wait for the memory and then read the memory. And branch instruction has five cycles delay.
Until now we've talked about hardware that supports pipelining instruction one by one. Now, we will talk a little bit about software pipeline. Let's look at the typical DSP MAC operation dot product.
So we load two values for memory. We multiply them, and we add them. How many cycles would it take to perform this loop five times, if we disregard any delay-slots?
The answer is the following. In the first cycle, we load the two operants. In the second cycle, we multiply them, and we disregard the delay-slots. And in the third cycle, we edit them. So there are three cycles to implement this operation. And if we want to run it five times, it will take 15 cycles.
The next slide will show us a time diagram of those 15 cycles. So here is a time diagram of the instruction. In the first cycle, we load the first two operants. In the second cycle, we multiply them. And in the third cycle, we edit them.
So we add them together, and now, in the fourth cycle, we load two more operants. In the fifth cycle, we multiply the two new operants. And then we add together at the sixth cycle, et cetera, et cetera. In the seventh cycle, we load the third pair of operants, multiply them, and add.
And all together it will take 15 cycles. This diagram we cut it short, because it didn't fit into the slide.
If we can pipeline the code, here is what we see. In the first cycle, we load two operants. In the second cycle, while the M unit multiplied those operants, we load two new operants.
So in the second cycle, we have two loads and one multiply. In the third cycle, we add the result of the multiplied together. We multiply the second pair of values, and we load the third pair of values.
In this point, the pipeline is full. It has two loads, one multiply and one add, and we do it in the fourth cycle and the fifth cycle. After the fifth cycle, there are no more loads, because we loaded already our five pairs of operants. Now, in cycle number 6, we multiply the fifth pair, while we added the result of the four pair multiplication.
And then in cycle number seven, we add the results of the fifth pair multiplication. All together, it will take seven cycles as opposed to 15 cycles that non-pipeline code takes. So pipelining those instructions took half the number of cycles.
So the question is how can we stage or schedule instructions to get a full pipeline? The compiler is smart enough to schedule instruction efficiently. When we talk about DSP algorithms, we remember that DSP algorithms have lots of loops. They are, typically, loop intensive and very little control, lots of loop processing.
If you look at the pipeline, you'll see that there is a strong dependency on timing. The result is exactly ready for the operation to perform it. So generally speaking, servicing of interrupts is not allowed in the middle of a loop, because fixed timing is essential. And in interrupt service you can stop the deterministic processing of the loop.
That's why in a typical DSP application, we limit the number of [INAUDIBLE] in the loop. So if there is a very long loop, we break it into multiple loops so that interrupts can be serviced between the different loops.
However, Texas Instruments developed a piece of algorithm called SPLOOP. SPLOOP enables servicing of interrupts even in the middle of loops.
This concludes the presentation. For more information, defer to the C66x CPU and Instruction Set Reference Guide. It's in the public domain, and the link to it is given below. And then any question regarding topics covered in this training or other KeyStone Training, visit the support forum at the TI E2E community website. Thank you.
This video is part of a series
-
KeyStone™ ARM® and DSP multicore device
video-playlist (28 videos)