SWCU195A December 2024 – May 2025 CC2744R7-Q1 , CC2745P10-Q1 , CC2745R10-Q1 , CC2745R7-Q1 , CC2755R10
Four custom instructions have been added as part of CDE. These are the common instructions supported by HW acceleration leveraging the CDE feature of CM33. Ternary MAC and BNN are special types with weight quantization supported to 2 bits and 1 bit respectively, therefore enabling multiple operations, such as multiply and accumulate, in a single clock cycle. Similarly, MMA (8x8 MAC) supports better throughput than using core instructions of ARM CPU. The batch normalization layer is also an integral part of CNN and is repeated depending on the network topology. This helps to speed up the training process, and hence instruction support for BN is important for overall performance of a typical network.
Matrix Multiplication - TMA
Matrix Multiply and Accumulate - MMA
Batch Normalization - BN
Support for Binary Neural Network - BNN
The custom instruction is this format:
CX3{A} {cond}, <coproc>, <Rd>, <Rn>, <Rm>, #<imm>
CX3D{A} {cond}, <coproc>, <Rd>, <Rd+1>, <Rn>, <Rm>, #<imm>
Which of the four instructions to execute is decided by the #<imm>field, and the opcodes are below:
#imm=0 TMA (Signed)
#imm=1 BNORM
#imm=2 BNN
#imm=3 TMA (Unsigned)
#imm=4 MMA (Signed)
#imm=5 MMA (Unsigned)
The pseudocode for each of the instructions is mentioned below:
#define COPROC 0 #define imm_TMA4X4S 0 #define imm_BNORM4 1 #define imm_BNN16X4 2 #define imm_TMA4X4U 3 #define imm_MMA2X2S 4 #define imm_MMA2X2U 5 uint64_t Rd; uint64_t Y; uint32_t Rn, Rm; |
Ternary Matrix Multiply and Accumulate (TMA)
Y = __arm_cx3da(COPROC, Rd, Rn, Rm, imm_TMA4X4S); //Y is 64 bit result
Y = {Rd+1(t+1), Rd(t+1)}
Rd(t+1) = {Y[1], Y[0]}
Rd+1(t+1) = {Y[3], Y[2]}
Y[3] = Saturate (sign_extend(Rd+1[31:16]) + Rm[7:6] * sign_extend(Rn[31:24]) + Rm[5:4] * sign_extend(Rn[23:16]) + Rm[3:2] * sign_extend(Rn[15:8]) + Rm[1:0] * sign_extend(Rn[7:0]))
Y[2] = Saturate (sign_extend(Rd+1[15:0]) + Rm[7:6] * sign_extend(Rn[31:24]) + Rm[5:4] * sign_extend(Rn[23:16]) + Rm[3:2] * sign_extend(Rn[15:8]) + Rm[1:0] * sign_extend(Rn[7:0]))
Y[1] = Saturate (sign_extend(Rd[31:16]) + Rm[7:6] * sign_extend(Rn[31:24]) + Rm[5:4] * sign_extend(Rn[23:16]) + Rm[3:2] * sign_extend(Rn[15:8]) + Rm[1:0] * sign_extend(Rn[7:0]))
Y[0] = Saturate (sign_extend(Rd[15:0]) + Rm[7:6] * sign_extend(Rn[31:24]) + Rm[5:4] * sign_extend(Rn[23:16]) + Rm[3:2] * sign_extend(Rn[15:8]) + Rm[1:0] * sign_extend(Rn[7:0]))
Y = __arm_cx3da(COPROC, Rd, Rn, Rm, imm_TMA4X4U); //Y is 64 bit result
Y = {Rd+1(t+1), Rd(t+1)}
Rd(t+1) = {Y[1], Y[0]}
Rd+1(t+1) = {Y[3], Y[2]}
Y[3] = Saturate (sign_extend(Rd+1[31:16]) + Rm[7:6] * Rn[31:24] + Rm[5:4] * Rn[23:16] + Rm[3:2] * Rn[15:8] + Rm[1:0] * Rn[7:0])
Y[2] = Saturate (sign_extend(Rd+1[15:0]) + Rm[7:6] * Rn[31:24] + Rm[5:4] * Rn[23:16] + Rm[3:2] * Rn[15:8] + Rm[1:0] * Rn[7:0])
Y[1] = Saturate (sign_extend(Rd[31:16]) + Rm[7:6] * Rn[31:24] + Rm[5:4] * Rn[23:16] + Rm[3:2] * Rn[15:8] + Rm[1:0] * Rn[7:0])
Y[0] = Saturate (sign_extend(Rd[15:0]) + Rm[7:6] * Rn[31:24] + Rm[5:4] * Rn[23:16] + Rm[3:2] * Rn[15:8] + Rm[1:0] * Rn[7:0])
Binary Neural Network (BNN)
Y = __arm_cx3da(COPROC, Rd, Rn, Rm, imm_BNN16X4); //Y is 64 bit result
Y = {Rd+1(t+1), Rd(t+1)}
Rd(t+1) = {Y[1], Y[0]}
Rd+1(t+1) = {Y[3], Y[2]}
Y[3] = sign_extend(Rd+1[31:16]) + sign_extend(POPCOUNT(Rn[31:16] XNOR Rm[31:16]))
Y[2] = sign_extend(Rd+1[15:0]) + sign_extend(POPCOUNT(Rn[31:16] XNOR Rm[15:0]))
Y[1] = sign_extend(Rd[31:16]) + sign_extend(POPCOUNT(Rn[15:0] XNOR Rm[31:16]))
Y[0] = sign_extend(Rd[15:0]) + sign_extend(POPCOUNT(Rn[15:0] XNOR Rm[15:0]))
Matrix Multiply and Accumulate (MMA)
Y = __arm_cx3da(COPROC, Rd, Rn, Rm, imm_MMA2X2S); //Y is 64 bit result
Y = {Rd+1(t+1), Rd(t+1)}
Rd(t+1) = {Y[1], Y[0]}
Rd+1(t+1) = {Y[3], Y[2]}
{Y[1], Y[0]} = Saturate(sign_extend(Rd) + { sign_extend(Rn[7:0])* sign_extend(Rm[7:0]) + sign_extend(Rn[15:8])* sign_extend(Rm[15:8)]})
{Y[3], Y[2]} = Saturate(sign_extend(Rd+1) + { sign_extend(Rn[23:16])* sign_extend(Rm[23:16]) + sign_extend(Rn[31:24])* sign_extend(Rm[31:24])})
Y = __arm_cx3da(COPROC, Rd, Rn, Rm, imm_MMA2X2U); //Y is 64 bit result
Y = {Rd+1(t+1), Rd(t+1)}
Rd(t+1) = {Y[1], Y[0]}
Rd+1(t+1) = {Y[3], Y[2]}
{Y[1], Y[0]} = Saturate(sign_extend(Rd) + {Rn[7:0] * sign_extend(Rm[7:0]) + Rn[15:8] * sign_extend(Rm[15:8)]})
{Y[3], Y[2]} = Saturate(sign_extend(Rd+1) + {Rn[23:16] * sign_extend(Rm[23:16]) + Rn[31:24] * sign_extend(Rm[31:24])})
Batch Normalization (BN)
Y = __arm_cx3da(COPROC, Rd, Rn, Rm, imm_BNORM4); //Y is 32 bit result
Y = Rd(t+1)
Cycle 1
Rd(t+1)[15:8] = clamp( [((Rd[31:24]* sign_extend(Rn[15:8]))<<8) + (Rd[23:16]* sign_extend(Rn[15:8]))] >> Rm[21:17] )
Rd(t+1)[7:0] = clamp( [((Rd[15:8]* sign_extend(Rn[7:0]))<<8) + (Rd[7:0]* sign_extend(Rn[7:0]))] >> Rm[16:12] )
Cycle 2
Rd(t+1)[31:24] = clamp( [((Rd+1[31:24]* sign_extend(Rn[31:24]))<<8) + (Rd+1[23:16]* sign_extend(Rn[31:24]))) >> Rm[31:27] )
Rd(t+1)[23:16] = clamp( [((Rd+1[15:8]* sign_extend(Rn[23:16]))<<8) + (Rd+1[7:0]* sign_extend(Rn[23:16]))] >> Rm[26:22] )
Other important points:
<coproc> has to be 0
2x32 bit registers (Rd, Rd+1) represent 4x16 bit data, each 16-bit data to be treated as signed and Rd, Rd+1 individually are not signed 32 bit numbers. Applicable for instructions #imm - 0, 1, 2, 3
2x32 bit registers (Rd, Rd+1) represent 2x32 bit data, each to be treated as signed. Applicable for instructions #imm - 4,5
Additional decodes from instruction opcodes - Applicable only for BN
Clamp High - Use upper 9 bit value (11:3) as signed value directly as clamp high value
Clamp Low - Use lower 3 bits to decode i.e. (2:0) : 000 → 0, 001 → -2, 010 → -4, 011 → -8, 100 → -16, 101 → -32, 110 → -64, 111 → -128
Some SCB registers that should to be programmed based on security state of the processor:
SCB -->CPACR (Coprocessor Access Control Register) - The CPACR register specifies the access privileges for coprocessors.
SCB -->NSACR (Non secure Access Control Register) - The NSACR register defines the Non-secure access permissions for both the FPU and coprocessors CP
SCB -->CPPWR (Coprocessor Power Control Register) - Applicable for co-processor and not for CDE logic since CDE logic shares its power domain with the CPU
The 'popcount' block in the data flow diagram of BN operation calculates the number of 1s in the 16 bit signal which is input to the block.