SPRUIG3 User guide

SPRUIG3C January 2018 – August 2019 TDA4VM , TDA4VM-Q1

5.1 Load Operations

VCOP’s fundamental load operation fetches a vector of 8 consecutive elements from memory into the 8 lanes of a vector register. In memory the elements are 8, 16, or 32 bit signed or unsigned values; they are sign or zero extended into the 40-bit lanes of the register. On C7x the migration tool models VCOP’s 40 bit lanes as 32 bit lanes; a vector register contains 16 of these lanes. (In 8-way SIMD mode only 8 of these 16 lanes are used; in 16-way SIMD mode all 16 are used.)

VCOP has various additional “distribution modes” that provide for alternate data layouts in memory: for example, as the data is read in it may be oversampled (read each element multiple times), undersampled (skip elements), deinterleaved, and so on. Descriptions of the distribution modes can be found in the VCOP CPU manual or the programmer’s guide.

Thus, load operations are characterized primarily by the element type in memory and distribution mode. The virtual machine has a template class called vcop_load that implements the various combinations using C7x operations. The template parameters of the class specify the type and distribution mode, along with a specification of what kind of low-level addressing to use (SE, SA, or indirect) and the number of SIMD lanes to emulate (8 or 16). The class has two methods, load() and SE_load(), which implement the load operation as specified by the template parameters. The load() method implements non-SE-based loads, and the SE_load() method implements SE-based loads.

For example, here is the translation of a VCOP load instruction using the CIRC2 distribution mode and SA-based addressing.

Kernel-C:

__vptr_uint16 in;
Vreg = in[Agen0].circ2();

translates to:

vcop_load<ushort, circ2, sa1adv>::load(Vreg, (uchar*)(tvals->p1));

The template parameters ushort, and circ2 specify the data type and distribution mode. The template parameter sa1adv tells the template to use SA-based addressing, with SA1 and advancing enabled. The SIMD factor defaults to 8. The runtime argument Vreg is the destination vector register (passed by reference, since the load writes into it). The tvals->p1 expression is the base address. The load() method, when load template is specialized for ushort, circ2, and sa1adv, results in generation of the specific C7x sequence to implement that combination.

The specific C7x instruction sequences generated by specialized load() methods that result from all combinations of type and distribution mode can be determined by examining the template specializations in the header files.

The load() methods need to handle loading the vector elements according to the distribution mode, and sign- or zero-extension. Most combinations of type and distribution mode are covered by a single C7x instruction. A few need additional instructions to exactly mimic VCOP’s specific modes.

In general SE-based loads, invoked via the SE_load() method, simply rely on the SE configuration as setup in the init() function (see Section 1.5), and translate to a simple access of the corresponding SE source register.

The basic setup for the SE is based on the data type. For example, signed 16-bit data uses element type __SE_ELETYPE_16BIT and __SE_PROMOTE_2X_SIGNEXT for sign-extension to 32 bits. The default vector length for 8-way SIMD is __SE_VECLEN_32BYTE, for 8 lanes of 32-bit (4-byte) data. Additional SE features used to implement the specific distribution modes are configured by the various specializations of the vcop_load template.

C7x can speculatively load without risk of faulting. Therefore most load sequences simply load a full C7x vector’s worth of data—that is, 16 lanes of 32-bit values. In 8-way SIMD mode, the extra values are simply unused.