SPRUIG3 User guide

SPRUIG3C January 2018 – August 2019 TDA4VM , TDA4VM-Q1

5.2 Store Operations

Similarly to loads, VCOP’s store operations are characterized by data type and distribution mode. In addition to distribution mode, there are two primary considerations for translation of VCOP stores for C7x: packing, and lane masking.

Packing is the opposite of sign extension. The source data in C7x registers is 32 bits wide. When storing to 16- or 8-bit element types, the elements must be truncated to that size. The C7x has direct instruction support for such packing stores. Packing depends on size but not signedness. Signed and unsigned types generally use the exact same sequence, so there are three fundamental translation sequences for each mode, corresponding to 8-bit, 16-bit, or 32-bit data.

Signedness does come into play for rounding or saturation; these are covered in Section 5.3.3 and Section 5.3.4.

Unlike loads, translation of stores is sensitive to the number of lanes. While excess lanes can be safely loaded and ignored, stores must take care to only store the number of lanes being modeled. That means in the default 8-way SIMD mode, stores are limited to storing only 8 lanes, even though the C7x vectors contain 16 lanes of (32-bit) data. There is instructional support for such partial-vector stores for some cases, but not all. In particular there are no partial-vector packing stores. Thus packing stores in 8-way SIMD mode that use regular indirect addressing require the use of an explicit predicate to mask the store of unused lanes. This lane-masking predicate is constant and can be computed outside the loop.

When using SA-based addressing, the SA automatically provides lane masking based on the VECLEN flag in the SA setup vector. In this case the explicit lane masks are not needed.

The virtual machine has a template class called vcop_store that implements the various combinations in terms of C7x operations. The template parameters of the class specify the type and distribution mode, along with a specification of what kind of low-level addressing to use (SA or indirect), and the number of SIMD lanes to emulate (8 or 16). The class has two methods: store() for regular stores and store_pred() for predicated stores. For example, here is the translation of a store instruction using the DS2 distribution mode and SA-based addressing.

Kernel-C source:

__vptr_s8 out;
out[Agen1].ds2() = Vreg;

translates to:

vcop_store<char, ds2, sa0adv>::store(Vreg, (uchar *)(tvals->p2));

The template parameters char and ds2 specify the data type and distribution mode. The template parameter sa0adv tells the template to use SA-based addressing, with SA0 and advancing enabled. The SIMD factor defaults to 8. The runtime argument Vreg is the source vector register. The tvals->p2 expression is the base address.

Most cases of unconditional stores using the basic store distributions modes translate to a one- or two-instruction sequence. Collating stores store a data-dependent number of elements in packed fashion and are therefore more involved.

Efficiency Warning: Collating Stores
Collating stores require a long translation sequence and may perform poorly.