Introduction

With the worldwide data deluge hitting the wireless and wired networks, operators are challenged to roll out networks that meet today's demand and are well positioned for the future. As a result, communications infrastructure equipment manufacturers are seeking core technology that can meet today's requirements and beyond, while working to reduce cost per bit and power consumption. TI has met that challenge with its new KeyStone multicore SoC architecture.

The new KeyStone multicore SoC architecture based devices include up to eight TMS320C66x DSP CorePacs, delivering unprecedented levels of fixed- and floating-point processing capability. The KeyStone architecture is designed with a highly efficient multicore memory architecture, allowing for full processing entitlement across all CorePacs while executing concurrently. This paper discusses the KeyStone memory hierarchy and how it connects to the other elements of SoC architecture, achieving full multicore processing entitlement.

KeyStone memory architecture

TI has risen to the challenge of the multicore SoC technology. The KeyStone architecture includes the new C66x fixed- and floating-point DSP core, configurable coprocessors for optimized, standards-based functions and interfaces, tiered memory architecture, a TeraNet switch fabric, and Multicore Navigator to tie it all together. There are three levels of memory in the KeyStone architecture. Each C66x CorePac has its own level 1 program (L1P) and level 1 data (L1D) memory. Additionally, each CorePac has a local level 2 unified memory. Each of the local memories can be independently configured as memory-mapped SRAM, cache, or a combination of the two.

The KeyStone architecture includes a shared memory subsystem, that consist of internal and external memory connected via the Multicore Shared Memory Controller (MSMC). The MSMC allows the CorePacs to dynamically share the internal and external memories for both program and data. The MSMC internal RAM offers flexibility to the programmer by allowing...
portions to be configured as a shared level 2 RAM or a shared level 3 (SL3) RAM. SL2 RAM is cacheable only within the local L1P and L1D caches, while SL3 is additionally cacheable in the local L2 caches.

To provide a fast path for software execution, external memory is connected through the same memory controller as the internal shared memory, rather than to the chip system interconnect as has been traditionally been done on embedded processor architectures. External memory is always treated as SL3 memory and is cacheable in L1 and L2. The following sections discuss the various enhancements made in the KeyStone architecture.

Memory enhancements for performance

The internal memory architectures of the C66x CorePac are the result of four primary enhancements over previous generations of the C6000™ DSPs, targeted specifically at performance entitlement and useability. These improvements are designed to achieve: 1) better execution efficiency from all levels of memory regardless of high activity by multiple cores and data I/O, 2) ease of managing cache coherency across multiple cores and data I/O, 3) memory protection and address extension, and 4) extended protection for soft errors to upper levels of memory.

Execution efficiency – The C66x CorePac’s memory subsystem is functionally identical to those of the latest C64x+™ and C67x™ generation of DSPs. The local L1 memories are 32 KB each and can be configured as all-cache (default), all memory-mapped SRAM, or a combination with options of 4-, 8-, 16 KB of cache. L1P is always direct-mapped and L1D is always two-way set-associative.

<table>
<thead>
<tr>
<th>32 KB L1P</th>
</tr>
</thead>
<tbody>
<tr>
<td>Memory controller (PMC) with memory protect/bandwidth management</td>
</tr>
</tbody>
</table>

**Fig. 2 – CorePac memory enhancements**
Similarly, the local L2 memory is up to 1 MB of unified memory (initial devices announced with either 512 KB or 1 MB). This memory is also configurable as all-cache, all memory-mapped SRAM (default), or a combination with options of 32, 64, 128, 256, or 512 KB of 4-way set-associative cache.

The access path to the shared memory subsystem has been redesigned to dramatically reduce latencies to the upper levels of memory, and to sustain the same level of efficiency regardless levels of activity in all CorePacs and data I/O.

**Level-two memory efficiency** – The LL2 memory devices and controller are operated at a higher clock rate than in the previous generation. The C66x LL2 memory operates at a clock rate that equals the CPU clock. The higher clock rate results in faster access times, and thus a reduction in any stalling brought on by a L1 cache miss, where memory must be fetched from LL2 cache or SRAM. This change alone automatically speeds an application when migrating from a C64x+ or C67x device, without the need to recompile for the new C66x instructions set.

Additionally, the cache coherency operations, both hidden from the user and driven by software commands, are more efficient and take fewer cycles to execute. This in turn means that automatic cache coherency operations (e.g. snoops, evictions) are less intrusive to the processor and result in fewer stall cycles. Manual cache coherency operations (e.g. global or block writeback and/or invalidate) take fewer cycles to complete, translating into less waiting time when synchronizing between CorePacs or with DMA masters while arbitrating for shared memory.

**Shared memory efficiency** – To improve the efficiency of execution to the shared memory, the eXtended Memory Controller (XMC) was placed within CorePac. The XMC is the path to the MSMC for both shared internal (SL2/SL3) and external (DDR3 SRAM) memory and builds upon the architecture implemented in prior devices that had shared Level 2 (SL2) memory (such as the TMS320C6472 DSP).

![Fig. 3 – Shared memory architecture](image-url)
On prior devices that had a SL2 memory, the access path to the SL2 was the same as the path to the LL2, with a pre-fetch buffer close to the internal interface. The pre-fetching was designed to hide access latencies to the shared RAM banks, and to optimize for code execution and read-only data accesses (write operations are fully supported). The XMC follows similar motivation, but is extended to add much more pre-fetching capability so that it is now optimal for program execution and R/W data fetches with similar performance to LL2. The pre-fetching not only reduces access latencies by pulling memory ‘closer’ to the C66x DSP core prior to access, but also mitigates against contention when other CorePacs and data I/O transfers are vying for the same memory resources through the MSMC.

The MSMC connects to the XMC through a 256-bit wide bus, and the XMC directly connects to four wide 1024-bit memory banks for the internal SL2/SL3 RAM. The internal banking allows the prefetching logic within the XMC to pull program and data ahead of future requests on every access to the physical RAM, avoiding subsequent accesses to stopping at the XMC. The MSMC directly connects to the external memory interface controller through another 256-bit interface, extending the high-bandwidth interface of CorePac all the way to the external memory.

For external memory, the KeyStone architecture greatly enhances previous architectures by accessing through the same path as the shared internal memory. This path is twice the width and half the speed of previous devices, dropping much of the latency to the external DDR3 memory controller (through the XMC and MSMC). Whereas in previous C6000 DSPs, and many embedded processor architectures, external CPU and cache access were issued through the chip-level interconnect, the XMC provides a more direct and optimal path. This improves the L1/L2 cache efficiency when executing from external memory, and reduces stalling faced when multiple cores and data I/O are arbitrating for the external memory concurrently.

For both internal and external memory, all data I/O traffic enters the MSMC through multiple direct paths off the chip interconnect and not through the CorePac memory controllers. This allows for complete orthogonality when the data I/O accesses are to memory endpoints to which the CorePacs are not accessing (e.g. CorePacs executing from SL2 and data I/O to/from DDR3), and provides arbitration after the XMC pre-fetch buffers such that memory bank conflicts are hidden from the CorePacs.

XMC also has embedded multi-stream pre-fetching buffers for both data and program pre-fetching. The program pre-fetch buffer serves the read request from L1P and L2, which can pre-fetch up to 128 bytes of program data before the CPU needs them. The data pre-fetch buffer services the read request from L1D and L2. The data pre-fetching unit supports eight pre-fetching streams and each stream can independently pre-fetch data from either increased address direction or decreased address direction. Pre-fetching can help reduce the compulsory miss penalty for streams of data entering DSP cores. In the multicore environment, pre-fetching can also improve the performance by spreading out the bandwidth peaks. In order to harness the performance improvement from pre-fetching without adding the negative impact, the memory can be configured to enable or disable pre-fetching through a 16 MB range.
**External memory efficiency** — In addition to the benefits brought by attaching the external memory to the MSMC, the KeyStone external memory also includes significant improvements to the external memory controller (EMIF). The KeyStone architecture supports the high-performance DDR3 SDRAM memory with speeds of 1,333 MT/s and higher. Memory bus widths supported are up to 64-bit data width, although the bus is configurable to 16 or 32 bits (to save on board space and power.) This higher width and higher speed above previous architectures allows for the integration of multiple higher-performance cores, accelerators, and data I/O.

**Cache coherency control** — Often in a multicore device, and multi-device system, data is shared among cores as part of the processing. The KeyStone architecture offers some improvement to simplify coherency management operations for the shared internal and external memories.

In the KeyStone architecture, the LL2 memory is always coherent with the L1D cache, so no special provisions for coherency management are required (though it is possible to make some performance optimizations by using L1D coherency commands). The shared memory, both SL2 and SL3, is not guaranteed coherent with L1 and L2 caches by the hardware. It is therefore required that software control handoffs to/from data I/O pages and access to shared buffers among the cores.

To simplify this, the fence operation has been added into the CorePac as a new MFENCE instruction. When combined with a simple CPU loop, MFENCE can be used to implement a fence operation to guarantee the sequential consistency between groups of read/write accesses. It can be used to synchronize the memory requests to a particular endpoint that may arrive by different paths. It can also be used to make memory access occur in a particular order from all CPUs’ perspective for some multiprocessor algorithm. This simplifies the coherency protocol required for shared data sections.

**Shared memory protection and address extension** — Memory protection has been a part of both the C64x+ and C67x DSP architectures within the internal memories (L1, L2, SL2). The KeyStone architecture extends this protection to external memory as well, and also adds additional flexibility to the internal memory protections. Additionally, the MSMC allows for extending the address space from 32- to 36-bit addressing for external memory.

Each C66x DSP is assigned a unique privilege ID (PrivID) value. Data I/O masters are assigned one PrivID, with the exception of the EDMA, which inherits the PrivID value of the master that configures it for each transfer. There are 16 total PrivID values supported in KeyStone devices. The memory protection attributes include read/write/execute access permissions for supervisor- and user-accesses individually.

**Memory protection for local memory** — The C66x CorePac extends the memory protection protocol from previous C6000 architectures by allowing a flexible mapping, under software control, of requestors to memory. All requestors of memory (C66x CPU, EDMA, Navigator, PCIe, SRIO, etc.) have an associated privilege ID. The internal memory controllers allow distinguishing between up to six distinct requestors and provisioning for ‘all others.’ As the KeyStone devices integrate more cores and many more DMA masters (both I/O and accelerator), this number is insufficient. KeyStone CorePacs allow mapping the IDs of the system masters to the IDs used within the protection logic such that the protection is tailored to the application.
Memory protection for shared memory — For shared memory there are multiple Memory Protection and Address eXtension (MPAX) units. C66x DSP accesses to the MSMC pass through a local MPAX within the XMC, and data I/O accesses to MSMC pass through MPAX logic within the MSMC, with separate control of internal shared memory and external memory.

The MPAX units combine memory protection and address extension into one step. The MPAX operates based on the privilege ID carried by each transaction to denote the requestor of memory, as done for the local memories. For each PrivID, the associated MPAX unit supports the definition of up to 16 memory sections within the internal shared memory and 16 memory sections within the external memory. Each is individually configurable and provides its own memory protection address extension attributes. Each segment can be sized to any power of 2 ranging from 4 KB to 4 GB. The address extension increases a 32-bit address to 36 bits for the external memory space.

---

**Fig. 4 – Memory protection attributes**

<table>
<thead>
<tr>
<th>Bit</th>
<th>Meaning when set</th>
<th>Bit</th>
<th>Meaning when set</th>
</tr>
</thead>
<tbody>
<tr>
<td>SR</td>
<td>Supervisor mode may read from segment</td>
<td>UR</td>
<td>User mode may read from segment</td>
</tr>
<tr>
<td>SW</td>
<td>Supervisor mode may write to segment</td>
<td>UW</td>
<td>User mode may write to segment</td>
</tr>
<tr>
<td>SX</td>
<td>Supervisor mode may execute from segment</td>
<td>UX</td>
<td>User mode may execute from segment</td>
</tr>
</tbody>
</table>

---

**Fig. 5 – MSMC Address Extension**
The address range definitions for the memory segments are flexible and can be overlapped to create non-power-of-2 segments, and to create additional memory regions.

The address extension function allows accommodating a larger external memory by remapping the 32-bit address to 36 bits. The address extension can map the identical virtual addresses from DSP cores to different physical addresses as well as map multiple virtual addresses to the same physical address, which gives different semantics to the same memory. The MPAX unit isolates operating systems and applications running on different DSP cores and it can also easily support the shared programs.

Figure 6 shows an example of how to use MPAX to build a multicore virtual memory in the KeyStone architecture. Assuming all the cores are running the same application, the shared program and static data can be put into shared internal and external memory regions. The private variables (for example data 1) can use the same virtual address in each core, with MPAX mapping them to different physical memory locations. This allows the exact same code image to execute on multiple cores without any remapping of the external memory in software during runtime.

**Error detection and correction** – Several generations of C6000 devices, targeted at infrastructure systems, have included soft error correction within the internal memory space. The KeyStone family of devices builds on this technology to extend soft error protection through all levels of memory.

\[\text{Fig. 6} \quad \text{– Multicore virtual memory}\]
The memory controllers within the hierarchy support several levels of protection, designed to allow full correction for code sections, and detection for errors in data space. The L1P, L2, and SL2 (or SL3 internal RAM) include parity information that detect and correct the single bit errors within a 128- or 256-bit segment of memory. Additionally, two bit errors within the same access will be detected and trigger an exception to the accessing CorePac. For data accesses, bit errors are detected only, triggering an exception. Whenever a detection or correction event occurs, information is stored in the memory controller for the faulting access to provide information to the application and/or host to aid in recovery.

Soft error protection is extended to the external memory space by ECC logic. There is an 8-bit ECC code for each 64 bits of data, supported for 32- and 64-bit data width configurations as an option. Adding ECC requires an additional 4-bit DDR3 IC for 32-bit data widths (making a 36-bit interface) or an additional 8-bit DDR3 IC (making a 72-bit interface) to hold the ECC values associated with the full external data space.

**Conclusion**

The various gains in the memory architecture of the new KeyStone architecture translate directly into performance improvement, both in the context of execution by a single core and a multicore SoCs. Compared to previous generations, the improvements span all levels of memory, including local L2, shared L2/L3, and external memory. The performance improvements hold up to concurrent accesses by multiple cores, accelerators, and data I/O though efficient arbitration and pre-fetching mechanisms.

The cache coherency control improvements allow for simpler arbitration for shared data pages, both between cores and between a core and data I/O.

The implementation of memory protection and address extension give a flexible programming model, large address reach, and protection against errant accesses.

Soft error protection at all levels of memory ensure that the run-time execution is unaffected by random soft error events that affect all embedded processors.

The advances made within the KeyStone architecture for memory performance, ease of use, and flexibility ensure that programmers can achieve full performance entitlements out of the powerful new C66x DSP generation. The architecture is extensible, and provides the foundation for a family of SoCs with a variety of number of cores, accelerators, and data I/O.

For more information visit [www.ti.com/c66multicore](http://www.ti.com/c66multicore)
IMPORTANT NOTICE

Texas Instruments Incorporated and its subsidiaries (TI) reserve the right to make corrections, modifications, enhancements, improvements, and other changes to its products and services at any time and to discontinue any product or service without notice. Customers should obtain the latest relevant information before placing orders and should verify that such information is current and complete. All products are sold subject to TI’s terms and conditions of sale supplied at the time of order acknowledgment.

TI warrants performance of its hardware products to the specifications applicable at the time of sale in accordance with TI’s standard warranty. Testing and other quality control techniques are used to the extent TI deems necessary to support this warranty. Except where mandated by government requirements, testing of all parameters of each product is not necessarily performed.

TI assumes no liability for applications assistance or customer product design. Customers are responsible for their products and applications using TI components. To minimize the risks associated with customer products and applications, customers should provide adequate design and operating safeguards.

TI does not warrant or represent that any license, either express or implied, is granted under any TI patent right, copyright, mask work right, or other TI intellectual property right relating to any combination, machine, or process in which TI products or services are used. Information published by TI regarding third-party products or services does not constitute a license from TI to use such products or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property of TI.

Reproduction of TI information in TI data books or data sheets is permissible only if reproduction is without alteration and is accompanied by all associated warranties, conditions, limitations, and notices. Reproduction of this information with alteration is an unfair and deceptive business practice. TI is not responsible or liable for such altered documentation. Information of third parties may be subject to additional restrictions.

Resale of TI products or services with statements different from or beyond the parameters stated by TI for that product or service voids all express and any implied warranties for the associated TI product or service and is an unfair and deceptive business practice. TI is not responsible or liable for any such statements.

TI products are neither designed nor intended for use in safety-critical applications (such as life support) where a failure of the TI product would reasonably be expected to cause severe personal injury or death, unless officers of the parties have executed an agreement specifically governing such use. Buyers represent that they have all necessary expertise in the safety and regulatory ramifications of their applications, and acknowledge and agree that they are solely responsible for all legal, regulatory and safety-related requirements concerning such use. Buyers represent that they have all necessary expertise in the safety and regulatory ramifications of their applications, and acknowledge and agree that they are solely responsible for all legal, regulatory and safety-related requirements concerning such use. Buyers represent that they have all necessary expertise in the safety and regulatory ramifications of their applications, and acknowledge and agree that they are solely responsible for all legal, regulatory and safety-related requirements concerning such use.

TI products are neither designed nor intended for use in military/aerospace applications or environments unless the TI products are specifically designated by TI as military-grade or "enhanced plastic." Only products designated by TI as military-grade meet military specifications. Buyers acknowledge and agree that any such use of TI products which TI has not designated as military-grade is solely at the Buyer's risk, and that they are solely responsible for compliance with all legal and regulatory requirements in connection with such use.

TI products are neither designed nor intended for use in automotive applications or environments unless the specific TI products are designated by TI as compliant with ISO/TS 16949 requirements. Buyers acknowledge and agree that they are solely responsible for all legal, regulatory and safety-related requirements concerning their products and any use of TI products in such safety-critical applications, notwithstanding any applications-related information or support that may be provided by TI. Further, Buyers must fully indemnify TI and its representatives against any damages arising out of the use of TI products in such safety-critical applications.

TI products are not authorized for use in safety-critical applications (such as life support) where a failure of the TI product would reasonably be expected to cause severe personal injury or death, unless officers of the parties have executed an agreement specifically governing such use. Buyers represent that they have all necessary expertise in the safety and regulatory ramifications of their applications, and acknowledge and agree that they are solely responsible for all legal, regulatory and safety-related requirements concerning such use. Buyers acknowledge and agree that any such use of TI products which TI has not designated as military-grade is solely at the Buyer's risk, and that they are solely responsible for compliance with all legal and regulatory requirements in connection with such use.

TI products are neither designed nor intended for use in safety-critical applications (such as life support) where a failure of the TI product would reasonably be expected to cause severe personal injury or death, unless officers of the parties have executed an agreement specifically governing such use. Buyers represent that they have all necessary expertise in the safety and regulatory ramifications of their applications, and acknowledge and agree that they are solely responsible for all legal, regulatory and safety-related requirements concerning such use. Buyers acknowledge and agree that any such use of TI products which TI has not designated as military-grade is solely at the Buyer's risk, and that they are solely responsible for compliance with all legal and regulatory requirements in connection with such use.

Following are URLs where you can obtain information on other Texas Instruments products and application solutions:

<table>
<thead>
<tr>
<th>Products</th>
<th>Applications</th>
</tr>
</thead>
<tbody>
<tr>
<td>Amplifiers</td>
<td>Audio</td>
</tr>
<tr>
<td>Data Converters</td>
<td>Automotive</td>
</tr>
<tr>
<td>DLP® Products</td>
<td>Communications and</td>
</tr>
<tr>
<td></td>
<td>Telecom</td>
</tr>
<tr>
<td>DSP</td>
<td>Machines</td>
</tr>
<tr>
<td></td>
<td>Peripherals</td>
</tr>
<tr>
<td>Clocks and Timers</td>
<td>Consumer Electronics</td>
</tr>
<tr>
<td>Interface</td>
<td>Energy</td>
</tr>
<tr>
<td>Logic</td>
<td>Industrial</td>
</tr>
<tr>
<td>Power Mgmt</td>
<td>Medical</td>
</tr>
<tr>
<td>Microcontrollers</td>
<td>Security</td>
</tr>
<tr>
<td>RFID</td>
<td>Space, Avionics &amp; Defense</td>
</tr>
<tr>
<td>RF/IF and ZigBee® Solutions</td>
<td>Wireless</td>
</tr>
</tbody>
</table>

Mailing Address: Texas Instruments, Post Office Box 655303, Dallas, Texas 75265
Copyright © 2010, Texas Instruments Incorporated