Internet Explorer is not a supported browser for TI.com. For the best experience, please use a different browser.
Video Player is loading.
Current Time 0:00
Duration 21:52
Loaded: 0.76%
Stream Type LIVE
Remaining Time 21:52
 
1x
  • Chapters
  • descriptions off, selected
  • en (Main), selected

Hello. This presentation will describe the ARM Cortex-A15 MPU for the Texas Instruments Sitara devices. We start by looking at the ARM Cortex-A15 MPCore. The ARM Cortex-A15 MPCore is single or dual ARM Cortex-A15 cores. Each one of them can operate up to 1.5 gigahertz.

Each of the A15 core is L1 data cache, 32 kilobyte, and 32 kilobyte of L1 program cache. In addition, there is a shared L2 cache that is shared between the two A15 when there are two A15 cores. And the size of the L2 cache for Sitara product is up to 1 megabyte.

The ARM Cortex-A15 MPCore was developed with advance debug and trace features. ARM calls it CoreSight. Side And we'll talk about it later in this presentation. ARM offered a generic interrupt controller. We talk about how we manage the interrupt inside the TI implementation later in this presentation.

There is a sophisticated power management capability. We'll talk about it again later. And then there is a bus, the AMBA bus. The AMBA bus can provide coherency, but in the Sitara family, the coherency is not utilized.

Now we get a little bit into TI implementation of the MPU subsystem in AM57 Sitara processors. In the center of the MPU, we see there ARM Cortex-A15 MPCore that we saw in the previous slide. And we see the connections of the MPU subsystem to the outside world.

On the top right-hand side, we see the connection of the trace and debug into the debug subsystem. On the left side, we see the connection of the interrupt controller into interrupt crossbar. The interrupt crossbar is the mechanism in which Sitara processors map the many events in the device into the MPU. It's done using interrupt crossbar that is connected into the local interrupt controller.

We see several bridges into different memory. The main one goes to the main interconnect of the device. And then we have some other connections. I will mention the L4 CFG interconnect that enables the MPU subsystem to control all the peripherals and the hardware IPs in the device.

Each one of the Cortex-A15 cores is described here. The ARM Cortex-A15 is a superscalar architecture. It is two ALU, two shifts, a branch unit, multiple and divide unit, load and store unit. And then we'll talk a little bit about the NEON and the FPU.

The pipeline can support three concurrent instructions. The function units are implemented, full implementation of the ARMv7-A architecture. And it has a little bit more MAC instructions compared to previous architecture. There is integer divide and automatic thumb mode.

This is the point where I want to mention two coprocessors that actually are part of the pipeline. These are the NEON-- we'll talk about the NEON in the next slides-- and the floating point engine. The pipeline in the A15 was optimized. It has a relatively long pipeline, 13 stages, until instruction is issued. When instruction is issued, it takes two cycles to implement integer operations, full cycle to implement multiply and load. And then it can take more cycles for NEON and floating point operation.

But the main advantage of this long pipeline is the out of order features. What does it mean? It means that when instructions are progressing in the pipeline, the pipeline is protected. Means that if instruction needs a certain register, and the register is not ready, the hardware will delay the execution of this instruction until the input register is ready.

But while the hardware delay execution of instruction because of this reason, the hardware can execute a later instruction in the pipeline that all the input is already available. So the hardware can change the order of instruction to achieve optimization of the execution. And this is one of the best features of the ARM Cortex-A15, and it's responsible to the high throughput of the device.

So in the previous slide, we talked about the long pipeline that enables this out of order operation. The problem with long pipeline is when there is a branch. When there is a branch pipeline, it's flushed. In other words, you start loading a new instruction that have to move through the pipeline. This is why for processors with long pipeline, it's very important to have dynamic branch prediction.

And the ARM Cortex-A15 has a dynamic branch prediction. There are two cases. One of them is loop prediction, which is relatively easy. And we can see it here in this slide, when there is loop, the hardware will always assume that we go to the start of the loop. And of course, it will be wrong only at the last iteration.

The more complex algorithm is in case of indirect branch prediction that depends on register values. And for this, ARM developed a mechanism and efficient algorithm to predict correctly what will be the branch. And this algorithm based on the history, and ARM does not reveal exactly how the algorithm is working.

We talk about the SIMD engine NEON. SIMD, as you know, stands for Single Instruction Multiple Data. And this is the vector coprocessor. And as I said, it sits on the pipeline. So the NEON instructions travel the pipeline like any other instruction. The NEON itself is either 64 or 128-bit data instruction. As I said, it's fully integrated into the pipeline.

There are 32 registers, each 64 bits, that can be arranged as 128-bit registers. And the data in those registers can be interpreted as either byte, half-word, word, or long. So if the data is byte and the NEON does operation on 128-bit, in one operation, the NEON will process 16 bytes, because 128 divide by 8 is 16. If the data is half-word, then in one operation, the NEON can add or subtract eight sets of half-words. So this, of course, for word or 32-bit, it takes four sets, and long, two sets.

One big advantage of the NEON is the load and store instruction that can do on-the-fly interleave. What is on-the-fly interleave? We can see it in this diagram. This diagram is taken from ARM document. You see the document name here.

And what it does is when the registers are loaded or stored from the memory, the NEON can rearrange the bytes in the register. This is very efficient, for example, for video where the data can have three components, x, y, and z. The application wants to work separately on the x, on the y, and on the z component here. But in one instruction on-the-fly, we extract all the x's to one register, all the y's to a different register, and so on.

In addition to the NEON, another coprocessor that sits on the pipeline, fully integrated into the main pipeline, is the vector floating point, sometimes called floating point unit. The vector floating point is 32 double precision. Double precision in floating point is 64-bit. 32 double precision registers for floating point operation.

And the FPU supports in hardware all the IEEE-defined floating point operation and rounding modes. And you might remember, rounding to minus infinity, plus infinity, zero, whatever you want. It supports a set of what we call fused MAC operation. MAC operation, when you do multiply, the hardware can round either after the addition or after the multiplication. Both of them are supported by the MAC. And it supports half-precision, IEEE754, which is 16-bit-- 1 sign bit, 5 bit exponent, and then 10-bit of mantissa.

Moving on to the memory management unit, like many other microcontrollers, the MPU has Memory Management Unit, or MMU, that translates logical address into physical address. The advantage of MMU are all known. It protects users in multi-user environment from each other. It protects the operating system. And here the hardware manager is the actual memory.

The MPU supports large physical addressing, which is basically 40-bit, support 1 terabyte. But TI implementation is limited to 8 gigabyte external memory, or 36-bit of external addressing. The page size is programmable. Page size is tradeoff. Smaller page size requires more calculation to allocate the page. But larger page size wastes more memory.

TI implementation of the MMU enables the user to define the page size for the 4 kilobyte. This is the smallest size of pages. There are three-level structure of translation, and then for the large Linux huge pages, which are 2 megabyte page size, there are two level of the translation.

The translation is sped up by using a Translation Lookaside Buffer, or TLB. Basically, it's a cache that remembers what pages were recently freed and are available. This cache, this translation is divided between L1, cache 2, and L2 cache to support the MMU.

So this slide illustrates how the MMU is working. When the MPU asks for logical address, the MMU first looks at the cache in the TLB to see if there is available page. If there is available page, it assigns it. And if there is no available page, then the MMU calculates or goes through the memory to find available page and then send it back to the MPU.

One advantage of this MMU that was developed by the ARM is the ability to support three privilege layers. User mode is for a single user. Supervisor mode is for operating system. And then there is hypervisor that controls, basically, for multiple operating system. This is the ability of the device to build virtual separate machines. This is the ability of the MMU of the ARM Cortex-A15 to behave like multiple virtual machines.

The translation is done in two stages. First, it's done from logical to intermediate physical address for supervisor for each operating system. And then it translates it for the real address of each of the operating systems.

The MPU subsystem has a small interrupt controller. The interrupt controller can support up to 160 events. In Sitara devices, there are many more events, so the device, it has an interrupt crossbar that we see on the left side. The interrupt crossbar has 480 events. And it can map them into one of the 160 events of the interrupt internal, the local interrupt controller. And then this interrupt controller can interrupt each one of the ARM A15.

And as we said, that way the ARM can respond to any one of the peripherals and the IPs in the device. It can respond to software generated by the ARM core itself, or by externals through the bus interface. There are several types of interrupts. There are virtual and physical interrupts. And of course, there is interrupt going out of the core, and go distribute through other cores [INAUDIBLE] to other CPUs in the system.

We mentioned that the ARM Cortex-A15 MPU has advanced power management in the sense that there are multiple power domains inside the ARM MPCore, which means that the user can shut down part of the MPCore if it is not used. In addition, ARM provides extremely fast state save and restore that basically speed up going to sleep and waking up from hibernation. This enables the processor to go into hibernation when there is a pause or gap in the processing time.

There is a small buffer, 32-entry loop buffer, that can be used for a short loop, for a 32-instruction loop. So if the device runs from the 32-entry loop buffer, then the fetch and some of the decode pipeline stages are disabled, again, to save power.

Debug and trace option is extremely important when we have system on the chip, because there is no visibility into the ARM Cortex-A15 MPU. So we trust hardware to tell us what's going on inside. PMU, or Performance Monitoring Unit, is a set of counters that gather statistics about various processor and memory events. So we can get statistics-- how long certain things took, how many times each one of the cores accesses certain memory, et cetera, et cetera.

In addition, ARM has what they call System Trace Macrocell, or STM, which is a tracing facility. Now, when we say tracing, there are several things that are associated with it. First of all, there should be logic that will determine when to start and when to finish the tracing. Then there is a logic that determines what is traced. Is it instruction? Is it memory, location, et cetera?

And then how the tracing information will go out to the memory, or will be sent online to some device over peripherals. All of these are defined by the System Trace Macrocell, which gives the ARM, basically, a publish specification of the hardware and software to drive the hardware.

Last but not least, ARM is Embedded Cross Trigger, or ECT. The Embedded Cross Trigger enables event from one core to trigger an event on another core. This enables measuring, for example, how many cycles it takes for a message that goes from one core to received by second core.

We are getting close to the end. I wanted to show the AM572 device architecture. You will see that there are two MPU at the left top of the slides. What you'll see here is the richness of the IP and the peripherals and the coprocessors that are part of the AM572. We have a display subsystem. There is lots of processors. The IPU, the IVA, the GPU. Of course, the DSP and the MPU.

We have system modules. We have lots of connectivity that is supported, lots of serial interfaces, lots of program and data storage. The AM572 is an extremely versatile processor that can fit into many types of applications.

For more information, please refer to the documentation with the URLs that are shown. For questions regarding topics that covered in this training or any other training, there is a set of forums at the TI Engineer to Engineer, or E2E, community website, where you can pose your question or search in thousands of questions and answers. Thank you so much.