Building an Industrial ARM: Processors, interconnects, and memory
Resources
[MUSIC PLAYING]
Hello. This is Jian Wang, I'm a system architect with the TIS Processor Business. In today's presentation, I want to talk about the second half of our series presentation called Building an Industrial Arm, where we actually talk about the architectural considerations and differentiations we put into the AM65 devices specific for industrial applications. In the first part of the presentation, my colleague Ron Burkett has actually worked us through some real-life examples of industrial systems, specifically, PLC controller processor where he discussed about the challenges regarding to the processor delays, challenges from the freedom from interference, and then for the real-time applications, and also the industrial networking.
In this part of the presentation, we want to continue on that discussion and in specific, focus on the AM65 devices, where we map into exactly the challenges on how do we solve those challenges with architecture. We follow the similar flow where he discussed about the course, the memory subsystems, and then the interconnect, and quality of service. Finally, we touch on a few major IPs we actually put in for the industrial applications.
On processor cores, Ron has described earlier as the control system is becoming very complex, the demand for the processor cores are increasingly becoming very high. In the AM65 devices, we actually put in the 64-bit A53 cores from ARM. They are suitable for CPUs. And we fully implemented the IO coherency to the on-chip memories and to the DDR. And also we've enabled the SIMD extensions and the floating point extensions. The virtualization support is actually integrated with the rest of the chips virtualization architecture.
In addition to the standard ARM offerings, we also enhanced the cores with industrial-specific requirements, including we have ECC protection on the L1 data cache and the L2 cache. We also have parity protection for the L1 instruction cache, L1 data cache tag RAM and the 30 bits. There is no error protection on the branch protection and the L2 victim buffers.
We also took a unique approach to actually put into independent dual-core MP2 clusters instead of taking a single MP2 cluster. And the main purpose is actually to have isolated the power and the clocking pass for each of the CPU clusters so we can ensure freedom from interference in some of the mission-critical applications. However, that actually does not sacrifice us from the model coherency because the interconnect outside of the clusters actually ensures the multi-core coherency. So essentially for systems where a quad-core CPU is fully coherent is still needed, the isoch behaves just like an MP4.
In addition to the main processors, we also have integrated a dual-core Cortex-R5 MCU cluster. The R5 is actually sitting on an isolated power and clocking island which we call the safety island. And R5 itself can actually run as either a split mode or lockstep mode. The purpose is actually you can have an integrated safety processor on the same SoC.
Now let's take a look at the memory architecture of the device. Earlier, Ron had mentioned that because of real-time determination requirements, we do require a larger on-chip memory. That's why we actually, on this device, we're putting a two-megabyte high-performance on-chip SRAM, which is actually being connect to a coherent network. We also have a 32-bit DDR controller, which can support both DDR4 and DDR3L, as well as the LPDDR4, which is also connected to this coherent interconnect. The two memory subsystems, along with coherent interconnect, form the high-performance interconnect network, which are fully coherent with the CPR cores. So any access coming into the SRAM or the DDR will be fully snoped by the CPU cores.
Both the SRAM and DDR are shared resources within the device. They can be accessed by multiple masters simultaneously. To understand how the arbitration and routing works, we have to take a look at the MSMC. So the MSMC stands for multi-core shared memory controller. And it it's basically the data movement backbone for the high-performance memories, and the systems, and the CPUs. The SRAM is actually building part of the MSMC subsystem where we have two banks of SRAM totalling two megabytes.
Routing-wise, MSMC is actually based on threads. There are a total of three threads within the MSMC. And each side has their own credits and [INAUDIBLE]. By the design, Thread 2-- it's called T2-- actually have a higher priority than the T0, which can preempt the T0. So basically, the idea is you can put the background traffic on the T0, with some of the more critical traffic on the T2. And you can guarantee T2 can preempt T0, but not the other way around. If your T3 class are actually the cache controller utilizing the T0, the T1 is actually for the master access to the CPU or master access from the CPU to the rest of the SoC.
There are two gateway ports from the rest of SoC to access the coherent interconnect. They are North Bridge 0 and North Bridge 1, as marked on the NB0 and NB1 on the diagram. The different class of traffic can actually be used to go through the North Bridge. And then there is the MMR controls for the user to program what kind of traffic can do go to T0 or what kind of traffic can go to T2. And North Bridge 0 is a primary connect to the SRM address space. And the North Bridge 1 is connected to the DDR address space.
The DDR EMIF controller supports the five classes of priorities. We call them class of services-- CoS. They are based on the read-write operations or based on the priorities. Just an example, let's take a look at how the high-priority transaction happens over a low-priority transaction. Basically, in the FIFO of the EMIF controller, the execution of the commands can be out of order. And then, of course, the rate of data coming back can be also out of order. The EMIF controller guarantees that high-priority transactions can always happen when the low-priority transactions are fully blocked.
Of course, we also mentioned the coherency. So the bridge between the MSMC and DDR maintains the data coherency. So as a specific example, if you have a high-priority transaction that depends on the low-priority transaction due to the address conflict, that high-priority transaction will be blocked until the low-priority transaction completes.
In the next slide, we take a look from user point of view-- how do we map the MSMC threads to the DDR class of services? So there are two bridges within the SoC. Each bridge actually has thread mapping from either 0 or 2. So basically, one of the class of services on a DDR which is either a high-priority read, variable-priority read, a low-priority read, or a normal-priority write, or a variable-priority write-- one of those five classes will be mapped by those MMRs.
There is also a override capability where you can actually define a range of RouteID. The RouteID is basically, each of the masters has a unique RouteID. It's basically similar to what we used to have called the MasterID. So you can actually define through MMR a range of RouteID which can override the priorities coming from the bus. So those are the kind of features on the DDR subsystem.
After the coherent, the interconnect, and the main memory subsystems, now let's take a look at the rest of the system interconnect beyond the coherent fabric. So there are four parallel passes from the rest of the system interconnect to the coherent fabric. There are basically two of them actually dedicated to the SRAM. And the other two is on the DDR. They are totally separate buses.
In the table below, we actually gave some raw bandwidths for budgetary purposes for each of the masters in the system and then for the meet some of the key slaves in the system. So once we notice what's in the last row of this master table where I have a C-bus by 4, which is actually coming from the NAV subsystem to the coherent interconnect showing there's four arrows in the diagram. And each one of them, it's a 128-bit bus running under 250 megahertz. That's why each one of those gave us four gigabytes of bandwidth capability.
Let's take an overview of the logic computations of the overall interconnect topology. The whole interconnect can be logically regarded as five fabrics. They are essentially the coherent fabric-- which is including the CPUs-- and SRAM, the MSMC, and then the DDR. So its all the high-performance and of course plus the coherent memory.
Then there is a deterministic-latency fabric, which basically has all the mission-critical interfaces including the ICSSG-- which is for industrial networking-- and then the GPU, and then the PCIe. And then there is a high-bandwidth, high-performance interconnect, which is the camera interface, the display, the crypto. They are all on this fabric.
And then it comes to the rest of the connectivity with all the slaves and normal subsystems, which is connected to the main SoC interconnect. Finally there is MCU Island, which is basically, as we mentioned earlier, considered a fully-isolated-- including power and a clock-- isolated fabric. It's including a little small Wake-up domain. There is also passes between the island to the main domain, but it can be isolated logically as well.
To further understand how end-to-end class configuration could look like for a real system, let's take a look at an example use case of Integrated PLC. So in the PLC application, typically you would have industrial internet networking traffic-- basically, two ports which switch, which you can have mixed traffic between real time and non-real time. And in this specific case, we will have different class of services from the A53 cores. They can have CPU access to the SRAM. And they can have CPU access to the DDR. And then, of course, you would have the CPU cache activities.
On the network side, the ICSSG, which is the industrial internet switch, can have real-time ethernet traffic from the fieldbus, which is the red color. We are machine joined here. And then you can have low-latency TSN data fetching for the Time-Aware Shaper, which can be either coming into the SRAM or coming to the DDR. And then you can have regular communication traffic, which is non-real time. And in this case, we can see we can completely by physical pass oscillate between the real time and non-real time, where the real time actually goes to the SRAM through the orange pass And then you can have the background traffic going to the DDR through the blue pass.
And as we mentioned earlier, the selection of each path is depending on two factors. The one factor is, what's the destination address. Whether it's SRAM, going to use these two passes, if it's DDR, it's going to use two greens. And then between the two, it's actually choosing by order ID. So basically you can select for each of the DMA transactions coming from ICSSG, you can program order ID group. And the order ID group actually determines which pass it will take. So that gave us the physical isolation between the real time and non-real time traffic to the same memory and points.
Let's take another look at a PCIe-based system. In this case, it typically could be PLC host or motion controller, where you would still have the CPU traffic, either cache activities, or either SRAM access, or the DDR. And in addition to that, you actually have PCIe traffic, which is you have different virtual channels in the PCIe configuration. And you can basically map VC3 to high priority, VC1 to medium priority, and VC0 to low priority, as example. And the where the low-priority traffic actually can go to the DDR, and the medium-priority traffic go, and the high-priority goes to the SRAM. And we can physically separate the L3 passes by selecting the order ID groups which determines which pass they would go on the interconnect passes.
And in addition, you can actually have a networking, which background component regular communications go into the DDR. This kind of demonstrates how a PCIe-based system where the VCs are mapped into different physical passes to the memory end points.
Let's take another look with one of the multimedia subsystems. So in this case, we take a look at the HMI-- the human machine interface application-- where display is needed. And in addition, you would have industrial networking. And here we still have the A53 traffic going to the SRAM, DDR, and cache activities-- accessing them. Then you would still have the industrial ethernet where the real time goes to the SRAM. And then the non-real time goes to the DDR through the DDR low pass.
And in addition to that, you would also need a display. So the display is actually real time output end point. And it actually needs to constantly DMA the display buffer from the DDR. Without proper priority, you could starve the display buffer. You have under-flow conditions. So to optimize the system, we can actually put the ethernet background traffic to the DDR low port. And then we can put the display on the DDR high port even they both are accessing the DDR. And again, these are configured through the other ID configurations on each of the paths.
This video is part of a series
-
Building an industrial ARM using Sitara™ AM6x
video-playlist (4 videos)