KeyStone I training: NETCP - packet accelerator (PA)
NETCP Packet Accelerator (PA) takes a look at the motivation behind the PA, the hardware, firmware and low level drivers, as well as a programming use case.
Resources
Hello, and welcome to the KeyStone Training for the Network Coprocessor Packet Accelerator. We will begin by looking at some applications for the Packet Accelerator followed by a look at the hardware modules that make up the Packet Accelerator. Then we all move onto the firmware followed by an examination of the PA low level driver. Lastly, we will take a look at the programming example using the PA and the PA LLD.
Let's get started with the applications. The purpose of the Packet Accelerator is to save the host DSP cycles. The Packet Accelerator does this mainly by classifying layer 2, layer 3, layer 4, and other custom received packets before they reach the host DSP. The Packet Accelerator also saves the DSP time by performing checksums on packets before they are transmitted. The Packet Accelerator also supports multicasting packets to multiple queues so that, for example, the packet can be sent to multiple queues where each core has been assigned to a different queue to watch. The Packet Accelerator has access to 8,192 queues in the Multicore Navigator and through these queues has access to the Security Accelerator for IPsec, SRTP, Air Cipher, and other supported encryption schemes.
Now that we have completed the applications, let's move on to a high-level overview of the Packet Accelerator. The PA contains six engines, a packet ID manager, and a statistics module. Each of these six engines are responsible for a different task. Let's take a look at each of these modules at a high level to get a general idea of the function of each module within the PA.
Let's begin with a look at the engines. The engines can be broken into two rough groups, the classify engines and the modify/multi-route engines. The classify engines are typically used for classifying receive packets. The modify/multi-route engines are a little more flexible and can be used during the transmit process, the receive process, or for general PA system tasks.
Let's take a closer look at the classify engines. The PA has one layer 2 classify engine, which is responsible for matching layer 2 header fields, including MAC, VLAN, and LLC snap headers. The PA has two layer 3 classify engines, which are responsible for matching layer 3 header fields, including IPv4, IPv6, and other custom layer 3 headers, including [INAUDIBLE]. The layer 3 classify engines can also find SPI headers and send packets through the Multicore Navigator to the SA for decryption.
While both layer 3 classify engines can match all of the fields mentioned, the layer 3 classify engine number 1 is primarily used to match the inner IP addresses of IPsec tunnels. PA also has one layer 4 classify engine, which is used for matching layer 4 header fields, such as UDP, TCP, and other custom layer 4 headers. We'll take a much more detailed look at each of these classified engines later in the training.
Now let's take a couple minutes to discuss the modify/multi-route engines. The modify and multi-route engines can be used for a variety of tasks, such as generating checksums such as IP and UDP checksums during the transmit process. Additionally, the modify/multi-route engines are also used to read the PA statistics from the statistics module and return them back to the host. Lastly, the modify/multi-route engines can be used to route packets to multiple locations like sending a packet to multiple receive queues. Multi-routing can be used during transmit or receive processes.
Now let's take a brief look at the PA statistics module. PA statistics module works in conjunction with the classify engines to store statistics for receive packets. The statistics are typically read and returned to the host using the modify/multi-route engine number 0. Lastly, let's take a look at the packet ID manager. The packet ID manager is used by the PDSPs to allocate IDs to the packets being processed.
Now that we have completed a high-level overview, let's take a more detailed look at the Packet Accelerator hardware and firmware. Let's take a look at each of the engines, particularly the hardware and the firmware. The Packet Accelerator has many hardware blocks some of which are grouped together as engines to perform a specific function. Let's begin with the layer 2 and layer 2 classify engines.
Each of these engines are similar in the fact that each use a group of three harder modules, which are a PDSP, a pass 1 lookup table, and a 16-bit timer. These PDSPs are all loaded with a classify 1 image. The PDSP is a Packet Data Structure Processor, which can parse incoming data packets.
Each of the PDSPs will have a firmware image loaded into its program memory, which will control the actions of the PDSP. Using the firmware, the PDSP will submit header lookup requests to its connected pass 1 lookup table to match packet headers against the entries stored in the table. The pass 1 lookup table will return the results of the search of the PDSP where it will determine based on its configuration where to send the packet. The PDSP is also responsible for adding and removing entries from its pass 1 lookup tables, which can store up to 64 different entries.
The PDSP uses its associated timer for various timing related tasks. PDSPs can also communicate with each other by passing data with the process packets over the streaming interface switch. This will allow the PDSPs to communicate the status of the previous lookup and allow entries in any of the classify engines to be linked together.
What differentiates the layer 2 and layer 3 classify engines is what they store in their lookup tables. The layer 2 engine will store layer 2 entries in its lookup table. And when processing packets, the PDSP will provide the layer 2 headers from the packet that it is currently processing to its lookup table for matching. Similarly, the layer 3 entries will be added to the layer 3 lookup tables. And the PDSPs will provide layer 3 headers from the package that they're processing for matching.
What separates the layer 3 0 engine from the layer 3 1 engines is how a lookup table entry is linked to an entry in another lookup table. For example, if there is no link to a previous entry or if it links to a layer 2 entry, then the packet will be added to the layer 3 0 engine. If the layer 3 entry links to another layer 3 entry, then it will be added to the layer 3 engine number 1. All of this is taken care of by the PA LLD, which will be covered in more detail later in this presentation.
The layer 4 classify engine is similar to the layer 2 and layer 3 classify engines except that it has a pass 2 lookup table, which can store up to around 8,000 entries instead of the 64 entries that are allowed by the pass 1 lookup tables. The layer 4 engine also uses the classify 2 firmware image instead of the classify 1 firmware image. Now that we have covered the classify engines, let's take a look at the modify/multi-route engines.
The modify/multi-route engines each have one PDSP and a timer and use the modify firmware image. These engines do not have lookup tables. So these PDSPs do not perform lookups.
For these engines, information must be loaded into the descriptors to determine what actions need to be taken on the packet, so just determining if a packet should be modified by performing a checksum or if the packet should be routed to multiple locations. For the packets that need to be modified, the descriptor will also contain information about where the packet should be sent after the modification has been completed. Which modify/multi-route engine a packet will be routed to will be taken care automatically by the PA LLD. The PA LLD will be covered in more detail later in this presentation.
Now that we have taken a detailed look at the engines in the Packet Accelerator, let's discuss the Packet Accelerator Low Level Driver. The PA LLD provides an abstraction layer between the application and the PA. The PA LLD will translate packet headers and routing requirements that are provided by the user into configuration packets and [INAUDIBLE] commands that can be understood by the PA firmware.
PA LLD also handles linking together entries in separate lookup tables. For example, an entry in a layer 2 classification lookup table can be linked to an entry layer 3 classification lookup table, which in turn can be linked to an entry in a layer 4 classification lookup table. This allows a packet to be matched with a specific layer 2, layer 3, and layer 4 address combination.
For this example, if an incoming packet matches a layer 2 address but the layer 3 address does not match, then the packet will be routed according to the specified failure route. In addition, the PA LLD allows an entry in a lookup table to be linked to several entries in another lookup table. For example, a single entry in a layer 3 classification lookup table can be linked to multiple entries in a layer 4 lookup table.
Now let's take a look at how the PA LLD communicates with the host, the Multicore Navigator and the network coprocessor. Host application will make calls to the PA LLD to do things, such as adding entries to lookup tables and generating commands for CRC checksums. All of this is done by the host application and will operate on host memory.
Once the configuration packet-- or command data has been generated, it will be sent to the network coprocessor through the Multicore Navigator. Once in the PA, the command will be executed by the appropriate module. Once the command finishes processing, the PA may also send back a response package to the host application using the Multicore Navigator to notify the application and the PA LLD whether or not the configuration was successful. Now let's move on to some APIs provided by the PA LLD.
PA LLD APIs can be broken down into four main parts, the system APIs, the configuration APIs, the custom configuration APIs, and utility functions. Let's begin with the system APIs. The system APIs provide basic system functions, such as opening and closing a PA LLD instance using the Create and Close functions as well as for resetting and initializing the PA hardware using the Reset Control function and downloading the PDSP firmware using the Download Image function.
Now let's take a look at the configuration APIs. The configuration APIs provide functions for generating configuration packets that will instruct the PA firmware how to add and remove entries from the lookup tables in the PA as well as functions for how to route error packets. An example of a function that will generate a configuration packet to allow the PA firmware to add an entry to a lookup table is the Add IP function, which will add an entry to the layer 3 classify engine lookup table. If the application then wanted to generate a configuration packet to allow the PA firmware to remove that entry from the L3 classify engine lookup table, then it could call the Delete Handle function. The Delete Handle function can also generate configuration packets to remove entries from the L2 classify engine lookup tables.
Now let's take a look at the custom configuration APIs. Custom configuration APIs are similar to the normal configuration APIs. They are still used to generate configuration packets for adding entries to the lookup tables. Except that the configuration packets that are generated in this case are for custom entries. Normal configuration APIs can still be used to generate configuration packets that will allow the firmware to remove the entries from the lookup tables.
The utility functions provided by the PA LLD will create instructions that can be provided to the modify/multi-route engines. For example, the format transmit route API will create a command that can be placed in the protocol-specific data section of a descriptor. Then when the packet DMA transfers that descriptor and link data packet to one of the modify/multi-route engines a UDP checksum or other checksum can be generated for that linked packet. Then the modify/multi-route engine can route that packet to another destination, such as the transmit queue for the Gigabit Ethernet switch, which will allow the packet to be transmitted over an 802.3 compliant network.
At this point, the basic configuration for the PA and the PA LLD have been completed. And we can begin doing things like adding entries to the lookup tables and generating commands for checksums. Let's start with the basic configuration for the Packet Accelerator.
In step 1, we set up the memory for the Packet Accelerator. We do this by first calling the GetBuffer requirements API and determine the memory requirements for the PA LLD. Using those memory requirements, we then call the Create function with member requirements to create the PA LLD instance.
In step 2, we initialize the PA and load the firmware. We do this by first calling the Reset Control function with the disable input. Then we call the Download Image function to download the firmware on the PDSPs. Lastly, we call the Reset Control function with the enable input to complete the PA initialization. At this point, the basic configuration for the PA and the PA LLD have been completed. And we can begin doing things like adding entries to the lookup tables and generating commands for the checksums.
Before we look at another example, let's take a look at the routing structure used by the PA LLD. The routing structure is used by the PA LLD configuration functions to generate commands to tell the Packet Accelerator how a packet should be routed. This information can be placed in a configuration packet to add an entry to a lookup table to tell the PA how to route a packet after matching a header.
The PA LLD provides the following fields for configuration, destination, flow ID, queue, multi-route handle, software info 0, and software info 1. The destination can be one of six possible destinations. The host destination tells the Packet Accelerator to route the packet to a receive queue. This queue can be any queue. And the queue number is specified in the queue field of the routing structure.
That EMAC destination tells the Packet Accelerator to route the packet to the Gigabit Ethernet switch subsystem. This will actually route the packet out of the Packet Accelerator to the transmit queue for the Gigabit Ethernet switch. The SASS0 destination tells the Packet Accelerator to route the packet to the first transmit queue for the SA.
The SASS1 destination tells the Packet Accelerator to route the packet to the second transmit queue for the SA. The discard destination tells the Packet Accelerator to discard the packet. The last destination is the continue parse destination. This tells the Packet Accelerator to route the packet inside the Packet Accelerator from one module to another.
For example, this might be used when a MAC entry has been added to a lookup table. That way when the L2 classify engine matches a receive packet, the packet will stay inside the Packet Accelerator and be routed to the L3 classify engine 0 to try to match an IP or custom L3 address. The flow ID field in the data structure is used to specify which packet DMA receive flow should be used. This field is ignored for the discard and continue parse destinations.
The queue field in the data structure is used to specify which queue should be used when sending the packet to a queue. This field is ignored for the discard and continue parse destinations. If using the EMAC destination, then the queue number that should be used is the transmit queue for the Gigabit Ethernet switch.
If the SASS0 destination is used, then the queue number specified should be the first SA transmit queue. If sending the packet to the SASS1 destination, the queue number for the second SA transmit queue should be used. If using only the Packet Accelerator, then the software info words can be given any value that is desired. If routing a packet between the Packet Accelerator and the Security Accelerator through the Security Accelerator transmit queues, like in the case of IPsec encryption, software info words must contain the information required by the Security Accelerator.
Now let's take a look at how to set up the receive configuration using the PA LLD. Following this process, we will add entries to lookup tables in the classify engines. Step one is to call one of the PA LLD configuration or custom configuration APIs with the address that will need to be added to the lookup table.
The information about where to send the packet after the match and where to send the configuration response packet should also be provided so that it can be included in the configuration packet. In step 2, the application will receive a formatted configuration packet back from the PA LLD. The PA LLD will also return an index of where the package should be sent.
Step three is to set the protocol-specific data section of the descriptor by calling the set PS data function. Once this is done, then the packet can be pushed onto the transmit queue index that is returned by the PA LLD. At this point, there is no more interaction required by the host until the configuration response packet is received by the Packet Accelerator. In step 4, the packet DMA will automatically pop the descriptor pointer from the transmit queue and send the packet to the network coprocessor. After the transmit has been completed, the transmit descriptor will be pushed onto the transmit completion queue by the packet DMA.
In step 5, the configuration packet is shown being transferred by the packet DMA from the transmit queue to the network coprocessor. Once in the network coprocessor, the configuration packet will be directed from the packet DMA controller through the streaming interface switch to the desired classify engine. The classify engine and will then attempt to add the entry to its lookup table. In this example, we are showing the packet being loaded from the transmit queue 640, which is the queue for PDSP 0 and the L2 classify engine. However, the process is almost identical for all of the classify engines with the only difference being which transmit queue the packet will be pushed onto, which is dependent on which configuration API was called.
In step 6, a configuration response back is sent from the classify engine to indicate whether or not the entry was successfully added to the lookup table. The command response packet is routed from the classify engine through the packet streaming switch to the DMA controller. Once in the packet DMA controller, the packet DMA will transfer the packet from the network coprocessor using the receive flow and receive queue specified. In this example, we showing the packet being routed from receive queue 900.
In step 7, once the packet has been transferred to host memory by the packet DMA, the receive flow will write the packet descriptor pointer onto the receive queue. In step 8, the host must once again take action. Here the host pops the descriptor off of the receive queue using the QMSS queue pop API.
In the last step, step 9, the result from adding the entry to the lookup table is forwarded to the PA LLD to be analyzed. If an error occurred while adding the entry to the lookup table, then the forward result API will return the appropriate error status. Steps 1 through 9 can be repeated to add more entries to the classify engine lookup tables.
Now that we have shown the process of how to configure the Packet Accelerator, let's take a look at how receive packets are processed by the PA. Assume for the sake of this example that the L2 classify engine lookup table, L3 classify engine 0 lookup table, and the L4 classify engine lookup table have all been configured with entries. Then if we receive a packet with MAC IPv4 and UDP headers from the Gigabit Ethernet switch subsystem, the receive processing will go as follows.
In the first step, the packet will be routed from the Gigabit Ethernet switch subsystem over the streaming interface switch to the L2 classify engine. Now that the packet has arrived at the L2 classify engine, it will examine the MAC header and compare it against the entries in its lookup table. If the MAC header was successfully matched, then the packet will be routed according to the match route that was specified when the entry was added to the lookup table.
If the lookup fails, then the packet will be routed according to the failure route that was specified when the entry was added to the lookup table. For this example, let's assume that the addresses match and that the entry was entered with the PA dest continue parce destination, which means that the packet will be routed through the packet streaming switch to the L3 classify engine 0 to try to match the IPv4 header. Step 3, once the packet reaches L3 classify engine 0, PSP1 will examine the IPv4 header and submit it to its lookup table.
The lookup table will then compare the packet header against the entries stored in the lookup table. If the headers match, then the packet will be routed according to the match route that was specified when the entry was added to the lookup table. Again, if the lookup fails, then the packet will be routed according to the failure route that was specified when the entry was added to the lookup table.
For this example, let's assume that the addresses match and this entry was also added with the PA dest continue parse destination. This means that the packet will be routed through the packet streaming switch to the L4 classify engine to match the UDP header. Step 4, once the packet reaches the L4 classify engine, PDSP3 will examine the UDP header and submit it to its lookup table where the lookup table will compare the packet header against the UDP entries that have been added to the lookup table.
Again, if the headers match, then the packet will be routed according to the match route that was specified when the entry was originally added to the lookup table. If the lookup fails, then the packet will be routed according to the failure structure that was specified when the entry was added to the lookup table. For this example, let's assume that the addresses match, that the entry was added with the PA dest host destination and with the queue specified as queue 900.
In step 5, the packet will be routed through the packet streaming switch to the packet DMA controller. Once in the packet DMA controller, the packet DMA will transfer the packet from the network coprocessor to the host using the specified receive flow to transmit the packet to queue 900. This completes the hardware receive processing example.
It is important to note that once the configuration has completed, which we assumed was done before starting this example, the entire process requires no interaction from the host. It is also important to note that the packet could have come from another location and could have been formatted with different headers. An alternate example is that packet could have arrived from the transmit queue for the L3 classify engine and could be checked against custom L3 entries that have been added to its lookup table.
Now let's take a look at what receiving packets from the Ethernet looks like from the perspective of the host. In step 1, after the packet has matched all the packet headers in the Packet Accelerator, the packet is transferred from the network coprocessor to a destination queue. At this point, if the host is watching the destination queue, then it can immediately take action.
Alternatively, if the queues are set up to be monitored through an interrupt service routine, then the host can be notified through an interrupt. In either case, once the host is ready to process the packet, in step 2, they can pop the descriptor off the receive queue using the QMSS queue pop function. Steps 1 and 2 can then be repeated for all receive data packets.
Now that we have completed the receive processing example, let's take a look at a transmit processing example. For the sake of this example, let's assume that we are transmitting the packet out over the Ethernet. We begin step 1 by calling the Formit Transmit Route function with the pseudo checksums for the desired header checksums.
Then in step 2, we receive a command back from the PA LLD to be placed in the protocol-specific data section of the transmit descriptor. In step 3, we write the protocol-specific data into the descriptor using the Set PS Data function. Then we link the data packet to the data buffer and push the packet onto the transmit queue. At this point, there are no more actions required by the host.
In step 4, the packet DMA will automatically transfer the packet to the network coprocessor. After the transfer has been completed, the transmit descriptor is written to the transmit completion queue by the packet DMA. Steps 1 through 4 should be repeated for all transmit packets.
Now that we have shown the transmit process from the perspective of the host, let's take a look at what happens in the PA hardware. In the first step, the packet is placed in the transmit queue for PDSP4, which lies in the modify/multi-route engine 0. The packet DMA will transfer the packet from the queue to the packet DMA controller on the network coprocessor.
The packet will then be routed to the modify/multi-route engine 0 via the packet streaming switch. For step 2, once the packet arrives in the modify/multi-route engine 0, PDSP4 will perform the checksum operation on the packet. PDSP4 can do up to two checksums simultaneously. So for example, it could do an IP and the UDP checksum simultaneously.
Once PDSP4 finishes the checksums, the packet will be routed through the packet streaming switch and the packet DMA controller where the packet DMA controller will transfer the packet to the transmit queue for the Gigabit Ethernet switch subsystem. In step 3, the packet DMA transfers the packet from the transmit queue for the Gigabit Ethernet switch to the packet DMA controller on the network coprocessor. From there, the packet will be transferred over the packet streaming switch to the Gigabit Ethernet switch subsystem for transmission over the Ethernet.
It is important to note the data transfer from the modify/multi-route engine 0 to the queue for the Gigabit Ethernet switch as well as the transfer from the transmit queue for the Gigabit Ethernet switch to the Gigabit Ethernet switch subsystem happened without any interaction from the host. Thank you for completing the Network Coprocessor Packet Accelerator KeyStone Training. For more information about the Packet Accelerator, please see the resources referenced on this slide.
This video is part of a series
-
KeyStone™ ARM® and DSP multicore device
video-playlist (28 videos)