Single bit ECC (error correction) mechanism can cause an incorrect memory update
Revision(s) Affected:
AWR2544
Description:
Note: The issue was uncovered during the debug of an incorrect memory access sequence in simulations. Till date there are no such issues reported in-field / deployment scenarios by any of our customers.
In the uncommon occurrences of single bit upset events on below tabulated memory ranges in the SoC, under a specific memory access sequence combination, the single bit error correction mechanism can cause an incorrect memory update.
The RAM memories on AWR294x are ECC protected with a Single bit Error Correction, Double bit Error Detection (SECDED) mechanism. On the occurrence of a specific sequence of events, the single bit error correction mechanism can cause an incorrect memory update.
For the issue to cause an impact to the application, all the below conditions must satisfy
- Random hardware faults, due to environmental conditions or other factors, leading to a single bit upset events occur, AND
- The Single bit upset event affects the impacted memory ranges, AND
- A read or partial-write access to the memory location with single bit error occurs (leading to the single bit error correction mechanism kicking in), AND
- A specific memory access sequence combination occurs after the single bit error correction happens, AND
- The incorrect memory update by the error correction mechanism is critical enough to impact the application program flow and is undetected by other safety mechanisms.
The following access combination (Conditions 3 and 4 above) to the impacted memory range after the single bit error correction happens can cause the issue.
- Read / Partial write access (from/to the location A with SEC) → (Followed by) Full write (to one or more memory locations in the same memory range) → (Followed by) Partial write (to any other location in the same memory range) : leads to incorrect update to last full-write location.
- Partial write access (to the location A with SEC) → (Followed by) Partial write (to any other location in the same memory range) : leads to incorrect update to location A.
Note: The issue doesn’t occur for all other combinations of memory access sequence combinations.
Workaround(s):
The single bit upset events are uncommon with lower probability of occurrence.
- The scenario must lead to single bit errors alone. Double bit errors are only detected and on double bit errors, depending on the criticality, the device is taken to safe state.
Partial write memory accesses (needed to cause the issue) are limited as
- Cached memory ranges do not lead to partial write accesses as cache lines writes are always full writes.
- Code sections are read only (hence the entire code section accesses do not satisfy the conditions to cause the issue).
- Impacted memories with partial write accesses can have other safety mechanisms that can detect or avoid such random errors.
- Higher level processing algorithms of Radar data cube have built in outlier rejection capabilities due to tracking functions (temporal and logical monitoring).
- Information redundancy techniques may be used on impact memories like Mailbox to detect errors.
In the impacted memory ranges, identify if there are possibilities of partial memory write accesses. Decide on the criticality for the need to take cation on such identified memory ranges with partial writes. Following are the possible courses of actions:
No Action:
- If single bit upset events are unlikely in the operating environment.
- If there are other safety mechanisms that can detect or avoid such spurious random errors.
- Action: One or more of the following options can be considered
- Avoid the partial write access pattern to those memory ranges.
- Re-initialise the impacted memory bank on single bit memory correction event.
- Treat the single bit memory correction event as an un-correctable error and enter safe state.
- This does not impact the Functional safety detectability claims and may impact the availability in the event of such single bit upset occurrence.
Refer below table for memory range and its corresponding ESM line & ECC aggregator bit if action (2-b-ii) needs to be taken.
This table includes only impacted memory list and corresponding details regarding
| Memory Name | Start address | End Address | ESM Line | ECC Aggregator Status bit |
|---|
| DSS L3 Bank0 | 0x88000000 | 0x880BFFFF | DSS_ESM:: GROUP1 Line No- 92 | DSS_ECC_AGG::SEC_STATUS_REG0:: DSS_L3RAM0_PEND |
| DSS L3 Bank1 | 0x8800C000 | 0x8817FFFF | DSS_ESM:: GROUP1 Line No- 92 | DSS_ECC_AGG::SEC_STATUS_REG0:: DSS_L3RAM1_PEND |
| DSS L3 Bank2 | 0x88180000 | 0x881FFFFF | DSS_ESM:: GROUP1 Line No- 92 | DSS_ECC_AGG::SEC_STATUS_REG0:: DSS_L3RAM2_PEND |
| DSS L3 Bank3 | 0x88200000 | 0x8827FFFF | DSS_ESM:: GROUP1 Line No- 92 | DSS_ECC_AGG::SEC_STATUS_REG0:: DSS_L3RAM3_PEND |
| MSS L2 Bank0 | 0xC0200000 | 0xC027FFFF | MSS_ESM:: GROUP1 Line No-18 | MSS_ECC_AGG_MSS::SEC_STATUS_REG0:: MSS_L2SLV0_PEND |
| MSS L2 Bank1 | 0xC0280000 | 0xC02EFFFF | MSS_ESM:: GROUP1 Line No-18 | MSS_ECC_AGG_MSS::SEC_STATUS_REG0:: MSS_L2SLV1_PEND |
| MSS Mailbox | 0xC5000000 | 0xC5001FFF | MSS_ESM:: GROUP1 Line No-18 | MSS_ECC_AGG_MSS::SEC_STATUS_REG0:: MSS_MBOX_PEND |
| MSS_RETRAM | 0xC5010000 | 0xC50107FF | MSS_ESM:: GROUP1 Line No-18 | MSS_ECC_AGG_MSS::SEC_STATUS_REG0:: MSS_RETRAM_PEND |
| DSS Mailbox | 0x83100000 | 0x83100FFF | DSS_ESM:: GROUP1 Line No- 92 | DSS_ECC_AGG::SEC_STATUS_REG0:: DSS_MAILBOX_PEND |
Note: MSS_L2 address captured above is from DSS and EDMA addressing View. MSS_L2_BANK0 and MSS_L2_BAK1 address view from MSS-R5 is 0x10200000-0x1027FFFF. and 0x10280000-0x102EFFFF respectively
Other memories that are not utilized by the application but used by the BSS, such as BSS_Mailbox and BSS_Static_RAM, are also affected by this errata
- The BSS mailbox is primarily used for communication between the BSS and MSS/DSS using mmWaveLink, following a message protocol that incorporates CRC for data integrity. Using CRC during message exchanges over the BSS mailbox reduces the risk associated with this memory.
- When a fault occurs (in this case, an ECC SEC), BSS sends an ESM Fault Asynchronous event message to the MSS/DSS as a notification. The application must read the b20:ECC_AGG_SEC_ERROR from AWR_AE_RF_ADV_ESMFAULT_STATUS_SB async-event from the BSS. Treat this single-bit memory correction event as an uncorrectable error and enter to a safe state.
- This workaround is only valid if the application uses BSS Patch from DFP version 2.4.14 or earlier