Cortex-A53 BUS_ACCESS Event Underreporting and Read Allocate Mode Writes
The Cortex-A53 Performance Monitoring Unit (PMU) provides a suite of events that allow developers to profile and analyze system performance. One such event, BUS_ACCESS (event ID 0x19), is designed to count all bus accesses, including cacheable traffic, non-cacheable traffic, and write streaming. However, as noted in the ARM documentation, the BUS_ACCESS event is not always counted accurately due to an erratum. This issue becomes particularly problematic when attempting to measure bus accesses across multiple cores simultaneously, as the underreporting can lead to significant discrepancies in performance analysis.
In scenarios where developers attempt to approximate the BUS_ACCESS count by summing other related events—such as non-cacheable external memory accesses (event ID 0xC1), L2-cache refills (event ID 0x17), and L2-cache write-backs (event ID 0x18)—the total often falls short of the expected bus access count. This discrepancy is primarily due to the omission of write streaming events, specifically those occurring in Read Allocate Mode. These writes are neither non-cacheable nor write-backs, and thus are not captured by the aforementioned events. The missing count can be as much as 8% in single-core setups, and the error margin may increase in multi-core configurations.
The ARM documentation references event ID 0xC5, labeled "Read allocate mode," as a potential counter for these missing write streaming events. However, the available information on this event is sparse, and its behavior, as depicted in the documentation, appears inconsistent with the expected functionality. This lack of clarity complicates efforts to accurately measure bus accesses and necessitates a deeper exploration of the Cortex-A53’s PMU events and their interactions with the memory subsystem.
Erratum in BUS_ACCESS Event and Read Allocate Mode Behavior
The core issue stems from an erratum in the Cortex-A53’s BUS_ACCESS event, which prevents it from accurately counting all bus accesses. This erratum is particularly evident when dealing with write streaming in Read Allocate Mode. In this mode, writes to cacheable memory regions that miss in the L1 cache trigger a read allocate operation, where the cache line is fetched from the next level of memory before the write is performed. These operations are not classified as non-cacheable accesses or write-backs, and thus are not counted by the standard events used to approximate BUS_ACCESS.
The ARM documentation suggests that BUS_ACCESS should include cacheable traffic, non-cacheable traffic, and write streaming. However, the erratum causes the write streaming component to be underreported. This underreporting is exacerbated by the lack of detailed information on event ID 0xC5, which is supposed to capture Read Allocate Mode writes. The documentation for this event is limited to a name and a graph that does not clearly align with the expected behavior, leaving developers to speculate about its applicability and accuracy.
Additionally, the Cortex-A53’s PMU includes several "implementation-defined" events, which are not fully documented and whose behavior can vary between different implementations of the processor. This lack of standardization further complicates efforts to accurately measure bus accesses and other performance metrics. Developers must rely on trial and error, combined with limited documentation, to determine which events can be used to approximate the missing counts.
Accurate Bus Access Counting Through Event Combination and Cache Management
To address the underreporting of bus accesses in the Cortex-A53, developers must employ a combination of PMU events and cache management techniques. The first step is to identify and enable the events that collectively cover the full range of bus accesses. As previously mentioned, non-cacheable external memory accesses (event ID 0xC1), L2-cache refills (event ID 0x17), and L2-cache write-backs (event ID 0x18) provide a partial count. However, these events must be supplemented with additional counters to capture the missing write streaming in Read Allocate Mode.
Event ID 0xC5, labeled "Read allocate mode," is a potential candidate for capturing these missing writes. Despite the limited documentation, developers can experiment with this event to determine its effectiveness. Enabling event ID 0xC5 and comparing its counts with the expected bus access patterns can help validate its usefulness. If event ID 0xC5 proves unreliable, developers may need to explore other implementation-defined events or use indirect methods to estimate the missing counts.
In addition to enabling the appropriate PMU events, developers must ensure proper cache management to avoid discrepancies caused by cache line evictions and invalidations. Data Synchronization Barriers (DSBs) and cache maintenance operations should be used to ensure that all memory transactions are properly accounted for. For example, a DSB instruction can be inserted before reading the PMU counters to ensure that all pending memory operations are completed, preventing underreporting due to outstanding writes.
Finally, developers should consider the impact of multi-core configurations on bus access counting. In multi-core systems, bus accesses from different cores can interfere with each other, leading to further discrepancies in the counts. To mitigate this, developers can use core-specific PMU configurations and synchronize the reading of counters across cores. This approach ensures that the counts from each core are accurately aggregated, providing a more reliable measure of total bus accesses.
By combining these techniques, developers can overcome the limitations of the Cortex-A53’s BUS_ACCESS event and achieve a more accurate count of bus accesses. While the process requires careful experimentation and validation, the resulting insights can significantly enhance performance analysis and optimization efforts.
This post provides a comprehensive analysis of the Cortex-A53 PMU’s BUS_ACCESS event underreporting issue, focusing on the challenges posed by Read Allocate Mode writes and the erratum in the BUS_ACCESS event. By detailing the possible causes and offering practical troubleshooting steps, it equips developers with the knowledge needed to accurately measure bus accesses and optimize their systems. The discussion emphasizes the importance of combining PMU events, managing cache operations, and considering multi-core configurations to achieve reliable performance metrics.