ARM Cortex-A55 PMU Limitations in Counting Non-Cacheable Accesses
The ARM Cortex-A55 processor, a member of the ARMv8-A architecture family, is widely used in embedded systems for its balance of performance and power efficiency. One of its key features is the Performance Monitoring Unit (PMU), which provides hardware counters to track various events, such as cache accesses, branch predictions, and bus transactions. However, a notable limitation of the Cortex-A55 PMU is its inability to directly count non-cacheable memory accesses. This limitation can be problematic for developers who need to profile and optimize systems where non-cacheable memory regions are heavily used, such as in memory-mapped I/O or shared memory between heterogeneous processors.
The Cortex-A55 PMU includes events like L1I_CACHE_ACCESS and L1D_CACHE_ACCESS, which count cacheable accesses to the L1 instruction and data caches, respectively. However, these events do not account for non-cacheable accesses, which bypass the cache hierarchy entirely. This omission makes it difficult to distinguish between cacheable and non-cacheable accesses when analyzing bus transactions using the BUS_ACCESS event. As a result, developers may struggle to accurately profile memory access patterns, leading to suboptimal system performance and resource utilization.
To address this limitation, it is essential to understand the underlying causes and explore alternative methods for tracking non-cacheable accesses. This involves a deep dive into the Cortex-A55 architecture, the behavior of its PMU, and the interactions between the processor, caches, and memory system. By leveraging architectural insights and available tools, developers can work around the PMU’s limitations and gain visibility into non-cacheable memory accesses.
Memory Access Attributes and PMU Event Filtering
The inability of the Cortex-A55 PMU to count non-cacheable accesses stems from the way memory access attributes are handled in the ARMv8-A architecture. Memory accesses in ARMv8-A are classified based on attributes such as cacheability, shareability, and memory type. These attributes are defined in the Memory Protection Unit (MPU) or Memory Management Unit (MMU) and determine how accesses are handled by the memory system. Cacheable accesses are routed through the cache hierarchy, while non-cacheable accesses bypass the caches and go directly to the memory or peripheral.
The Cortex-A55 PMU is designed to count events related to cacheable accesses, as these are typically the most performance-critical. Events like L1I_CACHE_ACCESS and L1D_CACHE_ACCESS are tied to the cache hierarchy and do not account for non-cacheable accesses. Similarly, the BUS_ACCESS event counts all bus transactions, regardless of whether they are cacheable or non-cacheable. This lack of granularity makes it impossible to directly measure non-cacheable accesses using the PMU.
One possible cause of this limitation is the complexity of filtering PMU events based on memory access attributes. The PMU hardware would need to inspect the memory attributes of each access and selectively count those that match specific criteria. Implementing such filtering would require additional hardware logic and could increase the complexity and power consumption of the PMU. Given the Cortex-A55’s focus on power efficiency, it is likely that this feature was omitted to simplify the design.
Another factor is the prioritization of use cases during the design of the Cortex-A55 PMU. The PMU is optimized for profiling cacheable memory accesses, which are more common in general-purpose computing and mobile applications. Non-cacheable accesses are typically used in specialized scenarios, such as direct memory access (DMA) or communication with peripherals. As a result, the PMU may not have been designed with these use cases in mind, leading to the observed limitations.
Profiling Non-Cacheable Accesses Using Alternative Methods
While the Cortex-A55 PMU does not directly support counting non-cacheable accesses, there are alternative methods to profile these accesses. These methods involve leveraging other features of the ARMv8-A architecture, such as the Embedded Trace Macrocell (ETM), system-level performance counters, and software-based instrumentation.
The Embedded Trace Macrocell (ETM) is a powerful tool for capturing detailed traces of instruction and data accesses. Unlike the PMU, which provides aggregate counts of events, the ETM captures a chronological record of all memory accesses, including their attributes. By configuring the ETM to trace non-cacheable accesses, developers can obtain a detailed profile of these accesses. However, using the ETM requires specialized hardware and software tools, and the volume of trace data can be overwhelming for large workloads.
System-level performance counters, available in some System-on-Chip (SoC) implementations, can also be used to profile non-cacheable accesses. These counters are typically implemented in the interconnect or memory controller and can provide visibility into bus transactions. By configuring these counters to track non-cacheable accesses, developers can obtain aggregate counts similar to those provided by the PMU. However, the availability and configurability of these counters depend on the specific SoC implementation, and they may not be as flexible as the PMU.
Software-based instrumentation is another approach to profiling non-cacheable accesses. This involves modifying the software to insert instrumentation code that tracks non-cacheable memory accesses. For example, developers can use memory-mapped I/O regions to log accesses to non-cacheable memory. While this approach provides fine-grained control over what is profiled, it can introduce significant overhead and may not be practical for all applications.
In addition to these methods, developers can use simulation and emulation tools to profile non-cacheable accesses. Tools like ARM Fast Models and Cycle Models provide detailed simulations of the Cortex-A55 and its memory system. By running workloads on these models, developers can analyze non-cacheable accesses without modifying the hardware or software. However, simulation and emulation can be time-consuming and may not fully capture the behavior of the real hardware.
In conclusion, while the Cortex-A55 PMU does not directly support counting non-cacheable accesses, developers can use alternative methods to profile these accesses. By leveraging tools like the ETM, system-level performance counters, software instrumentation, and simulation, developers can gain visibility into non-cacheable memory accesses and optimize their systems accordingly. Understanding the limitations of the PMU and the available alternatives is key to effective performance profiling and optimization on the Cortex-A55.