ARM Cortex-A9 L1 Data Cache Miss Rate Anomalies During Array Access

When profiling the Level 1 data cache (L1d) on an ARM Cortex-A9 processor, particularly on a Zynq-7020 device, unexpected cache miss rates can occur during array iteration. The issue manifests when attempting to measure cache utilization using the Performance Monitoring Unit (PMU) counters. Specifically, the counters for Level 1 data cache refill (CACHEREFILL) and Level 1 data cache access (CACHEACCESS) are used to calculate the cache miss rate as 100 * CACHEREFILL / CACHEACCESS. However, the reported miss rates are anomalously low, even when the array size exceeds the L1d cache capacity. For example, with an array size of 990,000 elements, the miss rate is reported as 0.3%, which is inconsistent with expectations given the 32KB L1d cache size.

The problem arises during a loop that iterates through a growing array, invalidating the data cache before each iteration and monitoring the PMU counters. Despite verifying that the cache is invalidated, the PMU counters are reset, and compiler optimizations are disabled, the observed cache miss rates do not align with theoretical expectations. This discrepancy suggests a misunderstanding of cache behavior, PMU counter usage, or potential hardware-software interaction issues.

Cache Invalidation Timing and PMU Counter Misinterpretation

Several factors could contribute to the unexpected cache miss rates observed during L1d cache profiling on the ARM Cortex-A9. One primary cause is the timing of cache invalidation relative to the PMU counter measurements. The Xil_DCacheInvalidate() function, provided by the Xilinx Board Support Package (BSP), is used to invalidate the data cache before each iteration. However, if the cache invalidation occurs too early or too late relative to the PMU counter start, it could lead to incorrect counter values. For instance, if the cache is invalidated after the PMU counters are started, the counters might not capture the full impact of cache refills.

Another potential cause is the misinterpretation of the PMU counters themselves. The CACHEACCESS counter increments on every data cache access, while the CACHEREFILL counter increments only when a cache miss occurs, requiring a refill from the next level of memory. However, the relationship between these counters and the actual cache behavior is not always straightforward. For example, prefetching, write-back policies, and cache line replacement strategies can influence the counter values in ways that are not immediately obvious.

Additionally, the ARM Cortex-A9’s cache architecture includes features such as write-back caching and speculative fetching, which can further complicate the interpretation of PMU counters. Write-back caching means that data modifications are not immediately written to main memory, potentially delaying cache refills. Speculative fetching can lead to cache accesses that do not correspond directly to the program’s data access patterns, resulting in unexpected counter values.

Finally, the role of compiler optimizations, even when disabled, cannot be entirely ruled out. While the code explicitly disables optimizations, subtle interactions between the compiler, the BSP, and the hardware might still affect the cache behavior. For example, the compiler might generate instructions that inadvertently influence the cache state or the PMU counters.

Accurate Cache Profiling Through Proper PMU Configuration and Cache Management

To address the unexpected cache miss rates during L1d cache profiling on the ARM Cortex-A9, a systematic approach to PMU configuration and cache management is required. The following steps outline a detailed troubleshooting process to ensure accurate cache profiling results.

Step 1: Verify PMU Counter Configuration and Initialization

Before starting the cache profiling, ensure that the PMU counters are correctly configured and initialized. This includes selecting the appropriate counters (CACHEACCESS and CACHEREFILL) and setting their initial values to zero. The ARMv7 Technical Reference Manual provides detailed instructions on configuring the PMU, including the necessary control registers and their bit fields. Verify that the counters are reset at the beginning of each iteration and that no other processes or interrupts are interfering with the counter measurements.

Step 2: Synchronize Cache Invalidation with PMU Counter Start

To ensure that the cache invalidation occurs at the correct time relative to the PMU counter measurements, insert a memory barrier instruction before starting the counters. Memory barriers enforce the ordering of memory operations, preventing the compiler or processor from reordering instructions in a way that could affect the cache state. For example, use the DSB (Data Synchronization Barrier) instruction to ensure that all previous memory operations, including cache invalidation, are completed before starting the PMU counters.

Xil_DCacheInvalidate();  // Invalidate the data cache
__DSB();                 // Insert a memory barrier
startEventMonitoring();  // Start the PMU counters

Step 3: Analyze Cache Behavior with Different Array Sizes

To better understand the cache behavior, profile the cache miss rates with a range of array sizes, from smaller than the L1d cache to significantly larger. This will help identify any patterns or anomalies in the cache miss rates. For example, if the miss rate remains low even for large arrays, it could indicate that the cache is not being fully utilized or that the PMU counters are not capturing all cache misses.

Step 4: Investigate Cache Prefetching and Write-Back Policies

The ARM Cortex-A9’s cache prefetching and write-back policies can significantly impact the PMU counter values. Prefetching can lead to cache accesses that do not correspond directly to the program’s data access patterns, while write-back caching can delay cache refills. To isolate the effects of these features, disable prefetching and write-back caching temporarily and repeat the cache profiling. Compare the results with and without these features enabled to determine their impact on the cache miss rates.

Step 5: Validate Results with Hardware Performance Counters

In addition to the PMU counters, use hardware performance counters to validate the cache profiling results. Hardware performance counters provide a more detailed view of the processor’s behavior, including cache hits, misses, and prefetching activity. By cross-referencing the PMU counter values with the hardware performance counters, you can identify any discrepancies and ensure that the cache profiling results are accurate.

Step 6: Review Compiler and BSP Interactions

Finally, review the interactions between the compiler, the BSP, and the hardware to ensure that they are not inadvertently affecting the cache behavior. This includes examining the generated assembly code to verify that the cache invalidation and PMU counter start instructions are correctly placed. Additionally, consult the BSP documentation to ensure that the Xil_DCacheInvalidate() function is implemented correctly and that it fully invalidates the data cache.

By following these steps, you can systematically identify and address the factors contributing to the unexpected cache miss rates during L1d cache profiling on the ARM Cortex-A9. This approach ensures accurate cache profiling results and provides valuable insights into the cache behavior of the ARM Cortex-A9 processor.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *