L2 Cache Initialization and Event Counter Monitoring in Cortex-A9 with PL310
The initialization of the L2 cache in ARM Cortex-A9 systems using the PL310 controller is a critical step in ensuring optimal system performance. The L2 cache, when properly configured, can significantly reduce memory access latency by caching frequently accessed data. However, improper initialization or configuration can lead to unexpected behavior, such as the absence of cache hit events (Drhit and Dwhit) in the event counters. This issue is often tied to the interaction between the Memory Management Unit (MMU), the L1 and L2 caches, and the DDR memory subsystem.
In the provided scenario, the L2 cache initialization sequence is implemented in U-Boot, and the event counters are used to monitor cache hits. However, the event counters do not update as expected until the MMU is enabled. This behavior is consistent with ARM architecture specifications, as the MMU is required to define memory regions as cacheable. Without the MMU, all memory accesses are treated as non-cacheable, preventing the L2 cache from being utilized effectively.
The initialization sequence includes enabling the L1 instruction and data caches, configuring the PL310 L2 cache controller, and setting up the event counters. The event counters are configured to monitor data read hits (Drhit) and data write hits (Dwhit). However, the counters only start updating after the MMU is enabled, confirming that the MMU is essential for cacheable memory regions.
MMU Dependency and Cacheable Memory Region Configuration
The MMU plays a pivotal role in defining memory attributes, including cacheability. In ARM architectures, the MMU translates virtual addresses to physical addresses and assigns memory attributes such as cacheability, shareability, and access permissions. Without the MMU, memory regions are treated as non-cacheable by default, meaning that all memory accesses bypass the cache and go directly to the DDR memory. This explains why the event counters do not register any cache hits until the MMU is enabled.
When the MMU is enabled, memory regions can be marked as cacheable in the page tables. This allows the L1 and L2 caches to store copies of frequently accessed data, reducing the need to access slower DDR memory. The MMU also enables other performance optimizations, such as out-of-order execution, speculative fetching, and memory access merging. These optimizations can significantly improve system performance, but they rely on the correct configuration of the MMU and cache subsystems.
In the provided initialization sequence, the MMU is enabled after configuring the L2 cache and event counters. The page tables are set up to create a flat memory mapping with 1MB sections, and the first entry is marked as cacheable, normal, write-back, and write-allocate. This configuration ensures that memory accesses to the specified region are cached, allowing the event counters to register cache hits.
Performance Anomalies and Cache Configuration Debugging
Despite the correct initialization of the L2 cache and MMU, performance anomalies were observed when testing the DDR memory subsystem. Specifically, enabling the D-Cache, branch prediction, and MMU resulted in longer access times compared to disabling these features. This behavior is counterintuitive, as enabling the cache and MMU should generally improve performance by reducing memory access latency.
The performance anomaly can be attributed to several factors, including cache thrashing, inefficient cache line utilization, or improper cache maintenance operations. Cache thrashing occurs when the cache is repeatedly filled and evicted due to conflicting memory access patterns, leading to increased memory traffic and longer access times. Inefficient cache line utilization can occur if the data access patterns do not align with the cache line size, resulting in frequent cache misses. Improper cache maintenance operations, such as failing to invalidate or clean the cache at the appropriate times, can also degrade performance by causing stale data to be accessed.
To diagnose and resolve the performance anomaly, the following steps should be taken:
-
Cache Line Size and Access Pattern Analysis: Ensure that the data access patterns align with the cache line size. The Cortex-A9 L1 cache line size is typically 32 bytes, while the L2 cache line size is 64 bytes. Accessing data in chunks that match the cache line size can improve cache utilization and reduce thrashing.
-
Cache Maintenance Operations: Verify that cache maintenance operations, such as invalidating or cleaning the cache, are performed at the appropriate times. For example, the cache should be invalidated before enabling it to ensure that stale data is not accessed. Similarly, the cache should be cleaned before disabling it to ensure that dirty data is written back to memory.
-
Memory Access Latency Measurement: Measure the memory access latency with and without the cache enabled to identify any bottlenecks. This can be done using performance counters or timers to track the time taken for memory read and write operations.
-
Event Counter Monitoring: Continue monitoring the event counters to ensure that cache hits are being registered as expected. If the event counters do not update as expected, verify that the memory region is correctly marked as cacheable in the page tables and that the cache is properly enabled.
-
Branch Prediction Impact: Evaluate the impact of branch prediction on performance. While branch prediction can improve performance by reducing pipeline stalls, it can also introduce overhead if the prediction accuracy is low. Disabling branch prediction and comparing the performance can help determine its impact.
By systematically analyzing and addressing these factors, the performance anomaly can be resolved, and the expected performance benefits of enabling the D-Cache, branch prediction, and MMU can be realized.
Implementing Data Synchronization Barriers and Cache Management
To ensure reliable and optimal performance, it is essential to implement data synchronization barriers and proper cache management techniques. Data synchronization barriers, such as the Data Synchronization Barrier (DSB) and Instruction Synchronization Barrier (ISB), ensure that memory accesses and cache maintenance operations are completed in the correct order. These barriers are particularly important in multi-core systems, where cache coherency must be maintained across cores.
In the provided initialization sequence, DSB and ISB instructions are used after enabling the D-side prefetch and before initializing the MMU. These barriers ensure that all cache maintenance operations are completed before proceeding with the MMU initialization. This is critical to prevent race conditions and ensure that the cache and MMU are in a consistent state.
Additionally, cache management techniques, such as cache partitioning and locking, can be used to optimize performance for specific workloads. Cache partitioning allows specific regions of the cache to be reserved for critical data, reducing contention and improving performance. Cache locking allows specific cache lines to be locked, preventing them from being evicted and ensuring that critical data remains in the cache.
By implementing these techniques and carefully analyzing the system’s behavior, the performance of the Cortex-A9 system with the PL310 L2 cache can be optimized, ensuring that the expected benefits of cache and MMU enablement are realized.