ARM Cortex-A76 and DSU Cache Way Partitioning Misbehavior
Cache way partitioning is a critical feature in modern ARM processors, particularly in multi-core systems where shared resources like the Last Level Cache (LLC) must be managed efficiently to ensure predictable performance. The ARM Cortex-A76, coupled with the DynamIQ Shared Unit (DSU), supports cache way partitioning to isolate cache resources for specific tasks or cores. However, unexpected behavior can arise when implementing this feature, as observed in the described scenario. The primary issue revolves around cache evictions occurring despite the partitioning scheme, particularly when co-runners are introduced. This suggests that the cache way partitioning is not being fully respected, leading to performance degradation and unexpected LLC miss rates.
The problem manifests when a victim task is allocated a portion of the LLC (e.g., 1MB out of 2MB) and co-runners are introduced to utilize the remaining cache. Ideally, the co-runners should not evict the victim task’s cache lines due to the partitioning. However, the observed behavior indicates a significant number of evictions (>90% of total cache accesses), which contradicts the expected outcome. This issue persists even when using software-based cache set partitioning tools like palloc, which have been proven effective on other platforms such as the Raspberry Pi 4. Additionally, disabling prefetching mechanisms via the CPUECTLR register yields partial improvements but does not fully resolve the issue, suggesting that prefetching behavior may not align with the partitioning directives.
Prefetching Mechanisms and Cache Partitioning Misalignment
One of the key factors contributing to the unexpected behavior is the interaction between prefetching mechanisms and cache way partitioning. Prefetching is designed to improve performance by anticipating memory accesses and loading data into the cache before it is explicitly requested. However, in the context of cache way partitioning, prefetching can inadvertently violate the isolation boundaries set by the partitioning scheme. This misalignment is evident in the performance counter data collected during the experiments.
When prefetching is disabled, the LLC miss rate for read-only co-runners decreases by approximately 30%, indicating that prefetching contributes significantly to the observed evictions. However, the remaining 60% of evictions suggest that other factors are at play. The performance counters reveal that prefetching activity persists even after disabling user-controllable prefetching mechanisms in the CPUECTLR register. This implies that certain prefetching operations are either not fully disabled or are being initiated by other components of the system, such as the DSU or memory controllers.
The performance counter data also highlights discrepancies between read and write operations. Disabling prefetching has a more pronounced effect on read operations, reducing the LLC miss rate significantly, while write operations remain largely unaffected. This suggests that write operations may be bypassing the cache or being handled differently by the memory subsystem, further complicating the cache partitioning behavior.
Resolving Cache Partitioning Issues with Synchronization and Configuration Adjustments
To address the cache way partitioning misbehavior, a combination of synchronization techniques and configuration adjustments is necessary. The first step is to ensure that all prefetching mechanisms are properly disabled or configured to respect the partitioning scheme. This involves not only setting the appropriate bits in the CPUECTLR register but also verifying that other system components, such as the DSU and memory controllers, are not initiating prefetching operations that violate the partitioning boundaries.
Implementing data synchronization barriers (DSBs) and cache management instructions can help enforce the partitioning scheme. DSBs ensure that all memory accesses are completed before proceeding, preventing speculative prefetching from interfering with the partitioning. Additionally, explicit cache invalidation and cleaning operations can be used to maintain cache coherence and prevent unintended evictions.
Another critical step is to validate the cache way partitioning configuration. This involves verifying that the partitioning registers are correctly programmed and that the partitioning scheme is being applied consistently across all cores and cache levels. Debugging tools and performance counters can be used to monitor cache behavior and identify any discrepancies between the expected and actual partitioning outcomes.
Finally, it is essential to consider the impact of software-based partitioning tools and their interaction with hardware partitioning mechanisms. Tools like palloc may need to be adapted to account for the specific behavior of the ARM Cortex-A76 and DSU, particularly in terms of prefetching and memory access patterns. By combining hardware and software approaches, it is possible to achieve the desired level of cache isolation and ensure predictable performance in multi-core systems.
In conclusion, the unexpected cache way partitioning behavior on the ARM Cortex-A76 with DSU can be attributed to misaligned prefetching mechanisms and incomplete partitioning configuration. By addressing these issues through synchronization techniques, configuration adjustments, and careful validation, it is possible to achieve the intended cache isolation and optimize system performance.