ARM Cortex-A Cache Coherency and Memory Access Latency in EL3 vs NS.EL1
The performance discrepancy of memcpy() between EL3 (ARM Trusted Firmware BL31) and NS.EL1 (Linux kernel module) is primarily rooted in the differences in cache coherency models, memory access latencies, and privilege level overheads. In NS.EL1, the Linux kernel operates in a non-secure state with full access to the MMU, caches, and optimized memory management routines. EL3, on the other hand, operates in the secure world with a different set of constraints, including stricter memory isolation, potential cache flushes, and the absence of certain optimizations that are present in the Linux kernel. The flat memory mapping in EL3, while preventing page faults, does not inherently guarantee efficient cache utilization or low-latency memory access. The PMU cycle counts reveal a significant performance gap, with EL3 memcpy() operations being approximately 10x slower than their NS.EL1 counterparts. This suggests that the overhead is not merely due to the memcpy() implementation but is deeply tied to the underlying hardware-software interactions at different exception levels.
The PMU cycle measurements indicate that the performance degradation scales linearly with the buffer size, which points to a systemic issue rather than a one-time overhead. The flat memory mapping in EL3, while simplifying address translation, may not be leveraging the ARM Cortex-A cache hierarchy effectively. Additionally, the secure world’s handling of cache coherency and memory barriers could introduce additional latency. The Juno R2 board’s specific implementation of the ARM architecture, including its L1 and L2 cache configurations, further complicates the performance analysis. The secure world’s reliance on physical addresses for memcpy() operations, as opposed to the virtual addresses used in NS.EL1, may also contribute to the observed overhead, as physical address access typically bypasses certain cache optimizations.
Flat Memory Mapping and Cache Inefficiencies in EL3
One of the primary causes of the high overhead in EL3 is the flat memory mapping approach used during BL31 setup. While this approach eliminates the need for page table walks and prevents page faults, it does not inherently optimize for cache utilization. The ARM Cortex-A architecture relies heavily on the MMU and cache hierarchy to minimize memory access latency. In NS.EL1, the Linux kernel’s memory management subsystem is highly optimized for performance, leveraging virtual address translation, cache line prefetching, and data locality to reduce latency. In contrast, the flat memory mapping in EL3 may result in suboptimal cache line utilization, leading to increased cache misses and higher memory access latency.
Another contributing factor is the difference in cache coherency models between the secure and non-secure worlds. In NS.EL1, the Linux kernel ensures cache coherency through a combination of hardware and software mechanisms, including data synchronization barriers and cache maintenance operations. In EL3, the secure world may not be leveraging these mechanisms as effectively, leading to potential cache incoherency and additional overhead. The use of physical addresses in EL3 further exacerbates this issue, as physical address access typically bypasses certain cache optimizations that are available with virtual address access. This can result in increased cache line evictions and higher memory access latency.
The ARM Trusted Firmware’s handling of cache maintenance operations in EL3 may also contribute to the observed overhead. In NS.EL1, the Linux kernel performs cache maintenance operations as part of its memory management routines, ensuring that data is coherent across the cache hierarchy. In EL3, the secure world may not be performing these operations as efficiently, leading to potential cache incoherency and additional latency. The flat memory mapping approach, while simplifying address translation, does not inherently optimize for cache coherency, which can result in increased overhead for memory-intensive operations like memcpy().
Optimizing Cache Coherency and Memory Access in EL3
To address the high overhead of memcpy() in EL3, several optimizations can be implemented to improve cache coherency and reduce memory access latency. First, the flat memory mapping approach should be revisited to ensure that it is leveraging the ARM Cortex-A cache hierarchy effectively. This may involve implementing a more sophisticated memory management strategy in EL3, such as using virtual address translation or optimizing the flat memory mapping for cache line utilization. By aligning memory access patterns with the cache hierarchy, it is possible to reduce cache misses and improve overall performance.
Second, cache maintenance operations should be optimized in EL3 to ensure that data is coherent across the cache hierarchy. This may involve implementing data synchronization barriers and cache invalidation routines that are tailored to the secure world’s memory access patterns. By ensuring that cache maintenance operations are performed efficiently, it is possible to reduce the overhead associated with cache incoherency and improve the performance of memory-intensive operations like memcpy().
Third, the use of physical addresses in EL3 should be reconsidered to leverage the cache optimizations that are available with virtual address access. While physical address access simplifies address translation, it typically bypasses certain cache optimizations that are available with virtual address access. By implementing a virtual address translation mechanism in EL3, it is possible to leverage these optimizations and reduce memory access latency. This may involve implementing a lightweight MMU in EL3 or using a hybrid approach that combines physical and virtual address access.
Finally, the ARM Trusted Firmware’s handling of memory-intensive operations like memcpy() should be optimized to reduce overhead. This may involve implementing a custom memcpy() routine that is tailored to the secure world’s memory access patterns and cache hierarchy. By optimizing the memcpy() implementation for EL3, it is possible to reduce the overhead associated with memory-intensive operations and improve overall performance.
In conclusion, the high overhead of memcpy() in EL3 compared to NS.EL1 is primarily due to differences in cache coherency models, memory access latencies, and privilege level overheads. By optimizing cache coherency, memory access patterns, and the memcpy() implementation in EL3, it is possible to reduce the overhead and improve overall performance. These optimizations should be tailored to the specific requirements of the secure world and the ARM Cortex-A architecture to ensure that they are effective in reducing latency and improving performance.