Cortex-A35 DDR3 Read Performance Bottlenecks and Optimization Strategies

Cortex-A35 DDR3 Read Access Latency and Bandwidth Discrepancies

The Cortex-A35, a power-efficient ARMv8-A processor, is designed for low-power applications but still requires careful tuning to achieve optimal memory performance. In this analysis, we focus on the observed discrepancies between DDR3 read and write bandwidths, specifically in the context of the i.MX 8X SoC with DDR3L-1866 memory. The measured bandwidths for MEMSET (6079 MB/s) and MEMREAD (2880 MB/s) reveal a significant performance gap, despite both operations leveraging NEON instructions and prefetch mechanisms. This discrepancy suggests underlying architectural and configuration limitations that need to be addressed.

The Cortex-A35’s memory subsystem is influenced by several factors, including cache line sizes, memory access patterns, and the behavior of the L1 and L2 caches. The L1 data cache is 32 KB with a 64-byte cache line, while the L2 unified cache is 512 KB. The DDR3L-1866 memory interface theoretically supports a bandwidth of 7464 MB/s, but achieving this requires optimal utilization of the memory controller and cache hierarchy. The observed MEMSET performance (81% of theoretical bandwidth) indicates that write operations are relatively well-optimized, while read operations are bottlenecked by factors such as read latency, cache fill behavior, and memory controller arbitration.

One critical aspect is the difference in access latency between reads and writes. DDR3 memory typically exhibits higher read latency (13 cycles) compared to write latency (9 cycles). While this difference alone does not fully explain the observed bandwidth gap, it highlights the need for careful tuning of the memory subsystem to mitigate latency effects. Additionally, the Cortex-A35’s load/store unit (LSU) and data cache unit (DCU) have limitations on the number of outstanding memory requests, which can further constrain read performance.

Cache Line Fill Behavior and Memory Controller Arbitration

The Cortex-A35’s cache line fill behavior plays a significant role in read performance. Cache fills occur in bursts of 64 bytes, which aligns with the 64-byte cache line size. However, the memory controller’s ability to handle these bursts efficiently depends on several factors, including the number of outstanding requests and the arbitration policies of the snoop control unit (SCU). The SCU is responsible for managing coherency and arbitration between multiple cores, but its configuration can impact single-core performance as well.

In the case of MEMREAD operations, the memory controller must fetch entire cache lines from DDR3 memory, even if only a portion of the data is immediately needed. This results in a higher effective latency for read operations compared to writes, where data can be buffered and written back to memory in larger chunks. The Cortex-A35’s L1 data cache supports eight outstanding linefill requests, but this may not be sufficient to fully utilize the DDR3 interface bandwidth, especially when dealing with sequential read patterns.

Another factor is the memory controller’s handling of read and write transactions. Write operations benefit from write merging, where multiple writes to the same cache line are combined into a single transaction. This reduces the effective number of memory accesses and improves write bandwidth. Read operations, on the other hand, cannot be merged in the same way, as each read request must fetch a unique cache line. This fundamental difference in transaction handling contributes to the observed performance gap.

Optimizing Cortex-A35 DDR3 Read Performance

To address the DDR3 read performance bottlenecks on the Cortex-A35, several optimization strategies can be employed. These include tuning cache and memory controller settings, leveraging NEON instructions more effectively, and utilizing the performance monitoring unit (PMU) to identify and address specific bottlenecks.

Cache and Memory Controller Configuration

Cache Line Prefetching: Enabling cache line prefetching can help mitigate read latency by fetching data into the cache before it is needed. The Cortex-A35 supports configurable prefetching through the CPUACTLR register. Adjusting the L1PCTL field can increase the number of prefetch requests, improving read performance for sequential access patterns.
Memory Barrier and Synchronization: Proper use of memory barriers can ensure that memory operations are executed in the correct order, reducing contention and improving overall performance. The Data Synchronization Barrier (DSB) and Data Memory Barrier (DMB) instructions should be used judiciously to manage memory access ordering.
SCU Configuration: While the SCU is primarily designed for multi-core systems, its arbitration policies can impact single-core performance. Configuring the SCU to prioritize read requests or increase the number of outstanding transactions can help improve read bandwidth. However, this requires access to SCU configuration registers, which may not be exposed to software in all implementations.

NEON Instruction Optimization

Vectorized Load/Store Operations: The Cortex-A35’s NEON unit supports 128-bit load and store instructions, which can be used to maximize memory throughput. By aligning memory accesses to 128-bit boundaries and using vectorized instructions, the effective bandwidth can be increased. For example, using LD1 and ST1 instructions with multiple registers can reduce the number of memory transactions required for a given data size.
Prefetching with NEON: The PRFM (Prefetch Memory) instruction can be used to prefetch data into the cache before it is needed. This is particularly useful for sequential read patterns, where prefetching can hide memory latency and improve overall performance. Careful placement of PRFM instructions within the code can ensure that data is available in the cache when needed.

Performance Monitoring and Analysis

PMU Utilization: The Cortex-A35’s Performance Monitoring Unit (PMU) provides detailed insights into memory subsystem performance. By configuring the PMU to count events such as cache misses, memory reads, and writes, specific bottlenecks can be identified and addressed. For example, a high number of L1 data cache misses may indicate inefficient prefetching or cache line utilization.
Bandwidth Measurement: Accurate bandwidth measurement is critical for evaluating the effectiveness of optimization efforts. Using the PMU to measure memory transactions and compare them against theoretical limits can help identify areas for improvement. Additionally, tools such as ARM DS-5 or Streamline can provide detailed performance analysis and visualization.

Advanced Techniques

DMA for Read Operations: Offloading read operations to a DMA engine can free up the CPU for other tasks while improving memory bandwidth utilization. The Cortex-A35’s DMA engine can be configured to handle large data transfers, reducing the load on the CPU and improving overall system performance.
Memory Attribute Configuration: Adjusting memory attributes in the MMU translation tables can influence memory access behavior. For example, marking memory regions as non-cacheable can bypass the cache hierarchy entirely, reducing contention and improving bandwidth for specific workloads. However, this approach should be used judiciously, as it can increase latency for other operations.

Summary of Optimization Steps

Optimization Area	Specific Actions	Expected Impact
Cache Configuration	Enable prefetching, adjust L1PCTL, use memory barriers	Reduced read latency, improved bandwidth
NEON Instructions	Vectorized load/store, prefetching with PRFM	Increased memory throughput
SCU and Memory Controller Tuning	Prioritize read requests, increase outstanding transactions	Improved arbitration and bandwidth
Performance Monitoring	Use PMU to count cache misses, measure bandwidth	Identify and address bottlenecks
DMA and Memory Attributes	Offload reads to DMA, configure non-cacheable regions	Reduced CPU load, improved bandwidth

By systematically addressing these areas, the Cortex-A35’s DDR3 read performance can be significantly improved, bringing it closer to the theoretical bandwidth limits. Each optimization step should be carefully evaluated and tested to ensure compatibility with the specific application and workload requirements.

Cortex-A35 DDR3 Read Performance Bottlenecks and Optimization Strategies

Cortex-A35 DDR3 Read Access Latency and Bandwidth Discrepancies

Cache Line Fill Behavior and Memory Controller Arbitration

Optimizing Cortex-A35 DDR3 Read Performance