ARM Cortex-A72 64-bit MADD Throughput Limitations

The ARM Cortex-A72, a high-performance processor core within the ARMv8-A architecture, exhibits a significant performance discrepancy when executing 64-bit integer multiply-accumulate (MADD) instructions compared to 32-bit integer, single-precision (float), and double-precision (double) operations. Specifically, the throughput of 64-bit MADD instructions is approximately one-third that of the other data types. This behavior is rooted in the architectural design of the Cortex-A72, particularly its execution pipeline and multiplier unit configuration.

The Cortex-A72 features a single integer multiplier unit that handles both 32-bit and 64-bit operations. However, the latency and throughput of 64-bit MADD operations are notably higher due to the increased complexity of 64-bit arithmetic. According to the Cortex-A72 Software Optimization Guide, the MADD instruction on 64-bit registers (X-form) has a latency of 5 cycles and a throughput of 1/3 instruction per cycle. In contrast, the 32-bit MADD (W-form) has a latency of 3 cycles and a throughput of 1 instruction per cycle. This discrepancy arises because the 64-bit MADD stalls the multiplier pipeline for two additional cycles, effectively reducing its throughput.

Furthermore, the Cortex-A72’s floating-point unit (FPU) and SIMD units are more optimized for floating-point operations. The FPU can execute floating-point multiplications (FMUL) at a throughput of 2 instructions per cycle for double-precision operations, which is significantly higher than the 64-bit integer MADD throughput. This is counterintuitive, as integer arithmetic is typically expected to be more efficient than floating-point arithmetic. However, the Cortex-A72’s design prioritizes floating-point and SIMD performance, likely due to the target workloads for this core, which often include multimedia and scientific computing tasks.

Pipeline Stalls and Multiplier Unit Constraints

The primary cause of the reduced throughput for 64-bit MADD instructions is the pipeline stall mechanism within the Cortex-A72’s multiplier unit. When a 64-bit MADD instruction is executed, the multiplier pipeline is stalled for two extra cycles to accommodate the increased data width and complexity of the operation. This stall directly impacts the throughput, as subsequent MADD instructions cannot be issued until the pipeline is cleared.

The following assembly code snippet illustrates the issue:

404538:    9b027e73    madd  x19, x19, x2, xzr
40453c:    f94037e2    ldr   x2, [sp, #104]
404540:    9b017c00    madd  x0, x0, x1, xzr
404544:    f94037e1    ldr   x1, [sp, #104]
404548:    9b037f5a    madd  x26, x26, x3, xzr

In this example, each 64-bit MADD instruction is followed by a load (LDR) instruction. While the LDR instructions do not directly cause the throughput reduction, they highlight the inefficiency of the 64-bit MADD pipeline. The multiplier unit cannot issue a new MADD instruction until the previous one has completed its 5-cycle latency, leading to a throughput of 1/3 instruction per cycle.

In contrast, the 32-bit MADD and floating-point FMUL instructions do not suffer from the same pipeline stalls. The 32-bit MADD has a shorter latency (3 cycles) and does not stall the pipeline, allowing for a throughput of 1 instruction per cycle. Similarly, the FPU’s FMUL instructions can achieve a throughput of 2 instructions per cycle due to the presence of two FPU/SIMD units.

Optimizing 64-bit MADD Performance

To mitigate the performance impact of 64-bit MADD instructions on the Cortex-A72, several optimization strategies can be employed:

1. Instruction Scheduling and Loop Unrolling

  • Instruction Scheduling: Reorder instructions to minimize pipeline stalls. For example, interleave independent instructions between MADD operations to keep the pipeline busy.
  • Loop Unrolling: Unroll loops to reduce the overhead of branch instructions and increase the number of independent instructions available for scheduling.

Example of optimized assembly code:

.L3:
    subs  w0, w0, #1
    mul   x15, x15, x1
    mul   x14, x14, x1
    mul   x13, x13, x1
    mul   x12, x12, x1
    mul   x11, x11, x1
    mul   x10, x10, x1
    mul   x9, x9, x1
    mul   x8, x8, x1
    bne   .L3

This code eliminates memory accesses and maximizes the utilization of the multiplier unit by issuing multiple independent MUL instructions in sequence.

2. Compiler Optimizations

  • Compiler Flags: Use compiler flags to enable aggressive optimizations for the Cortex-A72. For example, -O3 and -mcpu=cortex-a72 can help the compiler generate more efficient code.
  • Intrinsics: Use ARM-specific intrinsics to manually control instruction scheduling and pipeline utilization.

3. Algorithmic Adjustments

  • Data Type Selection: Where possible, use 32-bit integers or floating-point types instead of 64-bit integers to avoid the performance penalty of 64-bit MADD instructions.
  • SIMD Utilization: Leverage SIMD instructions for parallelizable workloads. While the Cortex-A72 does not support 64-bit integer vector multiplication, SIMD can still be beneficial for other operations.

4. Hardware-Specific Considerations

  • Cache Optimization: Ensure that data is cache-aligned and prefetched to minimize memory access latency. This is particularly important for workloads that involve large datasets.
  • Power and Thermal Management: Monitor the CPU’s power and thermal state, as high utilization of the multiplier unit can lead to increased power consumption and thermal throttling.

Performance Comparison Table

Operation Data Type Latency (Cycles) Throughput (Instructions/Cycle) Pipeline Stalls
MADD 32-bit 3 1 None
MADD 64-bit 5 1/3 2-cycle stall
FMUL Double 4 2 None
FMUL Single 3 2 None

Conclusion

The reduced throughput of 64-bit MADD instructions on the ARM Cortex-A72 is a direct result of the core’s architectural design, specifically the pipeline stalls in the multiplier unit. While this limitation can impact performance in workloads heavily reliant on 64-bit integer arithmetic, it can be mitigated through careful instruction scheduling, compiler optimizations, and algorithmic adjustments. Understanding the underlying hardware constraints is crucial for maximizing the performance of the Cortex-A72 in embedded and high-performance computing applications.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *