NEON Intrinsics vs. Plain C XOR Performance on Cortex-A72

The performance discrepancy between NEON intrinsics and plain C code for XOR operations on the ARM Cortex-A72 processor is a nuanced issue that requires a deep dive into the architecture, instruction latency, and throughput characteristics. The Cortex-A72, found in the Raspberry Pi 4, is a high-performance out-of-order execution core that supports Advanced SIMD (NEON) instructions. While NEON is designed to accelerate data-parallel workloads, its performance benefits are not always straightforward, especially for operations like XOR, which are relatively simple and have low latency in scalar execution units.

The primary observation is that the NEON implementation of XOR operations is slightly slower than the plain C implementation, despite the NEON code having fewer instructions. This counterintuitive result can be attributed to several factors, including instruction latency, throughput, and the out-of-order execution capabilities of the Cortex-A72. Understanding these factors requires a detailed analysis of the Cortex-A72’s microarchitecture, the specific instructions used, and the data flow in the code.

Latency and Throughput of NEON EOR Instructions

The Cortex-A72’s Advanced SIMD (NEON) execution units have different latency and throughput characteristics compared to the scalar execution units. According to the Cortex-A72 Software Optimization Guide, the latency of the NEON EOR (Exclusive OR) instruction is 3 cycles, whereas the scalar EOR instruction has a latency of 1 cycle. This difference in latency is a critical factor in the performance discrepancy.

The NEON EOR instruction operates on 128-bit vectors (V registers), allowing it to process multiple data elements in parallel. However, the increased latency means that the results of the NEON EOR instruction take longer to become available for subsequent instructions. In contrast, the scalar EOR instruction, which operates on 64-bit registers (X registers), has a lower latency, allowing the results to be available sooner for dependent instructions.

The throughput of the NEON EOR instruction is also a factor. The Cortex-A72 can issue up to two NEON EOR instructions per cycle, but this theoretical throughput is often not achievable in practice due to dependencies between instructions and other bottlenecks in the pipeline. The scalar EOR instruction, on the other hand, has a throughput of up to two instructions per cycle, and its lower latency makes it easier to achieve this throughput in practice.

The following table summarizes the latency and throughput characteristics of the EOR instructions on the Cortex-A72:

Instruction Type Latency (cycles) Throughput (instructions/cycle)
Scalar EOR 1 2
NEON EOR 3 2

The higher latency of the NEON EOR instruction means that the results of the NEON XOR operations take longer to become available, which can lead to stalls in the pipeline if subsequent instructions depend on these results. This is particularly problematic in tight loops or when there are many dependent instructions, as is the case in the provided code.

Out-of-Order Execution and Instruction Scheduling

The Cortex-A72 is an out-of-order execution core, which means it can reorder instructions to maximize throughput and hide latency. However, the effectiveness of out-of-order execution depends on the availability of independent instructions that can be executed while waiting for the results of high-latency instructions.

In the case of the NEON XOR implementation, the higher latency of the NEON EOR instruction means that the processor must wait longer for the results to become available before it can proceed with dependent instructions. While the out-of-order execution engine can attempt to hide this latency by executing other independent instructions, the effectiveness of this approach is limited by the number of independent instructions available and the overall instruction mix.

The scalar XOR implementation, with its lower latency, allows the processor to proceed with dependent instructions sooner, reducing the likelihood of stalls in the pipeline. Additionally, the scalar implementation may have more opportunities for out-of-order execution, as the lower latency of the scalar EOR instruction allows the processor to complete more instructions in a given time frame.

The following table compares the potential for out-of-order execution in the scalar and NEON implementations:

Implementation Latency of EOR Potential for Out-of-Order Execution
Scalar 1 cycle High
NEON 3 cycles Moderate

The scalar implementation’s lower latency allows for more effective out-of-order execution, as the processor can complete more instructions in parallel and hide the latency of other operations. The NEON implementation, with its higher latency, is more likely to experience stalls in the pipeline, reducing the overall throughput.

Data Alignment and Memory Access Patterns

Another factor that can impact the performance of the NEON implementation is data alignment and memory access patterns. The NEON instructions in the provided code load and store data using the ldp (load pair) and stp (store pair) instructions, which operate on 128-bit vectors. These instructions require the data to be aligned to 16-byte boundaries for optimal performance. If the data is not properly aligned, the processor may need to perform additional memory accesses, increasing the overall latency of the memory operations.

In contrast, the scalar implementation uses ldp and stp instructions that operate on 64-bit registers, which require only 8-byte alignment. This lower alignment requirement reduces the likelihood of misaligned memory accesses and can lead to more efficient memory operations.

The following table compares the alignment requirements and potential performance impact of the memory access patterns in the scalar and NEON implementations:

Implementation Alignment Requirement Potential Performance Impact
Scalar 8-byte Low
NEON 16-byte Moderate

If the data is not properly aligned for the NEON implementation, the processor may need to perform additional memory accesses or handle misaligned data, which can increase the overall latency of the memory operations. This can further exacerbate the performance discrepancy between the scalar and NEON implementations.

Compiler Optimization and Instruction Scheduling

The performance of the NEON implementation can also be influenced by the compiler’s ability to optimize the code and schedule instructions effectively. Modern compilers like GCC and Clang have sophisticated optimization passes that can reorder instructions, unroll loops, and apply other transformations to improve performance. However, the effectiveness of these optimizations depends on the specific code and the target architecture.

In the case of the NEON implementation, the compiler may not be able to fully hide the latency of the NEON EOR instructions, especially if there are many dependent instructions or if the code is not structured in a way that allows for effective out-of-order execution. The scalar implementation, with its lower latency, may be easier for the compiler to optimize, as there are fewer opportunities for stalls in the pipeline.

The following table compares the potential for compiler optimization in the scalar and NEON implementations:

Implementation Latency of EOR Potential for Compiler Optimization
Scalar 1 cycle High
NEON 3 cycles Moderate

The scalar implementation’s lower latency allows the compiler to more effectively schedule instructions and hide latency, leading to better overall performance. The NEON implementation, with its higher latency, may require more manual optimization to achieve similar performance.

Recommendations for Optimizing NEON XOR Performance

To address the performance discrepancy between the NEON and scalar implementations, several optimizations can be considered:

  1. Increase Instruction-Level Parallelism: To hide the latency of the NEON EOR instructions, the code should be structured to maximize instruction-level parallelism. This can be achieved by unrolling loops, increasing the number of independent instructions, and reducing dependencies between instructions.

  2. Ensure Proper Data Alignment: The data accessed by the NEON instructions should be aligned to 16-byte boundaries to avoid misaligned memory accesses and reduce latency. This can be achieved by using aligned memory allocation functions or by manually aligning the data.

  3. Use Compiler Intrinsics and Manual Optimization: While modern compilers are capable of optimizing code, manual optimization using compiler intrinsics can sometimes yield better results. By manually scheduling instructions and using intrinsics, the programmer can ensure that the NEON instructions are used effectively and that the latency is hidden as much as possible.

  4. Profile and Analyze Performance: Profiling the code using performance counters can help identify bottlenecks and areas for improvement. The Cortex-A72 provides a rich set of performance counters that can be used to measure instruction latency, throughput, and other metrics. By analyzing these metrics, the programmer can identify specific areas where the NEON implementation can be optimized.

  5. Consider Mixed Scalar and NEON Approaches: In some cases, a mixed approach that combines scalar and NEON instructions can yield better performance. By using scalar instructions for low-latency operations and NEON instructions for data-parallel operations, the programmer can achieve a balance between latency and throughput.

Conclusion

The performance discrepancy between NEON intrinsics and plain C code for XOR operations on the ARM Cortex-A72 is primarily due to the higher latency of the NEON EOR instruction and the challenges of hiding this latency in an out-of-order execution core. While NEON instructions can provide significant performance benefits for data-parallel workloads, their effectiveness depends on the specific characteristics of the workload and the target architecture.

By understanding the latency and throughput characteristics of the NEON instructions, ensuring proper data alignment, and optimizing the code for instruction-level parallelism, it is possible to improve the performance of the NEON implementation. However, in some cases, a scalar implementation may still be more efficient, especially for simple operations like XOR.

Ultimately, the choice between NEON and scalar implementations should be guided by careful profiling and analysis, taking into account the specific requirements of the application and the characteristics of the target architecture.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *