Optimizing ARM NEON and SVE for High-Bit Packing in 64-Byte Vectors

ARM NEON and SVE High-Bit Packing Challenges in 64-Byte Vectors

The task of packing the high bit of every byte in a 64-byte vector into a compact integer mask is a common operation in high-performance computing, particularly in image processing, compression, and machine learning workloads. On Intel architectures, this operation is efficiently handled by AVX-512 instructions like vpmovb2m and _mm512_movepi8_mask, which extract the most significant bit (MSB) from each byte and produce a mask. However, ARM architectures, particularly those utilizing NEON and SVE (Scalable Vector Extension), lack direct equivalents to these instructions, necessitating custom implementations.

The challenge lies in achieving performance parity with AVX-512 while adhering to ARM’s SIMD paradigms. ARM NEON, while powerful, operates on 128-bit vectors, requiring multiple iterations to process 64 bytes. SVE, on the other hand, offers scalable vector lengths but introduces complexity in handling variable vector sizes. The goal is to implement a solution that avoids memory stores, operates efficiently on both NEON and SVE, and returns the mask as a function result, similar to the AVX-512 intrinsics.

Scalar Inefficiencies and Vectorization Opportunities

The scalar implementation provided in the discussion, while functionally correct, is highly inefficient for ARM architectures. The loop iterates over each byte, shifting and OR-ing the mask, which fails to leverage ARM’s SIMD capabilities. This approach incurs significant overhead due to repeated scalar operations and branch prediction misses. Furthermore, the scalar implementation does not scale well to larger datasets, making it unsuitable for performance-critical applications.

ARM NEON and SVE offer vectorized operations that can process multiple bytes in parallel, significantly reducing the instruction count and improving throughput. However, the absence of a direct equivalent to vpmovb2m necessitates creative use of existing intrinsics. For NEON, the vshrq_n_u8 intrinsic can shift the high bit of each byte to the low bit position, enabling subsequent bitwise operations to construct the mask. SVE, with its predicate registers and flexible vector lengths, provides additional opportunities for optimization but requires careful handling to ensure compatibility across different implementations.

Efficient NEON and SVE Implementations for High-Bit Packing

To achieve optimal performance on ARM architectures, the high-bit packing operation must be broken down into a series of vectorized steps. For NEON, the process involves shifting the high bits, creating a mask, and combining the results into a single integer. For SVE, the operation can leverage predicate registers and scalable vector lengths to handle 64-byte vectors efficiently.

NEON Implementation

The NEON implementation begins by loading the 64-byte vector into four 128-bit NEON registers. Each register is processed independently to extract the high bits. The vshrq_n_u8 intrinsic shifts the high bit of each byte to the low bit position, resulting in a vector where each byte contains either 0x00 or 0x01. A comparison operation (vceqq_u8) is then used to create a mask vector, which is combined using bitwise OR operations to produce the final mask.

#include <arm_neon.h>

uint64_t neon_cvtb2mask512(uint8x16_t v0, uint8x16_t v1, uint8x16_t v2, uint8x16_t v3) {
    uint8x16_t msb = vdupq_n_u8(0x80);
    uint8x16_t mask0 = vshrq_n_u8(vandq_u8(v0, msb), 7);
    uint8x16_t mask1 = vshrq_n_u8(vandq_u8(v1, msb), 7);
    uint8x16_t mask2 = vshrq_n_u8(vandq_u8(v2, msb), 7);
    uint8x16_t mask3 = vshrq_n_u8(vandq_u8(v3, msb), 7);

    uint64x2_t result = vreinterpretq_u64_u8(mask0);
    result = vorrq_u64(result, vreinterpretq_u64_u8(mask1));
    result = vorrq_u64(result, vreinterpretq_u64_u8(mask2));
    result = vorrq_u64(result, vreinterpretq_u64_u8(mask3));

    return vgetq_lane_u64(result, 0) | (vgetq_lane_u64(result, 1) << 32);
}

SVE Implementation

The SVE implementation leverages predicate registers to handle variable vector lengths efficiently. The svwhilelt_b8 intrinsic generates a predicate mask for the input vector, which is used to extract the high bits. The svorr_x intrinsic combines the results into a single mask, ensuring compatibility across different SVE implementations.

#include <arm_sve.h>

uint64_t sve_cvtb2mask512(svuint8_t input) {
    svbool_t pg = svwhilelt_b8(0, 64);
    svuint8_t msb = svdup_u8(0x80);
    svuint8_t mask = svlsr_x(pg, svand_x(pg, input, msb), 7);

    uint64_t result = 0;
    for (int i = 0; i < 64; i += svcntb()) {
        svbool_t active = svwhilelt_b8(i, 64);
        result |= svorr_x(active, mask, result);
    }

    return result;
}

Performance Considerations

Both implementations avoid memory stores and return the mask as a function result, meeting the original requirements. The NEON implementation is optimized for fixed 128-bit vectors, while the SVE implementation scales dynamically based on the available vector length. Performance testing on target hardware is recommended to validate the optimizations and identify potential bottlenecks.

Comparison with AVX-512

While the ARM implementations are efficient, they may not match the raw throughput of AVX-512 due to architectural differences. However, the use of NEON and SVE ensures compatibility with a wide range of ARM processors, making these solutions viable for cross-platform applications.

Conclusion

By leveraging ARM NEON and SVE intrinsics, it is possible to implement high-bit packing operations that approach the performance of AVX-512. The provided implementations demonstrate how to extract and combine high bits efficiently, avoiding scalar inefficiencies and memory stores. These solutions are suitable for performance-critical applications on ARM architectures, providing a foundation for further optimization and adaptation to specific use cases.

Optimizing ARM NEON and SVE for High-Bit Packing in 64-Byte Vectors

ARM NEON and SVE High-Bit Packing Challenges in 64-Byte Vectors

Scalar Inefficiencies and Vectorization Opportunities