ARM-V8 PCIe Peer-to-Peer DMA Performance Drop Due to IOMMU_MMIO Attribute
The core issue revolves around a significant performance degradation observed during PCIe peer-to-peer transactions between two GPU cards on an ARM-V8 server when the IOMMU is enabled. The throughput drops from an expected 28GB/s to a mere 4GB/s. This degradation is traced back to the use of the dma_map_resource()
API in the GPU kernel mode driver, which maps the peer device’s MMIO space. The ARM IOMMU driver hardcodes the IOMMU_MMIO
protection attribute in the DMA map, leading to the setting of the ARM_LPAE_PTE_MEMATTR_DEV
attribute in the Page Table Entry (PTE). This attribute is intended for device memory but inadvertently introduces performance bottlenecks in PCIe peer-to-peer communication scenarios.
The ARM_LPAE_PTE_MEMATTR_DEV
attribute is designed to enforce strict memory ordering and non-cacheable behavior, which is suitable for MMIO regions but not optimal for high-throughput peer-to-peer DMA transfers. The performance degradation is particularly pronounced in PCIe Gen4 x16 configurations, where the expected bandwidth is significantly higher. The issue is exacerbated by the fact that the dma_map_resource()
API does not provide a straightforward way to override the IOMMU_MMIO
attribute without modifying the Linux kernel source code.
IOMMU_MMIO Hardcoding and ARM_LPAE_PTE_MEMATTR_DEV Impact
The root cause of the performance degradation lies in the hardcoding of the IOMMU_MMIO
attribute within the dma_map_resource()
API. When the IOMMU is enabled, the ARM IOMMU driver sets the ARM_LPAE_PTE_MEMATTR_DEV
attribute in the PTE for the mapped MMIO space. This attribute enforces device memory semantics, which include non-cacheable and strongly ordered memory access. While these semantics are appropriate for MMIO regions, they are suboptimal for PCIe peer-to-peer DMA transfers, where higher throughput and relaxed ordering are often required.
The ARM_LPAE_PTE_MEMATTR_DEV
attribute is part of the ARMv8 memory management architecture, which defines different memory attributes for various types of memory regions. The attribute is intended to ensure that device memory accesses are handled correctly, but it inadvertently introduces performance bottlenecks in high-throughput scenarios. The issue is further compounded by the fact that the dma_map_resource()
API does not provide a mechanism to specify alternative memory attributes, such as IOMMU_CACHE
, which would allow for more efficient memory access patterns.
The performance impact is particularly noticeable in PCIe Gen4 x16 configurations, where the expected bandwidth is significantly higher. The ARM_LPAE_PTE_MEMATTR_DEV
attribute forces the memory subsystem to treat the mapped region as device memory, which results in suboptimal performance for peer-to-peer DMA transfers. This is because device memory semantics enforce strict ordering and non-cacheable behavior, which are not conducive to high-throughput data transfers.
Optimizing PCIe Peer-to-Peer DMA with Alternative Memory Attributes
To address the performance degradation, it is necessary to explore alternative approaches that do not require modifying the Linux kernel source code. One potential solution is to use a different API or mechanism that allows for the specification of alternative memory attributes, such as IOMMU_CACHE
, which would enable more efficient memory access patterns for PCIe peer-to-peer DMA transfers.
One possible approach is to use the dma_alloc_coherent()
API, which is designed for allocating coherent memory for DMA operations. This API allows for the specification of memory attributes, including cacheable memory, which can significantly improve performance in high-throughput scenarios. However, this approach may not be suitable for all use cases, as it requires allocating new memory rather than mapping existing MMIO regions.
Another approach is to use the dma_map_single()
API, which is designed for mapping single pages of memory for DMA operations. This API also allows for the specification of memory attributes, including cacheable memory, which can improve performance in high-throughput scenarios. However, this approach may not be suitable for mapping large MMIO regions, as it is designed for single-page mappings.
A more elegant solution would be to introduce a new API or extend the existing dma_map_resource()
API to allow for the specification of alternative memory attributes. This would enable developers to specify the appropriate memory attributes for their specific use case, without requiring modifications to the Linux kernel source code. This approach would require collaboration with the Linux kernel community to propose and implement the necessary changes.
In the meantime, a workaround is to modify the Linux kernel source code to remove the hardcoding of the IOMMU_MMIO
attribute in the dma_map_resource()
API. This can be done by modifying the iommu_dma_map_resource()
function to allow for the specification of alternative memory attributes. However, this approach is not recommended for production environments, as it requires maintaining a custom kernel build and may introduce compatibility issues with future kernel updates.
In conclusion, the performance degradation observed during PCIe peer-to-peer transactions with the IOMMU enabled is primarily due to the hardcoding of the IOMMU_MMIO
attribute in the dma_map_resource()
API, which results in the setting of the ARM_LPAE_PTE_MEMATTR_DEV
attribute in the PTE. To address this issue, it is necessary to explore alternative approaches that allow for the specification of alternative memory attributes, such as IOMMU_CACHE
, which can significantly improve performance in high-throughput scenarios. While modifying the Linux kernel source code is a possible workaround, a more elegant solution would be to introduce a new API or extend the existing dma_map_resource()
API to allow for the specification of alternative memory attributes. This would enable developers to optimize PCIe peer-to-peer DMA transfers without requiring modifications to the Linux kernel source code.