ARM Cortex-A53/A57 Clusters and DMA Coherency Challenges
In a system featuring ARM Cortex-A53 and Cortex-A57 clusters interconnected via a CCI-400 coherent interconnect, ensuring proper coherency between CPU clusters and DMA engines can be a complex task. The Cortex-A53 and Cortex-A57 clusters are typically configured in an inner shareable domain, meaning that their caches are coherent with respect to each other. However, when a DMA engine operating in an outer shareable domain is introduced, the coherency model becomes more intricate. The DMA engine, being an ACE-Lite master, does not have full coherency capabilities like the ACE masters in the CPU clusters. This setup raises questions about how the CCI-400 ensures coherency between the inner shareable CPU clusters and the outer shareable DMA engine, particularly when dealing with memory transactions involving Tx (transmit) and Rx (receive) buffers.
The key challenge lies in understanding how the CCI-400 determines whether to snoop the CPU caches for addresses involved in DMA transactions. The Page Table Entries (PTEs) for the Tx and Rx buffers are marked as outer shareable, but the mechanism by which the CCI-400 interprets these markings and initiates snoop operations is not immediately obvious. This issue is further complicated by the fact that the MMU, which reads the PTEs, operates independently of the CCI-400. The System Control Unit (SCU) within each CPU cluster plays a role in maintaining cache coherency, but its interaction with the CCI-400 in the context of outer shareable transactions requires careful examination.
AxDOMAIN Signaling and Snoop Transaction Coordination
The coherency mechanism in this system relies heavily on the AxDOMAIN and AxSNOOP signals, which are part of the ACE (AXI Coherency Extensions) protocol. When a transaction is initiated by an ACE master, such as the Cortex-A53 or Cortex-A57 clusters, the AxDOMAIN signal is used to indicate the shareability domain of the transaction. For transactions involving outer shareable memory, the AxDOMAIN signal will reflect this, allowing the CCI-400 to determine whether a snoop operation is necessary. The AxSNOOP signal provides additional information about the type of coherency operation required, such as whether a cache line should be invalidated or cleaned.
In the case of DMA transactions, the DMA engine, being an ACE-Lite master, does not generate AxSNOOP signals. Instead, the CCI-400 must infer the necessary coherency actions based on the AxDOMAIN signal and the configuration of the memory regions involved. When the Tx and Rx buffers are marked as outer shareable in their PTEs, the CCI-400 uses the AxDOMAIN signal to identify these transactions as requiring coherency actions across the outer shareable domain. This triggers the CCI-400 to initiate snoop operations on the CPU caches, ensuring that any modified data in the caches is written back to memory before the DMA transfer begins, and that any stale data in the caches is invalidated after the transfer completes.
The SCU within each CPU cluster plays a supporting role in this process by managing cache coherency within the inner shareable domain. However, when transactions cross into the outer shareable domain, the SCU relies on the CCI-400 to coordinate snoop operations. This division of responsibilities ensures that coherency is maintained across the entire system, even when multiple shareability domains are involved.
Implementing Proper Shareability and Coherency Management
To ensure proper coherency in a system with multiple shareability domains, several steps must be taken. First, the PTEs for memory regions involved in DMA transactions must be correctly configured to reflect their shareability domain. For Tx and Rx buffers used by the DMA engine, the PTEs should be marked as outer shareable. This ensures that the CCI-400 can correctly identify these transactions as requiring coherency actions across the outer shareable domain.
Second, the system software must ensure that the AxDOMAIN signals are correctly generated by the ACE masters. This involves configuring the MMU and the memory attributes correctly, so that the AxDOMAIN signal reflects the shareability domain of the transaction. In the case of the DMA engine, which does not generate AxSNOOP signals, the system software must ensure that the CCI-400 is configured to handle these transactions appropriately. This may involve setting up tie-offs or other configuration parameters in the CCI-400 to ensure that it can infer the necessary coherency actions.
Third, the system software must manage cache maintenance operations to ensure that data is coherent across the entire system. This includes performing cache clean operations before initiating DMA transfers to ensure that any modified data in the CPU caches is written back to memory. After the DMA transfer completes, cache invalidate operations may be necessary to ensure that the CPU caches do not contain stale data. These cache maintenance operations must be coordinated with the CCI-400 to ensure that they are performed in the correct order and that they do not interfere with ongoing DMA transactions.
Finally, the system software must be aware of the limitations of the ACE-Lite protocol and the DMA engine. Since the DMA engine does not generate AxSNOOP signals, the system software must take additional steps to ensure that coherency is maintained. This may involve using memory barriers or other synchronization mechanisms to ensure that the DMA engine and the CPU clusters do not access the same memory regions simultaneously without proper coherency actions being taken.
In summary, ensuring proper coherency in a system with multiple shareability domains requires careful configuration of the PTEs, the AxDOMAIN and AxSNOOP signals, and the CCI-400. The system software must also manage cache maintenance operations and be aware of the limitations of the ACE-Lite protocol. By following these steps, it is possible to achieve a coherent system that can handle complex DMA transactions across multiple shareability domains.