ARMv8-aarch64 Synchronous Bus Errors and Interrupts Collision in EL3
When working with ARMv8-aarch64 systems, particularly in secure environments like EL3, handling concurrent synchronous exceptions and interrupts can be challenging. The scenario described involves the TrustZone Controller (TZC-400) generating both a bus error (synchronous exception) and an interrupt when an illegal transaction occurs due to insufficient security privileges. The system enters the synchronous exception handler but fails to process the interrupt, leading to a system hang. This issue is exacerbated when both events occur simultaneously, creating a race condition that the current exception and interrupt handling mechanisms are not equipped to resolve.
The TZC-400 is designed to enforce memory access policies by generating a DECERR response and raising an interrupt when an illegal transaction is detected. However, the ARMv8 architecture’s handling of synchronous exceptions and interrupts at the same exception level (EL3 in this case) can lead to unexpected behavior. Specifically, the synchronous exception handler masks interrupts by setting PSTATE.I, which prevents the interrupt from being serviced until the exception handler completes. If the exception handler does not properly restore the system state or handle the interrupt pending state, the system can become stuck.
Synchronous Exception Masking and Interrupt Routing Configuration
The root cause of the issue lies in the interaction between the ARMv8 exception handling mechanism and the TZC-400’s interrupt generation. When a synchronous exception occurs, the processor automatically masks interrupts by setting PSTATE.I if the exception is taken at the same exception level as the interrupt. This is a protective measure to ensure that the exception handler can execute without being preempted. However, in this case, it prevents the interrupt generated by the TZC-400 from being serviced.
Additionally, the interrupt routing configuration plays a critical role. If the interrupt is routed to a lower exception level (controlled by SCR_EL3 and HCR_EL2), it is implicitly masked at EL3. Even if the interrupt is routed to EL3, the Global Interrupt Controller (GIC) must be properly configured to deliver the interrupt. The GIC’s Highest Priority Pending Interrupt (HPPI) register should be checked to confirm that the interrupt is being reported correctly.
The synchronous exception handler in the provided code snippet saves the processor state and attempts to resume execution by adjusting the Exception Link Register (ELR_EL3). However, it does not account for the pending interrupt state or the possibility of nested exceptions. This can lead to a situation where the interrupt is never serviced, causing the system to hang.
Implementing Nested Exception Handling and Interrupt State Management
To resolve this issue, the synchronous exception handler must be modified to handle nested exceptions and manage the interrupt state explicitly. This involves several key steps:
First, the synchronous exception handler should check the interrupt state before returning. This can be done by reading the GIC’s HPPI register to determine if there is a pending interrupt. If an interrupt is pending, the handler should clear the exception condition and unmask interrupts before returning. This allows the interrupt to be serviced immediately after the exception handler completes.
Second, the exception handler should be designed to handle nested exceptions. This involves saving and restoring the processor state in a way that allows the handler to be re-entered if another exception occurs while the first is being processed. The provided code snippet already saves the general-purpose registers, but it should also save and restore the system registers that control interrupt masking and routing, such as PSTATE and SCR_EL3.
Third, the interrupt handler should be designed to handle the specific case of a TZC-400 interrupt. This involves reading the TZC-400’s interrupt status register to determine the cause of the interrupt and taking appropriate action, such as logging the event or resetting the system state. The interrupt handler should also clear the interrupt in the TZC-400 and the GIC to prevent it from being retriggered.
Finally, the system should be tested with a variety of illegal transactions to ensure that both the synchronous exception handler and the interrupt handler are functioning correctly. This includes testing with transactions that generate both a bus error and an interrupt simultaneously, as well as transactions that generate only one or the other.
By implementing these changes, the system can handle concurrent synchronous exceptions and interrupts correctly, preventing the system from hanging and ensuring that all events are processed as expected. This approach leverages the ARMv8 architecture’s exception handling mechanisms while accounting for the specific behavior of the TZC-400 and the GIC, providing a robust solution to a complex problem.