Secure and Non-Secure Domain Execution Time Discrepancy in ARM Cortex-M23
The ARM Cortex-M23 processor, which is part of the ARMv8-M architecture, introduces the concept of TrustZone for microcontrollers, enabling the division of code execution into secure and non-secure domains. This separation is crucial for applications requiring robust security, such as IoT devices, where sensitive operations must be isolated from less trusted code. However, this architectural division can introduce unexpected performance discrepancies between the secure and non-secure domains, even when executing seemingly identical code. One such discrepancy is observed when a simple loop running in the secure domain takes significantly less time to execute compared to the same loop running in the non-secure domain. This issue is particularly perplexing when the core clock frequency remains constant, and the code is compiled without optimizations.
The observed behavior, where a loop in the non-secure domain takes approximately 1.5 times longer to execute than the same loop in the secure domain, suggests that factors beyond the core clock frequency and instruction count are at play. This discrepancy can be attributed to several underlying causes, including differences in memory access timing, world switch overhead, and potential configuration mismatches between the secure and non-secure domains. Understanding these factors requires a deep dive into the ARM Cortex-M23 architecture, the TrustZone implementation, and the specific memory subsystem of the Nuvoton M2351 microcontroller.
Memory Access Timing Differences and World Switch Overhead
One of the primary reasons for the execution time discrepancy between the secure and non-secure domains lies in the memory access timing differences. The ARM Cortex-M23 processor, when integrated into the Nuvoton M2351 microcontroller, accesses different memory regions for secure and non-secure code. These memory regions may have distinct timing characteristics due to variations in the memory controller configuration, bus arbitration, and access permissions. For instance, the secure domain might have direct access to tightly coupled memory (TCM) or a high-speed SRAM, while the non-secure domain might be accessing a slower external memory or a region with additional security checks.
The Nuvoton M2351 microcontroller’s memory architecture, as depicted in its component chart, shows that secure and non-secure domains are mapped to different memory regions. These regions are connected to the processor via separate buses, which can have different latencies and bandwidths. When the secure domain accesses its designated memory region, the memory controller might prioritize these accesses, resulting in lower latency and faster execution. Conversely, when the non-secure domain accesses its memory region, the memory controller might impose additional checks or route the access through a slower path, leading to increased latency and longer execution times.
Another contributing factor is the world switch overhead. When code in the secure domain calls a function in the non-secure domain, the processor must perform a context switch between the secure and non-secure states. This switch involves saving and restoring the processor state, updating the security attributes, and potentially flushing caches to maintain security boundaries. While the overhead of a single world switch might be relatively small, it can accumulate over multiple switches, especially in tight loops or frequently called functions. In the case of the benchmark loop, the additional time required for the world switch could account for a significant portion of the observed execution time difference.
Investigating and Resolving Execution Time Discrepancies
To address the execution time discrepancy between the secure and non-secure domains, a systematic approach is required. The first step is to verify that the core clock frequency remains consistent across both domains. This can be done by measuring the clock frequency using an oscilloscope or a dedicated timer peripheral. If the clock frequency is confirmed to be the same, the next step is to analyze the memory access patterns and timing for both domains.
One effective method is to use a memory profiler or a performance monitoring unit (PMU) to measure the number of clock cycles spent on memory accesses in each domain. This data can reveal whether the non-secure domain is experiencing higher memory latency due to slower memory regions or additional security checks. If the memory access timing is indeed the culprit, optimizing the memory layout or adjusting the memory controller configuration might help reduce the latency. For example, placing frequently accessed non-secure data in faster memory regions or enabling caching for non-secure memory can improve performance.
Another approach is to minimize the world switch overhead by reducing the number of switches between the secure and non-secure domains. This can be achieved by batching secure and non-secure operations or restructuring the code to minimize cross-domain calls. Additionally, ensuring that the world switch mechanism is optimized, such as by using hardware acceleration for context switching or reducing the number of registers that need to be saved and restored, can further reduce the overhead.
Finally, it is essential to review the compiler and linker settings to ensure that both secure and non-secure code are compiled with the same optimizations and memory alignment. Discrepancies in these settings can lead to differences in code generation and memory layout, which might contribute to the execution time discrepancy. By carefully analyzing and addressing these factors, it is possible to achieve more consistent performance between the secure and non-secure domains in the ARM Cortex-M23 processor.