ARM Cortex-M7 High CPU Load Despite Higher Clock Speed
The issue at hand revolves around a significant disparity in CPU load between two systems: one based on the ARM Cortex-M7 microcontroller (CYT4BFX) running at 160 MHz, and the other based on the ARM Cortex-M4 microcontroller (CYT2B9X) running at 80 MHz. Despite the Cortex-M7’s higher clock speed, it exhibits a CPU load of 95%, whereas the Cortex-M4 operates at a mere 25% load. Both systems are running the same software stack, including the Autosar architecture with OSEK OS, and share nearly identical BSW configurations. The Cortex-M7 system does not utilize cache or Tightly-Coupled Memory (TCM), which are critical features for optimizing performance in ARM Cortex-M7 processors.
The Cortex-M7’s higher CPU load is counterintuitive given its architectural advantages over the Cortex-M4. The Cortex-M7 features a 6-stage dual-issue pipeline, optional Floating-Point Unit (FPU), and enhanced DSP capabilities, all of which should theoretically reduce CPU load. However, the absence of cache and TCM utilization negates these advantages, leading to inefficient memory access patterns and increased CPU load. The Cortex-M7’s memory subsystem is more complex than the Cortex-M4’s, and without proper configuration, it can become a bottleneck.
The Cortex-M7’s memory hierarchy includes Instruction Cache (I-Cache), Data Cache (D-Cache), Instruction TCM (ITCM), and Data TCM (DTCM). These features are designed to reduce latency and improve throughput by storing frequently accessed instructions and data closer to the CPU. However, enabling these features without proper configuration can lead to memory exceptions, as observed in the Cortex-M7 system. The Cortex-M7 also supports Memory Protection Unit (MPU) configuration, which can be used to define cacheable and non-cacheable memory regions. Without MPU configuration, the cache may attempt to cache non-cacheable memory regions, leading to undefined behavior.
Cache Configuration Issues and TCM Underutilization
The high CPU load in the Cortex-M7 system can be attributed to several factors, primarily revolving around cache configuration issues and the underutilization of TCM. The Cortex-M7’s cache and TCM features are not enabled by default, and their improper configuration can lead to performance degradation rather than improvement.
One of the primary causes of the high CPU load is the absence of cache configuration. The Cortex-M7’s cache is disabled by default, and enabling it without proper configuration can lead to memory exceptions. The cache must be configured to cache only specific memory regions, and this requires MPU configuration. Without MPU configuration, the cache may attempt to cache non-cacheable memory regions, leading to memory exceptions and increased CPU load. The Cortex-M7’s cache is also more complex than the Cortex-M4’s, with separate I-Cache and D-Cache, and each must be configured independently.
Another significant factor is the underutilization of TCM. TCM is a high-speed memory that is tightly coupled to the CPU, providing low-latency access to critical instructions and data. The Cortex-M7’s TCM is not enabled by default, and its underutilization can lead to inefficient memory access patterns. TCM is particularly useful for storing frequently accessed instructions and data, such as interrupt service routines (ISRs), real-time tasks, and critical data structures. Without TCM, these instructions and data must be fetched from slower external memory, increasing CPU load.
The Cortex-M7’s memory subsystem also includes an AHB (Advanced High-performance Bus) and an AXI (Advanced eXtensible Interface) bus, which are used to connect the CPU to external memory and peripherals. The AHB and AXI buses are more complex than the Cortex-M4’s memory bus, and without proper configuration, they can become a bottleneck. The Cortex-M7’s memory subsystem also includes a write buffer, which can be used to improve write performance, but it must be configured properly to avoid performance degradation.
Enabling and Configuring Cache and TCM for Optimal Performance
To reduce the CPU load in the Cortex-M7 system, it is essential to enable and configure the cache and TCM properly. This involves several steps, including enabling the cache, configuring the MPU, and enabling TCM. Each of these steps must be performed carefully to avoid memory exceptions and ensure optimal performance.
The first step is to enable the cache. The Cortex-M7’s cache is disabled by default, and it must be enabled using the appropriate library functions. The following functions can be used to enable the I-Cache and D-Cache:
SCB_InvalidateICache();
SCB_EnableICache();
SCB_EnableDCache();
However, enabling the cache without proper configuration can lead to memory exceptions. The cache must be configured to cache only specific memory regions, and this requires MPU configuration. The MPU can be used to define cacheable and non-cacheable memory regions, and it must be configured before enabling the cache. The following code snippet demonstrates how to configure the MPU to define a cacheable memory region:
MPU->RNR = 0; // Select region 0
MPU->RBAR = 0x20000000; // Base address of the memory region
MPU->RASR = (0b011 << 24) | (0b001 << 16) | (0b001 << 8) | (0b001 << 1); // Configure region attributes
The next step is to enable TCM. TCM is a high-speed memory that is tightly coupled to the CPU, providing low-latency access to critical instructions and data. The Cortex-M7’s TCM is not enabled by default, and it must be enabled using the appropriate library functions. The following code snippet demonstrates how to enable ITCM and DTCM:
SCB->ITCMCR |= 1; // Enable ITCM
SCB->DTCMCR |= 1; // Enable DTCM
Once TCM is enabled, it is essential to move critical instructions and data to TCM. This can be done using linker scripts or by manually placing critical instructions and data in TCM. The following code snippet demonstrates how to place a function in ITCM using a linker script:
__attribute__((section(".itcm"))) void critical_function() {
// Critical code
}
Finally, it is essential to optimize the memory access patterns to reduce CPU load. This involves minimizing the number of memory accesses, using DMA (Direct Memory Access) for data transfers, and optimizing the memory layout. The Cortex-M7’s memory subsystem includes a write buffer, which can be used to improve write performance, but it must be configured properly to avoid performance degradation. The following code snippet demonstrates how to enable the write buffer:
SCB->CCR |= (1 << 3); // Enable write buffer
In conclusion, the high CPU load in the Cortex-M7 system can be attributed to cache configuration issues and the underutilization of TCM. By enabling and configuring the cache and TCM properly, it is possible to reduce the CPU load and improve performance. The Cortex-M7’s memory subsystem is more complex than the Cortex-M4’s, and without proper configuration, it can become a bottleneck. However, with careful configuration and optimization, the Cortex-M7 can achieve significantly better performance than the Cortex-M4.