Cortex-A53 Data Prefetching Mechanisms and Access Control

The Cortex-A53 processor, a widely used ARMv8-A core, implements data prefetching mechanisms in both its L1 and L2 caches to improve memory access performance. These mechanisms predict future memory accesses and fetch data into the cache before it is explicitly requested by the CPU. While this is beneficial for most workloads, certain applications, such as real-time systems or specific benchmarking scenarios, may require fine-grained control over these prefetching mechanisms.

The Cortex-A53 provides control over prefetching through the CPUACTLR_EL1 (CPU Auxiliary Control Register, EL1) register. Specifically, bits in this register allow enabling or disabling the L1 and L2 data prefetching engines. However, modifying CPUACTLR_EL1 is not straightforward because write access to this register is controlled by higher exception levels (EL2 and EL3). By default, write access to CPUACTLR_EL1 is disabled, and attempts to modify it from EL1 (where the Linux kernel typically operates) will result in an exception, causing a kernel crash.

The control over CPUACTLR_EL1 write access is governed by two auxiliary control registers: ACTLR_EL2 (bit 0) and ACTLR_EL3 (bit 0). If either of these bits is set to 0, write access to CPUACTLR_EL1 is prohibited. This design ensures that only privileged software running at EL2 (hypervisor) or EL3 (secure monitor) can modify CPUACTLR_EL1, preventing unintended or malicious changes to the prefetching behavior.

Hypervisor and Secure Monitor Configuration for CPUACTLR_EL1 Access

To enable write access to CPUACTLR_EL1 from EL1, modifications must be made to the hypervisor (EL2) or secure monitor (EL3) code. The specific steps depend on the software stack in use. For example, if the system employs ARM Trusted Firmware (ATF) as the secure monitor, the relevant code can be found in the Cortex-A53-specific initialization routines within the ATF source tree. The file cortex_a53.S in the ATF repository contains the necessary assembly code to configure ACTLR_EL3.

To allow write access to CPUACTLR_EL1, the secure monitor must set bit 0 of ACTLR_EL3 to 1. Similarly, if a hypervisor is present, bit 0 of ACTLR_EL2 must also be set to 1. These changes must be made during the early boot process, typically in the platform initialization code. Below is an example of how to modify the ATF code to enable CPUACTLR_EL1 write access:

// In cortex_a53.S, modify the CPU reset handler
func cortex_a53_reset_func
    // Enable write access to CPUACTLR_EL1
    mrs x0, ACTLR_EL3
    orr x0, x0, #1
    msr ACTLR_EL3, x0
    // Continue with the rest of the reset handler
    ...
endfunc cortex_a53_reset_func

Once these changes are implemented, the Linux kernel running at EL1 can safely modify CPUACTLR_EL1 to disable data prefetching. The following code snippet demonstrates how to disable L1 and L2 data prefetching in the kernel:

static void __init disable_prefetch(void) {
    u64 value = 0;
    asm volatile("mrs %0, S3_1_C15_C2_0" : "=r" (value)); // Read CPUACTLR_EL1
    value |= (1 << 0); // Disable L1 data prefetching
    value |= (1 << 1); // Disable L2 data prefetching
    asm volatile("msr S3_1_C15_C2_0, %0" :: "r" (value)); // Write CPUACTLR_EL1
}

Diagnosing and Controlling L2 Prefetching Behavior

Even after disabling L1 data prefetching, the Cortex-A53 may still exhibit L2 prefetching behavior, as observed in the performance counters. This is because the L2 cache prefetching mechanism operates independently of the L1 cache and is controlled by separate bits in CPUACTLR_EL1. Disabling L2 prefetching requires setting additional bits in CPUACTLR_EL1, as shown in the previous code snippet.

To diagnose the L2 prefetching behavior, the Cortex-A53 provides Performance Monitoring Unit (PMU) events that can be used to track cache line fills caused by prefetching. Specifically, PMU event 0xC2, "Linefill because of prefetch," can be used to monitor the number of L2 cache line fills triggered by prefetching. However, accessing this event from user space requires proper configuration of the PMU and support from the kernel.

On Android systems, the PMU is typically accessed through the perf subsystem or libraries like PAPI (Performance Application Programming Interface). To enable access to PMU events, the device tree must be modified to include the PMU nodes for each CPU cluster. The following example shows the necessary additions to the device tree:

pmu_a53_0 {
    compatible = "arm,armv8-pmuv3";
    interrupts = <GIC_SPI 50 IRQ_TYPE_LEVEL_HIGH>,
                 <GIC_SPI 51 IRQ_TYPE_LEVEL_HIGH>,
                 <GIC_SPI 52 IRQ_TYPE_LEVEL_HIGH>,
                 <GIC_SPI 53 IRQ_TYPE_LEVEL_HIGH>;
    interrupt-affinity = <&cpu0>, <&cpu1>, <&cpu2>, <&cpu3>;
};

pmu_a53_1 {
    compatible = "arm,armv8-pmuv3";
    interrupts = <GIC_SPI 54 IRQ_TYPE_LEVEL_HIGH>,
                 <GIC_SPI 55 IRQ_TYPE_LEVEL_HIGH>,
                 <GIC_SPI 56 IRQ_TYPE_LEVEL_HIGH>,
                 <GIC_SPI 57 IRQ_TYPE_LEVEL_HIGH>;
    interrupt-affinity = <&cpu4>, <&cpu5>, <&cpu6>, <&cpu7>;
};

pmu_a72 {
    compatible = "arm,armv8-pmuv3";
    interrupts = <GIC_SPI 58 IRQ_TYPE_LEVEL_HIGH>,
                 <GIC_SPI 59 IRQ_TYPE_LEVEL_HIGH>;
    interrupt-affinity = <&cpu8>, <&cpu9>;
};

Once the device tree is updated, the PMU events can be accessed from user space using the perf tool or PAPI. However, some PMU events, such as event 0xC2, may not be exposed by default. In such cases, custom kernel modifications may be required to enable access to these events.

Summary of Key Points

Key Point Description
CPUACTLR_EL1 Controls L1 and L2 data prefetching in Cortex-A53.
ACTLR_EL2/EL3 Registers that control write access to CPUACTLR_EL1.
Hypervisor/Secure Monitor Modifications Required to enable write access to CPUACTLR_EL1 from EL1.
PMU Event 0xC2 Tracks L2 cache line fills caused by prefetching.
Device Tree Modifications Necessary to enable PMU access on Android systems.

By following the steps outlined above, developers can gain fine-grained control over the Cortex-A53’s data prefetching mechanisms, enabling them to optimize performance for specific workloads or disable prefetching entirely when necessary.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *