Accuracy and Limitations of Performance Measurement on ARM FVP Models

Functional Accuracy vs. Cycle Accuracy in ARM Fast Models

ARM Fixed Virtual Platforms (FVPs) and Fast Models are designed to provide a functionally accurate representation of ARM-based systems. Functional accuracy ensures that all instructions are executed correctly, and the behavior of the software running on the model matches what would occur on real hardware. However, this functional accuracy does not extend to cycle accuracy. Cycle accuracy refers to the precise modeling of clock cycles, which is critical for performance analysis, especially in scenarios where timing and latency are paramount.

Fast Models are optimized for speed, typically executing around 100 million instructions per second. This high-speed execution is achieved by abstracting away the detailed timing of individual clock cycles, which is why they are not cycle-accurate. In contrast, Cycle Models, which are derived from the actual RTL (Register Transfer Level) design of the processor, provide cycle-accurate simulation but at a significantly slower execution speed, often in the range of 10,000 to 100,000 instructions per second.

The distinction between functional and cycle accuracy is crucial when measuring the performance of programs on FVPs. While Fast Models can give a general sense of how a program will perform, they cannot provide precise timing information. For example, using clock_gettime to measure the execution time of a program on an FVP will yield results that are not directly comparable to real hardware. The timing measurements will reflect the functional execution of the program but will not account for the detailed timing variations that occur in a cycle-accurate model or real hardware.

Timing Annotation and Its Impact on Performance Measurement

To bridge the gap between functional and cycle accuracy, ARM provides a feature known as Timing Annotation (TA) in Fast Models. Timing Annotation allows users to introduce estimated timing delays into the simulation, which can be used to approximate the performance characteristics of real hardware. These annotations can be applied to various components of the system, such as caches, memory accesses, and even the CPU pipeline.

Each Fast Model CPU has parameters such as cpi_mul and cpi_div that can be used to adjust the Cycles Per Instruction (CPI) ratio. By default, Fast Models assume a CPI of 1, meaning one instruction is executed per clock cycle. However, real-world processors often have a CPI greater than 1 due to factors such as pipeline stalls, cache misses, and branch mispredictions. The cpi_mul and cpi_div parameters allow users to set a custom CPI, enabling a more realistic simulation of performance.

For example, to simulate a CPI of 1.25, you would set cpi_mul=5 and cpi_div=4. This adjustment can help approximate the performance impact of pipeline inefficiencies, but it is still an approximation and not a true cycle-accurate representation.

Cache modeling is another area where Timing Annotation can be applied. By default, cache modeling is disabled in Fast Models to maximize simulation speed. However, enabling cache modeling introduces estimated latencies for cache accesses, which can provide a more realistic performance profile. The latency parameters for caches, TLB (Translation Lookaside Buffer) page tables, and other memory-related components can be configured to reflect the expected behavior of the target hardware.

It is important to note that Timing Annotation is not a substitute for cycle-accurate simulation. While it can provide a relative comparison of performance between different configurations or algorithms, it cannot guarantee precise timing measurements. Additionally, enabling Timing Annotation will slow down the simulation, as the model must now account for the introduced delays.

Practical Considerations for Performance Measurement on FVPs

When using ARM FVPs for performance measurement, there are several practical considerations to keep in mind. First, it is essential to understand the limitations of the model being used. Fast Models are not cycle-accurate, and while Timing Annotation can provide some level of timing information, it is not a replacement for real hardware or cycle-accurate models.

For users who require precise timing measurements, ARM Cycle Models are the recommended solution. However, Cycle Models are only available for released processors and are not typically available for pre-silicon or unreleased architectures. This limitation means that for early-stage performance analysis, Fast Models with Timing Annotation may be the only option.

When using Timing Annotation, it is crucial to carefully configure the parameters to match the expected behavior of the target hardware. This configuration includes setting appropriate CPI values, enabling cache modeling, and adjusting memory access latencies. The --list-params option can be used to view all available parameters for a given FVP, and the --stat output can provide insights into the impact of the configured parameters on the simulation.

For example, if you are simulating a system with a multi-level cache hierarchy, you would need to enable cache modeling for each level of the cache and set the appropriate latencies for cache hits and misses. Additionally, you may need to account for the latency of downstream memory accesses, which can be approximated by adjusting the parameters for the outermost cache (e.g., L2 cache in the Base FVP).

In cases where precise timing is critical, it may be necessary to use a combination of Fast Models for functional verification and Cycle Models or real hardware for performance analysis. This approach allows for early-stage development and testing on Fast Models, with final performance validation on cycle-accurate models or hardware.

Performance Measurement Techniques and Tools

When measuring performance on ARM FVPs, it is important to use the appropriate techniques and tools to obtain meaningful results. One common approach is to use Performance Monitoring Unit (PMU) counters, which are available on many ARM processors. The PMU counters can provide detailed information about various aspects of program execution, such as the number of instructions executed, cache misses, and branch mispredictions.

In the context of Fast Models, the PMU counters can still be used, but the results should be interpreted with caution due to the lack of cycle accuracy. For example, the PMU_CCNTR register can be used to measure the number of cycles executed by the program. However, since Fast Models are not cycle-accurate, the cycle count reported by the PMU may not reflect the actual performance on real hardware.

Another technique for performance measurement is to use software-based timing functions, such as clock_gettime or gettimeofday. These functions can be used to measure the elapsed time for specific sections of code. However, as previously mentioned, the timing measurements obtained from Fast Models are not cycle-accurate and should be used for relative comparisons rather than absolute performance analysis.

For more accurate performance measurement, it may be necessary to use external tools or instrumentation. For example, ARM DS-5 Development Studio includes a performance analysis tool that can be used to profile and analyze the performance of software running on ARM processors. This tool can provide detailed insights into the execution of the program, including function-level timing, cache usage, and memory access patterns.

Conclusion

ARM FVPs and Fast Models provide a powerful platform for early-stage software development and functional verification. However, they are not cycle-accurate, and their performance measurements should be interpreted with caution. Timing Annotation can be used to approximate the performance characteristics of real hardware, but it is not a substitute for cycle-accurate simulation or real hardware testing.

When using FVPs for performance measurement, it is important to understand the limitations of the model and to use appropriate techniques and tools to obtain meaningful results. For precise timing measurements, ARM Cycle Models or real hardware should be used. By carefully configuring Timing Annotation and using the appropriate performance measurement techniques, developers can gain valuable insights into the performance of their software while still benefiting from the speed and flexibility of Fast Models.

In summary, while ARM FVPs are not cycle-accurate, they can still be a valuable tool for performance analysis when used correctly. By understanding the limitations of the model and applying the appropriate techniques, developers can achieve a balance between simulation speed and performance accuracy, enabling effective early-stage development and testing of ARM-based systems.

Accuracy and Limitations of Performance Measurement on ARM FVP Models

Functional Accuracy vs. Cycle Accuracy in ARM Fast Models

Timing Annotation and Its Impact on Performance Measurement

Practical Considerations for Performance Measurement on FVPs

Performance Measurement Techniques and Tools

Conclusion

Enabling Monitor Debug Mode on Cortex-A15 MPCore: Debug OS Lock and DSCR Configuration

Running Dual RTOS Kernels on Cortex-M33 with TrustZone: Challenges and Solutions

AXI-5 Protocol: BVALID Assertion Timing with WLAST

Emulating ARM Cortex-M7 Intrinsics on x86 for Bit-Exact MATLAB Simulations

ARM Cortex-R5 PC Value Becomes X in Wave Simulation

Benchmarking Code on FVP_MPS2_Cortex-M4: Challenges and Solutions

Leave a Reply Cancel reply

Functional Accuracy vs. Cycle Accuracy in ARM Fast Models

Timing Annotation and Its Impact on Performance Measurement

Practical Considerations for Performance Measurement on FVPs

Performance Measurement Techniques and Tools

Conclusion

Similar Posts

Leave a Reply Cancel reply