A small experimental toolkit for measuring cache hit / memory access timing on x86_64 and AArch64 (ARMv8) systems.
The repository contains two self-contained C programs that demonstrate low-level timing approaches (x86 RDTSC + CPUID and ARM PMU counters) and write measured timings to text files for further analysis.
Why this exists
- Useful for learning how CPU caches (L1/L2/L3) and main memory affect access latencies
- Can help researchers and students collect micro-benchmarks for cache timing analysis
x86_cache_hits.c— x86_64 program that uses CPUID/RDTSC/RDTSCP assembly to time cache hits/misses and writes to output files (L1/L2/L3/SDRAM)Aarch64_cache_hits.c— AArch64 (ARMv8) program that uses the cycle counter (PMU) to measure L1/L2/SDRAM timings and writes to output filesREADME.md— this file
- Allocate large buffers sized to typical cache layers (L1/L2/L3) and repeatedly perform memory accesses designed to: 1) ensure the data is loaded into a particular cache level and 2) measure the time of an access that hits that level.
- Collects a large number of timing samples (default: 1,000,000 samples) and writes them into text files for post-processing/plotting.
Each program produces output files in the current working directory:
L1_cache_hits.txt— measured L1 hit cycles and last value readL2_cache_hits.txt— measured L2 hit cycles and last value readL3_cache_hits.txt— measured L3 hit cycles and last value read (x86 only)SDRAM_cache_hits.txt— measured main-memory (DRAM) misses
- A modern x86_64 CPU for running
x86_cache_hits.c(RDTSC + CPUID instructions used) - AArch64 (ARMv8) CPU with user access to cycle counter for
Aarch64_cache_hits.c(kernel configuration may affect access) - gcc or clang (a typical Linux toolchain available on Linux distributions)
- Sufficient memory — the test programs allocate multi-megabyte buffers and large arrays to store results. Also the default number of traces (1,000,000) produces large output files and may take considerable time to run.
Notes regarding AArch64 and PMU/cycle counters
- On some Linux distributions, reading/writing PMU control registers or enabling user access to PMU counter may be restricted. If you get permission errors or measurements that look invalid, check your kernel/perf settings and any platform-specific guidance about enabling user-mode PMU access (e.g.,
perf_event_paranoidand/or platform-specific sysfs knobs).
General recommendation: compile with optimization enabled (e.g. -O2) for representative timings. If you want to debug, use -O0.
Example (native x86_64):
gcc -O2 x86_cache_hits.c -o x86_cache_hitsExample (native AArch64 / cross-compile from x86):
# native on ARM64
gcc -O2 Aarch64_cache_hits.c -o aarch64_cache_hits
# or cross-compile using a toolchain (example):
aarch64-linux-gnu-gcc -O2 Aarch64_cache_hits.c -o aarch64_cache_hitsTip: reduce the number of traces while experimenting to get faster runs, e.g. by editing the NUM_OF_TRACES macro or passing a compile-time define:
gcc -O2 -DNUM_OF_TRACES=10000 x86_cache_hits.c -o x86_cache_hits_smallRun the binary from the repository folder. It will print progress and produce the output files in the same directory.
# on x86_64
./x86_cache_hits
# on ARM64
./aarch64_cache_hitsEach output file is textual and contains one measurement per line. Example lines (two-columns separated by whitespace):
1234 42
4567 12
Columns:
From these outputs you can compute statistics (mean / median / histogram), build visualizations and determine typical latency ranges for each cache level.
- These programs run tight loops and allocate large arrays — expect heavy CPU usage and large memory usage.
- The default configuration is intended for collecting high-quality, high-sample-count traces. That can take a long time; when exploring or debugging, reduce
NUM_OF_TRACESfirst. - The programs use inline assembly and architecture-specific registers — do not run an x86 binary on ARM or vice-versa.
- Add scripts to parse and plot the produced datasets (e.g., a Python script using matplotlib / pandas).
- Add a small harness to automatically compile, run with fewer traces and produce sample plots for CI or notebooks.
This repository doesn't currently include a LICENSE file. If you want to add a license, consider MIT, Apache-2.0 or similar. Contributions and pull requests are welcome — please open an issue first to propose larger changes.
If you'd like, I can also add a small Python notebook or scripts that parse the output and produce example histograms. Would you like that next? 👇