First Things First: Hardware Counters

Tue, 10 Mar 2026 19:00:00 +0100

12.9× slower — and that’s the easy part

Two loops over the same array. Same data. Same sum operation. One walks the array sequentially; the other uses a random permutation for indirection. BenchmarkDotNet says SumRandom is 12.88× slower at one million elements. No surprise — random memory access is slower. Everyone knows that.

But how much slower will it get when the dataset grows 64×?

BDN measures time. Time compresses everything the CPU did — cache behavior, prefetch, pipeline stalls, memory latency — into a single scalar. It answers how much. It cannot answer why. And without why, the next question — what happens when conditions change — is a guess.

The first four posts taught doubt. Design lies through omission. Environment masks distortion. Data collection coordinates with failure. Interpretation drifts from evidence. Each layer peeled back a way the measurement could mislead, and each time the tools were doing it to you while appearing to work with you.

This post goes somewhere different.

All code in this post: clone, build, run. Numbers below were measured on dual Xeon E5-2697 v2 using BenchmarkDotNet v0.14.0. Charts use milliseconds unless otherwise noted; tables reproduce BenchmarkDotNet output. BDN’s Error column is the half-width of the 99.9% confidence interval.

The setup — two paths, same operation

[Benchmark(Baseline = true)]
public long SumSequential()
{
    long sum = 0;
    long[] data = _data;
    for (int i = 0; i < data.Length; i++)
        sum += data[i];
    return sum;
}

[Benchmark]
public long SumRandom()
{
    long sum = 0;
    long[] data = _data;
    int[] indices = _indices;
    for (int i = 0; i < indices.Length; i++)
        sum += data[indices[i]];
    return sum;
}

Sequential vs random access over long[]. _indices is a Fisher-Yates shuffle of 0..N-1 — same elements, different order. Full source in companion code.

Both methods compute the same sum. Both touch every element exactly once. The only difference: the order of access. Sequential walks the array from start to end. Random jumps through a pre-shuffled index array.

At one million elements (8 MB of long[] — plus 4 MB of int[] indices for the random variant — both fit comfortably in the 30 MB L3 cache on Ivy Bridge-EP):

| Method        | N       | Mean       | Error    | StdDev   | Ratio | RatioSD |
|-------------- |-------- |-----------:|---------:|---------:|------:|--------:|
| SumSequential | 1000000 |   561.0 us |  1.85 us |  1.44 us |  1.00 |    0.00 |
| SumRandom     | 1000000 | 7,223.9 us | 12.74 us | 10.63 us | 12.88 |    0.04 |

12.88× slower. The confidence intervals don’t overlap. The difference is real and large. Random access is slower — water is wet. Ship the sequential version, move on.

BDN told you the what. It didn’t tell you the why. And without the why, you can’t predict what happens next.

Level 1 — perf stat: the vital signs

perf stat reads hardware performance counters — registers built into the CPU that count events like cycles, instructions, cache accesses, and cache misses. No sampling, no code instrumentation, and typically negligible overhead — the CPU increments these counters in hardware, and perf stat reads the registers at process start/stop. When you request more events than the CPU has physical counter registers, perf multiplexes (time-shares) and scales the results, which introduces estimation error — the percentages in the output below reflect this.¹

perf stat -e cycles,instructions,cache-references,cache-misses,\
branch-instructions,branch-misses,L1-dcache-loads,L1-dcache-load-misses,\
LLC-loads,LLC-load-misses \
    dotnet run -c Release -- --filter '*Sequential*'

If any event is unsupported on your CPU, perf stat will report for that counter. Run perf list to see available events. At minimum, cycles and instructions (for IPC) are widely available on modern x86 CPUs; verify with perf list.

Run this for both variants and you get a side-by-side comparison of what the CPU was actually doing:²

Counter	Sequential	Random	What it means
IPC (instructions/cycle)	1.54	0.42	CPU throughput — how many instructions retire per clock cycle
L1 data cache miss rate	11.64%	24.90%	Fraction of loads that miss the fastest cache (32 KB, ~4 cycle latency)
LLC load miss rate	53.02%*	30.38%*	Fraction of last-level cache loads that go to DRAM — inverted due to aggregation; see note³
Branch misprediction rate	0.87%	2.60%	Fraction of branches predicted wrong — both are low

A caveat these numbers have earned: they are aggregated across the full BDN process — warmup, pilot, and actual iterations at all three dataset sizes (1M, 8M, 64M). They diagnose the mechanism (memory-bound vs compute-bound), not behavior at any single N. The IPC gap (1.54 vs 0.42) and the L1 miss rate gap (11.64% vs 24.90%) are directionally stable across aggregation — random access is memory-bound regardless of how you slice the data. The LLC miss rates are less trustworthy: sequential appears worse (53% vs 30%) because it runs ~3× more total iterations, and the 64M dataset dominates its LLC totals — see ³ for details. The prediction in Level 3 rests on the mechanism (memory-bound + L3-dependent), not on the exact percentages.

The first four posts taught methodical doubt. What to trust? What to distrust? How deep does the distortion go? Descartes reached for the cogito — the one thing doubt couldn’t dissolve. Here the descent through software abstraction reaches something similar. perf stat doesn’t measure time. It doesn’t measure abstractions. It reads registers that the silicon increments whether anyone is watching or not. The counters exist at the boundary where software models end and physics begins. Doubt doesn’t end in nihilism. It ends in firmer ground.

IPC is the headline. Sequential executes 1.54 instructions per cycle. Random executes 0.42. The CPU is 3.7× more productive on sequential access on this Ivy Bridge-EP — not because it runs different instructions, but because it doesn’t stall. The hardware prefetcher detects the sequential stride, fetches cache lines ahead of the loop, and the data is waiting in L1 before the load instruction executes.⁴

Random access defeats the prefetcher. Every load is a surprise. The CPU issues the load, waits 10-40 cycles for L2/L3, and the pipeline stalls. The instructions are the same — the wait is different.

BDN said “12.88× slower.” The aggregate hardware counters² reveal the mechanism: the CPU is stalling on cache misses. The instructions aren’t slower — they’re waiting. And waiting scales with memory latency, which scales with working set size.

That’s the basis for a prediction.

Level 2 — Flame graphs: the shape of time

Hardware counters tell you what the CPU is doing — stalling on cache misses, mispredicting branches. Flame graphs tell you where the cost concentrates in the code path.⁵

A flame graph is a visualization of stack traces sampled by perf record. The x-axis is not time — it’s the population of samples. Wider frames mean more time spent in that function.

# Record stack traces at 99 Hz (standalone runner, no BDN overhead)
perf record -g -F 99 --call-graph dwarf -- \
    dotnet run -c Release -- perf-sequential 8000000

# Convert to flame graph SVG
perf script | stackcollapse-perf.pl | flamegraph.pl > sequential.svg

Sequential — one hot column, tight loop, no stalls wide enough to sample:

Random — wider, flatter. The hot loop is still there, but the sampled stacks spread more broadly around it, consistent with the CPU spending more time waiting on the memory subsystem:

perf stat diagnosed the disease. The flame graph shows the hot path around it. Sequential’s samples concentrate in one tight column — the loop body runs, the prefetcher feeds it, the pipeline stays full. Random spreads wider — same loop body, but more of the sampled time accumulates in and around that path while the core waits for data. The structure of time, not just the quantity of it.

Three tools, three levels:

Tool	Question	Answer
BDN	How much slower?	12.88× (at 1M)
perf stat	Why?	IPC 0.42 vs 1.54 — cache miss stalls
Flame graph	Where?	The hot path around the inner loop

Level 3 — The prediction

This is where hardware counters do something BDN cannot. They don’t just explain the past. They make the future falsifiable.

At one million elements, sequential walks 8 MB of long[]. Random also loads a 4 MB int[] index array, bringing its working set to 12 MB. The L3 cache on this Xeon E5-2697 v2 is 30 MB — everything fits. Random access is slow because it misses L1 and L2 — but it hits L3. L3 latency is ~30 cycles. Bad, but bounded.

The hypothesis: if random access is memory-bound — the aggregate counters showed IPC 0.42 and L1 miss rate 24.90%, diagnosing the mechanism even though they span all dataset sizes² — and the current performance depends on L3 absorbing those misses, then exceeding L3 capacity will force misses to DRAM at ~200 cycles. The ratio should jump.

At 8 million elements, the data array alone is 64 MB. With the 32 MB index array, random’s working set reaches 96 MB — well beyond the 30 MB L3. At 64 million elements (512 MB data + 256 MB indices), there’s no question. Random access now predominantly misses all cache levels and goes to DRAM.

This is a falsifiable prediction. Not a statistical extrapolation from benchmark numbers. A deduction from cache architecture, informed by hardware counters that revealed the mechanism. Run it. See what happens.

| Method        | N        | Mean           | Error       | StdDev      | Ratio | RatioSD |
|-------------- |--------- |---------------:|------------:|------------:|------:|--------:|
| SumSequential | 1000000  |       561.0 us |     1.85 us |     1.44 us |  1.00 |    0.00 |
| SumRandom     | 1000000  |     7,223.9 us |    12.74 us |    10.63 us | 12.88 |    0.04 |
|               |          |                |             |             |       |         |
| SumSequential | 8000000  |     6,434.3 us |    51.43 us |    48.11 us |  1.00 |    0.01 |
| SumRandom     | 8000000  |   125,635.5 us | 2,319.67 us | 2,056.33 us | 19.53 |    0.34 |
|               |          |                |             |             |       |         |
| SumSequential | 64000000 |    82,974.9 us |   933.52 us |   728.83 us |  1.00 |    0.01 |
| SumRandom     | 64000000 | 1,613,935.5 us | 3,277.61 us | 2,736.96 us | 19.45 |    0.17 |

Sequential scales smoothly but super-linearly. 8× more data yields ~11.5× more time; 64× more data yields ~148× more time.⁶ The extra factor comes from the L3 boundary: at 1M (8 MB), sequential reads hit L3 at ~30 cycle latency. At 8M+ (64 MB+), the prefetcher must pull from DRAM (~200 cycles). Even within the DRAM-resident range (8M to 64M), scaling is ~12.9× for 8× data — still super-linear, likely due to TLB pressure at large working sets. The prefetcher hides most of the latency increase — but not all of it.

Random hits a cliff. In this run, the ratio jumps from ~13× at 1M to ~19.5× at 8M — about a 50% degradation on this dual-socket NUMA system.⁷ Beyond that, it stays in the same rough range rather than snapping back. The cliff happened between 1M and 8M, exactly where the working set crossed the L3 boundary. The exact ratios will differ on your hardware — the cliff at the L3 boundary won’t.

The prediction survived the test — on this hardware, on this run.

Popper, Logik der Forschung (1934)⁸: a falsifiable prediction distinguishes science from storytelling. “Random access is memory-bound (low IPC, high L1 miss rate). The working set fits L3 at 1M. At 8M, it won’t. The ratio will jump.” Run it. In this run, the ratio jumps from ~13× to ~19.5×. The exact numbers are unstable — dual-socket NUMA, thread migration, prefetcher heuristics all shift them.⁷ The mechanism isn’t. The theory wasn’t adjusted after the fact. It was stated before the data, derived from the cache hierarchy, and the data confirmed the shape — a cliff at the L3 boundary. That’s not extrapolation from a benchmark number. That’s deduction from architecture.

Through the first four posts, every tool revealed its distortion after the damage was done. Post-factum. Reactive. Hardware counters are the first tool in this series that generates a falsifiable hypothesis before the benchmark runs. Not a better explanation of the past — a testable claim about the future.

NUMA — where the numbers shift and the shape doesn’t

This machine has two sockets. Two Xeon E5-2697 v2, each with its own 30 MB L3 cache, its own memory controller, its own DRAM. When a thread runs on socket 0 and accesses memory allocated on socket 1, the load crosses the QPI interconnect — ~40 ns extra latency. When the OS migrates a thread between sockets mid-benchmark, the prefetcher resets, the L1/L2 are cold, and the next few thousand loads hit DRAM instead of cache.

BDN doesn’t know which socket it’s running on. It reports a single number. On dual-socket NUMA, that number carries noise from topology that has nothing to do with the code being measured.

Three runs: unpinned (OS schedules freely), pinned to socket 0 (taskset -c 0-11), pinned to socket 1 (taskset -c 12-23). Same binary, same data, same benchmark. Different answers.

Unpinned (OS schedules freely):
| Method        | N        | Mean           | Error        | StdDev       | Ratio | RatioSD |
|-------------- |--------- |---------------:|-------------:|-------------:|------:|--------:|
| SumSequential | 1000000  |       555.0 us |      1.26 us |      1.12 us |  1.00 |    0.00 |
| SumRandom     | 1000000  |     7,577.9 us |    181.88 us |    151.88 us | 13.65 |    0.26 |
| SumSequential | 8000000  |     9,146.4 us |  1,431.15 us |  1,338.70 us |  1.02 |    0.21 |
| SumRandom     | 8000000  |   127,720.5 us |  1,591.92 us |  1,329.32 us | 14.25 |    2.03 |
| SumSequential | 64000000 |    65,306.5 us |    522.56 us |    488.81 us |  1.00 |    0.01 |
| SumRandom     | 64000000 | 1,665,816.7 us | 35,695.93 us | 33,389.99 us | 25.51 |    0.53 |

Pinned to socket 0 (taskset -c 0-11):
| Method        | N        | Mean           | Error        | StdDev       | Ratio | RatioSD |
|-------------- |--------- |---------------:|-------------:|-------------:|------:|--------:|
| SumSequential | 1000000  |       564.9 us |      6.22 us |      5.51 us |  1.00 |    0.01 |
| SumRandom     | 1000000  |     8,199.0 us |    596.29 us |    557.77 us | 14.51 |    0.97 |
| SumSequential | 8000000  |     8,942.9 us |  1,201.30 us |  1,123.70 us |  1.02 |    0.18 |
| SumRandom     | 8000000  |   125,272.7 us |  2,566.78 us |  2,400.96 us | 14.24 |    1.93 |
| SumSequential | 64000000 |    67,722.2 us |    574.32 us |    479.59 us |  1.00 |    0.01 |
| SumRandom     | 64000000 | 1,675,995.7 us | 14,298.58 us | 11,939.97 us | 24.75 |    0.24 |

Pinned to socket 1 (taskset -c 12-23):
| Method        | N        | Mean           | Error       | StdDev      | Ratio | RatioSD |
|-------------- |--------- |---------------:|------------:|------------:|------:|--------:|
| SumSequential | 1000000  |       560.5 us |     1.47 us |     1.30 us |  1.00 |    0.00 |
| SumRandom     | 1000000  |     7,330.0 us |    40.50 us |    37.89 us | 13.08 |    0.07 |
| SumSequential | 8000000  |     6,961.7 us |   147.33 us |   137.82 us |  1.00 |    0.03 |
| SumRandom     | 8000000  |   124,440.5 us | 2,756.00 us | 2,577.97 us | 17.88 |    0.50 |
| SumSequential | 64000000 |    56,263.4 us |   685.72 us |   641.42 us |  1.00 |    0.02 |
| SumRandom     | 64000000 | 1,650,334.5 us | 4,652.78 us | 3,885.28 us | 29.34 |    0.33 |

The ratio at 1M is stable: 13.08–14.51×. Everything fits in L3 regardless of socket — NUMA doesn’t matter when the prefetcher keeps the pipeline full and the working set is cache-resident.

At 8M, the topology starts to show. Socket 1 reports 17.88× while unpinned and socket 0 hover around 14.2×. Sequential at 8M diverges the most: unpinned reports 9,146 us (BDN flagged bimodal distribution — thread migration mid-run), socket 0 reports 8,943 us, socket 1 reports 6,962 us. A 31% spread on the same sequential sum, same data, same binary. The difference is where the thread ran and whether it stayed there.

At 64M, the spread widens further: 24.75× (socket 0) to 29.34× (socket 1). An 18% swing in the ratio from thread placement alone. Random access times are close (~1.65–1.68s) — DRAM latency dominates and both sockets pay roughly the same price. Sequential is where the sockets diverge: socket 1 runs sequential 17% faster than socket 0 (56,263 vs 67,722 us), likely because socket 1’s memory controller has less contention from OS and runtime threads that default to socket 0.

The exact ratios from the earlier section — 12.88×, 19.53×, 19.45× — came from yet another run. They don’t match any of these three. That’s the point. On some runs the cliff at 8M is sharp (socket 1: 13.08× → 17.88×); on others it’s muted (unpinned: 13.65× → 14.25×, with the full impact deferred to 64M where DRAM dominates regardless of topology). Five runs, five sets of numbers, one shape: a cliff where the working set crosses the L3 boundary. Whether it lands at 8M or spreads across 8M–64M depends on thread placement and memory allocation — not on the code.

taskset and numactl aren’t exotic tools. They’re part of the measurement environment — the same environment that FTF-2 warned you about. On single-socket machines, none of this matters. On NUMA, it’s the difference between a 24.75× and a 29.34× — same code, same data, same question, different answer depending on which socket the OS picked.

The hardware checklist

Five questions hardware counters answer that benchmarks cannot:

Question	Counter	What to look for
Is my code cache-efficient?	`L1-dcache-load-misses`, `LLC-load-misses`	Miss rate above ~10% on this hardware suggests access pattern worth investigating
Is the CPU pipeline efficient?	`instructions` / `cycles` (IPC)	IPC below ~1.0 on this hardware suggests stalling on memory or branch misses
Is branch prediction working?	`branch-misses` / `branch-instructions`	Miss rate above ~5% on this hardware suggests unpredictable branches
Will this scale with data size?	Compare cache miss rates at small vs large N	Rising miss rate as N grows points toward a performance cliff
Where is time spent?	`perf record` + flame graph	Wide stacks indicate distributed stalls; narrow stacks indicate a hot loop

These thresholds are priors, not axioms — useful starting points for investigation on this hardware, unverified on yours.

When to use what

BDN suffices most of the time:

You’re comparing two implementations and the ratio is clear (>1.5× or <0.7×)
The result is stable across runs
You’re making a ship/no-ship decision on a known bottleneck

Reach for perf stat when:

Two variants show similar BDN times but you suspect different underlying behavior
The ratio changes unexpectedly across dataset sizes
You need to understand why something is slow, not just how much
You want to predict scaling behavior before running the full benchmark suite

Use flame graphs when:

perf stat says “cache misses” but you don’t know which access pattern causes them
A complex function is slow and you need to identify the hot path
You’re profiling an entire application, not an isolated benchmark

Run it yourself

git clone https://github.com/0x3f-blog/companion-code.git
cd companion-code/first-things-first/hardware-counters

# All benchmarks — 3 dataset sizes (~2 min)
dotnet run -c Release -- --filter '*'

# perf stat comparison (Linux only) — full event set matching the blog post
perf stat -e cycles,instructions,cache-references,cache-misses,\
branch-instructions,branch-misses,L1-dcache-loads,L1-dcache-load-misses,\
LLC-loads,LLC-load-misses \
    dotnet run -c Release -- --filter '*Sequential*'

perf stat -e cycles,instructions,cache-references,cache-misses,\
branch-instructions,branch-misses,L1-dcache-loads,L1-dcache-load-misses,\
LLC-loads,LLC-load-misses \
    dotnet run -c Release -- --filter '*Random*'

# Or use the included scripts
./Scripts/perf-stat.sh
./Scripts/run-scaling.sh

Benchmark environment

Component	Value
CPU	2× Intel Xeon E5-2697 v2 @ 2.70 GHz (24 cores / 48 threads)
L3 Cache	30 MB per socket
RAM	~115 GB DDR3-1866 (quad-channel per socket)
OS	Fedora Linux 42 (kernel 6.17)
Runtime	.NET 9.0.11 (RyuJIT AVX)
SDK	.NET SDK 10.0.102 (targets net9.0 — SDK 10 builds 9.0 apps)
BenchmarkDotNet	v0.14.0
perf	v6.18.6, `perf_event_paranoid=2`
GC	Server GC, Concurrent (BDN enables Server GC in benchmark processes by default)

Limitations: Different machine, different numbers. Dual-socket NUMA — thread migration can widen variance. perf stat numbers are aggregated over the full BDN process (warmup, pilot, actual iterations at all three dataset sizes), not isolated per-benchmark. Absolute counter values include BDN overhead; ratios between variants are meaningful. The L3 cache boundary (30 MB) is specific to Ivy Bridge-EP — your cache hierarchy will produce a cliff at a different dataset size. The IPC values reflect aggregate process behavior, not just the hot loop; isolated hot-loop IPC would be higher for sequential (~3.0+) and similar for random (~0.3-0.5).

Piercing through

Five posts. Five layers.

Design — what you measure. Environment — what surrounds the measurement. Data collection — how you gather it. Interpretation — what you do with the numbers. Cause — why the numbers are what they are.

Through the first four posts, the image moved steadily away from reality. Benchmark design distorted it. The environment masked the distortion. Coordinated omission replaced absent data with comfortable silence. Statistical interpretation severed the last thread connecting numbers to the thing they claimed to represent. Baudrillard’s phases of the simulacrum, played out in measurement: the image that distorts reality, the image that masks its absence, the image that bears no relation to reality at all.

perf stat pierces through. It doesn’t build another image. It reads registers that the silicon increments at every clock edge — cache miss, branch mispredict, instruction retired. Not a model of what happened. Not an abstraction of what happened. What happened, counted in hardware, whether anyone is watching or not. The first tool in five posts that measures the territory, not the map.

The series started with a lie — 27.2M ops/sec and three contradictory verdicts from the same optimization. It ends not with an answer but with a framework. Five layers, five dimensions. You don’t need to measure all of them every time. You need to know they exist, and when to reach for which one.

You have the tools. You know when to reach for which one.

Perf on 0x3F