.NET on 0x3F

First Things First: Hardware Counters

Tue, 10 Mar 2026 19:00:00 +0100

12.9× slower — and that’s the easy part

Two loops over the same array. Same data. Same sum operation. One walks the array sequentially; the other uses a random permutation for indirection. BenchmarkDotNet says SumRandom is 12.88× slower at one million elements. No surprise — random memory access is slower. Everyone knows that.

But how much slower will it get when the dataset grows 64×?

BDN measures time. Time compresses everything the CPU did — cache behavior, prefetch, pipeline stalls, memory latency — into a single scalar. It answers how much. It cannot answer why. And without why, the next question — what happens when conditions change — is a guess.

The first four posts taught doubt. Design lies through omission. Environment masks distortion. Data collection coordinates with failure. Interpretation drifts from evidence. Each layer peeled back a way the measurement could mislead, and each time the tools were doing it to you while appearing to work with you.

This post goes somewhere different.

All code in this post: clone, build, run. Numbers below were measured on dual Xeon E5-2697 v2 using BenchmarkDotNet v0.14.0. Charts use milliseconds unless otherwise noted; tables reproduce BenchmarkDotNet output. BDN’s Error column is the half-width of the 99.9% confidence interval.

The setup — two paths, same operation

[Benchmark(Baseline = true)]
public long SumSequential()
{
    long sum = 0;
    long[] data = _data;
    for (int i = 0; i < data.Length; i++)
        sum += data[i];
    return sum;
}

[Benchmark]
public long SumRandom()
{
    long sum = 0;
    long[] data = _data;
    int[] indices = _indices;
    for (int i = 0; i < indices.Length; i++)
        sum += data[indices[i]];
    return sum;
}

Sequential vs random access over long[]. _indices is a Fisher-Yates shuffle of 0..N-1 — same elements, different order. Full source in companion code.

Both methods compute the same sum. Both touch every element exactly once. The only difference: the order of access. Sequential walks the array from start to end. Random jumps through a pre-shuffled index array.

At one million elements (8 MB of long[] — plus 4 MB of int[] indices for the random variant — both fit comfortably in the 30 MB L3 cache on Ivy Bridge-EP):

| Method        | N       | Mean       | Error    | StdDev   | Ratio | RatioSD |
|-------------- |-------- |-----------:|---------:|---------:|------:|--------:|
| SumSequential | 1000000 |   561.0 us |  1.85 us |  1.44 us |  1.00 |    0.00 |
| SumRandom     | 1000000 | 7,223.9 us | 12.74 us | 10.63 us | 12.88 |    0.04 |

12.88× slower. The confidence intervals don’t overlap. The difference is real and large. Random access is slower — water is wet. Ship the sequential version, move on.

BDN told you the what. It didn’t tell you the why. And without the why, you can’t predict what happens next.

Level 1 — perf stat: the vital signs

perf stat reads hardware performance counters — registers built into the CPU that count events like cycles, instructions, cache accesses, and cache misses. No sampling, no code instrumentation, and typically negligible overhead — the CPU increments these counters in hardware, and perf stat reads the registers at process start/stop. When you request more events than the CPU has physical counter registers, perf multiplexes (time-shares) and scales the results, which introduces estimation error — the percentages in the output below reflect this.¹

perf stat -e cycles,instructions,cache-references,cache-misses,\
branch-instructions,branch-misses,L1-dcache-loads,L1-dcache-load-misses,\
LLC-loads,LLC-load-misses \
    dotnet run -c Release -- --filter '*Sequential*'

If any event is unsupported on your CPU, perf stat will report for that counter. Run perf list to see available events. At minimum, cycles and instructions (for IPC) are widely available on modern x86 CPUs; verify with perf list.

Run this for both variants and you get a side-by-side comparison of what the CPU was actually doing:²

Counter	Sequential	Random	What it means
IPC (instructions/cycle)	1.54	0.42	CPU throughput — how many instructions retire per clock cycle
L1 data cache miss rate	11.64%	24.90%	Fraction of loads that miss the fastest cache (32 KB, ~4 cycle latency)
LLC load miss rate	53.02%*	30.38%*	Fraction of last-level cache loads that go to DRAM — inverted due to aggregation; see note³
Branch misprediction rate	0.87%	2.60%	Fraction of branches predicted wrong — both are low

A caveat these numbers have earned: they are aggregated across the full BDN process — warmup, pilot, and actual iterations at all three dataset sizes (1M, 8M, 64M). They diagnose the mechanism (memory-bound vs compute-bound), not behavior at any single N. The IPC gap (1.54 vs 0.42) and the L1 miss rate gap (11.64% vs 24.90%) are directionally stable across aggregation — random access is memory-bound regardless of how you slice the data. The LLC miss rates are less trustworthy: sequential appears worse (53% vs 30%) because it runs ~3× more total iterations, and the 64M dataset dominates its LLC totals — see ³ for details. The prediction in Level 3 rests on the mechanism (memory-bound + L3-dependent), not on the exact percentages.

The first four posts taught methodical doubt. What to trust? What to distrust? How deep does the distortion go? Descartes reached for the cogito — the one thing doubt couldn’t dissolve. Here the descent through software abstraction reaches something similar. perf stat doesn’t measure time. It doesn’t measure abstractions. It reads registers that the silicon increments whether anyone is watching or not. The counters exist at the boundary where software models end and physics begins. Doubt doesn’t end in nihilism. It ends in firmer ground.

IPC is the headline. Sequential executes 1.54 instructions per cycle. Random executes 0.42. The CPU is 3.7× more productive on sequential access on this Ivy Bridge-EP — not because it runs different instructions, but because it doesn’t stall. The hardware prefetcher detects the sequential stride, fetches cache lines ahead of the loop, and the data is waiting in L1 before the load instruction executes.⁴

Random access defeats the prefetcher. Every load is a surprise. The CPU issues the load, waits 10-40 cycles for L2/L3, and the pipeline stalls. The instructions are the same — the wait is different.

BDN said “12.88× slower.” The aggregate hardware counters² reveal the mechanism: the CPU is stalling on cache misses. The instructions aren’t slower — they’re waiting. And waiting scales with memory latency, which scales with working set size.

That’s the basis for a prediction.

Level 2 — Flame graphs: the shape of time

Hardware counters tell you what the CPU is doing — stalling on cache misses, mispredicting branches. Flame graphs tell you where the cost concentrates in the code path.⁵

A flame graph is a visualization of stack traces sampled by perf record. The x-axis is not time — it’s the population of samples. Wider frames mean more time spent in that function.

# Record stack traces at 99 Hz (standalone runner, no BDN overhead)
perf record -g -F 99 --call-graph dwarf -- \
    dotnet run -c Release -- perf-sequential 8000000

# Convert to flame graph SVG
perf script | stackcollapse-perf.pl | flamegraph.pl > sequential.svg

Sequential — one hot column, tight loop, no stalls wide enough to sample:

Random — wider, flatter. The hot loop is still there, but the sampled stacks spread more broadly around it, consistent with the CPU spending more time waiting on the memory subsystem:

perf stat diagnosed the disease. The flame graph shows the hot path around it. Sequential’s samples concentrate in one tight column — the loop body runs, the prefetcher feeds it, the pipeline stays full. Random spreads wider — same loop body, but more of the sampled time accumulates in and around that path while the core waits for data. The structure of time, not just the quantity of it.

Three tools, three levels:

Tool	Question	Answer
BDN	How much slower?	12.88× (at 1M)
perf stat	Why?	IPC 0.42 vs 1.54 — cache miss stalls
Flame graph	Where?	The hot path around the inner loop

Level 3 — The prediction

This is where hardware counters do something BDN cannot. They don’t just explain the past. They make the future falsifiable.

At one million elements, sequential walks 8 MB of long[]. Random also loads a 4 MB int[] index array, bringing its working set to 12 MB. The L3 cache on this Xeon E5-2697 v2 is 30 MB — everything fits. Random access is slow because it misses L1 and L2 — but it hits L3. L3 latency is ~30 cycles. Bad, but bounded.

The hypothesis: if random access is memory-bound — the aggregate counters showed IPC 0.42 and L1 miss rate 24.90%, diagnosing the mechanism even though they span all dataset sizes² — and the current performance depends on L3 absorbing those misses, then exceeding L3 capacity will force misses to DRAM at ~200 cycles. The ratio should jump.

At 8 million elements, the data array alone is 64 MB. With the 32 MB index array, random’s working set reaches 96 MB — well beyond the 30 MB L3. At 64 million elements (512 MB data + 256 MB indices), there’s no question. Random access now predominantly misses all cache levels and goes to DRAM.

This is a falsifiable prediction. Not a statistical extrapolation from benchmark numbers. A deduction from cache architecture, informed by hardware counters that revealed the mechanism. Run it. See what happens.

| Method        | N        | Mean           | Error       | StdDev      | Ratio | RatioSD |
|-------------- |--------- |---------------:|------------:|------------:|------:|--------:|
| SumSequential | 1000000  |       561.0 us |     1.85 us |     1.44 us |  1.00 |    0.00 |
| SumRandom     | 1000000  |     7,223.9 us |    12.74 us |    10.63 us | 12.88 |    0.04 |
|               |          |                |             |             |       |         |
| SumSequential | 8000000  |     6,434.3 us |    51.43 us |    48.11 us |  1.00 |    0.01 |
| SumRandom     | 8000000  |   125,635.5 us | 2,319.67 us | 2,056.33 us | 19.53 |    0.34 |
|               |          |                |             |             |       |         |
| SumSequential | 64000000 |    82,974.9 us |   933.52 us |   728.83 us |  1.00 |    0.01 |
| SumRandom     | 64000000 | 1,613,935.5 us | 3,277.61 us | 2,736.96 us | 19.45 |    0.17 |

Sequential scales smoothly but super-linearly. 8× more data yields ~11.5× more time; 64× more data yields ~148× more time.⁶ The extra factor comes from the L3 boundary: at 1M (8 MB), sequential reads hit L3 at ~30 cycle latency. At 8M+ (64 MB+), the prefetcher must pull from DRAM (~200 cycles). Even within the DRAM-resident range (8M to 64M), scaling is ~12.9× for 8× data — still super-linear, likely due to TLB pressure at large working sets. The prefetcher hides most of the latency increase — but not all of it.

Random hits a cliff. In this run, the ratio jumps from ~13× at 1M to ~19.5× at 8M — about a 50% degradation on this dual-socket NUMA system.⁷ Beyond that, it stays in the same rough range rather than snapping back. The cliff happened between 1M and 8M, exactly where the working set crossed the L3 boundary. The exact ratios will differ on your hardware — the cliff at the L3 boundary won’t.

The prediction survived the test — on this hardware, on this run.

Popper, Logik der Forschung (1934)⁸: a falsifiable prediction distinguishes science from storytelling. “Random access is memory-bound (low IPC, high L1 miss rate). The working set fits L3 at 1M. At 8M, it won’t. The ratio will jump.” Run it. In this run, the ratio jumps from ~13× to ~19.5×. The exact numbers are unstable — dual-socket NUMA, thread migration, prefetcher heuristics all shift them.⁷ The mechanism isn’t. The theory wasn’t adjusted after the fact. It was stated before the data, derived from the cache hierarchy, and the data confirmed the shape — a cliff at the L3 boundary. That’s not extrapolation from a benchmark number. That’s deduction from architecture.

Through the first four posts, every tool revealed its distortion after the damage was done. Post-factum. Reactive. Hardware counters are the first tool in this series that generates a falsifiable hypothesis before the benchmark runs. Not a better explanation of the past — a testable claim about the future.

NUMA — where the numbers shift and the shape doesn’t

This machine has two sockets. Two Xeon E5-2697 v2, each with its own 30 MB L3 cache, its own memory controller, its own DRAM. When a thread runs on socket 0 and accesses memory allocated on socket 1, the load crosses the QPI interconnect — ~40 ns extra latency. When the OS migrates a thread between sockets mid-benchmark, the prefetcher resets, the L1/L2 are cold, and the next few thousand loads hit DRAM instead of cache.

BDN doesn’t know which socket it’s running on. It reports a single number. On dual-socket NUMA, that number carries noise from topology that has nothing to do with the code being measured.

Three runs: unpinned (OS schedules freely), pinned to socket 0 (taskset -c 0-11), pinned to socket 1 (taskset -c 12-23). Same binary, same data, same benchmark. Different answers.

Unpinned (OS schedules freely):
| Method        | N        | Mean           | Error        | StdDev       | Ratio | RatioSD |
|-------------- |--------- |---------------:|-------------:|-------------:|------:|--------:|
| SumSequential | 1000000  |       555.0 us |      1.26 us |      1.12 us |  1.00 |    0.00 |
| SumRandom     | 1000000  |     7,577.9 us |    181.88 us |    151.88 us | 13.65 |    0.26 |
| SumSequential | 8000000  |     9,146.4 us |  1,431.15 us |  1,338.70 us |  1.02 |    0.21 |
| SumRandom     | 8000000  |   127,720.5 us |  1,591.92 us |  1,329.32 us | 14.25 |    2.03 |
| SumSequential | 64000000 |    65,306.5 us |    522.56 us |    488.81 us |  1.00 |    0.01 |
| SumRandom     | 64000000 | 1,665,816.7 us | 35,695.93 us | 33,389.99 us | 25.51 |    0.53 |

Pinned to socket 0 (taskset -c 0-11):
| Method        | N        | Mean           | Error        | StdDev       | Ratio | RatioSD |
|-------------- |--------- |---------------:|-------------:|-------------:|------:|--------:|
| SumSequential | 1000000  |       564.9 us |      6.22 us |      5.51 us |  1.00 |    0.01 |
| SumRandom     | 1000000  |     8,199.0 us |    596.29 us |    557.77 us | 14.51 |    0.97 |
| SumSequential | 8000000  |     8,942.9 us |  1,201.30 us |  1,123.70 us |  1.02 |    0.18 |
| SumRandom     | 8000000  |   125,272.7 us |  2,566.78 us |  2,400.96 us | 14.24 |    1.93 |
| SumSequential | 64000000 |    67,722.2 us |    574.32 us |    479.59 us |  1.00 |    0.01 |
| SumRandom     | 64000000 | 1,675,995.7 us | 14,298.58 us | 11,939.97 us | 24.75 |    0.24 |

Pinned to socket 1 (taskset -c 12-23):
| Method        | N        | Mean           | Error       | StdDev      | Ratio | RatioSD |
|-------------- |--------- |---------------:|------------:|------------:|------:|--------:|
| SumSequential | 1000000  |       560.5 us |     1.47 us |     1.30 us |  1.00 |    0.00 |
| SumRandom     | 1000000  |     7,330.0 us |    40.50 us |    37.89 us | 13.08 |    0.07 |
| SumSequential | 8000000  |     6,961.7 us |   147.33 us |   137.82 us |  1.00 |    0.03 |
| SumRandom     | 8000000  |   124,440.5 us | 2,756.00 us | 2,577.97 us | 17.88 |    0.50 |
| SumSequential | 64000000 |    56,263.4 us |   685.72 us |   641.42 us |  1.00 |    0.02 |
| SumRandom     | 64000000 | 1,650,334.5 us | 4,652.78 us | 3,885.28 us | 29.34 |    0.33 |

The ratio at 1M is stable: 13.08–14.51×. Everything fits in L3 regardless of socket — NUMA doesn’t matter when the prefetcher keeps the pipeline full and the working set is cache-resident.

At 8M, the topology starts to show. Socket 1 reports 17.88× while unpinned and socket 0 hover around 14.2×. Sequential at 8M diverges the most: unpinned reports 9,146 us (BDN flagged bimodal distribution — thread migration mid-run), socket 0 reports 8,943 us, socket 1 reports 6,962 us. A 31% spread on the same sequential sum, same data, same binary. The difference is where the thread ran and whether it stayed there.

At 64M, the spread widens further: 24.75× (socket 0) to 29.34× (socket 1). An 18% swing in the ratio from thread placement alone. Random access times are close (~1.65–1.68s) — DRAM latency dominates and both sockets pay roughly the same price. Sequential is where the sockets diverge: socket 1 runs sequential 17% faster than socket 0 (56,263 vs 67,722 us), likely because socket 1’s memory controller has less contention from OS and runtime threads that default to socket 0.

The exact ratios from the earlier section — 12.88×, 19.53×, 19.45× — came from yet another run. They don’t match any of these three. That’s the point. On some runs the cliff at 8M is sharp (socket 1: 13.08× → 17.88×); on others it’s muted (unpinned: 13.65× → 14.25×, with the full impact deferred to 64M where DRAM dominates regardless of topology). Five runs, five sets of numbers, one shape: a cliff where the working set crosses the L3 boundary. Whether it lands at 8M or spreads across 8M–64M depends on thread placement and memory allocation — not on the code.

taskset and numactl aren’t exotic tools. They’re part of the measurement environment — the same environment that FTF-2 warned you about. On single-socket machines, none of this matters. On NUMA, it’s the difference between a 24.75× and a 29.34× — same code, same data, same question, different answer depending on which socket the OS picked.

The hardware checklist

Five questions hardware counters answer that benchmarks cannot:

Question	Counter	What to look for
Is my code cache-efficient?	`L1-dcache-load-misses`, `LLC-load-misses`	Miss rate above ~10% on this hardware suggests access pattern worth investigating
Is the CPU pipeline efficient?	`instructions` / `cycles` (IPC)	IPC below ~1.0 on this hardware suggests stalling on memory or branch misses
Is branch prediction working?	`branch-misses` / `branch-instructions`	Miss rate above ~5% on this hardware suggests unpredictable branches
Will this scale with data size?	Compare cache miss rates at small vs large N	Rising miss rate as N grows points toward a performance cliff
Where is time spent?	`perf record` + flame graph	Wide stacks indicate distributed stalls; narrow stacks indicate a hot loop

These thresholds are priors, not axioms — useful starting points for investigation on this hardware, unverified on yours.

When to use what

BDN suffices most of the time:

You’re comparing two implementations and the ratio is clear (>1.5× or <0.7×)
The result is stable across runs
You’re making a ship/no-ship decision on a known bottleneck

Reach for perf stat when:

Two variants show similar BDN times but you suspect different underlying behavior
The ratio changes unexpectedly across dataset sizes
You need to understand why something is slow, not just how much
You want to predict scaling behavior before running the full benchmark suite

Use flame graphs when:

perf stat says “cache misses” but you don’t know which access pattern causes them
A complex function is slow and you need to identify the hot path
You’re profiling an entire application, not an isolated benchmark

Run it yourself

git clone https://github.com/0x3f-blog/companion-code.git
cd companion-code/first-things-first/hardware-counters

# All benchmarks — 3 dataset sizes (~2 min)
dotnet run -c Release -- --filter '*'

# perf stat comparison (Linux only) — full event set matching the blog post
perf stat -e cycles,instructions,cache-references,cache-misses,\
branch-instructions,branch-misses,L1-dcache-loads,L1-dcache-load-misses,\
LLC-loads,LLC-load-misses \
    dotnet run -c Release -- --filter '*Sequential*'

perf stat -e cycles,instructions,cache-references,cache-misses,\
branch-instructions,branch-misses,L1-dcache-loads,L1-dcache-load-misses,\
LLC-loads,LLC-load-misses \
    dotnet run -c Release -- --filter '*Random*'

# Or use the included scripts
./Scripts/perf-stat.sh
./Scripts/run-scaling.sh

Benchmark environment

Component	Value
CPU	2× Intel Xeon E5-2697 v2 @ 2.70 GHz (24 cores / 48 threads)
L3 Cache	30 MB per socket
RAM	~115 GB DDR3-1866 (quad-channel per socket)
OS	Fedora Linux 42 (kernel 6.17)
Runtime	.NET 9.0.11 (RyuJIT AVX)
SDK	.NET SDK 10.0.102 (targets net9.0 — SDK 10 builds 9.0 apps)
BenchmarkDotNet	v0.14.0
perf	v6.18.6, `perf_event_paranoid=2`
GC	Server GC, Concurrent (BDN enables Server GC in benchmark processes by default)

Limitations: Different machine, different numbers. Dual-socket NUMA — thread migration can widen variance. perf stat numbers are aggregated over the full BDN process (warmup, pilot, actual iterations at all three dataset sizes), not isolated per-benchmark. Absolute counter values include BDN overhead; ratios between variants are meaningful. The L3 cache boundary (30 MB) is specific to Ivy Bridge-EP — your cache hierarchy will produce a cliff at a different dataset size. The IPC values reflect aggregate process behavior, not just the hot loop; isolated hot-loop IPC would be higher for sequential (~3.0+) and similar for random (~0.3-0.5).

Piercing through

Five posts. Five layers.

Design — what you measure. Environment — what surrounds the measurement. Data collection — how you gather it. Interpretation — what you do with the numbers. Cause — why the numbers are what they are.

Through the first four posts, the image moved steadily away from reality. Benchmark design distorted it. The environment masked the distortion. Coordinated omission replaced absent data with comfortable silence. Statistical interpretation severed the last thread connecting numbers to the thing they claimed to represent. Baudrillard’s phases of the simulacrum, played out in measurement: the image that distorts reality, the image that masks its absence, the image that bears no relation to reality at all.

perf stat pierces through. It doesn’t build another image. It reads registers that the silicon increments at every clock edge — cache miss, branch mispredict, instruction retired. Not a model of what happened. Not an abstraction of what happened. What happened, counted in hardware, whether anyone is watching or not. The first tool in five posts that measures the territory, not the map.

The series started with a lie — 27.2M ops/sec and three contradictory verdicts from the same optimization. It ends not with an answer but with a framework. Five layers, five dimensions. You don’t need to measure all of them every time. You need to know they exist, and when to reach for which one.

You have the tools. You know when to reach for which one.

First Things First: Statistics That Matter

Fri, 06 Mar 2026 18:00:00 +0100

3% slower. Ship it.

Two filter variants over 20 million integers. Five benchmark iterations. FilterTernary: 26.11 ms. FilterBranch: 25.30 ms. The ternary is 3% slower. PR description writes itself. Merge. Deploy.

Next day, rollback. Regression in production — on hardware where the difference vanishes, on data where it reverses.

Design fixed. Environment defended. Data collected honestly. The benchmark is solid. The number is real. The interpretation is not.

All code in this post: clone, build, run. Numbers below were measured on dual Xeon E5-2697 v2 using BenchmarkDotNet v0.14.0, pinned to a single NUMA node — run the companion code on your hardware for your own results. Different machine, different numbers.

Convention: charts use milliseconds unless otherwise noted; tables reproduce BenchmarkDotNet output. BDN’s Error column is the half-width of the 99.9% confidence interval.

The number is the answer

[Benchmark(Baseline = true)]
public long FilterBranch()
{
    long sum = 0;
    int[] data = _data;
    for (int i = 0; i < data.Length; i++)
    {
        if (data[i] > 0)
            sum += data[i];
    }
    return sum;
}

[Benchmark]
public long FilterTernary()
{
    long sum = 0;
    int[] data = _data;
    for (int i = 0; i < data.Length; i++)
    {
        int v = data[i];
        sum += v > 0 ? v : 0;
    }
    return sum;
}

Two filter variants over 20M integers (~95% positive). Full source in companion code.

Every benchmarking tutorial ends here: compare two means, pick the lower one. FilterTernary = 26.11 ms, FilterBranch = 25.30 ms — 3% difference. The ternary loses.

How many times did you run it?

Layer 1 — Confidence intervals eat your win

BenchmarkDotNet doesn’t just give you a mean. It gives you Mean ± Error — where Error is the half-width of the 99.9% confidence interval, computed using a Student’s t-distribution with n-1 degrees of freedom.¹

The 5-iteration run — the one that said “3% slower”:

| Method        | N        | Mean     | Error    | StdDev   | Ratio | RatioSD |
|-------------- |--------- |---------:|---------:|---------:|------:|--------:|
| FilterBranch  | 20000000 | 25.30 ms | 0.408 ms | 0.063 ms |  1.00 |    0.00 |
| FilterTernary | 20000000 | 26.11 ms | 2.624 ms | 0.681 ms |  1.03 |    0.02 |

The 99.9% CI for FilterBranch: 25.30 ± 0.408 ms → [24.89, 25.71]. For FilterTernary: 26.11 ± 2.624 ms → [23.49, 28.73]. FilterBranch’s entire range sits inside FilterTernary’s confidence interval. The “3% slower” could be a scheduling hiccup. Five iterations cannot tell you that.

You know this from Part 1. Overlapping CIs, unresolved difference. Run more iterations.

Twenty iterations:

| Method        | N        | Mean     | Error    | StdDev   | Ratio |
|-------------- |--------- |---------:|---------:|---------:|------:|
| FilterBranch  | 20000000 | 25.25 ms | 0.173 ms | 0.177 ms |  1.00 |
| FilterTernary | 20000000 | 25.64 ms | 0.111 ms | 0.109 ms |  1.02 |

The 99.9% CI for FilterBranch: [25.08, 25.42]. For FilterTernary: [25.53, 25.75]. No overlap. A manual Welch t-test on this data gives p < 0.001.² The difference is real.

FilterTernary is 2% slower. The 5-iteration run saw the right direction but had no basis to trust it — the CI was so wide it could not separate signal from noise.

The Error on FilterTernary dropped from ±2.6 ms to ±0.1 ms. An order of magnitude. More iterations, sure. But .NET’s JIT compiles in tiers: Tier-0 (quick, unoptimized) on first calls, Tier-1 (full optimization) after enough invocations. If BDN’s warmup didn’t fully promote both methods, the 5-iteration run might have caught Tier-0 code while the 20-iteration run measured Tier-1. Different machine code, different variance profile.

Worth checking. Expand the ternary first:

// FilterBranch
if (data[i] > 0)
    sum += data[i];

// FilterTernary — expand v > 0 ? v : 0
if (v > 0) sum += v;
else        sum += 0;

The branch skips. The ternary always adds — even zero. Structurally different operations.

[DisassemblyDiagnoser] (Enemy 6 introduced the tool) on the class dumps native code — run the benchmark, check BenchmarkDotNet.Artifacts/results/*-asm.md. Five iterations:

; FilterBranch — 54 bytes
M00_L00:
       mov       edi,[rcx]        ; load data[i]
       test      edi,edi          ; data[i] > 0?
       jle       short M00_L01    ; skip if not
       movsxd    rdi,edi          ; sign-extend to 64-bit
       add       rax,rdi          ; sum += data[i]
M00_L01:
       add       rcx,4            ; i++
       dec       edx
       jne       short M00_L00

; FilterTernary — 58 bytes
M00_L00:
       mov       edi,[rcx]        ; load v = data[i]
       test      edi,edi          ; v > 0?
       jle       short M00_L03    ; if not, jump to zero path
M00_L01:
       movsxd    rdi,edi          ; sign-extend
       add       rax,rdi          ; sum += v (or sum += 0)
       add       rcx,4            ; i++
       dec       edx
       jne       short M00_L00
; ...
M00_L03:
       xor       edi,edi          ; v = 0
       jmp       short M00_L01    ; jump back to add

Twenty iterations:

; FilterBranch — 54 bytes
M00_L00:
       mov       edi,[rcx]
       test      edi,edi
       jle       short M00_L01
       movsxd    rdi,edi
       add       rax,rdi
M00_L01:
       add       rcx,4
       dec       edx
       jne       short M00_L00

; FilterTernary — 58 bytes
M00_L00:
       mov       edi,[rcx]
       test      edi,edi
       jle       short M00_L03
M00_L01:
       movsxd    rdi,edi
       add       rax,rdi
       add       rcx,4
       dec       edx
       jne       short M00_L00
; ...
M00_L03:
       xor       edi,edi
       jmp       short M00_L01

Identical machine code. Both runs. The Error dropped because more iterations and lower observed variance both narrowed the confidence interval. BDN’s Error is t(0.0005, n−1) × StdDev / √n — StdDev for FilterTernary fell from 0.681 ms to 0.109 ms (6×), and the larger sample brought a smaller t-value and larger √n. The variance reduction did most of the work.

A number without error bars is an opinion. Five iterations produced CIs so wide that either outcome fit the data. Twenty produced CIs narrow enough to separate signal from noise — not certainty, but 99.9% confidence that FilterBranch is faster on this hardware. If you had shipped after five, you’d have deployed a guess as a conclusion.

CI answers one question: does a difference exist? It says nothing about whether the difference matters.

Layer 2 — Effect size: when “significant” doesn’t mean “meaningful”

The 20-iteration result says FilterTernary is 2% slower. The CIs don’t overlap. The difference is statistically real. But 0.4 ms on a 25 ms operation over 20 million integers. Is that worth changing the code?

Statistical significance asks does a difference exist? Practical significance asks does it matter? BDN answers the first. You answer the second.

Cohen’s d — the standardized effect size — measures the distance between two means in units of the pooled standard deviation:³

d = |mean_1 - mean_2| / pooled SD

public static double CohensD(double mean1, double stdDev1, double mean2, double stdDev2)
{
    double pooledSd = Math.Sqrt((stdDev1 * stdDev1 + stdDev2 * stdDev2) / 2.0);
    if (pooledSd == 0) return 0;
    return Math.Abs(mean1 - mean2) / pooledSd;
}

Cohen’s d computation — full source in Analysis/StatisticalReport.cs.

Cohen’s d for FilterBranch vs FilterTernary: |25.25 - 25.64| / sqrt((0.177^2 + 0.109^2)/2) = 0.39 / 0.147 = 2.65. By the standard thresholds (0.2 = small, 0.5 = medium, 0.8 = large), that’s a “large” effect.

But 2.65 for a 2% difference? Something is off.

The threshold trap

Cohen’s d thresholds were calibrated for psychology experiments where within-group variance is naturally high. BenchmarkDotNet’s within-run variance is very low in controlled microbenchmarks — sub-1% coefficient of variation for compute-bound loops. When the denominator (pooled SD) is tiny, even a trivial mean difference produces a massive d.

Three pairs from the companion code:

Pair	Ratio	Delta practical	Cohen’s d	“Interpretation”
FilterBranch vs FilterTernary	1.02	2%	2.65	“large”
SumArray vs SumSpan	1.01	0.5%	1.98	“large”
SearchLinear vs SearchBinary	0.001	1,071x	368	“large”

All three “large” by Cohen’s thresholds. Only one is a meaningful optimization. Wittgenstein (1953): meaning is use — a word means what it means in the language game where it was coined. Cohen’s thresholds were coined in a game where within-group variance is high and effect sizes are modest. Microbenchmarking is a different game — sub-1% coefficient of variation, deterministic loops, controlled environments. “Large” means something in psychology. The standard interpretation becomes misleading when BDN’s precision makes the denominator vanishingly small. A 0.5% difference and a 1,071x difference land in the same bucket.

Popper (1934): a hypothesis survives by resisting falsification, not by accumulating confirmation. “3% faster” is a hypothesis. Non-overlapping CIs survived the first test — the difference exists. But Cohen’s d at 2.65 for a 2% change is the hypothesis flattering itself. The effect size, on BDN’s terrain, does not survive scrutiny. Seek the conditions under which the claim fails, not the ones where it holds.

For microbenchmarks, rely primarily on BDN’s Ratio column rather than Cohen’s d. Ratio ~ 1.00 means “no practical difference.” Ratio ~ 0.001 means “algorithmic change.” Whether 2% matters depends on context — a hot loop called billions of times, or a function called once per request. Define your threshold before you run.

Two extremes

Small practical effect — array indexing vs Span indexing over 1M integers:

| Method   | Categories  | N       | Mean     | Error   | StdDev  | Ratio |
|--------- |------------ |-------- |---------:|--------:|--------:|------:|
| SumArray | SmallEffect | 1000000 | 512.7 us | 1.16 us | 1.19 us |  1.00 |
| SumSpan  | SmallEffect | 1000000 | 515.3 us | 1.28 us | 1.42 us |  1.01 |

Ratio = 1.01. The JIT produces nearly identical code for both — bounds-check elimination applies to int[] and ReadOnlySpan alike on .NET 9. The 2.6 us difference (0.5%) is likely real — the CIs don’t overlap, which is a conservative indicator — but not worth a code change.

Large practical effect — linear search vs binary search over 1M integers:

| Method       | Categories  | N       | Mean         | Error    | StdDev   | Ratio |
|------------- |------------ |-------- |-------------:|---------:|---------:|------:|
| SearchLinear | LargeEffect | 1000000 | 248,303.3 us | 928.6 us | 953.6 us | 1.000 |
| SearchBinary | LargeEffect | 1000000 |     231.8 us |   1.5 us |   1.7 us | 0.001 |

Ratio = 0.001. O(n) vs O(log n). An algorithmic change — not a JIT quirk, not a cache alignment artifact. 1,071x faster on this hardware. The algorithmic advantage holds on any platform with sorted data, though the exact multiplier will vary.

A number with error bars but no effect size is only half an answer.

Layer 3 — Micro vs macro: right question, wrong scale

A microbenchmark isolates a function. A macrobenchmark places it inside a pipeline. They answer different questions — and the answers disagree.

// Micro: isolated lookup — Dictionary vs linear search over 10,000 elements
[BenchmarkCategory("Micro")]
[Benchmark(Baseline = true)]
public int LookupLinear()
{
    int found = 0;
    for (int i = 0; i < _searchKeys.Length; i++)
    {
        if (Array.IndexOf(_data, _searchKeys[i]) >= 0)
            found++;
    }
    return found;
}

[BenchmarkCategory("Micro")]
[Benchmark]
public int LookupDictionary()
{
    int found = 0;
    for (int i = 0; i < _searchKeys.Length; i++)
    {
        if (_dict.ContainsKey(_searchKeys[i]))
            found++;
    }
    return found;
}

Microbenchmark — isolated lookup comparison over 200 search keys. Full source in companion code.

| Method           | Categories | Mean       | Error    | StdDev   | Ratio |
|----------------- |----------- |-----------:|---------:|---------:|------:|
| LookupLinear     | Micro      | 412.089 us | 1.609 us | 1.788 us | 1.000 |
| LookupDictionary | Micro      |   1.571 us | 0.012 us | 0.014 us | 0.004 |

Dictionary is 262x faster. Ship it?

The lookup lives inside a pipeline:

[Benchmark(Baseline = true)]
public long PipelineLinear()
{
    long v = ValidateArray(_workload);     // ~40% — sequential scan, 3M elements
    long t = PolynomialTransform(_workload); // ~40% — multiply/add/xor, 3M elements
    int  l = LookupAllLinear(_data, _searchKeys); // ~6% — 200 keys × Array.IndexOf
    long a = Aggregate(_workload);          // ~15% — weighted sum, stride 4
    return v ^ t ^ l ^ a;
}

[Benchmark]
public long PipelineDictionary()
{
    long v = ValidateArray(_workload);
    long t = PolynomialTransform(_workload);
    int  l = LookupAllDictionary(_searchKeys); // Dictionary.ContainsKey
    long a = Aggregate(_workload);
    return v ^ t ^ l ^ a;
}

Only the lookup step changes. Full source in companion code.

94% of the work doesn’t change regardless of lookup strategy.

| Method             | Categories | Mean         | Error     | StdDev    | Ratio |
|------------------- |----------- |-------------:|----------:|----------:|------:|
| PipelineLinear     | Macro      | 7,181.115 us | 59.636 us | 66.285 us |  1.00 |
| PipelineDictionary | Macro      | 6,611.982 us | 11.094 us | 11.871 us |  0.92 |

Pipeline with Dictionary is 8% faster. Not 262x. Eight percent.

The lookup consumes 412 us out of 7,181 us total — 5.7% of the pipeline. A 262x speedup on 5.7% gives a theoretical maximum improvement of 1 / (1 - 0.057 + 0.057/262) = 6.0% (Amdahl’s law⁴). The measured 8% is higher — cache effects from eliminating the linear scan likely benefit subsequent pipeline steps.

Micro answers “is this function faster?” Macro answers “will the user notice?”

Baudrillard (1981): the fourth phase of the simulacrum — the image bears no relation to any reality whatever. The microbenchmark says 262x. The macrobenchmark says 8%. Both have error bars. Both passed statistical tests. Both are internally consistent. Neither describes what the user experiences. Two maps orbiting each other, each valid within its own coordinate system, each detached from the territory they claim to represent. The micro number didn’t lie. The macro number didn’t lie. The lie was believing either one alone was the answer.

Eight percent might be worth it — or might not, depending on whether the pipeline runs once per request or once per hour. The microbenchmark alone cannot tell you.

Before you ship the number

Check	Question	If no…
Iterations	Did you run enough iterations? (>= 15 in this setup, configured via SimpleJob)	Your CIs are too wide — the result might be noise (see Layer 1)
CI overlap	Do the 99.9% CIs (BDN Error) not overlap?	Overlapping CIs suggest noise — but non-overlap is conservative, not definitive. Confirm with a formal test (Welch / Mann-Whitney)
Practical size	Is the Ratio meaningfully different from 1.00? Does it exceed your SESOI?	Statistically real but practically irrelevant — move on
Micro = Macro	Does the micro speedup translate to end-to-end improvement?	The bottleneck is elsewhere — profile before optimizing
Reproducible	Same result on different hardware / OS / runtime?	Environment-dependent — see Part 2

Three rules:

Always report confidence intervals. A mean without CI is a claim, not evidence. BenchmarkDotNet provides the Error column (99.9% CI half-width) — use it. CI overlap is a useful quick screening heuristic: overlapping CIs suggest noise, non-overlapping CIs suggest a real difference — but neither is definitive. Overlapping CIs can still hide a significant difference, and non-overlapping CIs are a conservative rule, not proof. For a formal conclusion, use a statistical test (Welch’s t-test, Mann-Whitney U). If you only ran 5 iterations, run more.
Distinguish statistical from practical significance. Non-overlapping CIs mean the difference exists. They don’t mean it matters. Define a SESOI (smallest effect size of interest) before running the benchmark — the minimum improvement that justifies the code change. BDN’s Ratio column tells you the proportional difference: if it doesn’t cross your SESOI threshold, the result is real but not actionable.
Confirm micro with macro. A microbenchmark shows a function is faster in isolation. A macrobenchmark shows the user will notice. Run both — or explain why you didn’t. A 262x micro speedup sounds compelling until Amdahl reduces it to 8%.

Run it yourself

git clone https://github.com/0x3f-blog/companion-code.git
cd companion-code/first-things-first/statistics-that-matter

# All benchmarks (20 iterations, ~3 min)
# Pin to a single NUMA node to eliminate cross-socket variance
taskset -c 0-11 dotnet run -c Release

# Individual scenarios
taskset -c 0-11 dotnet run -c Release -- --filter '*NoisyComparison*'
taskset -c 0-11 dotnet run -c Release -- --filter '*EffectSizeDemo*'
taskset -c 0-11 dotnet run -c Release -- --filter '*MicroVsMacro*'

# Reproduce the CI overlap demo (5 iterations — wide error bars)
taskset -c 0-11 dotnet run -c Release -- --filter '*NoisyComparison*' --iterationCount 5 --warmupCount 3

Benchmark environment

Component	Value
CPU	2x Intel Xeon E5-2697 v2 @ 2.70 GHz (24 cores / 48 threads)
RAM	~115 GB DDR3-1866 (quad-channel per socket)
OS	Fedora Linux 42 (kernel 6.17)
Runtime	.NET 9.0.11 (RyuJIT AVX)
SDK	.NET SDK 10.0.102
BenchmarkDotNet	v0.14.0
GC	Server GC, Concurrent (BDN enables Server GC in benchmark processes by default; host process uses Workstation)
Pinning	`taskset -c 0-11` — single socket, physical cores only
Job	SimpleJob (WarmupCount=5, IterationCount=20)

Limitations: Single machine, dual-socket NUMA. All benchmarks pinned to one socket to eliminate cross-socket memory access and thread migration — without pinning, NoisyComparison variance doubles and absolute values shift by 5-10% between runs (Part 2). EffectSizeDemo uses sorted data for binary search — the algorithmic advantage is inherent, not hardware-dependent. MicroVsMacro pipeline proportions (40/40/6/15%) are approximate — workload ratios on your hardware will vary.

Even with honest design, controlled environment, and correct measurement — the number still needs interpretation. Too few iterations and the CI swallows the difference. Tight CIs inflate Cohen’s d into meaninglessness. Microbenchmarks promise 262x while the user sees 8%.

Hume (1739): no finite number of observations guarantees the next will conform. But the problem isn’t too few observations — it’s too much readiness to conclude. The confirmation doesn’t come from the data. It comes from you. The number said “3% slower” and you heard “regression” because you were already looking for one. The CIs were wide enough to hold any story. You picked the one that matched.

“3% faster” is not a result. It’s a hypothesis. Treat it like one — confirm it with sufficient iterations, assess practical significance, and validate it against end-to-end behavior. Or revert the merge.

First Things First: Coordinated Omission

Tue, 03 Mar 2026 10:00:00 +0100

p99 = 1 ms — flip one switch — p99 = 195 ms

Same service. Same pause pattern. Same nominal target rate. One change in the client model — p99 jumps 182×. Not a system failure. A measurement failure.

Design can lie. The environment can lie. Fix both — the benchmark looks solid, the percentiles look clean. Too clean. The measurement method itself can lie — a systematic omission baked into how the test collects data.

All code in this post: clone, build, run. Numbers below were measured on dual Xeon E5-2697 v2 — run the companion code on your hardware for your own results. Different hardware, different numbers — that’s half the lesson.

Convention: charts use milliseconds; tables reproduce raw simulation output. Histograms are approximate visualizations of the recorded latency distribution — the percentile tables are the authoritative data.

Send, wait, measure, repeat

public static LatencyReport Run(SimulatedService service, int ratePerSec, int durationSec)
{
    int totalRequests = ratePerSec * durationSec;
    var recorder = new LatencyRecorder();

    for (int i = 0; i < totalRequests; i++)
    {
        long start = Stopwatch.GetTimestamp();
        service.Process();
        long elapsed = Stopwatch.GetTimestamp() - start;
        recorder.Record(elapsed);
    }

    return recorder.GetReport();
}

Closed-loop client — full source in companion code.

Send a request. Wait for the response. Measure the elapsed time. Send the next one. The client and the service take turns — a lockstep conversation where neither moves without the other. This pattern has a name: closed-loop.¹ Most load test frameworks default to it. Most dashboards assume it.

What does your test do when the system slows down?

The comfortable picture

The system under test: a simulated service with ~1 ms baseline latency (calibrated SpinWait) and a 200 ms pause every 500th request — modeling GC, compaction, or any periodic maintenance event. Target rate: 450 req/sec over 30 seconds (13,500 total). Average service time: (499 × 1 ms + 1 × 200 ms) / 500 = 1.4 ms. At 450 req/sec the service needs 630 ms of work per second — ~63% utilization, with headroom to spare. The pauses are the problem, not the capacity.

The closed-loop client has no rate limiter, no inter-request delay — totalRequests is just a count (rate × duration) to match the open-loop’s output volume. The effective rate is whatever the service delivers. During normal processing (~1 ms per request), well above 450 req/sec. During a 200 ms pause: zero. The arrival rate follows the system. When the system slows, the test slows with it.

| Metric | Closed-loop  |
|--------|-------------:|
| Count  |       13,500 |
| p50    |      1.00 ms |
| p90    |      1.00 ms |
| p99    |      1.07 ms |
| p99.9  |    200.15 ms |
| max    |    200.28 ms |

The dashboard looks clean. 99th percentile: 1 ms. Only p99.9 shows any trouble — and that’s 27 requests out of 13,500, the ones that directly hit a pause. Every other request: ~1 ms, tight distribution, no tail. You read the numbers and move on.

The dashboard maps what the test recorded — not what users experienced.

Hume (1739): no finite set of observations guarantees the next. A thousand closed-loop measurements say p99 = 1 ms. The thousand-and-first doesn’t have to agree. Induction from data that systematically omits the worst moments is induction from a sample that excludes its own counterexamples.

Flip one switch

Same service. Same pause injector. Same nominal target rate. One change: the client sends on a fixed schedule, regardless of whether the previous request came back.

public static LatencyReport Run(SimulatedService service, int ratePerSec, int durationSec)
{
    var recorder = new LatencyRecorder();
    long intervalTicks = Stopwatch.Frequency / ratePerSec;
    long deadline = Stopwatch.GetTimestamp() + (long)durationSec * Stopwatch.Frequency;
    long nextSend = Stopwatch.GetTimestamp();

    while (Stopwatch.GetTimestamp() < deadline)
    {
        long intendedStart = nextSend;
        nextSend += intervalTicks;

        service.Process();

        long now = Stopwatch.GetTimestamp();
        long latency = now - intendedStart;  // ← intended, not actual
        recorder.Record(latency);

        while (Stopwatch.GetTimestamp() < nextSend)
            Thread.SpinWait(10);
    }

    return recorder.GetReport();
}

Open-loop client — full source in companion code. Note: intervalTicks uses integer division, introducing sub-microsecond step quantization at 450 req/sec — negligible for this demonstration.

One line changed: now - intendedStart instead of now - actualStart. The user’s clock starts when they click, not when the server gets around to processing their request. When the service pauses, requests that should have been sent during the pause pile up — each measured from when it was supposed to start, because that’s when the user started waiting.

Bimodal. A peak at ~1 ms and a wide spread from 50–200 ms. Two different experiences on the same chart.

| Metric | Closed-loop  |    Open-loop |     Ratio |
|--------|-------------:|-------------:|----------:|
| Count  |       13,500 |       13,500 |           |
| p50    |      1.00 ms |      1.00 ms |      1.0x |
| p90    |      1.00 ms |    137.89 ms |    137.9x |
| p99    |      1.07 ms |    194.64 ms |    182.4x |
| p99.9  |    200.15 ms |    200.15 ms |      1.0x |
| max    |    200.28 ms |    200.41 ms |      1.0x |

Ratios computed from raw data before rounding to displayed precision.

Same system. Same load. Same pause. One variable: whether the test waits for a response before sending the next request.

Closed-loop p99 = 1 ms. Open-loop p99 = 195 ms. 182× on this workload.

The mechanism — coordinated omission

During a 200 ms pause, the closed-loop client waits. While waiting, it sends no new requests — it goes with the system, slowing down exactly when the system slows down. 200 ms × 450 req/sec = 90 requests that should have been sent but weren’t. They don’t appear in the histogram. They don’t exist in the data. The dashboard stays clean.

The open-loop client doesn’t coordinate. It tracks what the schedule should have been. After the pause resolves:

Request N+1: intended at T+2 ms, completed at T+201 ms → latency = 199 ms
Request N+2: intended at T+4 ms, completed at T+202 ms → latency = 198 ms
Request N+3: intended at T+7 ms, completed at T+203 ms → latency = 196 ms
…catch-up continues for ~160 requests until the schedule recovers

Each pause contaminates ~160 subsequent requests with elevated latency. 27 pauses × ~160 requests = ~4,300 requests — roughly a third of all traffic — experiencing latency between 2 ms and 200 ms. That’s why the open-loop p90 is 138 ms: the top 10% of requests (1,350 out of 13,500) fall squarely in that contaminated range.

The closed-loop client sees 27 bad requests. The open-loop client sees 4,300. Same service. Same pauses.

The worse the failure, the more requests the closed-loop client skips, the cleaner the dashboard. The mechanism is inversely proportional to the problem. A 200 ms pause omits 90 measurements. A 2-second pause omits 900. A 10-second GC stop-the-world omits 4,500. The worst event your system can produce is the one your test is least likely to record.

Gil Tene named this Coordinated Omission — the test coordinates with the system’s failures, omitting measurements precisely when they would be most damning.²

Baudrillard (1981): the third phase of the simulacrum — the image masks the absence of reality. The closed-loop benchmark doesn’t distort measurements. It masks their nonexistence. Those 90 requests during the pause aren’t poorly measured. They don’t exist. The dashboard is a simulacrum — it doesn’t lie about the system. It replaces it.

How to stop coordinating

Property	Closed-loop	Open-loop
Request timing	After previous response	Fixed schedule, independent of response
What it measures	Response time of sent requests (omits unsent)	Response time from intended start (incl. queuing)
During a pause	Stops sending → omits measurements	Tracks intended schedule → captures queuing
p99 under pauses	Looks clean (only direct hits visible)	Shows full impact (queued requests visible)
Best for	Throughput measurement, saturation testing	Latency measurement, SLA validation

Four rules for latency measurement:

Open-loop by default for latency load tests. Closed-loop is still useful for throughput and saturation testing — finding the breaking point. But if your SLAs are latency percentiles, you need open-loop. Closed-loop tells you the system can handle the load; open-loop tells you what users experience while it does.¹
Measure from intended time, not actual time. latency = now - intendedStart, not now - actualStart. The user’s clock starts when they click, not when the server gets around to reading their request.
Record the full tail. p50 and p99 are not enough. Report p99.9 and max. Coordinated omission hides in the gap between p99 and p99.9 — the range where closed-loop sees nothing and open-loop sees the damage.
Use histograms that can handle it. HdrHistogram³ records values across a wide dynamic range with configurable precision — from sub-millisecond to multi-second latencies in the same histogram. Fixed-bucket histograms clip the tail.

Tools that get it right

Tool	Open-loop	CO correction	Notes
wrk2⁴	Yes	Built-in	Constant-rate HTTP benchmark, HdrHistogram output
Gatling	Yes	Configurable	Open-loop mode available, reports percentiles
k6	Partial	Manual	Constant-rate via scenarios, no auto-correction
Custom (this post)	Yes	By design	`intendedStart` tracking, HdrHistogram.NET

Capabilities and defaults vary by tool version and configuration; verify settings in your release.

Run it yourself

git clone https://github.com/0x3f-blog/companion-code.git
cd companion-code/first-things-first/coordinated-omission
dotnet run -c Release

Benchmark environment

Component	Value
CPU	2× Intel Xeon E5-2697 v2 @ 2.70 GHz (24 cores / 48 threads)
RAM	~115 GB DDR3-1866 (quad-channel per socket)
OS	Fedora Linux 42 (kernel 6.17)
Runtime	.NET 9.0.11 (RyuJIT AVX)
SDK	.NET SDK 10.0.102
HdrHistogram	HdrHistogram.NET 2.5.0
Simulation	450 req/sec, 30 sec, 200 ms pause every 500 requests

Not BenchmarkDotNet — this is a custom in-process simulation. SpinWait calibrated at startup for ~1 ms baseline on current hardware (binary search, 50 samples, median). Fresh SimulatedService instance per client — no counter contamination.

Limitations: In-process simulation — no HTTP, no network stack, no kernel-level queuing. The open-loop client is single-threaded and blocks on Process(), so it tracks the intended schedule rather than dispatching concurrently (a real open-loop system like wrk2 or Gatling sends requests asynchronously). These simplifications isolate the coordinated omission mechanism from transport noise — the measurement effect is the same, but absolute numbers would differ in a networked setup.

Popper (1934): a meaningful test must be capable of producing a negative result. The closed-loop client cannot falsify the hypothesis “the system is healthy” — it hides the counterexamples. Measurements that would disprove it don’t exist. Open-loop is the falsification instrument: it doesn’t ask the system whether it’s ready. It measures regardless.

Each layer of deception sits closer to you. Design — visible in the code. Environment — visible in the configuration. The method of collection — buried in an assumption you never questioned. Data collected correctly. But what do the data mean?

A metric that looks better the worse the system performs isn’t a metric. It’s anesthesia.

First Things First: Enemies of Measurement

Fri, 27 Feb 2026 17:00:00 +0100

Same engine, different answers

Design fixed. Environment changed: cache temperature, GC pressure, data order, JIT tier. The numbers move by 2–6× without touching the algorithm.

Enemy	Effect	What it distorts
1. JIT Optimization Level	6×	Machine code quality
2. GC Pauses	2.3×	Allocation in hot path
3. System Noise	3.7× σ	Measurement variance
4. Cache State	2.9×	Memory hierarchy
5. Branch Predictor	5.0×	Data order
6. Dead Code Elimination	5.9×	Return type

The first three, BenchmarkDotNet defends against — if you know to look. The last three, you’re on your own. Some enemies use the storage engine directly (E2, E3, E4). Others isolate CPU-level effects using data derived from the storage engine (E1, E5, E6) — because these distortions hide in any hot path, not just Insert and Get.

All code: clone, build, run. Numbers below: dual Xeon E5-2697 v2, 48 threads, 30 MB L3 per socket, ~115 GB DDR3-1866, Fedora 42, .NET 9.0.11 (RyuJIT AVX), BenchmarkDotNet v0.14.0.¹ No WAL — these enemies hide in the in-memory path, where fsync can’t drown the signal. Different hardware, different numbers — that’s half the lesson.

Enemy 1 — JIT Optimization Level

The storage engine holds 100,000 rows (via Row.Generate). Setup extracts all payloads into a contiguous byte[] of ~14.5 MB — an integrity-check scenario. Two versions of the same loop. Same data. Same operation. One difference: [MethodImpl(MethodImplOptions.NoOptimization)] — forcing the JIT to emit completely unoptimized code (no register promotion, no SIMD, no bounds check elimination).

Descartes: de omnibus dubitandum est — doubt everything, starting with your own setup. This is not a Tier-0 vs Tier-1 comparison.² NoOptimization disables all optimizations — the absolute lower bound. Real Tier-0 → Tier-1 transitions (short methods without loops, where Tier-0 applies) show 2–4×. The 6× here is the extreme case, deliberately exaggerated to make the enemy visible.

[DisassemblyDiagnoser(maxDepth: 3)]
public class E1_JitWarmup
{
    private byte[] _payload; // ~14.5 MB — all payloads from 100K rows

    [GlobalSetup]
    public void Setup()
    {
        using var table = new StripedTable();
        for (int i = 0; i < 100_000; i++)
            table.Insert(i, Row.Generate(i));

        // Extract all payloads into contiguous array
        // ... (full source in companion code)
    }

    [Benchmark]
    [MethodImpl(MethodImplOptions.NoOptimization)]
    public long SumPayloadCold()
    {
        long sum = 0;
        var data = _payload;
        for (int i = 0; i < data.Length; i++)
            sum += data[i];
        return sum;
    }

    [Benchmark(Baseline = true)]
    public long SumPayloadWarm()
    {
        long sum = 0;
        var data = _payload;
        for (int i = 0; i < data.Length; i++)
            sum += data[i];
        return sum;
    }
}

Identical loop. Identical data. Identical result.

Method	Mean	Code Size	Ratio
SumPayloadCold	49.764 ms	124 B	6.03
SumPayloadWarm	8.247 ms	49 B	1.00

6× on this hardware. The [DisassemblyDiagnoser] on the class generates full JIT output in BenchmarkDotNet.Artifacts/results/ — 124 bytes of machine code vs 49. The unoptimized path pays for stack-based locals, bounds checks on every array access, scalar arithmetic — one byte at a time. The optimized path gets register promotion, bounds check elimination, and potentially SIMD vectorization. Same source code. Different machine code. 6× gap (remember: this is the extreme case — real Tier-0 → Tier-1 deltas are smaller but still significant).

BenchmarkDotNet runs warmup iterations by default (6–50 adaptive, plus 15–100 measurement iterations)³ — conservative enough that Tier-0 compiles to Tier-1 before measurement begins. Defense exists. But in benchmarks where tiered compilation actually applies (short methods without loops — where Tier-0 is the first compile), overriding warmup count too low or testing a method short enough to stay below the recompilation threshold can let unoptimized code leak into the measurement window. The first enemy hides in the JIT pipeline — and the DisassemblyDiagnoser is the only way to see it.

Enemy 2 — GC Pauses

Insert 100,000 rows into StripedTable. Same keys, same table, same final state. One difference: where the Row objects come from.

[MemoryDiagnoser]
public class E2_GcPauses
{
    private const int N = 100_000;
    private ITable _table;
    private int[] _keys;
    private Row[] _preAllocated;

    [GlobalSetup]  // keys + rows generated once, reused across iterations
    public void Setup()
    {
        var rng = new Random(42);
        _keys = new int[N];
        _preAllocated = new Row[N];
        for (int i = 0; i < N; i++)
        {
            _keys[i] = rng.Next(0, 200_000);
            _preAllocated[i] = Row.Generate(_keys[i]);
        }
    }

    [IterationSetup]
    public void IterationSetup()
    {
        _table = new StripedTable(); // fresh table per iteration
    }

    [Benchmark]
    public void InsertAllocHeavy()
    {
        for (int i = 0; i < N; i++)
            _table.Insert(_keys[i], Row.Generate(_keys[i])); // new byte[] per insert
    }

    [Benchmark(Baseline = true)]
    public void InsertPreAllocated()
    {
        for (int i = 0; i < N; i++)
            _table.Insert(_keys[i], _preAllocated[i]); // no per-insert allocation
    }
}

Row.Generate(key) allocates a fresh byte[32..256] every call. 100K inserts = 100K allocations = GC pressure. The baseline pre-allocates all rows in GlobalSetup — no per-insert payload allocations in the hot path. (The 7.52 MB in the table comes from ConcurrentDictionary internal growth — both methods pay that cost.)

Method	Mean	StdDev	Allocated	Alloc Ratio	Ratio
InsertAllocHeavy	36.81 ms	1.744 ms	23.9 MB	3.18	2.25
InsertPreAllocated	16.35 ms	1.448 ms	7.52 MB	1.00	1.00

2.3× on this workload. MemoryDiagnoser shows why: 24 MB allocated vs 8 MB. Both methods grow the ConcurrentDictionary from scratch (fresh table per iteration), but AllocHeavy adds 100K Row.Generate allocations on top — each creating a new byte[]. The extra allocation pressure triggers GC collections mid-measurement — each pause adds microseconds that accumulate into milliseconds. Look at StdDev: 1.74 ms for the allocating path — and BenchmarkDotNet flagged PreAllocated as bimodal (mValue = 3.94), consistent with GC pauses splitting the distribution into two clusters: iterations where a collection fired vs iterations where it didn’t. GC pauses are non-deterministic: sometimes a collection lands inside the timed region, sometimes it doesn’t.

BenchmarkDotNet can force GC between iterations (GcForce) and report allocation pressure (MemoryDiagnoser). The defense exists — but you have to look. A benchmark that allocates in the hot path and doesn’t report memory is measuring GC behavior, not your algorithm. The StdDev rises and nobody knows why.

Enemy 3 — System Noise

Two identical methods. Same table. Same data. Same code — literally copy-paste. The table is pre-populated in GlobalSetup — every Insert is an update, not a growth event. Deterministic, constant-cost work where OS noise is the only variable.⁴

public class E3_OsNoise
{
    private const int N = 100_000;
    private ITable _table;
    private int[] _keys;
    private Row[] _rows;

    [GlobalSetup]
    public void Setup()
    {
        _table = new StripedTable();
        // ... generate keys and rows ...
        for (int i = 0; i < N; i++)
            _table.Insert(_keys[i], _rows[i]);  // pre-populate
    }

    [Benchmark(Baseline = true)]
    public void InsertBaseline()
    {
        for (int i = 0; i < N; i++)
            _table.Insert(_keys[i], _rows[i]);
    }

    [Benchmark]
    public void InsertSame()
    {
        for (int i = 0; i < N; i++)
            _table.Insert(_keys[i], _rows[i]);
    }
}

The interesting number isn’t the ratio between methods — it’s the StdDev across two runs of the same benchmark under different conditions:

# Linux-only — taskset requires a real scheduler (not available on macOS/Windows)

# === Run 1: Noisy — saturate all CPU cores, then benchmark ===

# If the script exits (Ctrl-C or error), kill all background jobs automatically
trap 'kill $(jobs -p) 2>/dev/null' EXIT

# Spawn one infinite busy loop per CPU core — fills the scheduler with work
# $(nproc) returns your core count (e.g. 48), each loop burns 100% of one core
for i in $(seq 1 $(nproc)); do
  (while true; do :; done) &   # & sends each loop to background
done

# Now run the benchmark — the OS scheduler must fight for CPU time
dotnet run -c Release -- --filter '*E3*'

# Stop all busy loops
kill $(jobs -p)

# === Run 2: Isolated — pin benchmark to a single core, no contention ===

# taskset -c 0 = run only on core 0, no migration, no sharing
taskset -c 0 dotnet run -c Release -- --filter '*E3*'

Noisy run (all cores saturated):

Method	Mean	StdDev	Ratio
InsertBaseline	18.95 ms	0.945 ms	1.00
InsertSame	18.11 ms	0.583 ms	0.96

Isolated run (pinned to core 0, idle system):

Method	Mean	StdDev	Ratio
InsertBaseline	12.98 ms	0.252 ms	1.00
InsertSame	13.17 ms	0.254 ms	1.01

Same code. Same data. Same machine. The noisy run is 46% slower (mean) and 3.7× noisier (StdDev). The noise isn’t just the OS scheduler — it’s the entire system under contention. Thread migration between cores flushes caches. Context switches inject 10–100 μs of jitter.⁵ Competing processes saturate the memory bus and evict cache lines that the benchmark needs. Interrupts and kernel work preempt the benchmark thread mid-iteration. Under CPU saturation, these effects stack: on a 13 ms insert loop, the mean shifts by 46% and the variance explodes. On a 100 μs microbenchmark, the effect is destruction — not noise.

The defense: taskset pins to a core (add nice -n -20 with root for higher priority), more iterations average out the noise. BenchmarkDotNet’s MinIterationCount and Affinity (CPU core mask — equivalent of taskset inside the process) settings help. But the scheduler is always there — and the smaller your operation, the larger the enemy.

Three enemies down. All three live in the execution environment — BenchmarkDotNet can detect or mitigate them because it controls the process. The next three live at the boundary between your code and the hardware. Korzybski (1933): the map is not the territory. The framework maps the process. It can’t map a dataset that fits in L3, a data order that trains the branch predictor, or a return type that lets the JIT eliminate your computation. Those are your choices — and the hardware responds to them silently.

Enemy 4 — Cache State

Random Get() on StripedTable — in-memory, no WAL (hence nanosecond latencies, not microsecond-scale numbers where fsync dominates). Same operation. Same code. One parameter: how many entries in the table.

public class E4_CacheState
{
    private const int LookupCount = 100_000;

    [Params(10_000, 2_000_000)]
    public int TableSize { get; set; }

    private ITable _table;
    private int[] _lookupKeys;

    [GlobalSetup]
    public void Setup()
    {
        _table = new StripedTable();
        for (int i = 0; i < TableSize; i++)
            _table.Insert(i, Row.Generate(i));
        // ... random lookup keys ...
    }

    [Benchmark(OperationsPerInvoke = LookupCount)]
    public Row? LookupRandom()
    {
        Row? last = default;
        var table = _table;
        var keys = _lookupKeys;
        for (int i = 0; i < LookupCount; i++)
            last = table.Get(keys[i]);
        return last;
    }
}

OperationsPerInvoke divides total time by 100K — reporting per-lookup latency. Same Get(). Same StripedTable. Different table size.

TableSize	Mean	StdDev
10,000	17.05 ns	0.190 ns
2,000,000	50.08 ns	1.919 ns

2.9× on this hardware. StdDev tells the rest of the story.

10K entries: the benchmark’s working set — ConcurrentDictionary bucket arrays (~80 KB) and Node objects (~400 KB) — totals ~500 KB, comfortably within the 30 MB L3 on the local socket (dual-socket NUMA — each socket has its own 30 MB L3; the benchmark thread runs on one).⁶ The Row payloads (~1.4 MB of byte[]) exist on the heap but LookupRandom never dereferences them — it returns the Row struct, not the data. So only the dictionary traversal structure needs to fit in cache. Every lookup hits cached memory. StdDev is 0.19 ns — tight, repeatable.

2M entries: the dictionary working set (bucket arrays ~32 MB + nodes ~80 MB ≈ 112 MB) exceeds L3 by a wide margin and spills to DRAM. Random access means random cache misses — each miss costs 60–100 ns instead of 4–12 ns. StdDev jumps to 1.9 ns — 10× noisier — because DRAM latency varies with access pattern, NUMA topology, and memory controller contention.

Cache doesn’t just change the speed — it changes the quality of the measurement. Tight numbers, low StdDev, repeatable results — and potentially misleading. Popper (1934): a benchmark can falsify a hypothesis but never confirm one. The 2.9× gap and 10× StdDev increase point at cache hierarchy — perf stat -e cache-misses,cache-references would confirm, but the measurement already suggests the answer.

Same symptom — inflated speed and false confidence. Different cause. Hot cache vs cold DRAM.

Enemy 5 — Branch Predictor Training

Scan the results from the storage engine. Row.Generate(key) produces payloads of 32–256 bytes (formula: 32 + key % 225). Count how many exceed a threshold. Standard aggregation — the kind you’d run after querying the table.

public class E5_BranchPredictor
{
    [Params(8_000_000)]
    public int N { get; set; }

    private int[] _sorted;  // Row sizes from Row.Generate formula, sorted
    private int[] _random;  // Same values, shuffled

    [GlobalSetup]
    public void Setup()
    {
        _sorted = new int[N];
        for (int i = 0; i < N; i++)
            _sorted[i] = 32 + (i % 225);  // Row.Generate payload formula
        Array.Sort(_sorted);

        _random = _sorted.ToArray();
        new Random(42).Shuffle(_random);
    }

    [Benchmark]
    public int ScanSorted()
    {
        int count = 0, threshold = 150;
        var data = _sorted;
        for (int i = 0; i < data.Length; i++)
            if (data[i] > threshold) count++;
        return count;
    }

    [Benchmark(Baseline = true)]
    public int ScanRandom()
    {
        int count = 0, threshold = 150;
        var data = _random;
        for (int i = 0; i < data.Length; i++)
            if (data[i] > threshold) count++;
        return count;
    }
}

Same values. Same count returned. Both arrays accessed sequentially — the prefetcher treats them identically.⁷ Same memory layout, same access pattern. Only the value order differs — which is what branch predictors respond to.

Method	Mean	Ratio
ScanSorted	8.214 ms	0.20
ScanRandom	41.363 ms	1.00

5.0× on this hardware. Same algorithm, same data, same cache behavior — different order.

Threshold 150 splits the range roughly in half — 106 out of 225 possible values exceed it (~47%). Near 50–50 is maximum branch unpredictability.⁸ The sorted array presents a clean pattern: all values below threshold, then all above. The branch predictor learns after a few iterations and predicts correctly for millions of subsequent elements. The shuffled array is a coin flip every iteration — the predictor guesses wrong ~47% of the time, and each misprediction costs 15–20 cycles while the pipeline flushes and refills.

Sequential keys feed the prefetcher — a data design problem. Here the data is random but sorted — and the branch predictor likely changes the result without your knowledge. You’re trying to measure the storage engine’s aggregation cost. You’re mostly measuring the CPU pipeline’s response to data order.

Enemy 6 — Dead Code Elimination

Sum the data from Row.Generate’s formula — a checksum for integrity verification. 10 million iterations, pure arithmetic: 32 + (i % 225). No memory access. No exceptions. No side effects.

[DisassemblyDiagnoser(maxDepth: 3)]
public class E6_DeadCode
{
    [Params(10_000_000)]
    public int N { get; set; }

    [Benchmark]
    public void ChecksumEliminated()
    {
        long checksum = 0;
        for (int i = 0; i < N; i++)
            checksum += 32 + (i % 225);
        // checksum not returned — JIT drops the accumulation
    }

    [Benchmark(Baseline = true)]
    public long ChecksumPreserved()
    {
        long checksum = 0;
        for (int i = 0; i < N; i++)
            checksum += 32 + (i % 225);
        return checksum;
    }
}

Identical loop. One returns the result. One doesn’t.

Method	Mean	Code Size	Ratio
ChecksumEliminated	3.750 ms	21 B	0.17
ChecksumPreserved	22.220 ms	66 B	1.00

5.9× on this hardware. The DisassemblyDiagnoser shows why — the actual machine code for both methods:

; ChecksumEliminated — 21 bytes
M00_L00:
  inc   eax          ; i++
  cmp   eax, ecx     ; i < N?
  jl    M00_L00      ; loop

; ChecksumPreserved — 66 bytes
M00_L00:
  mov   edx, 91A2B3C5 ; magic constant for i % 225
  imul  esi            ; compiler-generated modulo
  ; ... 8 more instructions for 32 + (i % 225) ...
  add   rcx, rax      ; checksum += result
  inc   esi            ; i++
  cmp   esi, edi       ; i < N?
  jl    M00_L00        ; loop

[DisassemblyDiagnoser] on the class generates this — run the benchmark and check BenchmarkDotNet.Artifacts/results/ for the full listing (HTML + Markdown).

The JIT determined that checksum has no observable side effects — nobody reads it — and stripped out the entire accumulation. What remains is inc/cmp/jl: the loop counter, iterating 10 million times over nothing.⁹ The fix is simple: always return the computed value so the JIT must preserve it.

Here’s what makes this the most dangerous enemy: 3.75 ms looks plausible. It’s not zero. It’s not suspiciously fast. It looks like a reasonable time for 10 million iterations of lightweight arithmetic. Without DisassemblyDiagnoser, you’d trust it. You’d compare it against another implementation. You’d ship a conclusion based on a number that measures empty loop iterations.

21 bytes vs 66 bytes. The disassembler is the only reliable way to catch this. Because the lie that looks reasonable is worse than the lie that looks absurd.

Know your enemies

Enemy	Effect	Symptom	Defense
1. JIT Optimization Level	6×†	NoOptimization 6× slower (†extreme case; real Tier-0→1: 2–4×)	Warmup (BDN default) + DisassemblyDiagnoser
2. GC Pauses	2.3×	Allocation in hot path, StdDev spike	MemoryDiagnoser + GcForce + pre-allocate
3. System Noise	3.7× StdDev	Mean +46%, StdDev 3.7× under load	taskset + nice + more iterations
4. Cache State	2.9×	Working set > L3	Conscious choice: cold vs warm vs hot
5. Branch Predictor	5.0×	Sorted data 5× faster	Realistic (shuffled) data
6. Dead Code Elimination	5.9×	Code Size 21 B vs 66 B	Return result + DisassemblyDiagnoser

Each enemy alone shifted the result 2–6× on this hardware. Stack three and the benchmark and production are different universes.

A reference checklist — not a universal shield, but a starting point that covers what BDN configuration can cover (enemies 1–3) and adds inspection tooling for what it can’t (enemies 4–6). The enemy benchmarks in the companion code intentionally don’t use it — defenses must be down to show the enemies in action:

public class EnemyDefenseConfig : ManualConfig
{
    public EnemyDefenseConfig()
    {
        AddJob(Job.Default
            .WithWarmupCount(3)               // E1: ensure Tier-1 before measurement
            .WithGcServer(true)               // E2: Server GC — fewer, larger collections
            .WithGcForce(true)                // E2: force GC between iterations
            .WithMinIterationCount(15)        // E3: average out scheduler noise
            .WithMaxIterationCount(100)       // E3: let BDN adapt when noise is present
            .WithAffinity((IntPtr)0b11));     // E3: pin to cores 0–1

        AddDiagnoser(MemoryDiagnoser.Default);              // E2: allocation pressure
        AddDiagnoser(new DisassemblyDiagnoser(              // E1+E6: JIT output
            new DisassemblyDiagnoserConfig(maxDepth: 3)));

        AddColumn(StatisticColumn.StdDev);                  // E3: noise visible
    }
}

Enemies 1–3: configuration. Enemies 4–6: conscious data design. No config setting shuffles your test data for you.

Run it yourself

git clone https://github.com/0x3f-blog/companion-code.git
cd companion-code/first-things-first/enemies-of-measurement

# All six enemies
dotnet run -c Release

# One enemy at a time
dotnet run -c Release -- --filter '*E5*'

# OS noise comparison (Linux) — see E3 section for full commands
trap 'kill $(jobs -p) 2>/dev/null' EXIT
for i in $(seq 1 $(nproc)); do (while true; do :; done) & done
dotnet run -c Release -- --filter '*E3*'
kill $(jobs -p)
taskset -c 0 dotnet run -c Release -- --filter '*E3*'

The direction reproduces. The exact ratios depend on your hardware.

Benchmark environment

Component	Details
CPU	2× Intel Xeon E5-2697 v2 @ 2.70 GHz (24 cores / 48 threads)
L3 Cache	30 MB per socket
RAM	~115 GB DDR3-1866 (quad-channel per socket)
OS	Fedora Linux 42 (kernel 6.17)
Runtime	.NET 9.0.11 (RyuJIT AVX)
SDK	.NET SDK 10.0.102
BenchmarkDotNet	v0.14.0
Job	DefaultJob (BDN auto-selects iteration count, typically 15+)
GC	Server GC, Concurrent (BDN enables Server GC in benchmark processes by default; host process uses Workstation)
Storage	In-memory (no WAL) — enemies hide in the in-memory path
Power	`performance` governor, no frequency scaling
Hygiene	No browser, IDE, or heavy processes during runs

No WAL in this post. These enemies operate in the in-memory path, where fsync can’t drown the signal.

We walked the same path

Same storage engine. Same path. Different place.

Heraclitus (~500 BCE): you cannot step into the same river twice. JIT, GC, scheduler, cache, branch predictor, dead code — the river moved between measurements. Six enemies, each shifting the answer 2–6× on this hardware. They stack.

A number that survives design review but not these six enemies is a comfortable lie — it looks right, it feels reproducible, and it’s wrong.

Don’t trust a number that hasn’t survived six enemies.

First Things First: Why Benchmarks Lie

Tue, 24 Feb 2026 21:00:00 +0100

27.2M ops/sec — and a lie

Same two classes. Same data. Same machine. One benchmark says Dictionary + lock is 2× faster. Another says ConcurrentDictionary is 17× faster. A third says it doesn’t matter — fsync buries the difference in noise. Same optimization — three verdicts.

All code in this post: clone, build, run. Numbers below: dual Xeon E5-2697 v2, 48 threads, 30 MB L3 per socket (NUMA — two sockets, two separate caches, cross-socket traffic is real), ~115 GB DDR3-1866, Fedora 42, .NET 9.0.11 (RyuJIT AVX), BenchmarkDotNet v0.14.0, ShortRun job (LaunchCount=1, WarmupCount=3, IterationCount=3). No process or thread affinity pinning — on dual-socket NUMA, thread migration and cross-socket memory access can widen variance. Different hardware, different numbers — that’s half the lesson.

Throughput in this post is reported in M ops/sec, derived from BenchmarkDotNet’s per-operation time via OperationsPerInvoke = 500_000.

Methodology details (click to expand)

.NET on 0x3F

First Things First: Hardware Counters

12.9× slower — and that’s the easy part

The setup — two paths, same operation

Level 1 — perf stat: the vital signs

Level 2 — Flame graphs: the shape of time

Level 3 — The prediction

NUMA — where the numbers shift and the shape doesn’t

The hardware checklist

When to use what

Run it yourself

Benchmark environment

Piercing through

Further reading

First Things First: Statistics That Matter

3% slower. Ship it.

The number is the answer

Layer 1 — Confidence intervals eat your win

Layer 2 — Effect size: when “significant” doesn’t mean “meaningful”

The threshold trap

Two extremes

Layer 3 — Micro vs macro: right question, wrong scale

Before you ship the number

Run it yourself

Benchmark environment

Further reading

First Things First: Coordinated Omission

p99 = 1 ms — flip one switch — p99 = 195 ms

Send, wait, measure, repeat

The comfortable picture

Flip one switch

The mechanism — coordinated omission

How to stop coordinating

Tools that get it right

Run it yourself

Benchmark environment

Further reading

First Things First: Enemies of Measurement

Same engine, different answers

Enemy 1 — JIT Optimization Level

Enemy 2 — GC Pauses

Enemy 3 — System Noise

Enemy 4 — Cache State

Enemy 5 — Branch Predictor Training

Enemy 6 — Dead Code Elimination

Know your enemies

Run it yourself

Benchmark environment

We walked the same path

Further reading

First Things First: Why Benchmarks Lie