First Things First: Statistics That Matter

3% slower. Ship it.

Two filter variants over 20 million integers. Five benchmark iterations. FilterTernary: 26.11 ms. FilterBranch: 25.30 ms. The ternary is 3% slower. PR description writes itself. Merge. Deploy.

Next day, rollback. Regression in production — on hardware where the difference vanishes, on data where it reverses.

Design fixed. Environment defended. Data collected honestly. The benchmark is solid. The number is real. The interpretation is not.

All code in this post: clone, build, run. Numbers below were measured on dual Xeon E5-2697 v2 using BenchmarkDotNet v0.14.0, pinned to a single NUMA node — run the companion code on your hardware for your own results. Different machine, different numbers.

Convention: charts use milliseconds unless otherwise noted; tables reproduce BenchmarkDotNet output. BDN’s Error column is the half-width of the 99.9% confidence interval.

The number is the answer

[Benchmark(Baseline = true)]
public long FilterBranch()
{
    long sum = 0;
    int[] data = _data;
    for (int i = 0; i < data.Length; i++)
    {
        if (data[i] > 0)
            sum += data[i];
    }
    return sum;
}

[Benchmark]
public long FilterTernary()
{
    long sum = 0;
    int[] data = _data;
    for (int i = 0; i < data.Length; i++)
    {
        int v = data[i];
        sum += v > 0 ? v : 0;
    }
    return sum;
}

Two filter variants over 20M integers (~95% positive). Full source in companion code.

Every benchmarking tutorial ends here: compare two means, pick the lower one. FilterTernary = 26.11 ms, FilterBranch = 25.30 ms — 3% difference. The ternary loses.

How many times did you run it?

Layer 1 — Confidence intervals eat your win

BenchmarkDotNet doesn’t just give you a mean. It gives you Mean ± Error — where Error is the half-width of the 99.9% confidence interval, computed using a Student’s t-distribution with n-1 degrees of freedom.¹

The 5-iteration run — the one that said “3% slower”:

| Method        | N        | Mean     | Error    | StdDev   | Ratio | RatioSD |
|-------------- |--------- |---------:|---------:|---------:|------:|--------:|
| FilterBranch  | 20000000 | 25.30 ms | 0.408 ms | 0.063 ms |  1.00 |    0.00 |
| FilterTernary | 20000000 | 26.11 ms | 2.624 ms | 0.681 ms |  1.03 |    0.02 |

The 99.9% CI for FilterBranch: 25.30 ± 0.408 ms → [24.89, 25.71]. For FilterTernary: 26.11 ± 2.624 ms → [23.49, 28.73]. FilterBranch’s entire range sits inside FilterTernary’s confidence interval. The “3% slower” could be a scheduling hiccup. Five iterations cannot tell you that.

You know this from Part 1. Overlapping CIs, unresolved difference. Run more iterations.

Twenty iterations:

| Method        | N        | Mean     | Error    | StdDev   | Ratio |
|-------------- |--------- |---------:|---------:|---------:|------:|
| FilterBranch  | 20000000 | 25.25 ms | 0.173 ms | 0.177 ms |  1.00 |
| FilterTernary | 20000000 | 25.64 ms | 0.111 ms | 0.109 ms |  1.02 |

The 99.9% CI for FilterBranch: [25.08, 25.42]. For FilterTernary: [25.53, 25.75]. No overlap. A manual Welch t-test on this data gives p < 0.001.² The difference is real.

FilterTernary is 2% slower. The 5-iteration run saw the right direction but had no basis to trust it — the CI was so wide it could not separate signal from noise.

The Error on FilterTernary dropped from ±2.6 ms to ±0.1 ms. An order of magnitude. More iterations, sure. But .NET’s JIT compiles in tiers: Tier-0 (quick, unoptimized) on first calls, Tier-1 (full optimization) after enough invocations. If BDN’s warmup didn’t fully promote both methods, the 5-iteration run might have caught Tier-0 code while the 20-iteration run measured Tier-1. Different machine code, different variance profile.

Worth checking. Expand the ternary first:

// FilterBranch
if (data[i] > 0)
    sum += data[i];

// FilterTernary — expand v > 0 ? v : 0
if (v > 0) sum += v;
else        sum += 0;

The branch skips. The ternary always adds — even zero. Structurally different operations.

[DisassemblyDiagnoser] (Enemy 6 introduced the tool) on the class dumps native code — run the benchmark, check BenchmarkDotNet.Artifacts/results/*-asm.md. Five iterations:

; FilterBranch — 54 bytes
M00_L00:
       mov       edi,[rcx]        ; load data[i]
       test      edi,edi          ; data[i] > 0?
       jle       short M00_L01    ; skip if not
       movsxd    rdi,edi          ; sign-extend to 64-bit
       add       rax,rdi          ; sum += data[i]
M00_L01:
       add       rcx,4            ; i++
       dec       edx
       jne       short M00_L00

; FilterTernary — 58 bytes
M00_L00:
       mov       edi,[rcx]        ; load v = data[i]
       test      edi,edi          ; v > 0?
       jle       short M00_L03    ; if not, jump to zero path
M00_L01:
       movsxd    rdi,edi          ; sign-extend
       add       rax,rdi          ; sum += v (or sum += 0)
       add       rcx,4            ; i++
       dec       edx
       jne       short M00_L00
; ...
M00_L03:
       xor       edi,edi          ; v = 0
       jmp       short M00_L01    ; jump back to add

Twenty iterations:

; FilterBranch — 54 bytes
M00_L00:
       mov       edi,[rcx]
       test      edi,edi
       jle       short M00_L01
       movsxd    rdi,edi
       add       rax,rdi
M00_L01:
       add       rcx,4
       dec       edx
       jne       short M00_L00

; FilterTernary — 58 bytes
M00_L00:
       mov       edi,[rcx]
       test      edi,edi
       jle       short M00_L03
M00_L01:
       movsxd    rdi,edi
       add       rax,rdi
       add       rcx,4
       dec       edx
       jne       short M00_L00
; ...
M00_L03:
       xor       edi,edi
       jmp       short M00_L01

Identical machine code. Both runs. The Error dropped because more iterations and lower observed variance both narrowed the confidence interval. BDN’s Error is t(0.0005, n−1) × StdDev / √n — StdDev for FilterTernary fell from 0.681 ms to 0.109 ms (6×), and the larger sample brought a smaller t-value and larger √n. The variance reduction did most of the work.

A number without error bars is an opinion. Five iterations produced CIs so wide that either outcome fit the data. Twenty produced CIs narrow enough to separate signal from noise — not certainty, but 99.9% confidence that FilterBranch is faster on this hardware. If you had shipped after five, you’d have deployed a guess as a conclusion.

CI answers one question: does a difference exist? It says nothing about whether the difference matters.

Layer 2 — Effect size: when “significant” doesn’t mean “meaningful”

The 20-iteration result says FilterTernary is 2% slower. The CIs don’t overlap. The difference is statistically real. But 0.4 ms on a 25 ms operation over 20 million integers. Is that worth changing the code?

Statistical significance asks does a difference exist? Practical significance asks does it matter? BDN answers the first. You answer the second.

Cohen’s d — the standardized effect size — measures the distance between two means in units of the pooled standard deviation:³

d = |mean_1 - mean_2| / pooled SD

public static double CohensD(double mean1, double stdDev1, double mean2, double stdDev2)
{
    double pooledSd = Math.Sqrt((stdDev1 * stdDev1 + stdDev2 * stdDev2) / 2.0);
    if (pooledSd == 0) return 0;
    return Math.Abs(mean1 - mean2) / pooledSd;
}

Cohen’s d computation — full source in Analysis/StatisticalReport.cs.

Cohen’s d for FilterBranch vs FilterTernary: |25.25 - 25.64| / sqrt((0.177^2 + 0.109^2)/2) = 0.39 / 0.147 = 2.65. By the standard thresholds (0.2 = small, 0.5 = medium, 0.8 = large), that’s a “large” effect.

But 2.65 for a 2% difference? Something is off.

The threshold trap

Cohen’s d thresholds were calibrated for psychology experiments where within-group variance is naturally high. BenchmarkDotNet’s within-run variance is very low in controlled microbenchmarks — sub-1% coefficient of variation for compute-bound loops. When the denominator (pooled SD) is tiny, even a trivial mean difference produces a massive d.

Three pairs from the companion code:

Pair	Ratio	Delta practical	Cohen’s d	“Interpretation”
FilterBranch vs FilterTernary	1.02	2%	2.65	“large”
SumArray vs SumSpan	1.01	0.5%	1.98	“large”
SearchLinear vs SearchBinary	0.001	1,071x	368	“large”

All three “large” by Cohen’s thresholds. Only one is a meaningful optimization. Wittgenstein (1953): meaning is use — a word means what it means in the language game where it was coined. Cohen’s thresholds were coined in a game where within-group variance is high and effect sizes are modest. Microbenchmarking is a different game — sub-1% coefficient of variation, deterministic loops, controlled environments. “Large” means something in psychology. The standard interpretation becomes misleading when BDN’s precision makes the denominator vanishingly small. A 0.5% difference and a 1,071x difference land in the same bucket.

Popper (1934): a hypothesis survives by resisting falsification, not by accumulating confirmation. “3% faster” is a hypothesis. Non-overlapping CIs survived the first test — the difference exists. But Cohen’s d at 2.65 for a 2% change is the hypothesis flattering itself. The effect size, on BDN’s terrain, does not survive scrutiny. Seek the conditions under which the claim fails, not the ones where it holds.

For microbenchmarks, rely primarily on BDN’s Ratio column rather than Cohen’s d. Ratio ~ 1.00 means “no practical difference.” Ratio ~ 0.001 means “algorithmic change.” Whether 2% matters depends on context — a hot loop called billions of times, or a function called once per request. Define your threshold before you run.

Two extremes

Small practical effect — array indexing vs Span indexing over 1M integers:

| Method   | Categories  | N       | Mean     | Error   | StdDev  | Ratio |
|--------- |------------ |-------- |---------:|--------:|--------:|------:|
| SumArray | SmallEffect | 1000000 | 512.7 us | 1.16 us | 1.19 us |  1.00 |
| SumSpan  | SmallEffect | 1000000 | 515.3 us | 1.28 us | 1.42 us |  1.01 |

Ratio = 1.01. The JIT produces nearly identical code for both — bounds-check elimination applies to int[] and ReadOnlySpan<int> alike on .NET 9. The 2.6 us difference (0.5%) is likely real — the CIs don’t overlap, which is a conservative indicator — but not worth a code change.

Large practical effect — linear search vs binary search over 1M integers:

| Method       | Categories  | N       | Mean         | Error    | StdDev   | Ratio |
|------------- |------------ |-------- |-------------:|---------:|---------:|------:|
| SearchLinear | LargeEffect | 1000000 | 248,303.3 us | 928.6 us | 953.6 us | 1.000 |
| SearchBinary | LargeEffect | 1000000 |     231.8 us |   1.5 us |   1.7 us | 0.001 |

Ratio = 0.001. O(n) vs O(log n). An algorithmic change — not a JIT quirk, not a cache alignment artifact. 1,071x faster on this hardware. The algorithmic advantage holds on any platform with sorted data, though the exact multiplier will vary.

A number with error bars but no effect size is only half an answer.

Layer 3 — Micro vs macro: right question, wrong scale

A microbenchmark isolates a function. A macrobenchmark places it inside a pipeline. They answer different questions — and the answers disagree.

// Micro: isolated lookup — Dictionary vs linear search over 10,000 elements
[BenchmarkCategory("Micro")]
[Benchmark(Baseline = true)]
public int LookupLinear()
{
    int found = 0;
    for (int i = 0; i < _searchKeys.Length; i++)
    {
        if (Array.IndexOf(_data, _searchKeys[i]) >= 0)
            found++;
    }
    return found;
}

[BenchmarkCategory("Micro")]
[Benchmark]
public int LookupDictionary()
{
    int found = 0;
    for (int i = 0; i < _searchKeys.Length; i++)
    {
        if (_dict.ContainsKey(_searchKeys[i]))
            found++;
    }
    return found;
}

Microbenchmark — isolated lookup comparison over 200 search keys. Full source in companion code.

| Method           | Categories | Mean       | Error    | StdDev   | Ratio |
|----------------- |----------- |-----------:|---------:|---------:|------:|
| LookupLinear     | Micro      | 412.089 us | 1.609 us | 1.788 us | 1.000 |
| LookupDictionary | Micro      |   1.571 us | 0.012 us | 0.014 us | 0.004 |

Dictionary is 262x faster. Ship it?

The lookup lives inside a pipeline:

[Benchmark(Baseline = true)]
public long PipelineLinear()
{
    long v = ValidateArray(_workload);     // ~40% — sequential scan, 3M elements
    long t = PolynomialTransform(_workload); // ~40% — multiply/add/xor, 3M elements
    int  l = LookupAllLinear(_data, _searchKeys); // ~6% — 200 keys × Array.IndexOf
    long a = Aggregate(_workload);          // ~15% — weighted sum, stride 4
    return v ^ t ^ l ^ a;
}

[Benchmark]
public long PipelineDictionary()
{
    long v = ValidateArray(_workload);
    long t = PolynomialTransform(_workload);
    int  l = LookupAllDictionary(_searchKeys); // Dictionary.ContainsKey
    long a = Aggregate(_workload);
    return v ^ t ^ l ^ a;
}

Only the lookup step changes. Full source in companion code.

94% of the work doesn’t change regardless of lookup strategy.

| Method             | Categories | Mean         | Error     | StdDev    | Ratio |
|------------------- |----------- |-------------:|----------:|----------:|------:|
| PipelineLinear     | Macro      | 7,181.115 us | 59.636 us | 66.285 us |  1.00 |
| PipelineDictionary | Macro      | 6,611.982 us | 11.094 us | 11.871 us |  0.92 |

Pipeline with Dictionary is 8% faster. Not 262x. Eight percent.

The lookup consumes 412 us out of 7,181 us total — 5.7% of the pipeline. A 262x speedup on 5.7% gives a theoretical maximum improvement of 1 / (1 - 0.057 + 0.057/262) = 6.0% (Amdahl’s law⁴). The measured 8% is higher — cache effects from eliminating the linear scan likely benefit subsequent pipeline steps.

Micro answers “is this function faster?” Macro answers “will the user notice?”

Baudrillard (1981): the fourth phase of the simulacrum — the image bears no relation to any reality whatever. The microbenchmark says 262x. The macrobenchmark says 8%. Both have error bars. Both passed statistical tests. Both are internally consistent. Neither describes what the user experiences. Two maps orbiting each other, each valid within its own coordinate system, each detached from the territory they claim to represent. The micro number didn’t lie. The macro number didn’t lie. The lie was believing either one alone was the answer.

Eight percent might be worth it — or might not, depending on whether the pipeline runs once per request or once per hour. The microbenchmark alone cannot tell you.

Before you ship the number

Check	Question	If no…
Iterations	Did you run enough iterations? (>= 15 in this setup, configured via SimpleJob)	Your CIs are too wide — the result might be noise (see Layer 1)
CI overlap	Do the 99.9% CIs (BDN Error) not overlap?	Overlapping CIs suggest noise — but non-overlap is conservative, not definitive. Confirm with a formal test (Welch / Mann-Whitney)
Practical size	Is the Ratio meaningfully different from 1.00? Does it exceed your SESOI?	Statistically real but practically irrelevant — move on
Micro = Macro	Does the micro speedup translate to end-to-end improvement?	The bottleneck is elsewhere — profile before optimizing
Reproducible	Same result on different hardware / OS / runtime?	Environment-dependent — see Part 2

Three rules:

Always report confidence intervals. A mean without CI is a claim, not evidence. BenchmarkDotNet provides the Error column (99.9% CI half-width) — use it. CI overlap is a useful quick screening heuristic: overlapping CIs suggest noise, non-overlapping CIs suggest a real difference — but neither is definitive. Overlapping CIs can still hide a significant difference, and non-overlapping CIs are a conservative rule, not proof. For a formal conclusion, use a statistical test (Welch’s t-test, Mann-Whitney U). If you only ran 5 iterations, run more.
Distinguish statistical from practical significance. Non-overlapping CIs mean the difference exists. They don’t mean it matters. Define a SESOI (smallest effect size of interest) before running the benchmark — the minimum improvement that justifies the code change. BDN’s Ratio column tells you the proportional difference: if it doesn’t cross your SESOI threshold, the result is real but not actionable.
Confirm micro with macro. A microbenchmark shows a function is faster in isolation. A macrobenchmark shows the user will notice. Run both — or explain why you didn’t. A 262x micro speedup sounds compelling until Amdahl reduces it to 8%.

Run it yourself

git clone https://github.com/0x3f-blog/companion-code.git
cd companion-code/first-things-first/statistics-that-matter

# All benchmarks (20 iterations, ~3 min)
# Pin to a single NUMA node to eliminate cross-socket variance
taskset -c 0-11 dotnet run -c Release

# Individual scenarios
taskset -c 0-11 dotnet run -c Release -- --filter '*NoisyComparison*'
taskset -c 0-11 dotnet run -c Release -- --filter '*EffectSizeDemo*'
taskset -c 0-11 dotnet run -c Release -- --filter '*MicroVsMacro*'

# Reproduce the CI overlap demo (5 iterations — wide error bars)
taskset -c 0-11 dotnet run -c Release -- --filter '*NoisyComparison*' --iterationCount 5 --warmupCount 3

Benchmark environment

Component	Value
CPU	2x Intel Xeon E5-2697 v2 @ 2.70 GHz (24 cores / 48 threads)
RAM	~115 GB DDR3-1866 (quad-channel per socket)
OS	Fedora Linux 42 (kernel 6.17)
Runtime	.NET 9.0.11 (RyuJIT AVX)
SDK	.NET SDK 10.0.102
BenchmarkDotNet	v0.14.0
GC	Server GC, Concurrent (BDN enables Server GC in benchmark processes by default; host process uses Workstation)
Pinning	`taskset -c 0-11` — single socket, physical cores only
Job	SimpleJob (WarmupCount=5, IterationCount=20)

Limitations: Single machine, dual-socket NUMA. All benchmarks pinned to one socket to eliminate cross-socket memory access and thread migration — without pinning, NoisyComparison variance doubles and absolute values shift by 5-10% between runs (Part 2). EffectSizeDemo uses sorted data for binary search — the algorithmic advantage is inherent, not hardware-dependent. MicroVsMacro pipeline proportions (40/40/6/15%) are approximate — workload ratios on your hardware will vary.

Even with honest design, controlled environment, and correct measurement — the number still needs interpretation. Too few iterations and the CI swallows the difference. Tight CIs inflate Cohen’s d into meaninglessness. Microbenchmarks promise 262x while the user sees 8%.

Hume (1739): no finite number of observations guarantees the next will conform. But the problem isn’t too few observations — it’s too much readiness to conclude. The confirmation doesn’t come from the data. It comes from you. The number said “3% slower” and you heard “regression” because you were already looking for one. The CIs were wide enough to hold any story. You picked the one that matched.

“3% faster” is not a result. It’s a hypothesis. Treat it like one — confirm it with sufficient iterations, assess practical significance, and validate it against end-to-end behavior. Or revert the merge.

3% slower. Ship it.#

The number is the answer#

Layer 1 — Confidence intervals eat your win#

Layer 2 — Effect size: when “significant” doesn’t mean “meaningful”#

The threshold trap#

Two extremes#

Layer 3 — Micro vs macro: right question, wrong scale#

Before you ship the number#

Run it yourself#

Benchmark environment#

Further reading#