<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>.NET on 0x3F</title>
    <link>https://0x3f.blog/tags/dotnet/</link>
    <description>Recent content in .NET on 0x3F</description>
    <generator>Hugo -- 0.152.2</generator>
    <language>en-us</language>
    <lastBuildDate>Tue, 10 Mar 2026 19:00:00 +0100</lastBuildDate>
    <atom:link href="https://0x3f.blog/tags/dotnet/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>First Things First: Hardware Counters</title>
      <link>https://0x3f.blog/posts/first-things-first-hardware-counters/</link>
      <pubDate>Tue, 10 Mar 2026 19:00:00 +0100</pubDate>
      <guid>https://0x3f.blog/posts/first-things-first-hardware-counters/</guid>
      <description>12.9× slower — BDN tells you that. But it cannot tell you why, or predict that the ratio will jump when the dataset outgrows the cache. Hardware counters can. Time is a result, not a cause.</description>
      <content:encoded><![CDATA[<h2 id="129-slower--and-thats-the-easy-part">12.9× slower — and that&rsquo;s the easy part</h2>
<p>Two loops over the same array. Same data. Same sum operation. One walks the array sequentially; the other uses a random permutation for indirection. BenchmarkDotNet says SumRandom is 12.88× slower at one million elements. No surprise — random memory access is slower. Everyone knows that.</p>
<p>But <em>how much slower will it get</em> when the dataset grows 64×?</p>
<p>BDN measures time. Time compresses everything the CPU did — cache behavior, prefetch, pipeline stalls, memory latency — into a single scalar. It answers <em>how much</em>. It cannot answer <em>why</em>. And without <em>why</em>, the next question — <em>what happens when conditions change</em> — is a guess.</p>
<p>The first four posts taught doubt. Design lies through omission. Environment masks distortion. Data collection coordinates with failure. Interpretation drifts from evidence. Each layer peeled back a way the measurement could mislead, and each time the tools were doing it <em>to you</em> while appearing to work <em>with you</em>.</p>
<p>This post goes somewhere different.</p>
<p>All code in this post: clone, build, run. Numbers below were measured on dual Xeon E5-2697 v2 using BenchmarkDotNet v0.14.0. <em>Charts use milliseconds unless otherwise noted; tables reproduce BenchmarkDotNet output. BDN&rsquo;s Error column is the half-width of the 99.9% confidence interval.</em></p>
<hr>
<h2 id="the-setup--two-paths-same-operation">The setup — two paths, same operation</h2>
<div class="highlight"><pre data-lang="csharp"><code>[Benchmark(Baseline = true)]
public long SumSequential()
{
    long sum = 0;
    long[] data = _data;
    for (int i = 0; i &lt; data.Length; i&#43;&#43;)
        sum &#43;= data[i];
    return sum;
}

[Benchmark]
public long SumRandom()
{
    long sum = 0;
    long[] data = _data;
    int[] indices = _indices;
    for (int i = 0; i &lt; indices.Length; i&#43;&#43;)
        sum &#43;= data[indices[i]];
    return sum;
}</code></pre></div>
<p><small>Sequential vs random access over <code>long[]</code>. <code>_indices</code> is a Fisher-Yates shuffle of 0..N-1 — same elements, different order. Full source in companion code.</small></p>
<p>Both methods compute the same sum. Both touch every element exactly once. The only difference: the order of access. Sequential walks the array from start to end. Random jumps through a pre-shuffled index array.</p>
<p>At one million elements (8 MB of <code>long[]</code> — plus 4 MB of <code>int[]</code> indices for the random variant — both fit comfortably in the 30 MB L3 cache on Ivy Bridge-EP):</p>
<div class="highlight"><pre data-lang=""><code>| Method        | N       | Mean       | Error    | StdDev   | Ratio | RatioSD |
|-------------- |-------- |-----------:|---------:|---------:|------:|--------:|
| SumSequential | 1000000 |   561.0 us |  1.85 us |  1.44 us |  1.00 |    0.00 |
| SumRandom     | 1000000 | 7,223.9 us | 12.74 us | 10.63 us | 12.88 |    0.04 |</code></pre></div>
<p>12.88× slower. The confidence intervals don&rsquo;t overlap. The difference is real and large. Random access is slower — water is wet. Ship the sequential version, move on.</p>
<p>BDN told you the <em>what</em>. It didn&rsquo;t tell you the <em>why</em>. And without the <em>why</em>, you can&rsquo;t predict <em>what happens next</em>.</p>
<hr>
<h2 id="level-1--perf-stat-the-vital-signs">Level 1 — perf stat: the vital signs</h2>
<p><code>perf stat</code> reads hardware performance counters — registers built into the CPU that count events like cycles, instructions, cache accesses, and cache misses. No sampling, no code instrumentation, and typically negligible overhead — the CPU increments these counters in hardware, and <code>perf stat</code> reads the registers at process start/stop. When you request more events than the CPU has physical counter registers, <code>perf</code> multiplexes (time-shares) and scales the results, which introduces estimation error — the percentages in the output below reflect this.<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></p>
<div class="highlight"><pre data-lang="bash"><code>perf stat -e cycles,instructions,cache-references,cache-misses,\
branch-instructions,branch-misses,L1-dcache-loads,L1-dcache-load-misses,\
LLC-loads,LLC-load-misses \
    dotnet run -c Release -- --filter &#39;*Sequential*&#39;</code></pre></div>
<p><em>If any event is unsupported on your CPU, <code>perf stat</code> will report <code>&lt;not supported&gt;</code> for that counter. Run <code>perf list</code> to see available events. At minimum, <code>cycles</code> and <code>instructions</code> (for IPC) are widely available on modern x86 CPUs; verify with <code>perf list</code>.</em></p>
<p>Run this for both variants and you get a side-by-side comparison of what the CPU was actually doing:<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup></p>
<table>
  <thead>
      <tr>
          <th>Counter</th>
          <th>Sequential</th>
          <th>Random</th>
          <th>What it means</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>IPC (instructions/cycle)</td>
          <td>1.54</td>
          <td>0.42</td>
          <td>CPU throughput — how many instructions retire per clock cycle</td>
      </tr>
      <tr>
          <td>L1 data cache miss rate</td>
          <td>11.64%</td>
          <td>24.90%</td>
          <td>Fraction of loads that miss the fastest cache (32 KB, ~4 cycle latency)</td>
      </tr>
      <tr>
          <td>LLC load miss rate</td>
          <td>53.02%*</td>
          <td>30.38%*</td>
          <td>Fraction of last-level cache loads that go to DRAM — <em>inverted due to aggregation; see note</em><sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></td>
      </tr>
      <tr>
          <td>Branch misprediction rate</td>
          <td>0.87%</td>
          <td>2.60%</td>
          <td>Fraction of branches predicted wrong — both are low</td>
      </tr>
  </tbody>
</table>
<p><em>A caveat these numbers have earned: they are aggregated across the full BDN process — warmup, pilot, and actual iterations at all three dataset sizes (1M, 8M, 64M). They diagnose the mechanism (memory-bound vs compute-bound), not behavior at any single N. The IPC gap (1.54 vs 0.42) and the L1 miss rate gap (11.64% vs 24.90%) are directionally stable across aggregation — random access is memory-bound regardless of how you slice the data. The LLC miss rates are less trustworthy: sequential appears worse (53% vs 30%) because it runs ~3× more total iterations, and the 64M dataset dominates its LLC totals — see <sup id="fnref1:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup> for details. The prediction in Level 3 rests on the mechanism (memory-bound + L3-dependent), not on the exact percentages.</em></p>
<p>The first four posts taught methodical doubt. What to trust? What to distrust? How deep does the distortion go? Descartes reached for the <em>cogito</em> — the one thing doubt couldn&rsquo;t dissolve. Here the descent through software abstraction reaches something similar. <code>perf stat</code> doesn&rsquo;t measure time. It doesn&rsquo;t measure abstractions. It reads registers that the silicon increments whether anyone is watching or not. The counters exist at the boundary where software models end and physics begins. Doubt doesn&rsquo;t end in nihilism. It ends in firmer ground.</p>
<p><strong>IPC is the headline.</strong> Sequential executes 1.54 instructions per cycle. Random executes 0.42. The CPU is 3.7× more productive on sequential access on this Ivy Bridge-EP — not because it runs different instructions, but because it <em>doesn&rsquo;t stall</em>. The hardware prefetcher detects the sequential stride, fetches cache lines ahead of the loop, and the data is waiting in L1 before the load instruction executes.<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup></p>
<p>Random access defeats the prefetcher. Every load is a surprise. The CPU issues the load, waits 10-40 cycles for L2/L3, and the pipeline stalls. The instructions are the same — the wait is different.</p>
<div class="chart-container">
  <canvas id="chart-1983876d39dfa57fd14a677d1487a52b"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-1983876d39dfa57fd14a677d1487a52b').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['IPC\n(insn/cycle)', 'L1 miss rate\n(%)', 'Branch miss rate\n(%)'],
    datasets: [
      {
        label: 'Sequential',
        data: [1.54, 11.64, 0.87],
        backgroundColor: '#89b4fa',
        borderColor: '#89b4fa',
        borderWidth: 1
      },
      {
        label: 'Random',
        data: [0.42, 24.90, 2.60],
        backgroundColor: '#f38ba8',
        borderColor: '#f38ba8',
        borderWidth: 1
      }
    ]
  },
  options: {
    plugins: {
      title: { display: true, text: 'perf stat — same code, different microarchitecture behavior' },
      subtitle: { display: true, text: 'IPC: 1.54 vs 0.42 — the CPU stalls 3.7× more on random access' },
      legend: { display: true }
    },
    scales: {
      y: {
        title: { display: true, text: 'Value' }
      }
    }
  }
}
);
  })();
</script>

<p>BDN said &ldquo;12.88× slower.&rdquo; The aggregate hardware counters<sup id="fnref1:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> reveal the mechanism: the CPU is stalling on cache misses. The instructions aren&rsquo;t slower — they&rsquo;re <em>waiting</em>. And waiting scales with memory latency, which scales with working set size.</p>
<p>That&rsquo;s the basis for a prediction.</p>
<hr>
<h2 id="level-2--flame-graphs-the-shape-of-time">Level 2 — Flame graphs: the shape of time</h2>
<p>Hardware counters tell you <em>what</em> the CPU is doing — stalling on cache misses, mispredicting branches. Flame graphs tell you <em>where the cost concentrates</em> in the code path.<sup id="fnref:5"><a href="#fn:5" class="footnote-ref" role="doc-noteref">5</a></sup></p>
<p>A flame graph is a visualization of stack traces sampled by <code>perf record</code>. The x-axis is not time — it&rsquo;s the population of samples. Wider frames mean more time spent in that function.</p>
<div class="highlight"><pre data-lang="bash"><code># Record stack traces at 99 Hz (standalone runner, no BDN overhead)
perf record -g -F 99 --call-graph dwarf -- \
    dotnet run -c Release -- perf-sequential 8000000

# Convert to flame graph SVG
perf script | stackcollapse-perf.pl | flamegraph.pl &gt; sequential.svg</code></pre></div>
<p>Sequential — one hot column, tight loop, no stalls wide enough to sample:</p>
<div class="flamegraph-wrap">
  <div id="fg-ca5bd376abffe4f2d28d4b6288df8c15" class="flamegraph-canvas"></div>
</div>
<script>
  (function() {
    fetch('\/flamegraphs\/sequential.json')
      .then(function(r) { return r.json(); })
      .then(function(data) {
        var el = document.getElementById('fg-ca5bd376abffe4f2d28d4b6288df8c15');
        new FlameGraph(el, data, { title: 'SumSequential — 8M elements, 2000 iterations' });
      });
  })();
</script>

<p>Random — wider, flatter. The hot loop is still there, but the sampled stacks spread more broadly around it, consistent with the CPU spending more time waiting on the memory subsystem:</p>
<div class="flamegraph-wrap">
  <div id="fg-b761154c9fcbff5118ff07c2f405d7a2" class="flamegraph-canvas"></div>
</div>
<script>
  (function() {
    fetch('\/flamegraphs\/random.json')
      .then(function(r) { return r.json(); })
      .then(function(data) {
        var el = document.getElementById('fg-b761154c9fcbff5118ff07c2f405d7a2');
        new FlameGraph(el, data, { title: 'SumRandom — 8M elements, 100 iterations' });
      });
  })();
</script>

<p><code>perf stat</code> diagnosed the disease. The flame graph shows the hot path around it. Sequential&rsquo;s samples concentrate in one tight column — the loop body runs, the prefetcher feeds it, the pipeline stays full. Random spreads wider — same loop body, but more of the sampled time accumulates in and around that path while the core waits for data. The structure of time, not just the quantity of it.</p>
<p>Three tools, three levels:</p>
<table>
  <thead>
      <tr>
          <th>Tool</th>
          <th>Question</th>
          <th>Answer</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>BDN</td>
          <td>How much slower?</td>
          <td>12.88× (at 1M)</td>
      </tr>
      <tr>
          <td>perf stat</td>
          <td>Why?</td>
          <td>IPC 0.42 vs 1.54 — cache miss stalls</td>
      </tr>
      <tr>
          <td>Flame graph</td>
          <td>Where?</td>
          <td>The hot path around the inner loop</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="level-3--the-prediction">Level 3 — The prediction</h2>
<p>This is where hardware counters do something BDN cannot. They don&rsquo;t just explain the past. They make the future falsifiable.</p>
<p>At one million elements, sequential walks 8 MB of <code>long[]</code>. Random also loads a 4 MB <code>int[]</code> index array, bringing its working set to 12 MB. The L3 cache on this Xeon E5-2697 v2 is 30 MB — everything fits. Random access is slow because it misses L1 and L2 — but it hits L3. L3 latency is ~30 cycles. Bad, but bounded.</p>
<p>The hypothesis: if random access is memory-bound — the aggregate counters showed IPC 0.42 and L1 miss rate 24.90%, diagnosing the <em>mechanism</em> even though they span all dataset sizes<sup id="fnref2:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> — and the current performance depends on L3 absorbing those misses, then exceeding L3 capacity will force misses to DRAM at ~200 cycles. The ratio should jump.</p>
<p>At 8 million elements, the data array alone is 64 MB. With the 32 MB index array, random&rsquo;s working set reaches 96 MB — well beyond the 30 MB L3. At 64 million elements (512 MB data + 256 MB indices), there&rsquo;s no question. Random access now predominantly misses all cache levels and goes to DRAM.</p>
<p>This is a falsifiable prediction. Not a statistical extrapolation from benchmark numbers. A deduction from cache architecture, informed by hardware counters that revealed the mechanism. Run it. See what happens.</p>
<div class="highlight"><pre data-lang=""><code>| Method        | N        | Mean           | Error       | StdDev      | Ratio | RatioSD |
|-------------- |--------- |---------------:|------------:|------------:|------:|--------:|
| SumSequential | 1000000  |       561.0 us |     1.85 us |     1.44 us |  1.00 |    0.00 |
| SumRandom     | 1000000  |     7,223.9 us |    12.74 us |    10.63 us | 12.88 |    0.04 |
|               |          |                |             |             |       |         |
| SumSequential | 8000000  |     6,434.3 us |    51.43 us |    48.11 us |  1.00 |    0.01 |
| SumRandom     | 8000000  |   125,635.5 us | 2,319.67 us | 2,056.33 us | 19.53 |    0.34 |
|               |          |                |             |             |       |         |
| SumSequential | 64000000 |    82,974.9 us |   933.52 us |   728.83 us |  1.00 |    0.01 |
| SumRandom     | 64000000 | 1,613,935.5 us | 3,277.61 us | 2,736.96 us | 19.45 |    0.17 |</code></pre></div>
<div class="chart-container">
  <canvas id="chart-f454392f3be961dee1f24f4af208edff"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-f454392f3be961dee1f24f4af208edff').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['1M (8 MB data)\nfits L3', '8M (64 MB data)\nexceeds L3', '64M (512 MB data)\nDRAM only'],
    datasets: [
      {
        label: 'Sequential',
        data: [0.561, 6.434, 82.975],
        backgroundColor: '#89b4fa',
        borderColor: '#89b4fa',
        borderWidth: 1
      },
      {
        label: 'Random',
        data: [7.224, 125.636, 1613.936],
        backgroundColor: '#f38ba8',
        borderColor: '#f38ba8',
        borderWidth: 1
      }
    ]
  },
  options: {
    plugins: {
      title: { display: true, text: 'Sequential vs Random — scaling with dataset size' },
      subtitle: { display: true, text: 'Random: ~13× at 1M → ~20× at 8M (this run). Data array sizes shown; random also loads int[] indices.' },
      legend: { display: true }
    },
    scales: {
      y: {
        type: 'logarithmic',
        title: { display: true, text: 'Time (ms) — log scale' },
        min: 0.1,
        max: 10000
      }
    }
  }
}
);
  })();
</script>

<p><strong>Sequential scales smoothly but super-linearly.</strong> 8× more data yields ~11.5× more time; 64× more data yields ~148× more time.<sup id="fnref:6"><a href="#fn:6" class="footnote-ref" role="doc-noteref">6</a></sup> The extra factor comes from the L3 boundary: at 1M (8 MB), sequential reads hit L3 at ~30 cycle latency. At 8M+ (64 MB+), the prefetcher must pull from DRAM (~200 cycles). Even within the DRAM-resident range (8M to 64M), scaling is ~12.9× for 8× data — still super-linear, likely due to TLB pressure at large working sets. The prefetcher hides most of the latency increase — but not all of it.</p>
<p><strong>Random hits a cliff.</strong> In this run, the ratio jumps from ~13× at 1M to ~19.5× at 8M — about a 50% degradation on this dual-socket NUMA system.<sup id="fnref:7"><a href="#fn:7" class="footnote-ref" role="doc-noteref">7</a></sup> Beyond that, it stays in the same rough range rather than snapping back. The cliff happened between 1M and 8M, exactly where the working set crossed the L3 boundary. The exact ratios will differ on your hardware — the cliff at the L3 boundary won&rsquo;t.</p>
<div class="chart-container">
  <canvas id="chart-d4fc857cdb14550dcc03b6ed24e25bad"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-d4fc857cdb14550dcc03b6ed24e25bad').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['1M (8 MB)', '8M (64 MB)', '64M (512 MB)'],
    datasets: [
      {
        label: 'Ratio (Random / Sequential)',
        data: [12.88, 19.53, 19.45],
        backgroundColor: ['#a6e3a1', '#fab387', '#f38ba8'],
        borderColor: ['#a6e3a1', '#fab387', '#f38ba8'],
        borderWidth: 1
      }
    ]
  },
  options: {
    plugins: {
      title: { display: true, text: 'The ratio cliff — where L3 runs out' },
      subtitle: { display: true, text: '~13× → ~20× when working set exceeds 30 MB L3 cache (exact ratios shift with NUMA topology)' },
      legend: { display: false }
    },
    scales: {
      y: {
        title: { display: true, text: 'Ratio (Random / Sequential)' },
        min: 0,
        max: 25
      }
    }
  }
}
);
  })();
</script>

<p>The prediction survived the test — on this hardware, on this run.</p>
<p>Popper, <em>Logik der Forschung</em> (1934)<sup id="fnref:8"><a href="#fn:8" class="footnote-ref" role="doc-noteref">8</a></sup>: a falsifiable prediction distinguishes science from storytelling. &ldquo;Random access is memory-bound (low IPC, high L1 miss rate). The working set fits L3 at 1M. At 8M, it won&rsquo;t. The ratio will jump.&rdquo; Run it. In this run, the ratio jumps from ~13× to ~19.5×. The exact numbers are unstable — dual-socket NUMA, thread migration, prefetcher heuristics all shift them.<sup id="fnref1:7"><a href="#fn:7" class="footnote-ref" role="doc-noteref">7</a></sup> The mechanism isn&rsquo;t. The theory wasn&rsquo;t adjusted after the fact. It was stated before the data, derived from the cache hierarchy, and the data confirmed the <em>shape</em> — a cliff at the L3 boundary. That&rsquo;s not extrapolation from a benchmark number. That&rsquo;s deduction from architecture.</p>
<p>Through the first four posts, every tool revealed its distortion after the damage was done. Post-factum. Reactive. Hardware counters are the first tool in this series that generates a falsifiable hypothesis <em>before</em> the benchmark runs. Not a better explanation of the past — a testable claim about the future.</p>
<hr>
<h2 id="numa--where-the-numbers-shift-and-the-shape-doesnt">NUMA — where the numbers shift and the shape doesn&rsquo;t</h2>
<p>This machine has two sockets. Two Xeon E5-2697 v2, each with its own 30 MB L3 cache, its own memory controller, its own DRAM. When a thread runs on socket 0 and accesses memory allocated on socket 1, the load crosses the QPI interconnect — ~40 ns extra latency. When the OS migrates a thread between sockets mid-benchmark, the prefetcher resets, the L1/L2 are cold, and the next few thousand loads hit DRAM instead of cache.</p>
<p>BDN doesn&rsquo;t know which socket it&rsquo;s running on. It reports a single number. On dual-socket NUMA, that number carries noise from topology that has nothing to do with the code being measured.</p>
<p>Three runs: unpinned (OS schedules freely), pinned to socket 0 (<code>taskset -c 0-11</code>), pinned to socket 1 (<code>taskset -c 12-23</code>). Same binary, same data, same benchmark. Different answers.</p>
<div class="highlight"><pre data-lang=""><code>Unpinned (OS schedules freely):
| Method        | N        | Mean           | Error        | StdDev       | Ratio | RatioSD |
|-------------- |--------- |---------------:|-------------:|-------------:|------:|--------:|
| SumSequential | 1000000  |       555.0 us |      1.26 us |      1.12 us |  1.00 |    0.00 |
| SumRandom     | 1000000  |     7,577.9 us |    181.88 us |    151.88 us | 13.65 |    0.26 |
| SumSequential | 8000000  |     9,146.4 us |  1,431.15 us |  1,338.70 us |  1.02 |    0.21 |
| SumRandom     | 8000000  |   127,720.5 us |  1,591.92 us |  1,329.32 us | 14.25 |    2.03 |
| SumSequential | 64000000 |    65,306.5 us |    522.56 us |    488.81 us |  1.00 |    0.01 |
| SumRandom     | 64000000 | 1,665,816.7 us | 35,695.93 us | 33,389.99 us | 25.51 |    0.53 |

Pinned to socket 0 (taskset -c 0-11):
| Method        | N        | Mean           | Error        | StdDev       | Ratio | RatioSD |
|-------------- |--------- |---------------:|-------------:|-------------:|------:|--------:|
| SumSequential | 1000000  |       564.9 us |      6.22 us |      5.51 us |  1.00 |    0.01 |
| SumRandom     | 1000000  |     8,199.0 us |    596.29 us |    557.77 us | 14.51 |    0.97 |
| SumSequential | 8000000  |     8,942.9 us |  1,201.30 us |  1,123.70 us |  1.02 |    0.18 |
| SumRandom     | 8000000  |   125,272.7 us |  2,566.78 us |  2,400.96 us | 14.24 |    1.93 |
| SumSequential | 64000000 |    67,722.2 us |    574.32 us |    479.59 us |  1.00 |    0.01 |
| SumRandom     | 64000000 | 1,675,995.7 us | 14,298.58 us | 11,939.97 us | 24.75 |    0.24 |

Pinned to socket 1 (taskset -c 12-23):
| Method        | N        | Mean           | Error       | StdDev      | Ratio | RatioSD |
|-------------- |--------- |---------------:|------------:|------------:|------:|--------:|
| SumSequential | 1000000  |       560.5 us |     1.47 us |     1.30 us |  1.00 |    0.00 |
| SumRandom     | 1000000  |     7,330.0 us |    40.50 us |    37.89 us | 13.08 |    0.07 |
| SumSequential | 8000000  |     6,961.7 us |   147.33 us |   137.82 us |  1.00 |    0.03 |
| SumRandom     | 8000000  |   124,440.5 us | 2,756.00 us | 2,577.97 us | 17.88 |    0.50 |
| SumSequential | 64000000 |    56,263.4 us |   685.72 us |   641.42 us |  1.00 |    0.02 |
| SumRandom     | 64000000 | 1,650,334.5 us | 4,652.78 us | 3,885.28 us | 29.34 |    0.33 |</code></pre></div>
<div class="chart-container">
  <canvas id="chart-df02df7162ee4c354c4ab6183399391c"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-df02df7162ee4c354c4ab6183399391c').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['1M\nUnpinned', '1M\nSocket 0', '1M\nSocket 1', '8M\nUnpinned', '8M\nSocket 0', '8M\nSocket 1', '64M\nUnpinned', '64M\nSocket 0', '64M\nSocket 1'],
    datasets: [
      {
        label: 'Ratio (Random / Sequential)',
        data: [13.65, 14.51, 13.08, 14.25, 14.24, 17.88, 25.51, 24.75, 29.34],
        backgroundColor: ['#89b4fa', '#89b4fa', '#89b4fa', '#fab387', '#fab387', '#fab387', '#f38ba8', '#f38ba8', '#f38ba8'],
        borderColor: ['#89b4fa', '#89b4fa', '#89b4fa', '#fab387', '#fab387', '#fab387', '#f38ba8', '#f38ba8', '#f38ba8'],
        borderWidth: 1
      }
    ]
  },
  options: {
    plugins: {
      title: { display: true, text: 'The ratio cliff across NUMA topologies' },
      subtitle: { display: true, text: 'Same code, same data, different thread placement — the cliff is always there' },
      legend: { display: false }
    },
    scales: {
      y: {
        title: { display: true, text: 'Ratio (Random / Sequential)' },
        min: 0,
        max: 35
      }
    }
  }
}
);
  })();
</script>

<p>The ratio at 1M is stable: 13.08–14.51×. Everything fits in L3 regardless of socket — NUMA doesn&rsquo;t matter when the prefetcher keeps the pipeline full and the working set is cache-resident.</p>
<p>At 8M, the topology starts to show. Socket 1 reports 17.88× while unpinned and socket 0 hover around 14.2×. Sequential at 8M diverges the most: unpinned reports 9,146 us (BDN flagged bimodal distribution — thread migration mid-run), socket 0 reports 8,943 us, socket 1 reports 6,962 us. A 31% spread on the same sequential sum, same data, same binary. The difference is where the thread ran and whether it stayed there.</p>
<p>At 64M, the spread widens further: 24.75× (socket 0) to 29.34× (socket 1). An 18% swing in the ratio from thread placement alone. Random access times are close (~1.65–1.68s) — DRAM latency dominates and both sockets pay roughly the same price. Sequential is where the sockets diverge: socket 1 runs sequential 17% faster than socket 0 (56,263 vs 67,722 us), likely because socket 1&rsquo;s memory controller has less contention from OS and runtime threads that default to socket 0.</p>
<p>The exact ratios from the earlier section — 12.88×, 19.53×, 19.45× — came from yet another run. They don&rsquo;t match any of these three. That&rsquo;s the point. On some runs the cliff at 8M is sharp (socket 1: 13.08× → 17.88×); on others it&rsquo;s muted (unpinned: 13.65× → 14.25×, with the full impact deferred to 64M where DRAM dominates regardless of topology). Five runs, five sets of numbers, one shape: a cliff where the working set crosses the L3 boundary. Whether it lands at 8M or spreads across 8M–64M depends on thread placement and memory allocation — not on the code.</p>
<p><code>taskset</code> and <code>numactl</code> aren&rsquo;t exotic tools. They&rsquo;re part of the measurement environment — the same environment that FTF-2 warned you about. On single-socket machines, none of this matters. On NUMA, it&rsquo;s the difference between a 24.75× and a 29.34× — same code, same data, same question, different answer depending on which socket the OS picked.</p>
<hr>
<h2 id="the-hardware-checklist">The hardware checklist</h2>
<p>Five questions hardware counters answer that benchmarks cannot:</p>
<table>
  <thead>
      <tr>
          <th>Question</th>
          <th>Counter</th>
          <th>What to look for</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Is my code cache-efficient?</td>
          <td><code>L1-dcache-load-misses</code>, <code>LLC-load-misses</code></td>
          <td>Miss rate above ~10% on this hardware suggests access pattern worth investigating</td>
      </tr>
      <tr>
          <td>Is the CPU pipeline efficient?</td>
          <td><code>instructions</code> / <code>cycles</code> (IPC)</td>
          <td>IPC below ~1.0 on this hardware suggests stalling on memory or branch misses</td>
      </tr>
      <tr>
          <td>Is branch prediction working?</td>
          <td><code>branch-misses</code> / <code>branch-instructions</code></td>
          <td>Miss rate above ~5% on this hardware suggests unpredictable branches</td>
      </tr>
      <tr>
          <td>Will this scale with data size?</td>
          <td>Compare cache miss rates at small vs large N</td>
          <td>Rising miss rate as N grows points toward a performance cliff</td>
      </tr>
      <tr>
          <td>Where is time spent?</td>
          <td><code>perf record</code> + flame graph</td>
          <td>Wide stacks indicate distributed stalls; narrow stacks indicate a hot loop</td>
      </tr>
  </tbody>
</table>
<p><em>These thresholds are priors, not axioms — useful starting points for investigation on this hardware, unverified on yours.</em></p>
<h3 id="when-to-use-what">When to use what</h3>
<p><strong>BDN suffices</strong> most of the time:</p>
<ul>
<li>You&rsquo;re comparing two implementations and the ratio is clear (&gt;1.5× or &lt;0.7×)</li>
<li>The result is stable across runs</li>
<li>You&rsquo;re making a ship/no-ship decision on a known bottleneck</li>
</ul>
<p><strong>Reach for perf stat</strong> when:</p>
<ul>
<li>Two variants show similar BDN times but you suspect different underlying behavior</li>
<li>The ratio changes unexpectedly across dataset sizes</li>
<li>You need to understand <em>why</em> something is slow, not just <em>how much</em></li>
<li>You want to predict scaling behavior before running the full benchmark suite</li>
</ul>
<p><strong>Use flame graphs</strong> when:</p>
<ul>
<li><code>perf stat</code> says &ldquo;cache misses&rdquo; but you don&rsquo;t know which access pattern causes them</li>
<li>A complex function is slow and you need to identify the hot path</li>
<li>You&rsquo;re profiling an entire application, not an isolated benchmark</li>
</ul>
<hr>
<h3 id="run-it-yourself">Run it yourself</h3>
<div class="highlight"><pre data-lang="bash"><code>git clone https://github.com/0x3f-blog/companion-code.git
cd companion-code/first-things-first/hardware-counters

# All benchmarks — 3 dataset sizes (~2 min)
dotnet run -c Release -- --filter &#39;*&#39;

# perf stat comparison (Linux only) — full event set matching the blog post
perf stat -e cycles,instructions,cache-references,cache-misses,\
branch-instructions,branch-misses,L1-dcache-loads,L1-dcache-load-misses,\
LLC-loads,LLC-load-misses \
    dotnet run -c Release -- --filter &#39;*Sequential*&#39;

perf stat -e cycles,instructions,cache-references,cache-misses,\
branch-instructions,branch-misses,L1-dcache-loads,L1-dcache-load-misses,\
LLC-loads,LLC-load-misses \
    dotnet run -c Release -- --filter &#39;*Random*&#39;

# Or use the included scripts
./Scripts/perf-stat.sh
./Scripts/run-scaling.sh</code></pre></div>
<hr>
<h2 id="benchmark-environment">Benchmark environment</h2>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CPU</td>
          <td>2× Intel Xeon E5-2697 v2 @ 2.70 GHz (24 cores / 48 threads)</td>
      </tr>
      <tr>
          <td>L3 Cache</td>
          <td>30 MB per socket</td>
      </tr>
      <tr>
          <td>RAM</td>
          <td>~115 GB DDR3-1866 (quad-channel per socket)</td>
      </tr>
      <tr>
          <td>OS</td>
          <td>Fedora Linux 42 (kernel 6.17)</td>
      </tr>
      <tr>
          <td>Runtime</td>
          <td>.NET 9.0.11 (RyuJIT AVX)</td>
      </tr>
      <tr>
          <td>SDK</td>
          <td>.NET SDK 10.0.102 (targets net9.0 — SDK 10 builds 9.0 apps)</td>
      </tr>
      <tr>
          <td>BenchmarkDotNet</td>
          <td>v0.14.0</td>
      </tr>
      <tr>
          <td>perf</td>
          <td>v6.18.6, <code>perf_event_paranoid=2</code></td>
      </tr>
      <tr>
          <td>GC</td>
          <td>Server GC, Concurrent (BDN enables Server GC in benchmark processes by default)</td>
      </tr>
  </tbody>
</table>
<p><strong>Limitations:</strong> Different machine, different numbers. Dual-socket NUMA — thread migration can widen variance. <code>perf stat</code> numbers are aggregated over the full BDN process (warmup, pilot, actual iterations at all three dataset sizes), not isolated per-benchmark. Absolute counter values include BDN overhead; ratios between variants are meaningful. The L3 cache boundary (30 MB) is specific to Ivy Bridge-EP — your cache hierarchy will produce a cliff at a different dataset size. The IPC values reflect aggregate process behavior, not just the hot loop; isolated hot-loop IPC would be higher for sequential (~3.0+) and similar for random (~0.3-0.5).</p>
<hr>
<h2 id="piercing-through">Piercing through</h2>
<p>Five posts. Five layers.</p>
<p>Design — what you measure. Environment — what surrounds the measurement. Data collection — how you gather it. Interpretation — what you do with the numbers. Cause — why the numbers are what they are.</p>
<p>Through the first four posts, the image moved steadily away from reality. Benchmark design distorted it. The environment masked the distortion. Coordinated omission replaced absent data with comfortable silence. Statistical interpretation severed the last thread connecting numbers to the thing they claimed to represent. Baudrillard&rsquo;s phases of the simulacrum, played out in measurement: the image that distorts reality, the image that masks its absence, the image that bears no relation to reality at all.</p>
<p><code>perf stat</code> pierces through. It doesn&rsquo;t build another image. It reads registers that the silicon increments at every clock edge — cache miss, branch mispredict, instruction retired. Not a model of what happened. Not an abstraction of what happened. What happened, counted in hardware, whether anyone is watching or not. The first tool in five posts that measures the territory, not the map.</p>
<p>The series started with a lie — 27.2M ops/sec and three contradictory verdicts from the same optimization. It ends not with an answer but with a framework. Five layers, five dimensions. You don&rsquo;t need to measure all of them every time. You need to know they exist, and when to reach for which one.</p>
<p>You have the tools. You know when to reach for which one.</p>
<hr>
<h2 id="further-reading">Further reading</h2>
<ul>
<li>Brendan Gregg, <em>Systems Performance</em>, 2nd ed. (Addison-Wesley, 2020) — chapters 6 (CPU) and 7 (Memory). The definitive reference for PMU counters, <code>perf</code>, and flame graphs.<sup id="fnref:9"><a href="#fn:9" class="footnote-ref" role="doc-noteref">9</a></sup></li>
<li>Brendan Gregg, <a href="https://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html">CPU Flame Graphs</a> (2016) — the original methodology for flame graph generation and interpretation.<sup id="fnref1:5"><a href="#fn:5" class="footnote-ref" role="doc-noteref">5</a></sup></li>
<li>Ahmad Yasin, <a href="https://ieeexplore.ieee.org/document/6844459">A Top-Down Method for Performance Analysis and Counters Architecture</a> (ISPASS 2014) — the framework Intel uses: Frontend Bound, Backend Bound, Bad Speculation, Retiring.<sup id="fnref:10"><a href="#fn:10" class="footnote-ref" role="doc-noteref">10</a></sup></li>
<li>Ulrich Drepper, <a href="https://people.freebsd.org/~lstewart/articles/cpumemory.pdf">What Every Programmer Should Know About Memory</a> (2007) — cache hierarchy, prefetch, TLB. The foundation for understanding cache miss counters.<sup id="fnref:11"><a href="#fn:11" class="footnote-ref" role="doc-noteref">11</a></sup></li>
<li>Agner Fog, <a href="https://www.agner.org/optimize/microarchitecture.pdf">Microarchitecture of Intel, AMD and VIA CPUs</a> (2025, continuously updated) — pipeline, execution ports, cache latencies at the microarchitecture level.<sup id="fnref:12"><a href="#fn:12" class="footnote-ref" role="doc-noteref">12</a></sup></li>
<li>Denis Bakhvalov, <a href="https://book.easyperf.net/perf_book"><em>Performance Analysis and Tuning on Modern CPUs</em></a> (easyperf.net, 2020) — practical guide to PMU, <code>perf</code>, and top-down analysis.<sup id="fnref:13"><a href="#fn:13" class="footnote-ref" role="doc-noteref">13</a></sup></li>
<li>Andi Kleen, <a href="https://github.com/andikleen/pmu-tools">pmu-tools / toplev</a> — automated top-down microarchitecture analysis using hardware counters.<sup id="fnref:14"><a href="#fn:14" class="footnote-ref" role="doc-noteref">14</a></sup></li>
<li>perf wiki, <a href="https://perfwiki.github.io/main/tutorial/">Tutorial</a> — official documentation for Linux <code>perf</code> tools.<sup id="fnref:15"><a href="#fn:15" class="footnote-ref" role="doc-noteref">15</a></sup></li>
<li>BenchmarkDotNet, <a href="https://benchmarkdotnet.org/articles/configs/diagnosers.html">Diagnosers</a> — BDN&rsquo;s built-in hardware counter collection via ETW (Windows only).<sup id="fnref:16"><a href="#fn:16" class="footnote-ref" role="doc-noteref">16</a></sup></li>
<li>Intel, <a href="https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html"><em>64 and IA-32 Architectures Optimization Reference Manual</em></a> (2025) — chapter 3: top-down analysis, performance counter event codes.<sup id="fnref:17"><a href="#fn:17" class="footnote-ref" role="doc-noteref">17</a></sup></li>
</ul>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Ivy Bridge-EP has 4 general-purpose and 3 fixed hardware counter registers per core. When you request more events than available registers (as in the 10-event <code>perf stat</code> command above), <code>perf</code> time-multiplexes: it rotates events through the available registers and scales the counts by the sampling ratio. The percentages shown in the output (e.g., <code>(39.98%)</code>) indicate what fraction of runtime each counter was actually active. This introduces estimation error, but for long-running workloads like BDN benchmarks the error is typically small.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>The <code>perf stat</code> numbers shown are aggregated over the entire BDN process, which includes warmup, pilot runs, and actual iterations at all three dataset sizes. This means the absolute values include BDN framework overhead. The <em>ratios</em> between Sequential and Random are meaningful — both variants include the same overhead. For isolated hot-loop counters, use BDN&rsquo;s <code>[HardwareCounters]</code> diagnoser or run <code>perf stat</code> on a standalone loop outside BDN.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a>&#160;<a href="#fnref1:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a>&#160;<a href="#fnref2:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:3">
<p>The LLC miss rates appear counterintuitive — sequential (53%) is higher than random (30%). This is an artifact of aggregation: sequential runs ~3× more total iterations (being much faster per-op), and the 64M dataset (512 MB, far exceeding 30 MB L3) dominates the sequential counter totals. Random access, being slower, runs fewer iterations, so its LLC counters are weighted more toward the smaller (L3-resident) datasets. For a fair LLC comparison, you would need per-dataset-size <code>perf stat</code> runs — which requires running benchmarks outside BDN or using BDN&rsquo;s <code>[HardwareCounters]</code> diagnoser. The IPC and L1 miss rate comparisons are more robust to this aggregation effect.&#160;<a href="#fnref:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a>&#160;<a href="#fnref1:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:4">
<p>The Intel hardware prefetcher on Ivy Bridge-EP detects sequential and strided access patterns and prefetches cache lines into L1/L2 before the load instruction executes. For a sequential <code>long[]</code> walk with 8-byte stride, the prefetcher can stay ahead of the loop, effectively hiding memory latency. Random access has no predictable stride — every load is a cache miss that the CPU must wait for.&#160;<a href="#fnref:4" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:5">
<p>Brendan Gregg, <a href="https://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html">CPU Flame Graphs</a> (2016). Flame graphs collapse stack traces into a single visualization where width = time. The x-axis is alphabetical (not temporal) — a common source of misinterpretation.&#160;<a href="#fnref:5" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a>&#160;<a href="#fnref1:5" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:6">
<p>Sequential access is not perfectly linear across this range. At 1M (8 MB), data fits in L3 (~30 cycle latency). At 8M (64 MB), it exceeds L3 and every cache line comes from DRAM (~200 cycle latency). The prefetcher hides most of this increase by issuing DRAM requests ahead of the loop, but the transition from L3-resident to DRAM-resident adds a constant factor. Within the DRAM-resident range (8M → 64M), scaling is closer to linear: 8× more data → ~12.9× more time. The remaining super-linearity likely reflects TLB pressure and NUMA effects at 512 MB working set on this dual-socket system.&#160;<a href="#fnref:6" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:7">
<p>Dual-socket NUMA adds variance to these ratios. Thread migration between sockets, local vs remote memory access, and OS scheduling decisions can shift the ratio by 1-2× between runs. Pinning to a single socket with <code>taskset</code> or <code>numactl</code> reduces this. The shape of the curve — cliff at the L3 boundary — is stable; the exact height of the cliff is not.&#160;<a href="#fnref:7" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a>&#160;<a href="#fnref1:7" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:8">
<p>Karl Popper, <em>Logik der Forschung</em> (1934), published in English as <em>The Logic of Scientific Discovery</em> (Hutchinson, 1959). The demarcation criterion — a theory is scientific if and only if it is falsifiable — applies directly: &ldquo;cache miss rate is high, working set fits L3, exceeding L3 will degrade the ratio&rdquo; is falsifiable. &ldquo;Random access is slow because it&rsquo;s random&rdquo; is not.&#160;<a href="#fnref:8" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:9">
<p>Brendan Gregg, <em>Systems Performance</em>, 2nd ed. (Addison-Wesley, 2020). Chapters 6 and 7 cover CPU and memory performance analysis with <code>perf</code>. The methodology sections — USE method, TSA method — apply directly to interpreting the IPC and cache miss data in this post.&#160;<a href="#fnref:9" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:10">
<p>Ahmad Yasin, <a href="https://ieeexplore.ieee.org/document/6844459">A Top-Down Method for Performance Analysis and Counters Architecture</a>, ISPASS 2014. Classifies every cycle into four categories: Frontend Bound, Backend Bound (memory/core), Bad Speculation, Retiring. The random access pattern in this post is Backend Bound (memory) — the CPU is ready to execute but waiting for data.&#160;<a href="#fnref:10" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:11">
<p>Ulrich Drepper, <a href="https://people.freebsd.org/~lstewart/articles/cpumemory.pdf">What Every Programmer Should Know About Memory</a> (2007). Sections 3 (CPU caches) and 6 (programming for performance) explain why sequential access is fast (hardware prefetch, spatial locality) and random access is slow (no predictable stride, no prefetch).&#160;<a href="#fnref:11" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:12">
<p>Agner Fog, <a href="https://www.agner.org/optimize/microarchitecture.pdf">Microarchitecture of Intel, AMD and VIA CPUs</a> (2025, continuously updated). Table of cache latencies: L1 ~4 cycles, L2 ~12 cycles, L3 ~30 cycles, DRAM ~200 cycles on Ivy Bridge-EP. These latencies explain the ~13× ratio (L3 hits) vs ~20–29× ratio (DRAM) observed across runs in the benchmark results.&#160;<a href="#fnref:12" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:13">
<p>Denis Bakhvalov, <a href="https://book.easyperf.net/perf_book"><em>Performance Analysis and Tuning on Modern CPUs</em></a> (easyperf.net, 2020). Chapters on PMU counters and <code>perf</code> provide practical workflows for exactly the kind of analysis shown in this post — from <code>perf stat</code> to diagnosis.&#160;<a href="#fnref:13" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:14">
<p>Andi Kleen, <a href="https://github.com/andikleen/pmu-tools">pmu-tools / toplev</a>. Automates the Yasin top-down analysis method. For Intel CPUs, <code>toplev</code> can classify bottlenecks without manual counter selection.&#160;<a href="#fnref:14" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:15">
<p>perf wiki, <a href="https://perfwiki.github.io/main/tutorial/">Tutorial</a>. Documents <code>perf stat</code> (counter aggregation), <code>perf record</code> (sampling), <code>perf report</code> (analysis). The <code>perf_event_paranoid</code> sysctl controls access: <code>2</code> allows per-process counters without root.&#160;<a href="#fnref:15" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:16">
<p>BenchmarkDotNet, <a href="https://benchmarkdotnet.org/articles/configs/diagnosers.html">Diagnosers</a>. BDN can collect hardware counters per-benchmark via ETW on Windows. The <code>[HardwareCounters]</code> attribute enables collection of specific counters (e.g., <code>InstructionRetired</code>, <code>CacheMisses</code>). On Linux, BDN does not natively collect hardware counters — use <code>perf stat</code> externally as shown in this post.&#160;<a href="#fnref:16" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:17">
<p>Intel, <a href="https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html"><em>64 and IA-32 Architectures Optimization Reference Manual</em></a> (2025), chapter 3. Defines the performance monitoring events and their architectural guarantees. <code>cycles</code> and <code>instructions</code> are the safest architectural counters — widely available on modern x86 CPUs; verify with <code>perf list</code>. <code>cache-references</code> and <code>cache-misses</code> are also architectural in the Intel PMU spec, but their mapping to physical events varies by microarchitecture (e.g., they may count LLC references on one µarch and L2 on another). On non-Intel CPUs, check <code>perf list</code> for available mappings.&#160;<a href="#fnref:17" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>]]></content:encoded>
    </item>
    <item>
      <title>First Things First: Statistics That Matter</title>
      <link>https://0x3f.blog/posts/first-things-first-statistics-that-matter/</link>
      <pubDate>Fri, 06 Mar 2026 18:00:00 +0100</pubDate>
      <guid>https://0x3f.blog/posts/first-things-first-statistics-that-matter/</guid>
      <description>3% slower — or noise? The error bar stretches 10%. Confidence intervals, effect size, and micro vs macro: the three layers between a number and a conclusion.</description>
      <content:encoded><![CDATA[<h2 id="3-slower-ship-it">3% slower. Ship it.</h2>
<p>Two filter variants over 20 million integers. Five benchmark iterations. FilterTernary: 26.11 ms. FilterBranch: 25.30 ms. The ternary is 3% slower. PR description writes itself. Merge. Deploy.</p>
<p>Next day, rollback. Regression in production — on hardware where the difference vanishes, on data where it reverses.</p>
<p>Design fixed. Environment defended. Data collected honestly. The benchmark is solid. The number is real. The interpretation is not.</p>
<p>All code in this post: clone, build, run. Numbers below were measured on dual Xeon E5-2697 v2 using BenchmarkDotNet v0.14.0, pinned to a single NUMA node — run the companion code on your hardware for your own results. Different machine, different numbers.</p>
<p><em>Convention: charts use milliseconds unless otherwise noted; tables reproduce BenchmarkDotNet output. BDN&rsquo;s Error column is the half-width of the 99.9% confidence interval.</em></p>
<hr>
<h2 id="the-number-is-the-answer">The number is the answer</h2>
<div class="highlight"><pre data-lang="csharp"><code>[Benchmark(Baseline = true)]
public long FilterBranch()
{
    long sum = 0;
    int[] data = _data;
    for (int i = 0; i &lt; data.Length; i&#43;&#43;)
    {
        if (data[i] &gt; 0)
            sum &#43;= data[i];
    }
    return sum;
}

[Benchmark]
public long FilterTernary()
{
    long sum = 0;
    int[] data = _data;
    for (int i = 0; i &lt; data.Length; i&#43;&#43;)
    {
        int v = data[i];
        sum &#43;= v &gt; 0 ? v : 0;
    }
    return sum;
}</code></pre></div>
<p><small>Two filter variants over 20M integers (~95% positive). Full source in companion code.</small></p>
<p>Every benchmarking tutorial ends here: compare two means, pick the lower one. FilterTernary = 26.11 ms, FilterBranch = 25.30 ms — 3% difference. The ternary loses.</p>
<p>How many times did you run it?</p>
<hr>
<h2 id="layer-1--confidence-intervals-eat-your-win">Layer 1 — Confidence intervals eat your win</h2>
<p>BenchmarkDotNet doesn&rsquo;t just give you a mean. It gives you Mean ± Error — where Error is the half-width of the 99.9% confidence interval, computed using a Student&rsquo;s t-distribution with n-1 degrees of freedom.<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></p>
<p>The 5-iteration run — the one that said &ldquo;3% slower&rdquo;:</p>
<div class="highlight"><pre data-lang=""><code>| Method        | N        | Mean     | Error    | StdDev   | Ratio | RatioSD |
|-------------- |--------- |---------:|---------:|---------:|------:|--------:|
| FilterBranch  | 20000000 | 25.30 ms | 0.408 ms | 0.063 ms |  1.00 |    0.00 |
| FilterTernary | 20000000 | 26.11 ms | 2.624 ms | 0.681 ms |  1.03 |    0.02 |</code></pre></div>
<p>The 99.9% CI for FilterBranch: 25.30 ± 0.408 ms → <strong>[24.89, 25.71]</strong>. For FilterTernary: 26.11 ± 2.624 ms → <strong>[23.49, 28.73]</strong>. FilterBranch&rsquo;s entire range sits inside FilterTernary&rsquo;s confidence interval. The &ldquo;3% slower&rdquo; could be a scheduling hiccup. Five iterations cannot tell you that.</p>
<p>You know this from <a href="/posts/first-things-first-why-benchmarks-lie/">Part 1</a>. Overlapping CIs, unresolved difference. Run more iterations.</p>
<p>Twenty iterations:</p>
<div class="highlight"><pre data-lang=""><code>| Method        | N        | Mean     | Error    | StdDev   | Ratio |
|-------------- |--------- |---------:|---------:|---------:|------:|
| FilterBranch  | 20000000 | 25.25 ms | 0.173 ms | 0.177 ms |  1.00 |
| FilterTernary | 20000000 | 25.64 ms | 0.111 ms | 0.109 ms |  1.02 |</code></pre></div>
<p>The 99.9% CI for FilterBranch: <strong>[25.08, 25.42]</strong>. For FilterTernary: <strong>[25.53, 25.75]</strong>. No overlap. A manual Welch t-test on this data gives p &lt; 0.001.<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> The difference is real.</p>
<p>FilterTernary is 2% slower. The 5-iteration run saw the right direction but had no basis to trust it — the CI was so wide it could not separate signal from noise.</p>
<div class="chart-container">
  <canvas id="chart-1735f2ba415780ef75e867f7bf9e0cef"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-1735f2ba415780ef75e867f7bf9e0cef').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['5 iterations', '20 iterations'],
    datasets: [
      {
        label: 'FilterBranch',
        data: [25.30, 25.25],
        backgroundColor: '#89b4fa',
        borderColor: '#89b4fa',
        borderWidth: 1
      },
      {
        label: 'FilterTernary',
        data: [26.11, 25.64],
        backgroundColor: '#f38ba8',
        borderColor: '#f38ba8',
        borderWidth: 1
      }
    ]
  },
  options: {
    plugins: {
      title: { display: true, text: 'More iterations, narrower confidence intervals' },
      subtitle: { display: true, text: '5 iter: CIs overlap — inconclusive. 20 iter: CIs separate — confirmed.' },
      legend: { display: true }
    },
    scales: {
      y: {
        title: { display: true, text: 'Time (ms)' },
        min: 22,
        max: 30
      }
    }
  },
  plugins: [{
    id: 'errBars',
    afterDraw: function(chart) {
      var ctx = chart.ctx;
      // 99.9% CI half-widths — BDN Error column directly (no conversion)
      // FilterBranch: 5iter=0.408, 20iter=0.173
      // FilterTernary: 5iter=2.624, 20iter=0.111
      var ci = [[0.408, 0.173], [2.624, 0.111]];
      chart.data.datasets.forEach(function(ds, di) {
        var meta = chart.getDatasetMeta(di);
        meta.data.forEach(function(bar, i) {
          var hw = ci[di][i];
          var yLo = chart.scales.y.getPixelForValue(ds.data[i] - hw);
          var yHi = chart.scales.y.getPixelForValue(ds.data[i] + hw);
          ctx.save();
          ctx.strokeStyle = '#cdd6f4';
          ctx.lineWidth = 2;
          ctx.beginPath(); ctx.moveTo(bar.x, yLo); ctx.lineTo(bar.x, yHi); ctx.stroke();
          ctx.beginPath(); ctx.moveTo(bar.x - 6, yLo); ctx.lineTo(bar.x + 6, yLo); ctx.stroke();
          ctx.beginPath(); ctx.moveTo(bar.x - 6, yHi); ctx.lineTo(bar.x + 6, yHi); ctx.stroke();
          ctx.restore();
        });
      });
    }
  }]
}
);
  })();
</script>

<p>The Error on FilterTernary dropped from ±2.6 ms to ±0.1 ms. An order of magnitude. More iterations, sure. But .NET&rsquo;s JIT compiles in tiers: Tier-0 (quick, unoptimized) on first calls, Tier-1 (full optimization) after enough invocations. If BDN&rsquo;s warmup didn&rsquo;t fully promote both methods, the 5-iteration run might have caught Tier-0 code while the 20-iteration run measured Tier-1. Different machine code, different variance profile.</p>
<p>Worth checking. Expand the ternary first:</p>
<div class="highlight"><pre data-lang="csharp"><code>// FilterBranch
if (data[i] &gt; 0)
    sum &#43;= data[i];

// FilterTernary — expand v &gt; 0 ? v : 0
if (v &gt; 0) sum &#43;= v;
else        sum &#43;= 0;</code></pre></div>
<p>The branch skips. The ternary always adds — even zero. Structurally different operations.</p>
<p><code>[DisassemblyDiagnoser]</code> (<a href="/posts/first-things-first-enemies-of-measurement/">Enemy 6</a> introduced the tool) on the class dumps native code — run the benchmark, check <code>BenchmarkDotNet.Artifacts/results/*-asm.md</code>. Five iterations:</p>
<div class="highlight"><pre data-lang="nasm"><code>; FilterBranch — 54 bytes
M00_L00:
       mov       edi,[rcx]        ; load data[i]
       test      edi,edi          ; data[i] &gt; 0?
       jle       short M00_L01    ; skip if not
       movsxd    rdi,edi          ; sign-extend to 64-bit
       add       rax,rdi          ; sum &#43;= data[i]
M00_L01:
       add       rcx,4            ; i&#43;&#43;
       dec       edx
       jne       short M00_L00

; FilterTernary — 58 bytes
M00_L00:
       mov       edi,[rcx]        ; load v = data[i]
       test      edi,edi          ; v &gt; 0?
       jle       short M00_L03    ; if not, jump to zero path
M00_L01:
       movsxd    rdi,edi          ; sign-extend
       add       rax,rdi          ; sum &#43;= v (or sum &#43;= 0)
       add       rcx,4            ; i&#43;&#43;
       dec       edx
       jne       short M00_L00
; ...
M00_L03:
       xor       edi,edi          ; v = 0
       jmp       short M00_L01    ; jump back to add</code></pre></div>
<p>Twenty iterations:</p>
<div class="highlight"><pre data-lang="nasm"><code>; FilterBranch — 54 bytes
M00_L00:
       mov       edi,[rcx]
       test      edi,edi
       jle       short M00_L01
       movsxd    rdi,edi
       add       rax,rdi
M00_L01:
       add       rcx,4
       dec       edx
       jne       short M00_L00

; FilterTernary — 58 bytes
M00_L00:
       mov       edi,[rcx]
       test      edi,edi
       jle       short M00_L03
M00_L01:
       movsxd    rdi,edi
       add       rax,rdi
       add       rcx,4
       dec       edx
       jne       short M00_L00
; ...
M00_L03:
       xor       edi,edi
       jmp       short M00_L01</code></pre></div>
<p>Identical machine code. Both runs. The Error dropped because more iterations and lower observed variance both narrowed the confidence interval. BDN&rsquo;s Error is t(0.0005, n−1) × StdDev / √n — StdDev for FilterTernary fell from 0.681 ms to 0.109 ms (6×), and the larger sample brought a smaller t-value and larger √n. The variance reduction did most of the work.</p>
<p>A number without error bars is an opinion. Five iterations produced CIs so wide that either outcome fit the data. Twenty produced CIs narrow enough to separate signal from noise — not certainty, but 99.9% confidence that FilterBranch is faster on this hardware. If you had shipped after five, you&rsquo;d have deployed a guess as a conclusion.</p>
<p>CI answers one question: does a difference exist? It says nothing about whether the difference matters.</p>
<hr>
<h2 id="layer-2--effect-size-when-significant-doesnt-mean-meaningful">Layer 2 — Effect size: when &ldquo;significant&rdquo; doesn&rsquo;t mean &ldquo;meaningful&rdquo;</h2>
<p>The 20-iteration result says FilterTernary is 2% slower. The CIs don&rsquo;t overlap. The difference is statistically real. But 0.4 ms on a 25 ms operation over 20 million integers. Is that worth changing the code?</p>
<p>Statistical significance asks <em>does a difference exist?</em> Practical significance asks <em>does it matter?</em> BDN answers the first. You answer the second.</p>
<p>Cohen&rsquo;s d — the standardized effect size — measures the distance between two means in units of the pooled standard deviation:<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></p>
<blockquote>
<p>d = |mean_1 - mean_2| / pooled SD</p>
</blockquote>
<div class="highlight"><pre data-lang="csharp"><code>public static double CohensD(double mean1, double stdDev1, double mean2, double stdDev2)
{
    double pooledSd = Math.Sqrt((stdDev1 * stdDev1 &#43; stdDev2 * stdDev2) / 2.0);
    if (pooledSd == 0) return 0;
    return Math.Abs(mean1 - mean2) / pooledSd;
}</code></pre></div>
<p><small>Cohen&rsquo;s d computation — full source in <code>Analysis/StatisticalReport.cs</code>.</small></p>
<p>Cohen&rsquo;s d for FilterBranch vs FilterTernary: |25.25 - 25.64| / sqrt((0.177^2 + 0.109^2)/2) = 0.39 / 0.147 = <strong>2.65</strong>. By the standard thresholds (0.2 = small, 0.5 = medium, 0.8 = large), that&rsquo;s a &ldquo;large&rdquo; effect.</p>
<p>But 2.65 for a 2% difference? Something is off.</p>
<h3 id="the-threshold-trap">The threshold trap</h3>
<p>Cohen&rsquo;s d thresholds were calibrated for psychology experiments where within-group variance is naturally high. BenchmarkDotNet&rsquo;s within-run variance is very low in controlled microbenchmarks — sub-1% coefficient of variation for compute-bound loops. When the denominator (pooled SD) is tiny, even a trivial mean difference produces a massive d.</p>
<p>Three pairs from the companion code:</p>
<table>
  <thead>
      <tr>
          <th>Pair</th>
          <th style="text-align: right">Ratio</th>
          <th style="text-align: right">Delta practical</th>
          <th style="text-align: right">Cohen&rsquo;s d</th>
          <th>&ldquo;Interpretation&rdquo;</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>FilterBranch vs FilterTernary</td>
          <td style="text-align: right">1.02</td>
          <td style="text-align: right">2%</td>
          <td style="text-align: right">2.65</td>
          <td>&ldquo;large&rdquo;</td>
      </tr>
      <tr>
          <td>SumArray vs SumSpan</td>
          <td style="text-align: right">1.01</td>
          <td style="text-align: right">0.5%</td>
          <td style="text-align: right">1.98</td>
          <td>&ldquo;large&rdquo;</td>
      </tr>
      <tr>
          <td>SearchLinear vs SearchBinary</td>
          <td style="text-align: right">0.001</td>
          <td style="text-align: right">1,071x</td>
          <td style="text-align: right">368</td>
          <td>&ldquo;large&rdquo;</td>
      </tr>
  </tbody>
</table>
<p>All three &ldquo;large&rdquo; by Cohen&rsquo;s thresholds. Only one is a meaningful optimization. Wittgenstein (1953): meaning is use — a word means what it means in the language game where it was coined. Cohen&rsquo;s thresholds were coined in a game where within-group variance is high and effect sizes are modest. Microbenchmarking is a different game — sub-1% coefficient of variation, deterministic loops, controlled environments. &ldquo;Large&rdquo; means something in psychology. The standard interpretation becomes misleading when BDN&rsquo;s precision makes the denominator vanishingly small. A 0.5% difference and a 1,071x difference land in the same bucket.</p>
<p>Popper (1934): a hypothesis survives by resisting falsification, not by accumulating confirmation. &ldquo;3% faster&rdquo; is a hypothesis. Non-overlapping CIs survived the first test — the difference exists. But Cohen&rsquo;s d at 2.65 for a 2% change is the hypothesis flattering itself. The effect size, on BDN&rsquo;s terrain, does not survive scrutiny. Seek the conditions under which the claim fails, not the ones where it holds.</p>
<p>For microbenchmarks, <strong>rely primarily on BDN&rsquo;s Ratio column</strong> rather than Cohen&rsquo;s d. Ratio ~ 1.00 means &ldquo;no practical difference.&rdquo; Ratio ~ 0.001 means &ldquo;algorithmic change.&rdquo; Whether 2% matters depends on context — a hot loop called billions of times, or a function called once per request. Define your threshold before you run.</p>
<h3 id="two-extremes">Two extremes</h3>
<p><strong>Small practical effect</strong> — array indexing vs Span indexing over 1M integers:</p>
<div class="highlight"><pre data-lang=""><code>| Method   | Categories  | N       | Mean     | Error   | StdDev  | Ratio |
|--------- |------------ |-------- |---------:|--------:|--------:|------:|
| SumArray | SmallEffect | 1000000 | 512.7 us | 1.16 us | 1.19 us |  1.00 |
| SumSpan  | SmallEffect | 1000000 | 515.3 us | 1.28 us | 1.42 us |  1.01 |</code></pre></div>
<p>Ratio = 1.01. The JIT produces nearly identical code for both — bounds-check elimination applies to <code>int[]</code> and <code>ReadOnlySpan&lt;int&gt;</code> alike on .NET 9. The 2.6 us difference (0.5%) is likely real — the CIs don&rsquo;t overlap, which is a conservative indicator — but not worth a code change.</p>
<p><strong>Large practical effect</strong> — linear search vs binary search over 1M integers:</p>
<div class="highlight"><pre data-lang=""><code>| Method       | Categories  | N       | Mean         | Error    | StdDev   | Ratio |
|------------- |------------ |-------- |-------------:|---------:|---------:|------:|
| SearchLinear | LargeEffect | 1000000 | 248,303.3 us | 928.6 us | 953.6 us | 1.000 |
| SearchBinary | LargeEffect | 1000000 |     231.8 us |   1.5 us |   1.7 us | 0.001 |</code></pre></div>
<p>Ratio = 0.001. O(n) vs O(log n). An algorithmic change — not a JIT quirk, not a cache alignment artifact. 1,071x faster on this hardware. The algorithmic advantage holds on any platform with sorted data, though the exact multiplier will vary.</p>
<div class="chart-container">
  <canvas id="chart-8acdec6e1897b21786f81e66d908845e"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-8acdec6e1897b21786f81e66d908845e').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['Array vs Span\n(0.5% difference)', 'Linear vs Binary\n(1,071× difference)'],
    datasets: [
      {
        label: 'SumArray / SearchLinear',
        data: [0.513, 248.3],
        backgroundColor: '#89b4fa',
        borderColor: '#89b4fa',
        borderWidth: 1
      },
      {
        label: 'SumSpan / SearchBinary',
        data: [0.515, 0.232],
        backgroundColor: '#f38ba8',
        borderColor: '#f38ba8',
        borderWidth: 1
      }
    ]
  },
  options: {
    plugins: {
      title: { display: true, text: 'Small vs large practical effect' },
      subtitle: { display: true, text: 'Both "statistically significant" — only one worth shipping' },
      legend: { display: true }
    },
    scales: {
      y: {
        type: 'logarithmic',
        title: { display: true, text: 'Time (ms) — log scale' },
        min: 0.1,
        max: 1000
      }
    }
  }
}
);
  })();
</script>

<p>A number with error bars but no effect size is only half an answer.</p>
<hr>
<h2 id="layer-3--micro-vs-macro-right-question-wrong-scale">Layer 3 — Micro vs macro: right question, wrong scale</h2>
<p>A microbenchmark isolates a function. A macrobenchmark places it inside a pipeline. They answer different questions — and the answers disagree.</p>
<div class="highlight"><pre data-lang="csharp"><code>// Micro: isolated lookup — Dictionary vs linear search over 10,000 elements
[BenchmarkCategory(&#34;Micro&#34;)]
[Benchmark(Baseline = true)]
public int LookupLinear()
{
    int found = 0;
    for (int i = 0; i &lt; _searchKeys.Length; i&#43;&#43;)
    {
        if (Array.IndexOf(_data, _searchKeys[i]) &gt;= 0)
            found&#43;&#43;;
    }
    return found;
}

[BenchmarkCategory(&#34;Micro&#34;)]
[Benchmark]
public int LookupDictionary()
{
    int found = 0;
    for (int i = 0; i &lt; _searchKeys.Length; i&#43;&#43;)
    {
        if (_dict.ContainsKey(_searchKeys[i]))
            found&#43;&#43;;
    }
    return found;
}</code></pre></div>
<p><small>Microbenchmark — isolated lookup comparison over 200 search keys. Full source in companion code.</small></p>
<div class="highlight"><pre data-lang=""><code>| Method           | Categories | Mean       | Error    | StdDev   | Ratio |
|----------------- |----------- |-----------:|---------:|---------:|------:|
| LookupLinear     | Micro      | 412.089 us | 1.609 us | 1.788 us | 1.000 |
| LookupDictionary | Micro      |   1.571 us | 0.012 us | 0.014 us | 0.004 |</code></pre></div>
<p>Dictionary is <strong>262x faster</strong>. Ship it?</p>
<p>The lookup lives inside a pipeline:</p>
<div class="highlight"><pre data-lang="csharp"><code>[Benchmark(Baseline = true)]
public long PipelineLinear()
{
    long v = ValidateArray(_workload);     // ~40% — sequential scan, 3M elements
    long t = PolynomialTransform(_workload); // ~40% — multiply/add/xor, 3M elements
    int  l = LookupAllLinear(_data, _searchKeys); // ~6% — 200 keys × Array.IndexOf
    long a = Aggregate(_workload);          // ~15% — weighted sum, stride 4
    return v ^ t ^ l ^ a;
}

[Benchmark]
public long PipelineDictionary()
{
    long v = ValidateArray(_workload);
    long t = PolynomialTransform(_workload);
    int  l = LookupAllDictionary(_searchKeys); // Dictionary.ContainsKey
    long a = Aggregate(_workload);
    return v ^ t ^ l ^ a;
}</code></pre></div>
<p><small>Only the lookup step changes. Full source in companion code.</small></p>
<p>94% of the work doesn&rsquo;t change regardless of lookup strategy.</p>
<div class="highlight"><pre data-lang=""><code>| Method             | Categories | Mean         | Error     | StdDev    | Ratio |
|------------------- |----------- |-------------:|----------:|----------:|------:|
| PipelineLinear     | Macro      | 7,181.115 us | 59.636 us | 66.285 us |  1.00 |
| PipelineDictionary | Macro      | 6,611.982 us | 11.094 us | 11.871 us |  0.92 |</code></pre></div>
<p>Pipeline with Dictionary is <strong>8% faster</strong>. Not 262x. Eight percent.</p>
<div class="chart-container">
  <canvas id="chart-483253b27459f166f6cda9715e4b2e69"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-483253b27459f166f6cda9715e4b2e69').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['Micro: Lookup only', 'Macro: Full pipeline'],
    datasets: [
      {
        label: 'Linear (baseline)',
        data: [0.412, 7.181],
        backgroundColor: '#89b4fa',
        borderColor: '#89b4fa',
        borderWidth: 1
      },
      {
        label: 'Dictionary (variant)',
        data: [0.002, 6.612],
        backgroundColor: '#f38ba8',
        borderColor: '#f38ba8',
        borderWidth: 1
      }
    ]
  },
  options: {
    plugins: {
      title: { display: true, text: 'Micro vs macro — isolated speedup vs end-to-end impact' },
      subtitle: { display: true, text: '262× micro speedup on 6% of pipeline → 8% end-to-end' },
      legend: { display: true }
    },
    scales: {
      y: {
        title: { display: true, text: 'Time (ms)' }
      }
    }
  }
}
);
  })();
</script>

<p>The lookup consumes 412 us out of 7,181 us total — 5.7% of the pipeline. A 262x speedup on 5.7% gives a theoretical maximum improvement of 1 / (1 - 0.057 + 0.057/262) = <strong>6.0%</strong> (Amdahl&rsquo;s law<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup>). The measured 8% is higher — cache effects from eliminating the linear scan likely benefit subsequent pipeline steps.</p>
<p>Micro answers <em>&ldquo;is this function faster?&rdquo;</em> Macro answers <em>&ldquo;will the user notice?&rdquo;</em></p>
<p>Baudrillard (1981): the fourth phase of the simulacrum — the image bears no relation to any reality whatever. The microbenchmark says 262x. The macrobenchmark says 8%. Both have error bars. Both passed statistical tests. Both are internally consistent. Neither describes what the user experiences. Two maps orbiting each other, each valid within its own coordinate system, each detached from the territory they claim to represent. The micro number didn&rsquo;t lie. The macro number didn&rsquo;t lie. The lie was believing either one alone was the answer.</p>
<p>Eight percent might be worth it — or might not, depending on whether the pipeline runs once per request or once per hour. The microbenchmark alone cannot tell you.</p>
<hr>
<h2 id="before-you-ship-the-number">Before you ship the number</h2>
<table>
  <thead>
      <tr>
          <th>Check</th>
          <th>Question</th>
          <th>If no&hellip;</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Iterations</td>
          <td>Did you run enough iterations? (&gt;= 15 in this setup, configured via SimpleJob)</td>
          <td>Your CIs are too wide — the result might be noise (see Layer 1)</td>
      </tr>
      <tr>
          <td>CI overlap</td>
          <td>Do the 99.9% CIs (BDN Error) <em>not</em> overlap?</td>
          <td>Overlapping CIs suggest noise — but non-overlap is conservative, not definitive. Confirm with a formal test (Welch / Mann-Whitney)</td>
      </tr>
      <tr>
          <td>Practical size</td>
          <td>Is the Ratio meaningfully different from 1.00? Does it exceed your SESOI?</td>
          <td>Statistically real but practically irrelevant — move on</td>
      </tr>
      <tr>
          <td>Micro = Macro</td>
          <td>Does the micro speedup translate to end-to-end improvement?</td>
          <td>The bottleneck is elsewhere — profile before optimizing</td>
      </tr>
      <tr>
          <td>Reproducible</td>
          <td>Same result on different hardware / OS / runtime?</td>
          <td>Environment-dependent — see <a href="/posts/first-things-first-enemies-of-measurement/">Part 2</a></td>
      </tr>
  </tbody>
</table>
<p>Three rules:</p>
<ol>
<li>
<p><strong>Always report confidence intervals.</strong> A mean without CI is a claim, not evidence. BenchmarkDotNet provides the Error column (99.9% CI half-width) — use it. CI overlap is a useful quick screening heuristic: overlapping CIs suggest noise, non-overlapping CIs suggest a real difference — but neither is definitive. Overlapping CIs can still hide a significant difference, and non-overlapping CIs are a conservative rule, not proof. For a formal conclusion, use a statistical test (Welch&rsquo;s t-test, Mann-Whitney U). If you only ran 5 iterations, run more.</p>
</li>
<li>
<p><strong>Distinguish statistical from practical significance.</strong> Non-overlapping CIs mean the difference exists. They don&rsquo;t mean it matters. Define a SESOI (smallest effect size of interest) before running the benchmark — the minimum improvement that justifies the code change. BDN&rsquo;s Ratio column tells you the proportional difference: if it doesn&rsquo;t cross your SESOI threshold, the result is real but not actionable.</p>
</li>
<li>
<p><strong>Confirm micro with macro.</strong> A microbenchmark shows a function is faster in isolation. A macrobenchmark shows the user will notice. Run both — or explain why you didn&rsquo;t. A 262x micro speedup sounds compelling until Amdahl reduces it to 8%.</p>
</li>
</ol>
<h3 id="run-it-yourself">Run it yourself</h3>
<div class="highlight"><pre data-lang="bash"><code>git clone https://github.com/0x3f-blog/companion-code.git
cd companion-code/first-things-first/statistics-that-matter

# All benchmarks (20 iterations, ~3 min)
# Pin to a single NUMA node to eliminate cross-socket variance
taskset -c 0-11 dotnet run -c Release

# Individual scenarios
taskset -c 0-11 dotnet run -c Release -- --filter &#39;*NoisyComparison*&#39;
taskset -c 0-11 dotnet run -c Release -- --filter &#39;*EffectSizeDemo*&#39;
taskset -c 0-11 dotnet run -c Release -- --filter &#39;*MicroVsMacro*&#39;

# Reproduce the CI overlap demo (5 iterations — wide error bars)
taskset -c 0-11 dotnet run -c Release -- --filter &#39;*NoisyComparison*&#39; --iterationCount 5 --warmupCount 3</code></pre></div>
<hr>
<h2 id="benchmark-environment">Benchmark environment</h2>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CPU</td>
          <td>2x Intel Xeon E5-2697 v2 @ 2.70 GHz (24 cores / 48 threads)</td>
      </tr>
      <tr>
          <td>RAM</td>
          <td>~115 GB DDR3-1866 (quad-channel per socket)</td>
      </tr>
      <tr>
          <td>OS</td>
          <td>Fedora Linux 42 (kernel 6.17)</td>
      </tr>
      <tr>
          <td>Runtime</td>
          <td>.NET 9.0.11 (RyuJIT AVX)</td>
      </tr>
      <tr>
          <td>SDK</td>
          <td>.NET SDK 10.0.102</td>
      </tr>
      <tr>
          <td>BenchmarkDotNet</td>
          <td>v0.14.0</td>
      </tr>
      <tr>
          <td>GC</td>
          <td>Server GC, Concurrent (BDN enables Server GC in benchmark processes by default; host process uses Workstation)</td>
      </tr>
      <tr>
          <td>Pinning</td>
          <td><code>taskset -c 0-11</code> — single socket, physical cores only</td>
      </tr>
      <tr>
          <td>Job</td>
          <td>SimpleJob (WarmupCount=5, IterationCount=20)</td>
      </tr>
  </tbody>
</table>
<p><strong>Limitations:</strong> Single machine, dual-socket NUMA. All benchmarks pinned to one socket to eliminate cross-socket memory access and thread migration — without pinning, NoisyComparison variance doubles and absolute values shift by 5-10% between runs (<a href="/posts/first-things-first-enemies-of-measurement/">Part 2</a>). <code>EffectSizeDemo</code> uses sorted data for binary search — the algorithmic advantage is inherent, not hardware-dependent. <code>MicroVsMacro</code> pipeline proportions (40/40/6/15%) are approximate — workload ratios on your hardware will vary.</p>
<hr>
<p>Even with honest design, controlled environment, and correct measurement — the number still needs interpretation. Too few iterations and the CI swallows the difference. Tight CIs inflate Cohen&rsquo;s d into meaninglessness. Microbenchmarks promise 262x while the user sees 8%.</p>
<p>Hume (1739): no finite number of observations guarantees the next will conform. But the problem isn&rsquo;t too few observations — it&rsquo;s too much readiness to conclude. The confirmation doesn&rsquo;t come from the data. It comes from you. The number said &ldquo;3% slower&rdquo; and you heard &ldquo;regression&rdquo; because you were already looking for one. The CIs were wide enough to hold any story. You picked the one that matched.</p>
<p>&ldquo;3% faster&rdquo; is not a result. It&rsquo;s a hypothesis. Treat it like one — confirm it with sufficient iterations, assess practical significance, and validate it against end-to-end behavior. Or revert the merge.</p>
<hr>
<h2 id="further-reading">Further reading</h2>
<ul>
<li>Cohen, <em>Statistical Power Analysis for the Behavioral Sciences</em> (1988) — the standard reference for effect size. Defines Cohen&rsquo;s d and the small/medium/large thresholds.<sup id="fnref1:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></li>
<li>Georges, Buytaert, Eeckhout, <a href="https://dl.acm.org/doi/10.1145/1297027.1297033">Statistically Rigorous Java Performance Evaluation</a> (OOPSLA 2007) — how many iterations, which statistical tests, how to report. Directly applies to BDN methodology.<sup id="fnref:5"><a href="#fn:5" class="footnote-ref" role="doc-noteref">5</a></sup></li>
<li>Mytkowicz et al., <a href="https://dl.acm.org/doi/10.1145/1508244.1508275">Producing Wrong Data Without Doing Anything Obviously Wrong</a> (ASPLOS 2009) — measurement bias from setup sensitivity. Small environmental changes flip benchmark results.<sup id="fnref:6"><a href="#fn:6" class="footnote-ref" role="doc-noteref">6</a></sup></li>
<li>Kalibera &amp; Jones, <a href="https://dl.acm.org/doi/10.1145/2491894.2464160">Rigorous Benchmarking in Reasonable Time</a> (ISMM 2013) — how many iterations you actually need, steady-state detection, randomization.<sup id="fnref:7"><a href="#fn:7" class="footnote-ref" role="doc-noteref">7</a></sup></li>
<li>Andrey Akinshin, <em>Pro .NET Benchmarking</em> (Apress, 2019) — the BenchmarkDotNet author on statistics, confidence intervals, comparing results.<sup id="fnref:8"><a href="#fn:8" class="footnote-ref" role="doc-noteref">8</a></sup></li>
<li>BenchmarkDotNet documentation, <a href="https://benchmarkdotnet.org/articles/features/statistics.html">Statistics</a> and <a href="https://benchmarkdotnet.org/articles/samples/IntroStatisticalTesting.html">IntroStatisticalTesting</a> — Mann-Whitney, Welch&rsquo;s t-test, the Ratio column, CI computation.<sup id="fnref1:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></li>
<li>Brendan Gregg, <a href="https://www.brendangregg.com/activebenchmarking.html">Benchmarking Gone Wrong</a> (LISA 2014) — visual comparison, ignoring variance, cherry picking. Anti-patterns that match the &ldquo;3% slower&rdquo; scenario.<sup id="fnref:9"><a href="#fn:9" class="footnote-ref" role="doc-noteref">9</a></sup></li>
<li>Matt Dowle, <a href="https://h2oai.github.io/db-benchmark/">Database-like ops benchmark</a> — ratio-based comparison and reproducibility in practice.<sup id="fnref:10"><a href="#fn:10" class="footnote-ref" role="doc-noteref">10</a></sup></li>
<li>Gene M. Amdahl, <a href="https://dl.acm.org/doi/10.1145/1465482.1465560">Validity of the single processor approach to achieving large scale computing capabilities</a> (AFIPS 1967) — the law that explains why micro speedups vanish at macro scale.<sup id="fnref1:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup></li>
</ul>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>BenchmarkDotNet, <a href="https://benchmarkdotnet.org/articles/features/statistics.html">Statistics</a> and <a href="https://benchmarkdotnet.org/articles/samples/IntroStatisticalTesting.html">IntroStatisticalTesting</a>. Documents the Mann-Whitney U test and Welch&rsquo;s t-test implementations, the Ratio column semantics, and how the Error column is computed. Error is the half-width of the 99.9% confidence interval using a Student&rsquo;s t-distribution: Error = t(0.0005, n-1) x StdDev / sqrt(n), where n is the number of iterations after outlier removal. Because the t-distribution has heavier tails at small n, the Error column naturally grows when iterations are few — making CI overlap a useful (conservative) visual screening tool. For formal inference, prefer BDN&rsquo;s built-in Welch or Mann-Whitney tests.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a>&#160;<a href="#fnref1:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>Welch&rsquo;s t-test (unequal variances), computed manually from BDN&rsquo;s summary statistics. With the 20-iteration data: t = (25.25 - 25.64) / sqrt(0.177^2/n_1 + 0.109^2/n_2) = -8.4, df = 32 (Welch-Satterthwaite), p &lt; 0.001. BDN&rsquo;s own StatisticalTestColumn uses a Welch-based TOST equivalence test or Mann-Whitney — see <a href="https://benchmarkdotnet.org/articles/samples/IntroStatisticalTesting.html">IntroStatisticalTesting</a> for details on the built-in tests.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:3">
<p>Jacob Cohen, <em>Statistical Power Analysis for the Behavioral Sciences</em>, 2nd ed. (Lawrence Erlbaum, 1988). The canonical source for effect size conventions. d = 0.2 (small), 0.5 (medium), 0.8 (large) — thresholds that became standard by widespread adoption, not mathematical derivation. Cohen himself warned against rigid cutoffs; in microbenchmarking, BDN&rsquo;s sub-1% CoV makes d misleadingly large for trivial differences.&#160;<a href="#fnref:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a>&#160;<a href="#fnref1:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:4">
<p>Gene M. Amdahl, <a href="https://dl.acm.org/doi/10.1145/1465482.1465560">Validity of the single processor approach to achieving large scale computing capabilities</a>, AFIPS 1967. If the optimized component is fraction f of total runtime, the maximum speedup is 1 / (1 - f + f/S), where S is the component speedup. For f = 0.057 and S = 262: 1 / (1 - 0.057 + 0.057/262) = 1 / 0.9432 = 1.060 — a 6.0% end-to-end improvement.&#160;<a href="#fnref:4" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a>&#160;<a href="#fnref1:4" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:5">
<p>Georges, Buytaert, Eeckhout, <a href="https://dl.acm.org/doi/10.1145/1297027.1297033">Statistically Rigorous Java Performance Evaluation</a>, OOPSLA 2007. Demonstrates that many published benchmarks use insufficient iterations and no confidence intervals. Proposes a methodology that BenchmarkDotNet later adopted — including the minimum iteration count that prevents the instability shown in this post.&#160;<a href="#fnref:5" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:6">
<p>Mytkowicz, Diwan, Hauswirth, Sweeney, <a href="https://dl.acm.org/doi/10.1145/1508244.1508275">Producing Wrong Data Without Doing Anything Obviously Wrong</a>, ASPLOS 2009. Changing the UNIX environment size or link order flips benchmark results. The case for randomization and effect sizes over raw means.&#160;<a href="#fnref:6" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:7">
<p>Kalibera &amp; Jones, <a href="https://dl.acm.org/doi/10.1145/2491894.2464160">Rigorous Benchmarking in Reasonable Time</a>, ISMM 2013. A practical methodology for choosing iteration counts — too few and your CIs are meaningless, too many and you&rsquo;re wasting time. The sweet spot depends on the coefficient of variation.&#160;<a href="#fnref:7" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:8">
<p>Andrey Akinshin, <em>Pro .NET Benchmarking</em> (Apress, 2019). Chapters 5-7 cover statistics, confidence intervals, and comparing benchmark results. The authoritative guide for BenchmarkDotNet users.&#160;<a href="#fnref:8" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:9">
<p>Brendan Gregg, <a href="https://www.brendangregg.com/activebenchmarking.html">Benchmarking Gone Wrong / Active Benchmarking</a>, LISA 2014. Anti-patterns: visual comparison (&ldquo;this graph looks faster&rdquo;), ignoring variance, cherry-picking runs. The &ldquo;3% slower with 5 iterations&rdquo; scenario in this post is Gregg&rsquo;s &ldquo;visual comparison&rdquo; anti-pattern compounded with insufficient sample size.&#160;<a href="#fnref:9" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:10">
<p>Matt Dowle, <a href="https://h2oai.github.io/db-benchmark/">Database-like ops benchmark</a>. A practical example of ratio-based comparison across implementations, with reproducibility as a first-class concern.&#160;<a href="#fnref:10" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>]]></content:encoded>
    </item>
    <item>
      <title>First Things First: Coordinated Omission</title>
      <link>https://0x3f.blog/posts/first-things-first-coordinated-omission/</link>
      <pubDate>Tue, 03 Mar 2026 10:00:00 +0100</pubDate>
      <guid>https://0x3f.blog/posts/first-things-first-coordinated-omission/</guid>
      <description>Same service, same pause pattern, different client model — p99 jumps from 1 ms to 195 ms. The measurement method itself lies.</description>
      <content:encoded><![CDATA[<h2 id="p99--1-ms--flip-one-switch--p99--195-ms">p99 = 1 ms — flip one switch — p99 = 195 ms</h2>
<p>Same service. Same pause pattern. Same nominal target rate. One change in the client model — p99 jumps 182×. Not a system failure. A measurement failure.</p>
<p>Design can lie. The environment can lie. Fix both — the benchmark looks solid, the percentiles look clean. Too clean. The measurement method itself can lie — a systematic omission baked into how the test collects data.</p>
<p>All code in this post: clone, build, run. Numbers below were measured on dual Xeon E5-2697 v2 — run the companion code on your hardware for your own results. Different hardware, different numbers — that&rsquo;s half the lesson.</p>
<p><em>Convention: charts use milliseconds; tables reproduce raw simulation output. Histograms are approximate visualizations of the recorded latency distribution — the percentile tables are the authoritative data.</em></p>
<hr>
<h2 id="send-wait-measure-repeat">Send, wait, measure, repeat</h2>
<div class="highlight"><pre data-lang="csharp"><code>public static LatencyReport Run(SimulatedService service, int ratePerSec, int durationSec)
{
    int totalRequests = ratePerSec * durationSec;
    var recorder = new LatencyRecorder();

    for (int i = 0; i &lt; totalRequests; i&#43;&#43;)
    {
        long start = Stopwatch.GetTimestamp();
        service.Process();
        long elapsed = Stopwatch.GetTimestamp() - start;
        recorder.Record(elapsed);
    }

    return recorder.GetReport();
}</code></pre></div>
<p><small>Closed-loop client — full source in companion code.</small></p>
<p>Send a request. Wait for the response. Measure the elapsed time. Send the next one. The client and the service take turns — a lockstep conversation where neither moves without the other. This pattern has a name: <strong>closed-loop</strong>.<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> Most load test frameworks default to it. Most dashboards assume it.</p>
<p>What does your test do when the system slows down?</p>
<hr>
<h2 id="the-comfortable-picture">The comfortable picture</h2>
<p>The system under test: a simulated service with ~1 ms baseline latency (calibrated SpinWait) and a 200 ms pause every 500th request — modeling GC, compaction, or any periodic maintenance event. Target rate: 450 req/sec over 30 seconds (13,500 total). Average service time: (499 × 1 ms + 1 × 200 ms) / 500 = 1.4 ms. At 450 req/sec the service needs 630 ms of work per second — ~63% utilization, with headroom to spare. The pauses are the problem, not the capacity.</p>
<p>The closed-loop client has no rate limiter, no inter-request delay — <code>totalRequests</code> is just a count (rate × duration) to match the open-loop&rsquo;s output volume. The effective rate is whatever the service delivers. During normal processing (~1 ms per request), well above 450 req/sec. During a 200 ms pause: zero. The arrival rate follows the system. When the system slows, the test slows with it.</p>
<div class="chart-container">
  <canvas id="chart-4bc91d3463c210fa54b820e8639d3096"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-4bc91d3463c210fa54b820e8639d3096').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['0–2', '2–50', '50–100', '100–150', '150–200', '200+'],
    datasets: [{
      label: 'Request count',
      data: [13473, 0, 0, 0, 0, 27],
      backgroundColor: '#89b4fa',
      borderColor: '#89b4fa',
      borderWidth: 1
    }]
  },
  options: {
    plugins: {
      title: { display: true, text: 'Closed-loop latency distribution — 13,500 requests' },
      subtitle: { display: true, text: '~1ms baseline, 200ms pause every 500 requests' },
      legend: { display: false }
    },
    scales: {
      x: { title: { display: true, text: 'Latency bucket (ms)' } },
      y: { title: { display: true, text: 'Request count' } }
    }
  }
}
);
  })();
</script>

<div class="highlight"><pre data-lang=""><code>| Metric | Closed-loop  |
|--------|-------------:|
| Count  |       13,500 |
| p50    |      1.00 ms |
| p90    |      1.00 ms |
| p99    |      1.07 ms |
| p99.9  |    200.15 ms |
| max    |    200.28 ms |</code></pre></div>
<p>The dashboard looks clean. 99th percentile: 1 ms. Only p99.9 shows any trouble — and that&rsquo;s 27 requests out of 13,500, the ones that directly hit a pause. Every other request: ~1 ms, tight distribution, no tail. You read the numbers and move on.</p>
<p>The dashboard maps what the test recorded — not what users experienced.</p>
<p>Hume (1739): no finite set of observations guarantees the next. A thousand closed-loop measurements say p99 = 1 ms. The thousand-and-first doesn&rsquo;t have to agree. Induction from data that systematically omits the worst moments is induction from a sample that excludes its own counterexamples.</p>
<hr>
<h2 id="flip-one-switch">Flip one switch</h2>
<p>Same service. Same pause injector. Same nominal target rate. One change: the client sends on a fixed schedule, regardless of whether the previous request came back.</p>
<div class="highlight"><pre data-lang="csharp"><code>public static LatencyReport Run(SimulatedService service, int ratePerSec, int durationSec)
{
    var recorder = new LatencyRecorder();
    long intervalTicks = Stopwatch.Frequency / ratePerSec;
    long deadline = Stopwatch.GetTimestamp() &#43; (long)durationSec * Stopwatch.Frequency;
    long nextSend = Stopwatch.GetTimestamp();

    while (Stopwatch.GetTimestamp() &lt; deadline)
    {
        long intendedStart = nextSend;
        nextSend &#43;= intervalTicks;

        service.Process();

        long now = Stopwatch.GetTimestamp();
        long latency = now - intendedStart;  // ← intended, not actual
        recorder.Record(latency);

        while (Stopwatch.GetTimestamp() &lt; nextSend)
            Thread.SpinWait(10);
    }

    return recorder.GetReport();
}</code></pre></div>
<p><small>Open-loop client — full source in companion code. Note: <code>intervalTicks</code> uses integer division, introducing sub-microsecond step quantization at 450 req/sec — negligible for this demonstration.</small></p>
<p>One line changed: <code>now - intendedStart</code> instead of <code>now - actualStart</code>. The user&rsquo;s clock starts when they click, not when the server gets around to processing their request. When the service pauses, requests that should have been sent during the pause pile up — each measured from when it was <em>supposed</em> to start, because that&rsquo;s when the user started waiting.</p>
<div class="chart-container">
  <canvas id="chart-b8b70701f9ac8d7fb985049ac934642f"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-b8b70701f9ac8d7fb985049ac934642f').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['0–2', '2–50', '50–100', '100–150', '150–200', '200+'],
    datasets: [{
      label: 'Request count',
      data: [9100, 900, 1100, 1100, 1273, 27],
      backgroundColor: '#f38ba8',
      borderColor: '#f38ba8',
      borderWidth: 1
    }]
  },
  options: {
    plugins: {
      title: { display: true, text: 'Open-loop latency distribution — 13,500 requests' },
      subtitle: { display: true, text: 'Same service, same target rate — bimodal distribution' },
      legend: { display: false }
    },
    scales: {
      x: { title: { display: true, text: 'Latency bucket (ms)' } },
      y: { title: { display: true, text: 'Request count' } }
    }
  }
}
);
  })();
</script>

<p>Bimodal. A peak at ~1 ms and a wide spread from 50–200 ms. Two different experiences on the same chart.</p>
<div class="chart-container">
  <canvas id="chart-0bccaa457ae3fd03e19b00bddc547d51"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-0bccaa457ae3fd03e19b00bddc547d51').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['p50', 'p90', 'p99', 'p99.9', 'max'],
    datasets: [
      {
        label: 'Closed-loop',
        data: [1.00, 1.00, 1.07, 200.15, 200.28],
        backgroundColor: '#89b4fa',
        borderColor: '#89b4fa',
        borderWidth: 1
      },
      {
        label: 'Open-loop',
        data: [1.00, 137.89, 194.64, 200.15, 200.41],
        backgroundColor: '#f38ba8',
        borderColor: '#f38ba8',
        borderWidth: 1
      }
    ]
  },
  options: {
    plugins: {
      title: { display: true, text: 'Closed-loop vs open-loop — percentile comparison' },
      subtitle: { display: true, text: 'Same service, same target rate, same pauses — different measurement' },
      legend: { display: true }
    },
    scales: {
      x: { title: { display: true, text: 'Percentile' } },
      y: {
        type: 'logarithmic',
        title: { display: true, text: 'Latency (ms) — log scale' },
        min: 0.5,
        max: 500
      }
    }
  }
}
);
  })();
</script>

<div class="highlight"><pre data-lang=""><code>| Metric | Closed-loop  |    Open-loop |     Ratio |
|--------|-------------:|-------------:|----------:|
| Count  |       13,500 |       13,500 |           |
| p50    |      1.00 ms |      1.00 ms |      1.0x |
| p90    |      1.00 ms |    137.89 ms |    137.9x |
| p99    |      1.07 ms |    194.64 ms |    182.4x |
| p99.9  |    200.15 ms |    200.15 ms |      1.0x |
| max    |    200.28 ms |    200.41 ms |      1.0x |</code></pre></div>
<p><small>Ratios computed from raw data before rounding to displayed precision.</small></p>
<p>Same system. Same load. Same pause. One variable: whether the test waits for a response before sending the next request.</p>
<p>Closed-loop p99 = 1 ms. Open-loop p99 = 195 ms. <strong>182× on this workload.</strong></p>
<hr>
<h2 id="the-mechanism--coordinated-omission">The mechanism — coordinated omission</h2>
<p>During a 200 ms pause, the closed-loop client waits. While waiting, it sends no new requests — it goes with the system, slowing down exactly when the system slows down. 200 ms × 450 req/sec = 90 requests that <em>should have</em> been sent but weren&rsquo;t. They don&rsquo;t appear in the histogram. They don&rsquo;t exist in the data. The dashboard stays clean.</p>
<p>The open-loop client doesn&rsquo;t coordinate. It tracks what the schedule <em>should have been</em>. After the pause resolves:</p>
<ul>
<li>Request N+1: intended at T+2 ms, completed at T+201 ms → latency = <strong>199 ms</strong></li>
<li>Request N+2: intended at T+4 ms, completed at T+202 ms → latency = <strong>198 ms</strong></li>
<li>Request N+3: intended at T+7 ms, completed at T+203 ms → latency = <strong>196 ms</strong></li>
<li>&hellip;catch-up continues for ~160 requests until the schedule recovers</li>
</ul>
<p>Each pause contaminates ~160 subsequent requests with elevated latency. 27 pauses × ~160 requests = ~4,300 requests — roughly a third of all traffic — experiencing latency between 2 ms and 200 ms. That&rsquo;s why the open-loop p90 is 138 ms: the top 10% of requests (1,350 out of 13,500) fall squarely in that contaminated range.</p>
<p>The closed-loop client sees 27 bad requests. The open-loop client sees 4,300. Same service. Same pauses.</p>
<p>The worse the failure, the more requests the closed-loop client skips, the cleaner the dashboard. The mechanism is inversely proportional to the problem. A 200 ms pause omits 90 measurements. A 2-second pause omits 900. A 10-second GC stop-the-world omits 4,500. The worst event your system can produce is the one your test is least likely to record.</p>
<p>Gil Tene named this <strong>Coordinated Omission</strong> — the test coordinates with the system&rsquo;s failures, omitting measurements precisely when they would be most damning.<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup></p>
<p>Baudrillard (1981): the third phase of the simulacrum — the image masks the <em>absence</em> of reality. The closed-loop benchmark doesn&rsquo;t distort measurements. It masks their nonexistence. Those 90 requests during the pause aren&rsquo;t poorly measured. They don&rsquo;t exist. The dashboard is a simulacrum — it doesn&rsquo;t lie about the system. It replaces it.</p>
<hr>
<h2 id="how-to-stop-coordinating">How to stop coordinating</h2>
<table>
  <thead>
      <tr>
          <th>Property</th>
          <th>Closed-loop</th>
          <th>Open-loop</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Request timing</td>
          <td>After previous response</td>
          <td>Fixed schedule, independent of response</td>
      </tr>
      <tr>
          <td>What it measures</td>
          <td>Response time of sent requests (omits unsent)</td>
          <td>Response time from intended start (incl. queuing)</td>
      </tr>
      <tr>
          <td>During a pause</td>
          <td>Stops sending → omits measurements</td>
          <td>Tracks intended schedule → captures queuing</td>
      </tr>
      <tr>
          <td>p99 under pauses</td>
          <td>Looks clean (only direct hits visible)</td>
          <td>Shows full impact (queued requests visible)</td>
      </tr>
      <tr>
          <td>Best for</td>
          <td>Throughput measurement, saturation testing</td>
          <td>Latency measurement, SLA validation</td>
      </tr>
  </tbody>
</table>
<p>Four rules for latency measurement:</p>
<ol>
<li>
<p><strong>Open-loop by default for latency load tests.</strong> Closed-loop is still useful for throughput and saturation testing — finding the breaking point. But if your SLAs are latency percentiles, you need open-loop. Closed-loop tells you the system <em>can</em> handle the load; open-loop tells you what users <em>experience</em> while it does.<sup id="fnref1:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></p>
</li>
<li>
<p><strong>Measure from intended time, not actual time.</strong> <code>latency = now - intendedStart</code>, not <code>now - actualStart</code>. The user&rsquo;s clock starts when they click, not when the server gets around to reading their request.</p>
</li>
<li>
<p><strong>Record the full tail.</strong> p50 and p99 are not enough. Report p99.9 and max. Coordinated omission hides in the gap between p99 and p99.9 — the range where closed-loop sees nothing and open-loop sees the damage.</p>
</li>
<li>
<p><strong>Use histograms that can handle it.</strong> HdrHistogram<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup> records values across a wide dynamic range with configurable precision — from sub-millisecond to multi-second latencies in the same histogram. Fixed-bucket histograms clip the tail.</p>
</li>
</ol>
<h3 id="tools-that-get-it-right">Tools that get it right</h3>
<table>
  <thead>
      <tr>
          <th>Tool</th>
          <th>Open-loop</th>
          <th>CO correction</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>wrk2<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup></td>
          <td>Yes</td>
          <td>Built-in</td>
          <td>Constant-rate HTTP benchmark, HdrHistogram output</td>
      </tr>
      <tr>
          <td>Gatling</td>
          <td>Yes</td>
          <td>Configurable</td>
          <td>Open-loop mode available, reports percentiles</td>
      </tr>
      <tr>
          <td>k6</td>
          <td>Partial</td>
          <td>Manual</td>
          <td>Constant-rate via scenarios, no auto-correction</td>
      </tr>
      <tr>
          <td>Custom (this post)</td>
          <td>Yes</td>
          <td>By design</td>
          <td><code>intendedStart</code> tracking, HdrHistogram.NET</td>
      </tr>
  </tbody>
</table>
<p>Capabilities and defaults vary by tool version and configuration; verify settings in your release.</p>
<h3 id="run-it-yourself">Run it yourself</h3>
<div class="highlight"><pre data-lang="bash"><code>git clone https://github.com/0x3f-blog/companion-code.git
cd companion-code/first-things-first/coordinated-omission
dotnet run -c Release</code></pre></div>
<hr>
<h2 id="benchmark-environment">Benchmark environment</h2>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CPU</td>
          <td>2× Intel Xeon E5-2697 v2 @ 2.70 GHz (24 cores / 48 threads)</td>
      </tr>
      <tr>
          <td>RAM</td>
          <td>~115 GB DDR3-1866 (quad-channel per socket)</td>
      </tr>
      <tr>
          <td>OS</td>
          <td>Fedora Linux 42 (kernel 6.17)</td>
      </tr>
      <tr>
          <td>Runtime</td>
          <td>.NET 9.0.11 (RyuJIT AVX)</td>
      </tr>
      <tr>
          <td>SDK</td>
          <td>.NET SDK 10.0.102</td>
      </tr>
      <tr>
          <td>HdrHistogram</td>
          <td>HdrHistogram.NET 2.5.0</td>
      </tr>
      <tr>
          <td>Simulation</td>
          <td>450 req/sec, 30 sec, 200 ms pause every 500 requests</td>
      </tr>
  </tbody>
</table>
<p>Not BenchmarkDotNet — this is a custom in-process simulation. SpinWait calibrated at startup for ~1 ms baseline on current hardware (binary search, 50 samples, median). Fresh <code>SimulatedService</code> instance per client — no counter contamination.</p>
<p><strong>Limitations:</strong> In-process simulation — no HTTP, no network stack, no kernel-level queuing. The open-loop client is single-threaded and blocks on <code>Process()</code>, so it tracks the intended schedule rather than dispatching concurrently (a real open-loop system like wrk2 or Gatling sends requests asynchronously). These simplifications isolate the coordinated omission mechanism from transport noise — the measurement effect is the same, but absolute numbers would differ in a networked setup.</p>
<hr>
<p>Popper (1934): a meaningful test must be capable of producing a negative result. The closed-loop client cannot falsify the hypothesis &ldquo;the system is healthy&rdquo; — it hides the counterexamples. Measurements that would disprove it don&rsquo;t exist. Open-loop is the falsification instrument: it doesn&rsquo;t ask the system whether it&rsquo;s ready. It measures regardless.</p>
<p>Each layer of deception sits closer to you. Design — visible in the code. Environment — visible in the configuration. The method of collection — buried in an assumption you never questioned. Data collected correctly. But what do the data mean?</p>
<p>A metric that looks better the worse the system performs isn&rsquo;t a metric. It&rsquo;s anesthesia.</p>
<hr>
<h2 id="further-reading">Further reading</h2>
<ul>
<li>Gil Tene, <a href="https://www.youtube.com/watch?v=lJ8ydIuPFeU">How NOT to Measure Latency</a> (Strange Loop 2015) — the definitive talk on coordinated omission, open vs closed loop, and why percentile measurements lie.<sup id="fnref1:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup></li>
<li>Gil Tene, <a href="https://www.infoq.com/presentations/latency-response-time/">How NOT to Measure Latency</a> (QCon San Francisco 2015) — recorded version of the talk, more on why averages and even p99 are insufficient without the full distribution.<sup id="fnref:5"><a href="#fn:5" class="footnote-ref" role="doc-noteref">5</a></sup></li>
<li>Schroeder, Wierman, Harchol-Balter, <a href="https://www.usenix.org/conference/nsdi-06/open-versus-closed-cautionary-tale">Open Versus Closed: A Cautionary Tale</a> (NSDI 2006) — the formal paper showing that open-loop and closed-loop produce fundamentally different results.<sup id="fnref2:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></li>
<li>Dean &amp; Barroso, <a href="https://dl.acm.org/doi/10.1145/2408776.2408794">The Tail at Scale</a> (CACM 2013) — why tail latency matters in distributed systems, fan-out amplification.<sup id="fnref:6"><a href="#fn:6" class="footnote-ref" role="doc-noteref">6</a></sup></li>
<li>Ousterhout, <a href="https://dl.acm.org/doi/10.1145/3213770">Always Measure One Level Deeper</a> (CACM 2018) — the general principle: measure the layer below where you think the problem is.<sup id="fnref:7"><a href="#fn:7" class="footnote-ref" role="doc-noteref">7</a></sup></li>
<li><a href="https://hdrhistogram.github.io/HdrHistogram/">HdrHistogram</a> — high dynamic range histogram for latency recording, with coordinated omission correction. Ports: Java, C#, C, Go, Rust, JavaScript, Python, Erlang.<sup id="fnref1:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></li>
<li>Gil Tene, <a href="https://github.com/giltene/wrk2">wrk2</a> — constant-rate HTTP benchmark with built-in coordinated omission correction and HdrHistogram output.<sup id="fnref1:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup></li>
<li>Brendan Gregg, <a href="https://www.brendangregg.com/activebenchmarking.html">Active Benchmarking</a> — methodology and anti-patterns for honest measurement.<sup id="fnref:8"><a href="#fn:8" class="footnote-ref" role="doc-noteref">8</a></sup></li>
<li>Martin Thompson, <a href="https://mechanical-sympathy.blogspot.com/">Mechanical Sympathy</a> — latency-focused systems programming, false sharing, memory access patterns.<sup id="fnref:9"><a href="#fn:9" class="footnote-ref" role="doc-noteref">9</a></sup></li>
<li>Andrey Akinshin, <em>Pro .NET Benchmarking</em> (Apress, 2019) — comprehensive guide to .NET measurement, including percentile pitfalls.<sup id="fnref:10"><a href="#fn:10" class="footnote-ref" role="doc-noteref">10</a></sup></li>
</ul>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Schroeder, Wierman, Harchol-Balter, <a href="https://www.usenix.org/conference/nsdi-06/open-versus-closed-cautionary-tale">Open Versus Closed: A Cautionary Tale</a>, NSDI 2006. The formal demonstration that open-loop and closed-loop benchmarks produce fundamentally different performance characteristics — even on the same system under the same nominal load.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a>&#160;<a href="#fnref1:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a>&#160;<a href="#fnref2:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>Gil Tene, <a href="https://www.youtube.com/watch?v=lJ8ydIuPFeU">How NOT to Measure Latency</a>, Strange Loop 2015. Defines coordinated omission, demonstrates the mechanism, introduces HdrHistogram. The single most important talk on latency measurement.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a>&#160;<a href="#fnref1:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:3">
<p><a href="https://hdrhistogram.github.io/HdrHistogram/">HdrHistogram</a> by Gil Tene. Records values across a configurable dynamic range (e.g., 1 microsecond to 1 hour) with uniform precision at any percentile level. .NET port: <a href="https://www.nuget.org/packages/HdrHistogram/">HdrHistogram.NET</a> on NuGet.&#160;<a href="#fnref:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a>&#160;<a href="#fnref1:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:4">
<p>Gil Tene, <a href="https://github.com/giltene/wrk2">wrk2</a>. A fork of wrk that maintains a constant request rate (open-loop) and records latency from intended send time. The output includes full HdrHistogram percentile data — no coordinated omission by construction.&#160;<a href="#fnref:4" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a>&#160;<a href="#fnref1:4" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:5">
<p>Gil Tene, <a href="https://www.infoq.com/presentations/latency-response-time/">How NOT to Measure Latency</a>, QCon San Francisco 2015. Why the mean is useless, why p99 isn&rsquo;t enough, why you need the full distribution.&#160;<a href="#fnref:5" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:6">
<p>Dean &amp; Barroso, <a href="https://dl.acm.org/doi/10.1145/2408776.2408794">The Tail at Scale</a>, CACM 2013. In a fan-out architecture, the probability of hitting at least one slow backend grows with the number of backends. Tail latency isn&rsquo;t a statistics curiosity — it&rsquo;s the dominant user experience at scale.&#160;<a href="#fnref:6" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:7">
<p>Ousterhout, <a href="https://dl.acm.org/doi/10.1145/3213770">Always Measure One Level Deeper</a>, CACM 2018. The general principle: if the numbers don&rsquo;t make sense, measure the layer below. Coordinated omission is a measurement-layer problem — you have to look at <em>how</em> the test records latency, not just <em>what</em> it reports.&#160;<a href="#fnref:7" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:8">
<p>Brendan Gregg, <a href="https://www.brendangregg.com/activebenchmarking.html">Active Benchmarking</a>. Methodology for honest benchmarking: verify work done, eliminate perturbation, report confidence. Includes a section on coordinated omission as a common anti-pattern.&#160;<a href="#fnref:8" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:9">
<p>Martin Thompson, <a href="https://mechanical-sympathy.blogspot.com/">Mechanical Sympathy</a>. Blog series on latency-sensitive systems programming — false sharing, memory access patterns, lock-free data structures. Context for understanding why sub-millisecond measurement matters.&#160;<a href="#fnref:9" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:10">
<p>Andrey Akinshin, <em>Pro .NET Benchmarking</em> (Apress, 2019). The BenchmarkDotNet author&rsquo;s comprehensive treatment of measurement in .NET — warmup, outliers, statistics, environment control, percentile reporting.&#160;<a href="#fnref:10" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>]]></content:encoded>
    </item>
    <item>
      <title>First Things First: Enemies of Measurement</title>
      <link>https://0x3f.blog/posts/first-things-first-enemies-of-measurement/</link>
      <pubDate>Fri, 27 Feb 2026 17:00:00 +0100</pubDate>
      <guid>https://0x3f.blog/posts/first-things-first-enemies-of-measurement/</guid>
      <description>Six forces that change benchmark results 2–6× without changing the algorithm. Same storage engine, same data, same machine — different answers.</description>
      <content:encoded><![CDATA[<h2 id="same-engine-different-answers">Same engine, different answers</h2>
<p>Design fixed. Environment changed: cache temperature, GC pressure, data order, JIT tier. The numbers move by 2–6× without touching the algorithm.</p>
<table>
  <thead>
      <tr>
          <th>Enemy</th>
          <th>Effect</th>
          <th>What it distorts</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1. JIT Optimization Level</td>
          <td>6×</td>
          <td>Machine code quality</td>
      </tr>
      <tr>
          <td>2. GC Pauses</td>
          <td>2.3×</td>
          <td>Allocation in hot path</td>
      </tr>
      <tr>
          <td>3. System Noise</td>
          <td>3.7× σ</td>
          <td>Measurement variance</td>
      </tr>
      <tr>
          <td>4. Cache State</td>
          <td>2.9×</td>
          <td>Memory hierarchy</td>
      </tr>
      <tr>
          <td>5. Branch Predictor</td>
          <td>5.0×</td>
          <td>Data order</td>
      </tr>
      <tr>
          <td>6. Dead Code Elimination</td>
          <td>5.9×</td>
          <td>Return type</td>
      </tr>
  </tbody>
</table>
<p>The first three, BenchmarkDotNet defends against — if you know to look. The last three, you&rsquo;re on your own. Some enemies use the storage engine directly (E2, E3, E4). Others isolate CPU-level effects using data derived from the storage engine (E1, E5, E6) — because these distortions hide in any hot path, not just <code>Insert</code> and <code>Get</code>.</p>
<p>All code: <a href="https://github.com/0x3f-blog/companion-code">clone, build, run</a>. Numbers below: dual Xeon E5-2697 v2, 48 threads, 30 MB L3 per socket, ~115 GB DDR3-1866, Fedora 42, .NET 9.0.11 (RyuJIT AVX), BenchmarkDotNet v0.14.0.<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> No WAL — these enemies hide in the in-memory path, where fsync can&rsquo;t drown the signal. Different hardware, different numbers — that&rsquo;s half the lesson.</p>
<hr>
<h2 id="enemy-1--jit-optimization-level">Enemy 1 — JIT Optimization Level</h2>
<p>The storage engine holds 100,000 rows (via <code>Row.Generate</code>). Setup extracts all payloads into a contiguous <code>byte[]</code> of ~14.5 MB — an integrity-check scenario. Two versions of the same loop. Same data. Same operation. One difference: <code>[MethodImpl(MethodImplOptions.NoOptimization)]</code> — forcing the JIT to emit completely unoptimized code (no register promotion, no SIMD, no bounds check elimination).</p>
<p>Descartes: <em>de omnibus dubitandum est</em> — doubt everything, starting with your own setup. This is <em>not</em> a Tier-0 vs Tier-1 comparison.<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> <code>NoOptimization</code> disables <em>all</em> optimizations — the absolute lower bound. Real Tier-0 → Tier-1 transitions (short methods without loops, where Tier-0 applies) show 2–4×. The 6× here is the extreme case, deliberately exaggerated to make the enemy visible.</p>
<div class="highlight"><pre data-lang="csharp"><code>[DisassemblyDiagnoser(maxDepth: 3)]
public class E1_JitWarmup
{
    private byte[] _payload; // ~14.5 MB — all payloads from 100K rows

    [GlobalSetup]
    public void Setup()
    {
        using var table = new StripedTable&lt;int, Row&gt;();
        for (int i = 0; i &lt; 100_000; i&#43;&#43;)
            table.Insert(i, Row.Generate(i));

        // Extract all payloads into contiguous array
        // ... (full source in companion code)
    }

    [Benchmark]
    [MethodImpl(MethodImplOptions.NoOptimization)]
    public long SumPayloadCold()
    {
        long sum = 0;
        var data = _payload;
        for (int i = 0; i &lt; data.Length; i&#43;&#43;)
            sum &#43;= data[i];
        return sum;
    }

    [Benchmark(Baseline = true)]
    public long SumPayloadWarm()
    {
        long sum = 0;
        var data = _payload;
        for (int i = 0; i &lt; data.Length; i&#43;&#43;)
            sum &#43;= data[i];
        return sum;
    }
}</code></pre></div>
<p>Identical loop. Identical data. Identical result.</p>
<div class="chart-container">
  <canvas id="chart-fbc74b58a700254e6911c71790250880"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-fbc74b58a700254e6911c71790250880').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['Unoptimized (NoOptimization — 124 B)', 'Optimized (default JIT — 49 B)'],
    datasets: [{
      label: 'Mean (ms)',
      data: [49.764, 8.247],
      backgroundColor: ['#f38ba8', '#89b4fa'],
      borderColor: ['#f38ba8', '#89b4fa'],
      borderWidth: 1
    }]
  },
  options: {
    plugins: {
      title: { display: true, text: 'JIT optimization impact — payload checksum over 100K rows' },
      subtitle: { display: true, text: 'Same loop, same data — NoOptimization vs default JIT (not Tier-0 vs Tier-1)' },
      legend: { display: false }
    },
    scales: {
      y: { title: { display: true, text: 'Mean (ms)' }, min: 0, max: 55 }
    }
  }
}
);
  })();
</script>

<table>
  <thead>
      <tr>
          <th>Method</th>
          <th style="text-align: right">Mean</th>
          <th style="text-align: right">Code Size</th>
          <th style="text-align: right">Ratio</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SumPayloadCold</td>
          <td style="text-align: right">49.764 ms</td>
          <td style="text-align: right">124 B</td>
          <td style="text-align: right">6.03</td>
      </tr>
      <tr>
          <td>SumPayloadWarm</td>
          <td style="text-align: right">8.247 ms</td>
          <td style="text-align: right">49 B</td>
          <td style="text-align: right">1.00</td>
      </tr>
  </tbody>
</table>
<p><strong>6× on this hardware.</strong> The <code>[DisassemblyDiagnoser]</code> on the class generates full JIT output in <code>BenchmarkDotNet.Artifacts/results/</code> — 124 bytes of machine code vs 49. The unoptimized path pays for stack-based locals, bounds checks on every array access, scalar arithmetic — one byte at a time. The optimized path gets register promotion, bounds check elimination, and potentially SIMD vectorization. Same source code. Different machine code. 6× gap (remember: this is the extreme case — real Tier-0 → Tier-1 deltas are smaller but still significant).</p>
<p>BenchmarkDotNet runs warmup iterations by default (6–50 adaptive, plus 15–100 measurement iterations)<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup> — conservative enough that Tier-0 compiles to Tier-1 before measurement begins. Defense exists. But in benchmarks where tiered compilation actually applies (short methods without loops — where Tier-0 <em>is</em> the first compile), overriding warmup count too low or testing a method short enough to stay below the recompilation threshold can let unoptimized code leak into the measurement window. The first enemy hides in the JIT pipeline — and the <code>DisassemblyDiagnoser</code> is the only way to see it.</p>
<hr>
<h2 id="enemy-2--gc-pauses">Enemy 2 — GC Pauses</h2>
<p>Insert 100,000 rows into <code>StripedTable</code>. Same keys, same table, same final state. One difference: where the <code>Row</code> objects come from.</p>
<div class="highlight"><pre data-lang="csharp"><code>[MemoryDiagnoser]
public class E2_GcPauses
{
    private const int N = 100_000;
    private ITable&lt;int, Row&gt; _table;
    private int[] _keys;
    private Row[] _preAllocated;

    [GlobalSetup]  // keys &#43; rows generated once, reused across iterations
    public void Setup()
    {
        var rng = new Random(42);
        _keys = new int[N];
        _preAllocated = new Row[N];
        for (int i = 0; i &lt; N; i&#43;&#43;)
        {
            _keys[i] = rng.Next(0, 200_000);
            _preAllocated[i] = Row.Generate(_keys[i]);
        }
    }

    [IterationSetup]
    public void IterationSetup()
    {
        _table = new StripedTable&lt;int, Row&gt;(); // fresh table per iteration
    }

    [Benchmark]
    public void InsertAllocHeavy()
    {
        for (int i = 0; i &lt; N; i&#43;&#43;)
            _table.Insert(_keys[i], Row.Generate(_keys[i])); // new byte[] per insert
    }

    [Benchmark(Baseline = true)]
    public void InsertPreAllocated()
    {
        for (int i = 0; i &lt; N; i&#43;&#43;)
            _table.Insert(_keys[i], _preAllocated[i]); // no per-insert allocation
    }
}</code></pre></div>
<p><code>Row.Generate(key)</code> allocates a fresh <code>byte[32..256]</code> every call. 100K inserts = 100K allocations = GC pressure. The baseline pre-allocates all rows in <code>GlobalSetup</code> — no per-insert payload allocations in the hot path. (The 7.52 MB in the table comes from <code>ConcurrentDictionary</code> internal growth — both methods pay that cost.)</p>
<div class="chart-container">
  <canvas id="chart-0fd375f3d9bcb7aa2f7d3d92f7afe5b1"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-0fd375f3d9bcb7aa2f7d3d92f7afe5b1').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['AllocHeavy (23.9 MB alloc)', 'PreAllocated (7.52 MB alloc)'],
    datasets: [{
      label: 'Mean (ms)',
      data: [36.81, 16.35],
      backgroundColor: ['#f38ba8', '#89b4fa'],
      borderColor: ['#f38ba8', '#89b4fa'],
      borderWidth: 1
    }]
  },
  options: {
    plugins: {
      title: { display: true, text: 'GC pause impact — 100K inserts into StripedTable' },
      subtitle: { display: true, text: 'Row.Generate per insert (allocation) vs pre-allocated rows' },
      legend: { display: false }
    },
    scales: {
      y: { title: { display: true, text: 'Mean (ms)' }, min: 0, max: 45 }
    }
  }
}
);
  })();
</script>

<table>
  <thead>
      <tr>
          <th>Method</th>
          <th style="text-align: right">Mean</th>
          <th style="text-align: right">StdDev</th>
          <th style="text-align: right">Allocated</th>
          <th style="text-align: right">Alloc Ratio</th>
          <th style="text-align: right">Ratio</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>InsertAllocHeavy</td>
          <td style="text-align: right">36.81 ms</td>
          <td style="text-align: right">1.744 ms</td>
          <td style="text-align: right">23.9 MB</td>
          <td style="text-align: right">3.18</td>
          <td style="text-align: right">2.25</td>
      </tr>
      <tr>
          <td>InsertPreAllocated</td>
          <td style="text-align: right">16.35 ms</td>
          <td style="text-align: right">1.448 ms</td>
          <td style="text-align: right">7.52 MB</td>
          <td style="text-align: right">1.00</td>
          <td style="text-align: right">1.00</td>
      </tr>
  </tbody>
</table>
<p><strong>2.3× on this workload.</strong> <code>MemoryDiagnoser</code> shows why: 24 MB allocated vs 8 MB. Both methods grow the <code>ConcurrentDictionary</code> from scratch (fresh table per iteration), but <code>AllocHeavy</code> adds 100K <code>Row.Generate</code> allocations on top — each creating a new <code>byte[]</code>. The extra allocation pressure triggers GC collections mid-measurement — each pause adds microseconds that accumulate into milliseconds. Look at <code>StdDev</code>: 1.74 ms for the allocating path — and BenchmarkDotNet flagged <code>PreAllocated</code> as <em>bimodal</em> (mValue = 3.94), consistent with GC pauses splitting the distribution into two clusters: iterations where a collection fired vs iterations where it didn&rsquo;t. GC pauses are non-deterministic: sometimes a collection lands inside the timed region, sometimes it doesn&rsquo;t.</p>
<p>BenchmarkDotNet can force GC between iterations (<a href="https://benchmarkdotnet.org/articles/configs/jobs.html"><code>GcForce</code></a>) and report allocation pressure (<a href="https://benchmarkdotnet.org/articles/configs/diagnosers.html"><code>MemoryDiagnoser</code></a>). The defense exists — but you have to <em>look</em>. A benchmark that allocates in the hot path and doesn&rsquo;t report memory is measuring GC behavior, not your algorithm. The <code>StdDev</code> rises and nobody knows why.</p>
<hr>
<h2 id="enemy-3--system-noise">Enemy 3 — System Noise</h2>
<p>Two identical methods. Same table. Same data. Same code — literally copy-paste. The table is pre-populated in <code>GlobalSetup</code> — every <code>Insert</code> is an update, not a growth event. Deterministic, constant-cost work where OS noise is the only variable.<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup></p>
<div class="highlight"><pre data-lang="csharp"><code>public class E3_OsNoise
{
    private const int N = 100_000;
    private ITable&lt;int, Row&gt; _table;
    private int[] _keys;
    private Row[] _rows;

    [GlobalSetup]
    public void Setup()
    {
        _table = new StripedTable&lt;int, Row&gt;();
        // ... generate keys and rows ...
        for (int i = 0; i &lt; N; i&#43;&#43;)
            _table.Insert(_keys[i], _rows[i]);  // pre-populate
    }

    [Benchmark(Baseline = true)]
    public void InsertBaseline()
    {
        for (int i = 0; i &lt; N; i&#43;&#43;)
            _table.Insert(_keys[i], _rows[i]);
    }

    [Benchmark]
    public void InsertSame()
    {
        for (int i = 0; i &lt; N; i&#43;&#43;)
            _table.Insert(_keys[i], _rows[i]);
    }
}</code></pre></div>
<p>The interesting number isn&rsquo;t the ratio between methods — it&rsquo;s the <code>StdDev</code> across two <em>runs</em> of the same benchmark under different conditions:</p>
<div class="highlight"><pre data-lang="bash"><code># Linux-only — taskset requires a real scheduler (not available on macOS/Windows)

# === Run 1: Noisy — saturate all CPU cores, then benchmark ===

# If the script exits (Ctrl-C or error), kill all background jobs automatically
trap &#39;kill $(jobs -p) 2&gt;/dev/null&#39; EXIT

# Spawn one infinite busy loop per CPU core — fills the scheduler with work
# $(nproc) returns your core count (e.g. 48), each loop burns 100% of one core
for i in $(seq 1 $(nproc)); do
  (while true; do :; done) &amp;   # &amp; sends each loop to background
done

# Now run the benchmark — the OS scheduler must fight for CPU time
dotnet run -c Release -- --filter &#39;*E3*&#39;

# Stop all busy loops
kill $(jobs -p)

# === Run 2: Isolated — pin benchmark to a single core, no contention ===

# taskset -c 0 = run only on core 0, no migration, no sharing
taskset -c 0 dotnet run -c Release -- --filter &#39;*E3*&#39;</code></pre></div>
<p><strong>Noisy run</strong> (all cores saturated):</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th style="text-align: right">Mean</th>
          <th style="text-align: right">StdDev</th>
          <th style="text-align: right">Ratio</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>InsertBaseline</td>
          <td style="text-align: right">18.95 ms</td>
          <td style="text-align: right">0.945 ms</td>
          <td style="text-align: right">1.00</td>
      </tr>
      <tr>
          <td>InsertSame</td>
          <td style="text-align: right">18.11 ms</td>
          <td style="text-align: right">0.583 ms</td>
          <td style="text-align: right">0.96</td>
      </tr>
  </tbody>
</table>
<p><strong>Isolated run</strong> (pinned to core 0, idle system):</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th style="text-align: right">Mean</th>
          <th style="text-align: right">StdDev</th>
          <th style="text-align: right">Ratio</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>InsertBaseline</td>
          <td style="text-align: right">12.98 ms</td>
          <td style="text-align: right">0.252 ms</td>
          <td style="text-align: right">1.00</td>
      </tr>
      <tr>
          <td>InsertSame</td>
          <td style="text-align: right">13.17 ms</td>
          <td style="text-align: right">0.254 ms</td>
          <td style="text-align: right">1.01</td>
      </tr>
  </tbody>
</table>
<div class="chart-container">
  <canvas id="chart-784ff5851d7c3933d8da2d1e3dc9ceb8"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-784ff5851d7c3933d8da2d1e3dc9ceb8').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['InsertBaseline', 'InsertSame'],
    datasets: [
      {
        label: 'Noisy (all cores saturated)',
        data: [0.945, 0.583],
        backgroundColor: '#f38ba8',
        borderColor: '#f38ba8',
        borderWidth: 1
      },
      {
        label: 'Isolated (pinned core)',
        data: [0.252, 0.254],
        backgroundColor: '#89b4fa',
        borderColor: '#89b4fa',
        borderWidth: 1
      }
    ]
  },
  options: {
    plugins: {
      title: { display: true, text: 'System noise impact — StdDev of identical insert loops' },
      subtitle: { display: true, text: 'Same code, same data, same machine — different running conditions' },
      legend: { display: true }
    },
    scales: {
      y: { title: { display: true, text: 'StdDev (ms)' }, min: 0, max: 1.1 }
    }
  }
}
);
  })();
</script>

<p>Same code. Same data. Same machine. The noisy run is 46% slower (mean) and <strong>3.7× noisier</strong> (StdDev). The noise isn&rsquo;t just the OS scheduler — it&rsquo;s the entire system under contention. Thread migration between cores flushes caches. Context switches inject 10–100 μs of jitter.<sup id="fnref:5"><a href="#fn:5" class="footnote-ref" role="doc-noteref">5</a></sup> Competing processes saturate the memory bus and evict cache lines that the benchmark needs. Interrupts and kernel work preempt the benchmark thread mid-iteration. Under CPU saturation, these effects stack: on a 13 ms insert loop, the mean shifts by 46% and the variance explodes. On a 100 μs microbenchmark, the effect is destruction — not noise.</p>
<p>The defense: <code>taskset</code> pins to a core (add <code>nice -n -20</code> with root for higher priority), more iterations average out the noise. BenchmarkDotNet&rsquo;s <a href="https://benchmarkdotnet.org/articles/configs/jobs.html"><code>MinIterationCount</code></a> and <a href="https://benchmarkdotnet.org/articles/configs/jobs.html"><code>Affinity</code></a> (CPU core mask — equivalent of <code>taskset</code> inside the process) settings help. But the scheduler is always there — and the smaller your operation, the larger the enemy.</p>
<hr>
<p>Three enemies down. All three live in the execution environment — BenchmarkDotNet can detect or mitigate them because it controls the process. The next three live at the boundary between your code and the hardware. Korzybski (1933): <em>the map is not the territory.</em> The framework maps the process. It can&rsquo;t map a dataset that fits in L3, a data order that trains the branch predictor, or a return type that lets the JIT eliminate your computation. Those are your choices — and the hardware responds to them silently.</p>
<hr>
<h2 id="enemy-4--cache-state">Enemy 4 — Cache State</h2>
<p>Random <code>Get()</code> on <code>StripedTable</code> — in-memory, no WAL (hence nanosecond latencies, not microsecond-scale numbers where fsync dominates). Same operation. Same code. One parameter: how many entries in the table.</p>
<div class="highlight"><pre data-lang="csharp"><code>public class E4_CacheState
{
    private const int LookupCount = 100_000;

    [Params(10_000, 2_000_000)]
    public int TableSize { get; set; }

    private ITable&lt;int, Row&gt; _table;
    private int[] _lookupKeys;

    [GlobalSetup]
    public void Setup()
    {
        _table = new StripedTable&lt;int, Row&gt;();
        for (int i = 0; i &lt; TableSize; i&#43;&#43;)
            _table.Insert(i, Row.Generate(i));
        // ... random lookup keys ...
    }

    [Benchmark(OperationsPerInvoke = LookupCount)]
    public Row? LookupRandom()
    {
        Row? last = default;
        var table = _table;
        var keys = _lookupKeys;
        for (int i = 0; i &lt; LookupCount; i&#43;&#43;)
            last = table.Get(keys[i]);
        return last;
    }
}</code></pre></div>
<p><code>OperationsPerInvoke</code> divides total time by 100K — reporting per-lookup latency. Same <code>Get()</code>. Same <code>StripedTable</code>. Different table size.</p>
<div class="chart-container">
  <canvas id="chart-a3ed0e909a3d6554cae2af400231ddbd"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-a3ed0e909a3d6554cae2af400231ddbd').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['10K entries (fits L3)', '2M entries (spills to DRAM)'],
    datasets: [{
      label: 'Per-lookup latency (ns)',
      data: [17.05, 50.08],
      backgroundColor: ['#89b4fa', '#f38ba8'],
      borderColor: ['#89b4fa', '#f38ba8'],
      borderWidth: 1
    }]
  },
  options: {
    plugins: {
      title: { display: true, text: 'Cache hierarchy impact — random Get() on StripedTable' },
      subtitle: { display: true, text: 'Same operation, same code — different table size' },
      legend: { display: false }
    },
    scales: {
      y: { title: { display: true, text: 'Latency per lookup (ns)' }, min: 0, max: 70 }
    }
  }
}
);
  })();
</script>

<table>
  <thead>
      <tr>
          <th>TableSize</th>
          <th style="text-align: right">Mean</th>
          <th style="text-align: right">StdDev</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>10,000</td>
          <td style="text-align: right">17.05 ns</td>
          <td style="text-align: right">0.190 ns</td>
      </tr>
      <tr>
          <td>2,000,000</td>
          <td style="text-align: right">50.08 ns</td>
          <td style="text-align: right">1.919 ns</td>
      </tr>
  </tbody>
</table>
<p><strong>2.9× on this hardware.</strong> <code>StdDev</code> tells the rest of the story.</p>
<p>10K entries: the benchmark&rsquo;s working set — <code>ConcurrentDictionary</code> bucket arrays (~80 KB) and <code>Node</code> objects (~400 KB) — totals ~500 KB, comfortably within the 30 MB L3 on the local socket (dual-socket NUMA — each socket has its own 30 MB L3; the benchmark thread runs on one).<sup id="fnref:6"><a href="#fn:6" class="footnote-ref" role="doc-noteref">6</a></sup> The <code>Row</code> payloads (~1.4 MB of <code>byte[]</code>) exist on the heap but <code>LookupRandom</code> never dereferences them — it returns the <code>Row</code> struct, not the data. So only the dictionary traversal structure needs to fit in cache. Every lookup hits cached memory. <code>StdDev</code> is 0.19 ns — tight, repeatable.</p>
<p>2M entries: the dictionary working set (bucket arrays ~32 MB + nodes ~80 MB ≈ 112 MB) exceeds L3 by a wide margin and spills to DRAM. Random access means random cache misses — each miss costs 60–100 ns instead of 4–12 ns. <code>StdDev</code> jumps to 1.9 ns — 10× noisier — because DRAM latency varies with access pattern, NUMA topology, and memory controller contention.</p>
<p>Cache doesn&rsquo;t just change the speed — it changes the <em>quality</em> of the measurement. Tight numbers, low StdDev, repeatable results — and potentially misleading. Popper (1934): a benchmark can falsify a hypothesis but never confirm one. The 2.9× gap and 10× StdDev increase point at cache hierarchy — <code>perf stat -e cache-misses,cache-references</code> would confirm, but the measurement already suggests the answer.</p>
<p>Same symptom — inflated speed and false confidence. Different cause. Hot cache vs cold DRAM.</p>
<hr>
<h2 id="enemy-5--branch-predictor-training">Enemy 5 — Branch Predictor Training</h2>
<p>Scan the results from the storage engine. <code>Row.Generate(key)</code> produces payloads of 32–256 bytes (formula: <code>32 + key % 225</code>). Count how many exceed a threshold. Standard aggregation — the kind you&rsquo;d run after querying the table.</p>
<div class="highlight"><pre data-lang="csharp"><code>public class E5_BranchPredictor
{
    [Params(8_000_000)]
    public int N { get; set; }

    private int[] _sorted;  // Row sizes from Row.Generate formula, sorted
    private int[] _random;  // Same values, shuffled

    [GlobalSetup]
    public void Setup()
    {
        _sorted = new int[N];
        for (int i = 0; i &lt; N; i&#43;&#43;)
            _sorted[i] = 32 &#43; (i % 225);  // Row.Generate payload formula
        Array.Sort(_sorted);

        _random = _sorted.ToArray();
        new Random(42).Shuffle(_random);
    }

    [Benchmark]
    public int ScanSorted()
    {
        int count = 0, threshold = 150;
        var data = _sorted;
        for (int i = 0; i &lt; data.Length; i&#43;&#43;)
            if (data[i] &gt; threshold) count&#43;&#43;;
        return count;
    }

    [Benchmark(Baseline = true)]
    public int ScanRandom()
    {
        int count = 0, threshold = 150;
        var data = _random;
        for (int i = 0; i &lt; data.Length; i&#43;&#43;)
            if (data[i] &gt; threshold) count&#43;&#43;;
        return count;
    }
}</code></pre></div>
<p>Same values. Same count returned. Both arrays accessed sequentially — the prefetcher treats them identically.<sup id="fnref:7"><a href="#fn:7" class="footnote-ref" role="doc-noteref">7</a></sup> Same memory layout, same access pattern. Only the value order differs — which is what branch predictors respond to.</p>
<div class="chart-container">
  <canvas id="chart-19f69055fb99c286b2bc0911279df80d"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-19f69055fb99c286b2bc0911279df80d').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['Sorted (predictable)', 'Random (unpredictable)'],
    datasets: [{
      label: 'Mean (ms)',
      data: [8.214, 41.363],
      backgroundColor: ['#a6e3a1', '#f38ba8'],
      borderColor: ['#a6e3a1', '#f38ba8'],
      borderWidth: 1
    }]
  },
  options: {
    plugins: {
      title: { display: true, text: 'Branch prediction impact — 8M Row sizes, threshold filter' },
      subtitle: { display: true, text: 'Same values from Row.Generate formula, different order' },
      legend: { display: false }
    },
    scales: {
      y: { title: { display: true, text: 'Mean (ms)' }, min: 0, max: 50 }
    }
  }
}
);
  })();
</script>

<table>
  <thead>
      <tr>
          <th>Method</th>
          <th style="text-align: right">Mean</th>
          <th style="text-align: right">Ratio</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ScanSorted</td>
          <td style="text-align: right">8.214 ms</td>
          <td style="text-align: right">0.20</td>
      </tr>
      <tr>
          <td>ScanRandom</td>
          <td style="text-align: right">41.363 ms</td>
          <td style="text-align: right">1.00</td>
      </tr>
  </tbody>
</table>
<p><strong>5.0× on this hardware.</strong> Same algorithm, same data, same cache behavior — different order.</p>
<p>Threshold 150 splits the range roughly in half — 106 out of 225 possible values exceed it (~47%). Near 50–50 is maximum branch unpredictability.<sup id="fnref:8"><a href="#fn:8" class="footnote-ref" role="doc-noteref">8</a></sup> The sorted array presents a clean pattern: all values below threshold, then all above. The branch predictor learns after a few iterations and predicts correctly for millions of subsequent elements. The shuffled array is a coin flip every iteration — the predictor guesses wrong ~47% of the time, and each misprediction costs 15–20 cycles while the pipeline flushes and refills.</p>
<p>Sequential keys feed the prefetcher — a data design problem. Here the data is random but sorted — and the branch predictor likely changes the result without your knowledge. You&rsquo;re trying to measure the storage engine&rsquo;s aggregation cost. You&rsquo;re mostly measuring the CPU pipeline&rsquo;s response to data order.</p>
<hr>
<h2 id="enemy-6--dead-code-elimination">Enemy 6 — Dead Code Elimination</h2>
<p>Sum the data from <code>Row.Generate</code>&rsquo;s formula — a checksum for integrity verification. 10 million iterations, pure arithmetic: <code>32 + (i % 225)</code>. No memory access. No exceptions. No side effects.</p>
<div class="highlight"><pre data-lang="csharp"><code>[DisassemblyDiagnoser(maxDepth: 3)]
public class E6_DeadCode
{
    [Params(10_000_000)]
    public int N { get; set; }

    [Benchmark]
    public void ChecksumEliminated()
    {
        long checksum = 0;
        for (int i = 0; i &lt; N; i&#43;&#43;)
            checksum &#43;= 32 &#43; (i % 225);
        // checksum not returned — JIT drops the accumulation
    }

    [Benchmark(Baseline = true)]
    public long ChecksumPreserved()
    {
        long checksum = 0;
        for (int i = 0; i &lt; N; i&#43;&#43;)
            checksum &#43;= 32 &#43; (i % 225);
        return checksum;
    }
}</code></pre></div>
<p>Identical loop. One returns the result. One doesn&rsquo;t.</p>
<div class="chart-container">
  <canvas id="chart-30568c15ff06091da31d3c38fb6c0b2d"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-30568c15ff06091da31d3c38fb6c0b2d').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['Eliminated (void — 21 B)', 'Preserved (return — 66 B)'],
    datasets: [{
      label: 'Mean (ms)',
      data: [3.750, 22.220],
      backgroundColor: ['#f38ba8', '#89b4fa'],
      borderColor: ['#f38ba8', '#89b4fa'],
      borderWidth: 1
    }]
  },
  options: {
    plugins: {
      title: { display: true, text: 'Dead code elimination — checksum over Row.Generate formula' },
      subtitle: { display: true, text: 'void (JIT strips accumulation) vs return (full computation)' },
      legend: { display: false }
    },
    scales: {
      y: { title: { display: true, text: 'Mean (ms)' }, min: 0, max: 25 }
    }
  }
}
);
  })();
</script>

<table>
  <thead>
      <tr>
          <th>Method</th>
          <th style="text-align: right">Mean</th>
          <th style="text-align: right">Code Size</th>
          <th style="text-align: right">Ratio</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ChecksumEliminated</td>
          <td style="text-align: right">3.750 ms</td>
          <td style="text-align: right">21 B</td>
          <td style="text-align: right">0.17</td>
      </tr>
      <tr>
          <td>ChecksumPreserved</td>
          <td style="text-align: right">22.220 ms</td>
          <td style="text-align: right">66 B</td>
          <td style="text-align: right">1.00</td>
      </tr>
  </tbody>
</table>
<p><strong>5.9× on this hardware.</strong> The <code>DisassemblyDiagnoser</code> shows why — the actual machine code for both methods:</p>
<div class="highlight"><pre data-lang="nasm"><code>; ChecksumEliminated — 21 bytes
M00_L00:
  inc   eax          ; i&#43;&#43;
  cmp   eax, ecx     ; i &lt; N?
  jl    M00_L00      ; loop

; ChecksumPreserved — 66 bytes
M00_L00:
  mov   edx, 91A2B3C5 ; magic constant for i % 225
  imul  esi            ; compiler-generated modulo
  ; ... 8 more instructions for 32 &#43; (i % 225) ...
  add   rcx, rax      ; checksum &#43;= result
  inc   esi            ; i&#43;&#43;
  cmp   esi, edi       ; i &lt; N?
  jl    M00_L00        ; loop</code></pre></div>
<p><code>[DisassemblyDiagnoser]</code> on the class generates this — run the benchmark and check <code>BenchmarkDotNet.Artifacts/results/</code> for the full listing (HTML + Markdown).</p>
<p>The JIT determined that <code>checksum</code> has no observable side effects — nobody reads it — and stripped out the entire accumulation. What remains is <code>inc/cmp/jl</code>: the loop counter, iterating 10 million times over nothing.<sup id="fnref:9"><a href="#fn:9" class="footnote-ref" role="doc-noteref">9</a></sup> The fix is simple: <a href="https://benchmarkdotnet.org/articles/guides/good-practices.html#avoid-dead-code-elimination">always return the computed value</a> so the JIT must preserve it.</p>
<p>Here&rsquo;s what makes this the most dangerous enemy: <strong>3.75 ms looks plausible.</strong> It&rsquo;s not zero. It&rsquo;s not suspiciously fast. It looks like a reasonable time for 10 million iterations of lightweight arithmetic. Without <code>DisassemblyDiagnoser</code>, you&rsquo;d trust it. You&rsquo;d compare it against another implementation. You&rsquo;d ship a conclusion based on a number that measures empty loop iterations.</p>
<p>21 bytes vs 66 bytes. The disassembler is the only reliable way to catch this. Because the lie that looks reasonable is worse than the lie that looks absurd.</p>
<hr>
<h2 id="know-your-enemies">Know your enemies</h2>
<table>
  <thead>
      <tr>
          <th>Enemy</th>
          <th>Effect</th>
          <th>Symptom</th>
          <th>Defense</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1. JIT Optimization Level</td>
          <td>6×†</td>
          <td>NoOptimization 6× slower (†extreme case; real Tier-0→1: 2–4×)</td>
          <td>Warmup (BDN default) + DisassemblyDiagnoser</td>
      </tr>
      <tr>
          <td>2. GC Pauses</td>
          <td>2.3×</td>
          <td>Allocation in hot path, StdDev spike</td>
          <td>MemoryDiagnoser + GcForce + pre-allocate</td>
      </tr>
      <tr>
          <td>3. System Noise</td>
          <td>3.7× StdDev</td>
          <td>Mean +46%, StdDev 3.7× under load</td>
          <td>taskset + nice + more iterations</td>
      </tr>
      <tr>
          <td>4. Cache State</td>
          <td>2.9×</td>
          <td>Working set &gt; L3</td>
          <td>Conscious choice: cold vs warm vs hot</td>
      </tr>
      <tr>
          <td>5. Branch Predictor</td>
          <td>5.0×</td>
          <td>Sorted data 5× faster</td>
          <td>Realistic (shuffled) data</td>
      </tr>
      <tr>
          <td>6. Dead Code Elimination</td>
          <td>5.9×</td>
          <td>Code Size 21 B vs 66 B</td>
          <td><a href="https://benchmarkdotnet.org/articles/guides/good-practices.html#avoid-dead-code-elimination">Return result</a> + DisassemblyDiagnoser</td>
      </tr>
  </tbody>
</table>
<p>Each enemy alone shifted the result 2–6× on this hardware. Stack three and the benchmark and production are different universes.</p>
<p>A reference checklist — not a universal shield, but a starting point that covers what BDN configuration <em>can</em> cover (enemies 1–3) and adds inspection tooling for what it can&rsquo;t (enemies 4–6). The enemy benchmarks in the companion code intentionally don&rsquo;t use it — defenses must be <em>down</em> to show the enemies in action:</p>
<div class="highlight"><pre data-lang="csharp"><code>public class EnemyDefenseConfig : ManualConfig
{
    public EnemyDefenseConfig()
    {
        AddJob(Job.Default
            .WithWarmupCount(3)               // E1: ensure Tier-1 before measurement
            .WithGcServer(true)               // E2: Server GC — fewer, larger collections
            .WithGcForce(true)                // E2: force GC between iterations
            .WithMinIterationCount(15)        // E3: average out scheduler noise
            .WithMaxIterationCount(100)       // E3: let BDN adapt when noise is present
            .WithAffinity((IntPtr)0b11));     // E3: pin to cores 0–1

        AddDiagnoser(MemoryDiagnoser.Default);              // E2: allocation pressure
        AddDiagnoser(new DisassemblyDiagnoser(              // E1&#43;E6: JIT output
            new DisassemblyDiagnoserConfig(maxDepth: 3)));

        AddColumn(StatisticColumn.StdDev);                  // E3: noise visible
    }
}</code></pre></div>
<p>Enemies 1–3: configuration. Enemies 4–6: conscious data design. No config setting shuffles your test data for you.</p>
<h3 id="run-it-yourself">Run it yourself</h3>
<div class="highlight"><pre data-lang="bash"><code>git clone https://github.com/0x3f-blog/companion-code.git
cd companion-code/first-things-first/enemies-of-measurement

# All six enemies
dotnet run -c Release

# One enemy at a time
dotnet run -c Release -- --filter &#39;*E5*&#39;

# OS noise comparison (Linux) — see E3 section for full commands
trap &#39;kill $(jobs -p) 2&gt;/dev/null&#39; EXIT
for i in $(seq 1 $(nproc)); do (while true; do :; done) &amp; done
dotnet run -c Release -- --filter &#39;*E3*&#39;
kill $(jobs -p)
taskset -c 0 dotnet run -c Release -- --filter &#39;*E3*&#39;</code></pre></div>
<p>The <em>direction</em> reproduces. The exact ratios depend on your hardware.</p>
<hr>
<h2 id="benchmark-environment">Benchmark environment</h2>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CPU</td>
          <td>2× Intel Xeon E5-2697 v2 @ 2.70 GHz (24 cores / 48 threads)</td>
      </tr>
      <tr>
          <td>L3 Cache</td>
          <td>30 MB per socket</td>
      </tr>
      <tr>
          <td>RAM</td>
          <td>~115 GB DDR3-1866 (quad-channel per socket)</td>
      </tr>
      <tr>
          <td>OS</td>
          <td>Fedora Linux 42 (kernel 6.17)</td>
      </tr>
      <tr>
          <td>Runtime</td>
          <td>.NET 9.0.11 (RyuJIT AVX)</td>
      </tr>
      <tr>
          <td>SDK</td>
          <td>.NET SDK 10.0.102</td>
      </tr>
      <tr>
          <td>BenchmarkDotNet</td>
          <td>v0.14.0</td>
      </tr>
      <tr>
          <td>Job</td>
          <td>DefaultJob (BDN auto-selects iteration count, typically 15+)</td>
      </tr>
      <tr>
          <td>GC</td>
          <td>Server GC, Concurrent (BDN enables Server GC in benchmark processes by default; host process uses Workstation)</td>
      </tr>
      <tr>
          <td>Storage</td>
          <td>In-memory (no WAL) — enemies hide in the in-memory path</td>
      </tr>
      <tr>
          <td>Power</td>
          <td><code>performance</code> governor, no frequency scaling</td>
      </tr>
      <tr>
          <td>Hygiene</td>
          <td>No browser, IDE, or heavy processes during runs</td>
      </tr>
  </tbody>
</table>
<p><strong>No WAL in this post.</strong> These enemies operate in the in-memory path, where fsync can&rsquo;t drown the signal.</p>
<hr>
<h2 id="we-walked-the-same-path">We walked the same path</h2>
<p>Same storage engine. Same path. Different place.</p>
<p>Heraclitus (~500 BCE): <em>you cannot step into the same river twice.</em> JIT, GC, scheduler, cache, branch predictor, dead code — the river moved between measurements. Six enemies, each shifting the answer 2–6× on this hardware. They stack.</p>
<p>A number that survives design review but not these six enemies is a comfortable lie — it looks right, it feels reproducible, and it&rsquo;s wrong.</p>
<p>Don&rsquo;t trust a number that hasn&rsquo;t survived six enemies.</p>
<hr>
<h2 id="further-reading">Further reading</h2>
<ul>
<li>Mytkowicz et al., <a href="https://dl.acm.org/doi/10.1145/1508244.1508275">Producing Wrong Data Without Doing Anything Obviously Wrong</a>, ASPLOS 2009 — how link order, environment variable size, and filesystem layout change benchmark results by 30%+.</li>
<li>Georges, Buytaert &amp; Eeckhout, <a href="https://dl.acm.org/doi/10.1145/1297027.1297033">Statistically Rigorous Java Performance Evaluation</a>, OOPSLA 2007 — methodology: how many iterations, confidence intervals, how to report results.</li>
<li>Curtsinger &amp; Berger, <a href="https://dl.acm.org/doi/10.1145/2451116.2451141">Stabilizer: Statistically Sound Performance Evaluation</a>, ASPLOS 2013 — randomizing code/data layout to eliminate cache alignment bias.</li>
<li>Blackburn et al., <a href="https://dl.acm.org/doi/10.1145/1378704.1378723">Wake Up and Smell the Coffee</a>, CACM 2008 — GC-aware benchmarking, steady-state vs startup.</li>
<li>Fog, <a href="https://www.agner.org/optimize/microarchitecture.pdf">Microarchitecture of Intel, AMD and VIA CPUs</a>, 2024 — branch prediction, cache hierarchy, instruction latency tables.</li>
<li>Fog, <a href="https://www.agner.org/optimize/optimizing_cpp.pdf">Optimizing Software in C++</a>, 2024 — compiler optimizations, dead code elimination, benchmark pitfalls.</li>
<li>Drepper, <a href="https://people.freebsd.org/~lstewart/articles/cpumemory.pdf">What Every Programmer Should Know About Memory</a>, 2007 — cache hierarchy, prefetch, NUMA, TLB. Still the definitive reference.</li>
<li>Akinshin, <a href="https://link.springer.com/book/10.1007/978-1-4842-4941-3">Pro .NET Benchmarking</a>, Apress 2019 — author of BenchmarkDotNet, comprehensive treatment of all six enemies.</li>
<li>Gregg, <a href="https://www.brendangregg.com/systems-performance-2nd-edition-book.html">Systems Performance</a>, 2nd ed., 2020 — CPU scheduler, context switches, interrupt coalescing.</li>
<li><a href="https://benchmarkdotnet.org/">BenchmarkDotNet Documentation</a> — <a href="https://benchmarkdotnet.org/articles/configs/jobs.html">Jobs</a> (WarmupCount, IterationCount defaults), <a href="https://benchmarkdotnet.org/articles/guides/good-practices.html">Good Practices</a> (dead code, setup/cleanup), <a href="https://benchmarkdotnet.org/articles/configs/diagnosers.html">MemoryDiagnoser</a>, <a href="https://benchmarkdotnet.org/articles/configs/diagnosers.html">DisassemblyDiagnoser</a>.</li>
<li>.NET Runtime, <a href="https://github.com/dotnet/runtime/blob/main/docs/design/features/tiered-compilation.md">Tiered Compilation Design Doc</a> — Tier 0 → Tier 1 → OSR.</li>
<li>.NET Runtime, <a href="https://github.com/dotnet/runtime/blob/main/docs/design/coreclr/botr/garbage-collection.md">GC Design Doc</a> — Gen0/1/2, Server vs Workstation GC, suspension.</li>
</ul>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>BenchmarkDotNet uses <code>DefaultJob</code> for all benchmarks. E2 reports a custom job name (<code>Job-XSSCPO</code>) because <code>[IterationSetup]</code> forces <code>InvocationCount=1</code> and <code>UnrollFactor=1</code> — BDN cannot batch-invoke methods that require per-iteration setup. The iteration count is still auto-selected.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>.NET&rsquo;s tiered compilation: Tier-0 (quick JIT — fast compile, slow code) → Tier-1 (optimized — slow compile, fast code). Since .NET Core 3.0, <em>quick JIT for loops</em> is disabled by default (<code>TC_QuickJitForLoops</code> off) — methods containing loops go straight to Tier-1. <code>NoOptimization</code> is more extreme than Tier-0: it disables <em>all</em> optimizations, not just the expensive ones. For the full pipeline, see .NET Runtime <a href="https://github.com/dotnet/runtime/blob/main/docs/design/features/tiered-compilation.md">Tiered Compilation Design Doc</a>.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:3">
<p>BenchmarkDotNet <code>DefaultJob</code> settings: <code>MinWarmupIterationCount</code> = 6, <code>MaxWarmupIterationCount</code> = 50 (adaptive), <code>MinIterationCount</code> = 15, <code>MaxIterationCount</code> = 100 (adaptive). See <a href="https://benchmarkdotnet.org/articles/configs/jobs.html">BenchmarkDotNet Jobs documentation</a> and source: <a href="https://github.com/dotnet/BenchmarkDotNet/blob/master/src/BenchmarkDotNet/Jobs/JobExtensions.cs"><code>DefaultConfig</code></a>.&#160;<a href="#fnref:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:4">
<p>Unlike E2 (which uses <code>[IterationSetup]</code> for a fresh table per iteration — because GC pressure needs fresh allocations), E3 intentionally uses <code>[GlobalSetup]</code> with a pre-populated table. Every iteration does updates to existing keys, not inserts that grow the <code>ConcurrentDictionary</code>. Fresh-table inserts add resize variance that drowns the OS noise signal we&rsquo;re trying to isolate.&#160;<a href="#fnref:4" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:5">
<p>Gregg, <em>Systems Performance</em> (2nd ed., 2020), Ch. 6. Context switch overhead varies from ~5 μs (hot cache, same core) to 100+ μs (cold cache, cross-NUMA migration). On a dual-socket system, thread migration between sockets adds memory access latency on top of the pipeline flush.&#160;<a href="#fnref:5" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:6">
<p>Drepper, <a href="https://people.freebsd.org/~lstewart/articles/cpumemory.pdf">What Every Programmer Should Know About Memory</a>, 2007, sections 3 and 6. L1: ~1 ns, L2: ~4 ns, L3: ~12 ns, DRAM: 60–100 ns. Random access to a dataset larger than L3 falls back to full DRAM latency — no prefetch, no spatial locality, every access is a cache miss.&#160;<a href="#fnref:6" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:7">
<p>Drepper, <a href="https://people.freebsd.org/~lstewart/articles/cpumemory.pdf">What Every Programmer Should Know About Memory</a>, 2007, sections 3.3 and 6.2. Sequential access triggers hardware prefetch — the CPU loads cache lines before code asks for them. Random access falls back to full DRAM latency.&#160;<a href="#fnref:7" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:8">
<p>Fog, <a href="https://www.agner.org/optimize/microarchitecture.pdf">Microarchitecture of Intel, AMD and VIA CPUs</a>, 2024, section 3. Branch prediction uses pattern history tables. A perfectly sorted sequence is trivially predictable after the transition point. A uniformly random ~50/50 pattern achieves the worst-case misprediction rate — the predictor has no pattern to learn. Each misprediction flushes the pipeline (15–20 cycles on modern Intel).&#160;<a href="#fnref:8" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:9">
<p>The JIT&rsquo;s dead code elimination for <code>ChecksumEliminated</code> is partial: it removes the accumulation (<code>checksum += 32 + (i % 225)</code>) because the result is never observed, but retains the loop counter (<code>i++</code>, compare, branch). The method still executes 10M loop iterations — it just does nothing useful in each one. This produces a plausible-looking 3.75 ms instead of the expected ~22 ms. The <code>DisassemblyDiagnoser</code> reveals the difference: 21 bytes of machine code (inc/cmp/jl) vs 66 bytes (full arithmetic + accumulation).&#160;<a href="#fnref:9" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>]]></content:encoded>
    </item>
    <item>
      <title>First Things First: Why Benchmarks Lie</title>
      <link>https://0x3f.blog/posts/first-things-first-why-benchmarks-lie/</link>
      <pubDate>Tue, 24 Feb 2026 21:00:00 +0100</pubDate>
      <guid>https://0x3f.blog/posts/first-things-first-why-benchmarks-lie/</guid>
      <description>Dictionary vs ConcurrentDictionary in C#: one benchmark says 2× faster, another 17× slower. Three scenarios showing why BenchmarkDotNet results lie.</description>
      <content:encoded><![CDATA[<h2 id="272m-opssec--and-a-lie">27.2M ops/sec — and a lie</h2>
<p>Same two classes. Same data. Same machine. One benchmark says <code>Dictionary + lock</code> is 2× faster. Another says <code>ConcurrentDictionary</code> is 17× faster. A third says it doesn&rsquo;t matter — fsync buries the difference in noise. Same optimization — three verdicts.</p>
<p>All code in this post: clone, build, run. Numbers below: dual Xeon E5-2697 v2, 48 threads, 30 MB L3 per socket (NUMA — two sockets, two separate caches, cross-socket traffic is real), ~115 GB DDR3-1866, Fedora 42, .NET 9.0.11 (RyuJIT AVX), BenchmarkDotNet v0.14.0, <code>ShortRun</code> job (LaunchCount=1, WarmupCount=3, IterationCount=3). No process or thread affinity pinning — on dual-socket NUMA, thread migration and cross-socket memory access can widen variance. Different hardware, different numbers — that&rsquo;s half the lesson.</p>
<p>Throughput in this post is reported in <strong>M ops/sec</strong>, derived from BenchmarkDotNet&rsquo;s per-operation time via <code>OperationsPerInvoke = 500_000</code>.</p>
<details>
<summary>Methodology details (click to expand)</summary>
<ul>
<li><strong>Inserts per invocation:</strong> N = 500,000 (insert-or-update via <code>dictionary[key] = value</code>). BDN divides invocation time by N to report nanoseconds per insert, from which ops/sec follows directly.</li>
<li><strong>Sequential/Narrow keys (0..N-1):</strong> every invocation after the first is pure overwrite — steady-state from the second call onward.</li>
<li><strong>Realistic keys (random from [0..1M)):</strong> <code>_randomKeys</code> is pre-generated once in <code>GlobalSetup</code> — a deterministic multiset with ~393K unique keys (expected for 500K draws from 1M: <code>n(1 - e^{-m/n})</code>). Subsequent invocations replay the same keys, modeling steady-state overwrites.</li>
<li><strong>Why <code>GlobalSetup</code>, not <code>IterationSetup</code>:</strong> the table is created once and persists across all iterations — no fresh table per run. This is deliberate. <code>IterationSetup</code> (reset the table before every iteration) measures cold inserts into an empty dictionary; <code>GlobalSetup</code> measures steady-state overwrites into a populated one. Cold inserts are dominated by dictionary resizing and memory allocation — costs that flatten the Lock vs Striped difference to ~1×, hiding the contention the post is about. Production systems don&rsquo;t restart between requests. The 17× Narrow advantage and the three-verdict divergence only appear under steady-state, where lock contention — not allocation — is the bottleneck.</li>
<li><strong>WAL storage:</strong> btrfs (<code>/var/tmp</code>, not tmpfs) — fsync hits a real filesystem, verified via <code>stat -f -c %T</code>.</li>
<li><strong>ShortRun job:</strong> <a href="https://benchmarkdotnet.org/articles/configs/jobs.html"><code>ShortRun</code></a> is deliberately quick (3 measurement iterations) — enough to show <em>relative</em> behavior across scenarios, not enough for tight confidence intervals or absolute production throughput. BDN emits the exact job config in every run: <code>Job=ShortRun-.NET 9.0  Runtime=.NET 9.0  IterationCount=3  LaunchCount=1  WarmupCount=3</code>.</li>
<li><strong>GC:</strong> BenchmarkDotNet can force a full GC between invocations (controlled by <code>GcForce</code> in the job config) — another laboratory condition that limits real-world representativeness.</li>
<li><strong>For publication-grade numbers:</strong> use <code>[SimpleJob]</code> (15+ iterations, tighter confidence intervals / CIs); <code>ShortRun</code> trades precision for speed.</li>
</ul>
</details>
<hr>
<h2 id="the-anatomy-of-a-comfortable-lie">The anatomy of a comfortable lie</h2>
<div class="highlight"><pre data-lang="csharp"><code>private LockTable&lt;int, Row&gt; _table;

[Benchmark]
public void InsertBenchmark()
{
    for (int i = 0; i &lt; 1_000_000; i&#43;&#43;)
        _table.Insert(i, GenerateRow(i));
}</code></pre></div>
<p>A storage engine — the running example here because it stacks every distortion layer: disk I/O, caching, concurrency, durability, all tangled. <code>LockTable</code> wraps <code>Dictionary&lt;TKey, TValue&gt;</code> behind a single <code>lock</code>; <code>StripedTable</code> wraps <code>ConcurrentDictionary</code> (striped locks<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>). They exist so the examples compile and run, and the locking semantics match. Swap &ldquo;storage engine&rdquo; for HTTP server, JSON serializer, or crypto library and the mechanics are identical. Anything touching memory hierarchies or I/O carries this gap.</p>
<p>Four assumptions hiding inside that one method:</p>
<table>
  <thead>
      <tr>
          <th>What the benchmark assumes</th>
          <th>What load exposes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Sequential keys (0, 1, 2, &hellip;)</td>
          <td>Random keys (uniform here; production often Zipfian<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>)</td>
      </tr>
      <tr>
          <td>Single thread, zero contention</td>
          <td>Dozens of threads, lock pressure</td>
      </tr>
      <tr>
          <td>Data fits in L3 cache (30 MB per socket)</td>
          <td>Working set 10–100× larger</td>
      </tr>
      <tr>
          <td>No durability (no fsync)</td>
          <td>WAL<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup> + fsync per batch</td>
      </tr>
  </tbody>
</table>
<p>The instinct is correct: fix it. Swap the lock for <code>ConcurrentDictionary</code>, add threads, use real data. The mistake comes next — believing the benchmark that exposed the problem can also verify the fix.</p>
<hr>
<h2 id="scenario-a--the-flat-line-lazy">Scenario A — The flat line (Lazy)</h2>
<div class="highlight"><pre data-lang="csharp"><code>[Benchmark(Baseline = true, OperationsPerInvoke = N)]
public void InsertSequential()
{
    for (int i = 0; i &lt; N; i&#43;&#43;)
        _table.Insert(i, Row.Default);
}</code></pre></div>
<p>Sequential keys. Single method. No durability. Run against both backends — <code>Dictionary + lock</code> and <code>ConcurrentDictionary</code> — with a <code>ThreadCount</code> parameter set to 1, 4, 16, 32. <code>InsertSequential</code> ignores it. That&rsquo;s the point.</p>
<div class="chart-container">
  <canvas id="chart-f84b01ddf308fb613addd1e7ee0fcff8"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-f84b01ddf308fb613addd1e7ee0fcff8').getContext('2d');
    new Chart(ctx, 
{
  type: 'line',
  data: {
    labels: ['1 thread', '4 threads', '16 threads', '32 threads'],
    datasets: [{
      label: 'Dictionary + lock (M ops/sec)',
      data: [27.2, 27.6, 27.3, 24.6],
      borderColor: '#f38ba8',
      backgroundColor: 'rgba(243, 139, 168, 0.1)',
      tension: 0.3
    }, {
      label: 'ConcurrentDictionary (M ops/sec)',
      data: [12.3, 10.5, 10.4, 10.6],
      borderColor: '#89b4fa',
      backgroundColor: 'rgba(137, 180, 250, 0.1)',
      tension: 0.3
    }]
  },
  options: {
    plugins: {
      title: { display: true, text: 'Insert throughput — baseline conditions' },
      subtitle: { display: true, text: '500K sequential inserts · single thread · no durability' }
    },
    scales: {
      y: { title: { display: true, text: 'M ops/sec' }, min: 0, max: 55 }
    }
  }
}
);
  })();
</script>

<div class="highlight"><pre data-lang="text"><code>| Backend              | 1 thread | 4 threads | 16 threads | 32 threads |
|----------------------|----------|-----------|------------|------------|
| Dictionary &#43; lock    | 27.2M    | 27.6M     | 27.3M      | 24.6M      |
| ConcurrentDictionary | 12.3M    | 10.5M     | 10.4M      | 10.6M      |</code></pre></div>
<p>These are steady-state overwrite numbers — state persists across invocations within a parameter combination, so after the first invocation the dictionary is full and every subsequent insert (including warmup and measurement) is a key update, not a growth/rehash event. Lock ranges from 24.6M to 27.6M across all thread counts — essentially flat (InsertSequential ignores <code>ThreadCount</code>; the spread is ShortRun noise). <code>ConcurrentDictionary</code> ranges from 10.4M to 12.3M. Is that variation real? Here&rsquo;s the raw BenchmarkDotNet output for <code>ConcurrentDictionary</code> InsertSequential:</p>
<div class="highlight"><pre data-lang="text"><code>| ThreadCount | Mean      | Error      | StdDev   | → M ops/sec |
|-------------|-----------|------------|----------|-------------|
| 1           | 81.28 ns  | ±24.88 ns  | 1.36 ns  | 12.3M       |
| 4           | 95.66 ns  | ±62.53 ns  | 3.43 ns  | 10.5M       |
| 16          | 96.15 ns  | ±24.42 ns  | 1.34 ns  | 10.4M       |
| 32          | 94.09 ns  | ±67.09 ns  | 3.68 ns  | 10.6M       |</code></pre></div>
<p><code>Error</code> is half of BenchmarkDotNet&rsquo;s confidence interval — throughout this post, &ldquo;CI&rdquo; means BDN&rsquo;s default (99.9%, very conservative). At <code>ShortRun</code>&rsquo;s 3 measurement iterations the t-distribution with 2 degrees of freedom inflates the CI further. <code>StdDev</code> is small (1.3–3.7 ns), meaning the individual runs were consistent, but the statistical confidence is low. The CIs overlap massively across all four rows. The variation looks like noise.</p>
<p>Neither backend reacts to thread count. Sequential keys feed the prefetcher<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup>, spatial locality holds the working set in cache, no fsync means no I/O stalls.</p>
<p>The benchmark doesn&rsquo;t react because it has nothing to react <em>to</em>.</p>
<p>Verdict: <code>ConcurrentDictionary</code> is a regression. Revert.</p>
<hr>
<h2 id="scenario-b--the-better-lie-narrow">Scenario B — The better lie (Narrow)</h2>
<div class="highlight"><pre data-lang="csharp"><code>[Benchmark(OperationsPerInvoke = N)]
public void InsertNarrow()
{
    int opsPerThread = N / ThreadCount;
    var options = new ParallelOptions { MaxDegreeOfParallelism = ThreadCount };
    Parallel.For(0, ThreadCount, options, threadIdx =&gt;
    {
        int start = threadIdx * opsPerThread;
        for (int i = 0; i &lt; opsPerThread; i&#43;&#43;)
            _table.Insert(start &#43; i, Row.Default);
    });
}</code></pre></div>
<p>One lie removed: threads now run in parallel. Three remain — sequential keys per partition, working set still fits in L3 cache, and no durability. <code>MaxDegreeOfParallelism</code> caps concurrency at <code>ThreadCount</code>; the actual threads come from the .NET ThreadPool, which ramps up workers on demand — for long-running CPU-bound work like this, the pool usually saturates quickly.</p>
<div class="chart-container">
  <canvas id="chart-9ad0113d54c400617d9b492a3708be27"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-9ad0113d54c400617d9b492a3708be27').getContext('2d');
    new Chart(ctx, 
{
  type: 'line',
  data: {
    labels: ['1 thread', '4 threads', '16 threads', '32 threads'],
    datasets: [{
      label: 'Dictionary + lock (M ops/sec)',
      data: [26.5, 5.8, 4.1, 2.8],
      borderColor: '#f38ba8',
      backgroundColor: 'rgba(243, 139, 168, 0.1)',
      tension: 0.3
    }, {
      label: 'ConcurrentDictionary (M ops/sec)',
      data: [10.4, 16.8, 40.5, 49.3],
      borderColor: '#89b4fa',
      backgroundColor: 'rgba(137, 180, 250, 0.1)',
      tension: 0.3
    }]
  },
  options: {
    plugins: {
      title: { display: true, text: 'Insert throughput — parallel, no durability' },
      subtitle: { display: true, text: '500K inserts · 1–32 threads · sequential keys per partition' }
    },
    scales: {
      y: { title: { display: true, text: 'M ops/sec' }, min: 0, max: 55 }
    }
  }
}
);
  })();
</script>

<div class="highlight"><pre data-lang="text"><code>| Backend              | 1 thread | 4 threads | 16 threads | 32 threads |
|----------------------|----------|-----------|------------|------------|
| Dictionary &#43; lock    | 26.5M    | 5.8M      | 4.1M       | 2.8M       |
| ConcurrentDictionary | 10.4M    | 16.8M     | 40.5M      | 49.3M      |</code></pre></div>
<p>The direction of the crossover is clear — even with ShortRun&rsquo;s wide CIs, the magnitude is too large to dismiss. Raw BenchmarkDotNet output for InsertNarrow (ns per insert, via <code>OperationsPerInvoke</code>). Note: ns/op here is wall‑clock time divided by N, not single‑thread latency — parallel overlap makes ns/op smaller than any individual thread&rsquo;s per‑insert time. Note the Error column — ShortRun (n=3) produces CIs wider than the Mean in several rows:</p>
<div class="highlight"><pre data-lang="text"><code>| Backend    | ThreadCount | Mean      | Error       | StdDev    | → M ops/sec |
|------------|-------------|-----------|-------------|-----------|-------------|
| Lock       | 1           | 37.67 ns  | ±5.63 ns    | 0.31 ns   | 26.5M       |
| Lock       | 32          | 351.35 ns | ±1,159.4 ns | 63.55 ns  | 2.8M        |
| ConcDic    | 1           | 95.79 ns  | ±60.33 ns   | 3.31 ns   | 10.4M       |
| ConcDic    | 32          | 20.28 ns  | ±35.85 ns   | 1.97 ns   | 49.3M       |</code></pre></div>
<p>Lock collapses from 26.5M to 2.8M. <code>ConcurrentDictionary</code> climbs from 10.4M to 49.3M. The direction is unambiguous; the exact ratio (~17× at 32 threads) is approximate given ShortRun&rsquo;s wide CIs. The verdict flips: ship the optimization.</p>
<p>Except — 49.3M ops/sec will never happen in production. Sequential keys per partition mean the prefetcher runs at full tilt. The working set still fits comfortably in L3 cache. No fsync means zero I/O wait. On this hardware, the gap between Narrow and Realistic is 391× (49.3M vs 126K). Fsync drowns both backends to the point where the CIs overlap completely.</p>
<p><a href="https://en.wikipedia.org/wiki/Goodhart%27s_law">Goodhart&rsquo;s Law</a> (1975), in <a href="https://doi.org/10.1002/(SICI)1234-981X(199707)5:3%3C305::AID-EURO184%3E3.0.CO;2-4">Strathern&rsquo;s phrasing</a> (1997): <em>&ldquo;When a measure becomes a target, it ceases to be a good measure.&rdquo;</em></p>
<p>Volkswagen — <em>Deutsche Gründlichkeit</em> — shipped ECU software that detected test conditions and switched calibration<sup id="fnref:5"><a href="#fn:5" class="footnote-ref" role="doc-noteref">5</a></sup>. When the car knew it was being tested, it changed behavior. Passed every bench in Europe and the US for years. Not better engineering. Better cheating.</p>
<p>A benchmark that reacts in the right direction but overstates the magnitude is more dangerous than one that doesn&rsquo;t react at all — it breeds the confidence to ship without measuring further.</p>
<hr>
<h2 id="scenario-c--same-optimization-opposite-conclusions-realistic">Scenario C — Same optimization, opposite conclusions (Realistic)</h2>
<div class="highlight"><pre data-lang="csharp"><code>[GlobalSetup]
public void Setup()
{
    var walDir = Path.Combine(&#34;/var/tmp&#34;, &#34;bench-wal&#34;);
    Directory.CreateDirectory(walDir);
    var walPath = Path.Combine(walDir, $&#34;wal-{BackendType}-{ThreadCount}.log&#34;);
    _table = BackendType == Backend.Lock
        ? new LockTable&lt;int, Row&gt;(walPath: walPath)
        : new StripedTable&lt;int, Row&gt;(walPath: walPath);
    var rng = new Random(42);
    _randomKeys = new int[N];
    _randomRows = new Row[N];
    for (int i = 0; i &lt; N; i&#43;&#43;)
    {
        _randomKeys[i] = rng.Next(0, KeySpace);
        _randomRows[i] = Row.Generate(_randomKeys[i]);
    }
}

[Benchmark(OperationsPerInvoke = N)]
public void InsertRealistic()
{
    int opsPerThread = N / ThreadCount;
    var options = new ParallelOptions { MaxDegreeOfParallelism = ThreadCount };
    Parallel.For(0, ThreadCount, options, threadIdx =&gt;
    {
        int start = threadIdx * opsPerThread;
        for (int i = 0; i &lt; opsPerThread; i&#43;&#43;)
            _table.Insert(_randomKeys[start &#43; i], _randomRows[start &#43; i]);
    });
    _table.FlushWAL();  // one fsync per batch — group commit, not per-insert
}</code></pre></div>
<p>All four distortions from the opening table addressed. Random keys. Parallel threads. Working set that approaches L3 capacity — at ~500K entries with variable-size Row payloads, the hash table reaches the 30 MB L3 boundary (keys + rows + bucket overhead), before counting the WAL buffer. Durability enforced — <code>FlushWAL()</code> calls fsync once per batch (group commit), not per insert. WAL on btrfs (<code>/var/tmp</code>), not tmpfs — fsync hits a real filesystem. A real OLTP system calling fsync per transaction would be slower still.</p>
<div class="chart-container">
  <canvas id="chart-4cf20a219dabbca812ff900a3c717627"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-4cf20a219dabbca812ff900a3c717627').getContext('2d');
    new Chart(ctx, 
{
  type: 'line',
  data: {
    labels: ['1 thread', '4 threads', '16 threads', '32 threads'],
    datasets: [{
      label: 'Dictionary + lock (K ops/sec)',
      data: [125, 122, 124, 120],
      borderColor: '#f38ba8',
      backgroundColor: 'rgba(243, 139, 168, 0.1)',
      tension: 0.3
    }, {
      label: 'ConcurrentDictionary (K ops/sec)',
      data: [125, 124, 126, 126],
      borderColor: '#89b4fa',
      backgroundColor: 'rgba(137, 180, 250, 0.1)',
      tension: 0.3
    }]
  },
  options: {
    plugins: {
      title: { display: true, text: 'Insert throughput — production-like conditions' },
      subtitle: { display: true, text: '500K inserts · 1–32 threads · random keys · WAL + fsync on btrfs · same chart shape, different unit, different bottleneck' }
    },
    scales: {
      y: { title: { display: true, text: 'K ops/sec' }, min: 0, max: 150 }
    }
  }
}
);
  })();
</script>

<div class="highlight"><pre data-lang="text"><code>| Backend              | 1 thread | 4 threads | 16 threads | 32 threads |
|----------------------|----------|-----------|------------|------------|
| Dictionary &#43; lock    | 125K     | 122K      | 124K       | 120K       |
| ConcurrentDictionary | 125K     | 124K      | 126K       | 126K       |</code></pre></div>
<p>Are these differences real? Raw BenchmarkDotNet output for InsertRealistic at 32 threads:</p>
<div class="highlight"><pre data-lang="text"><code>| Backend    | Mean        | Error       | StdDev     | → K ops/sec |
|------------|-------------|-------------|------------|-------------|
| Lock       | 8,335 ns/op | ±4,166 ns   | 228 ns     | 120K        |
| ConcDic    | 7,934 ns/op | ±1,538 ns   | 84 ns      | 126K        |</code></pre></div>
<p>The 99.9% CIs overlap (Lock: 4,169–12,501 ns; ConcDic: 6,396–9,472 ns). With only 3 iterations (ShortRun), the CIs are too wide to resolve a 5% difference — it&rsquo;s indistinguishable from noise at this sample size.</p>
<p>The lines don&rsquo;t cross. They overlap. Both backends converge to ~124K ops/sec regardless of thread count. Note the unit: ns/op here is wall‑clock time per invocation divided by N, so each insert&rsquo;s share includes 1/N of the single <code>FlushWAL()</code> call — amortized fsync, not per‑insert fsync. In batch terms: ~8,000 ns/op × 500K = ~4 seconds per batch, one fsync per batch — a commit rate of ~0.25 batches/sec.</p>
<p>How do we know fsync is the bottleneck and not just noise? Remove it. A fourth benchmark, <code>InsertRealisticNoFlush</code>, runs the same random‑key parallel inserts without calling <code>FlushWAL()</code>:</p>
<div class="highlight"><pre data-lang="csharp"><code>[Benchmark(OperationsPerInvoke = N)]
public void InsertRealisticNoFlush()
{
    int opsPerThread = N / ThreadCount;
    var options = new ParallelOptions { MaxDegreeOfParallelism = ThreadCount };
    Parallel.For(0, ThreadCount, options, threadIdx =&gt;
    {
        int start = threadIdx * opsPerThread;
        for (int i = 0; i &lt; opsPerThread; i&#43;&#43;)
            _table.Insert(_randomKeys[start &#43; i], _randomRows[start &#43; i]);
    });
    // no FlushWAL — pure in-memory
}</code></pre></div>
<p>Raw BenchmarkDotNet output at 32 threads:</p>
<div class="highlight"><pre data-lang="text"><code>| Method                 | Backend | Mean        | Error       | StdDev    |
|------------------------|---------|-------------|-------------|-----------|
| InsertRealistic        | Lock    | 8,335 ns/op | ±4,166 ns   | 228 ns    |
| InsertRealisticNoFlush | Lock    | 422 ns/op   | ±684 ns     | 38 ns     |
| InsertRealistic        | ConcDic | 7,934 ns/op | ±1,538 ns   | 84 ns     |
| InsertRealisticNoFlush | ConcDic | 33 ns/op    | ±15 ns      | 0.8 ns    |</code></pre></div>
<div class="chart-container">
  <canvas id="chart-e1996b7e22b178c34aa252d6e9fb204c"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-e1996b7e22b178c34aa252d6e9fb204c').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['With fsync', 'Without fsync'],
    datasets: [
      {
        label: 'Dictionary + lock',
        data: [0.120, 2.37],
        backgroundColor: '#f38ba8',
        borderColor: '#f38ba8',
        borderWidth: 1
      },
      {
        label: 'ConcurrentDictionary',
        data: [0.126, 30.3],
        backgroundColor: '#89b4fa',
        borderColor: '#89b4fa',
        borderWidth: 1
      }
    ]
  },
  options: {
    plugins: {
      title: { display: true, text: 'fsync is the bottleneck — remove it and the difference reappears' },
      subtitle: { display: true, text: '32 threads · random keys · same code, ± one FlushWAL() call' }
    },
    scales: {
      y: {
        type: 'logarithmic',
        title: { display: true, text: 'M ops/sec (log scale)' },
        min: 0.01,
        max: 100
      }
    }
  }
}
);
  })();
</script>

<p>Without fsync, <code>ConcurrentDictionary</code> appears substantially faster than lock at 32 threads (33 ns vs 422 ns, ~12.8×) — but note Lock&rsquo;s Error (±684 ns) exceeds its Mean (422 ns), so the exact ratio is approximate at this sample size. The direction is consistent with the Narrow scenario: lock contention dominates when I/O is removed. With fsync, both backends are crushed to ~0.12M ops/sec. The ~7,900 ns that fsync adds per insert is the same for both backends, and it drowns everything else. The optimization that the Narrow benchmark promised at 49.3M ops/sec is unresolved at this sample size under production I/O — the CIs overlap completely.</p>
<p>Three benchmarks. Three verdicts.</p>
<table>
  <thead>
      <tr>
          <th></th>
          <th>Lazy (flat)</th>
          <th>Narrow (inflated)</th>
          <th>Realistic (fsync)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Dictionary + lock</td>
          <td>27.2M ops/sec</td>
          <td>2.8M ops/sec</td>
          <td>120K ops/sec</td>
      </tr>
      <tr>
          <td>ConcurrentDictionary</td>
          <td>12.3M ops/sec</td>
          <td>49.3M ops/sec</td>
          <td>126K ops/sec</td>
      </tr>
      <tr>
          <td><strong>ConcDic / Lock</strong></td>
          <td><strong>0.45×</strong></td>
          <td><strong>~17×</strong></td>
          <td><strong>~1.05×</strong></td>
      </tr>
      <tr>
          <td><strong>Verdict</strong></td>
          <td><strong>Revert</strong> — lock is ~2×</td>
          <td><strong>Ship</strong> — ConcDic is ~17×</td>
          <td><strong>Irrelevant</strong> — both ≈124K</td>
      </tr>
  </tbody>
</table>
<p>M ops/sec → K ops/sec. Korzybski (1933): <em>the map is not the territory.</em> Three benchmarks, three maps — each accurate within its own borders, each blind beyond them.</p>
<h3 id="run-it-yourself">Run it yourself</h3>
<div class="highlight"><pre data-lang="bash"><code>git clone https://github.com/0x3f-blog/companion-code.git
cd companion-code/first-things-first/why-benchmarks-lie

# All benchmarks (Lazy, Narrow, Realistic)
dotnet run -c Release

# Two benchmark classes — WhyBenchmarksLieSequentialBenchmark (Lazy)
# and WhyBenchmarksLieBenchmark (Narrow, Realistic, NoFlush)
dotnet run -c Release -- --filter &#39;*WhyBenchmarksLie*&#39;</code></pre></div>
<p>The <em>direction</em> reproduces; the exact magnitudes don&rsquo;t.</p>
<h2 id="benchmark-environment">Benchmark environment</h2>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CPU</td>
          <td>2× Intel Xeon E5-2697 v2 @ 2.70 GHz (24 cores / 48 threads)</td>
      </tr>
      <tr>
          <td>L3 Cache</td>
          <td>30 MB per socket</td>
      </tr>
      <tr>
          <td>RAM</td>
          <td>~115 GB DDR3-1866 (quad-channel per socket)</td>
      </tr>
      <tr>
          <td>OS</td>
          <td>Fedora Linux 42 (kernel 6.17)</td>
      </tr>
      <tr>
          <td>Runtime</td>
          <td>.NET 9.0.11 (RyuJIT AVX)</td>
      </tr>
      <tr>
          <td>SDK</td>
          <td>.NET SDK 10.0.102</td>
      </tr>
      <tr>
          <td>BenchmarkDotNet</td>
          <td>v0.14.0</td>
      </tr>
      <tr>
          <td>Job</td>
          <td>ShortRun (WarmupCount=3, IterationCount=3)</td>
      </tr>
      <tr>
          <td>GC</td>
          <td>Server GC, Concurrent (BDN default for benchmark processes)</td>
      </tr>
      <tr>
          <td>WAL storage</td>
          <td>btrfs (<code>/var/tmp</code>, not tmpfs) — fsync hits a real filesystem</td>
      </tr>
      <tr>
          <td>NVMe</td>
          <td>Samsung SSD 970 EVO Plus 1TB (consumer, no PLP)</td>
      </tr>
  </tbody>
</table>
<p><strong>Limitations:</strong> Dual-socket NUMA — no process or thread affinity pinning, so thread migration and cross-socket memory access can widen variance. ShortRun (n=3) provides wide CIs — sufficient for relative comparisons, not for tight absolute numbers.</p>
<hr>
<h2 id="what-to-do-starting-tomorrow">What to do starting tomorrow</h2>
<p><strong><a href="https://en.wikipedia.org/wiki/Little%27s_law">Little&rsquo;s Law</a></strong> (1961): <strong>L = λ × W</strong> — <em>in-flight requests = throughput × latency</em>.</p>
<p>Throughput without latency is half the story:</p>
<table>
  <thead>
      <tr>
          <th>Throughput</th>
          <th>Latency</th>
          <th>In-flight (L = λ × W)</th>
          <th>What it looks like</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>100K ops/sec</td>
          <td>10 μs</td>
          <td>~1</td>
          <td>Single-threaded, no queuing</td>
      </tr>
      <tr>
          <td>100K ops/sec</td>
          <td>1 ms</td>
          <td>100</td>
          <td>Moderate concurrency</td>
      </tr>
      <tr>
          <td>100K ops/sec</td>
          <td>10 ms</td>
          <td>1,000</td>
          <td>Heavy concurrency, queuing</td>
      </tr>
  </tbody>
</table>
<p>Same throughput. Three different systems. Same throughput can hide wildly different p99 — queueing shows up in the tails first. Reporting throughput without a latency distribution is a hospital bragging about patient volume and omitting the mortality rate.</p>
<p>Red flags — any of these and the numbers are suspect:</p>
<table>
  <thead>
      <tr>
          <th>Red flag</th>
          <th>Why it matters</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Sequential keys only</td>
          <td>Perfect branch prediction, zero hash collisions, sequential cache access</td>
      </tr>
      <tr>
          <td>Single thread</td>
          <td>No lock contention, no false sharing, no NUMA effects</td>
      </tr>
      <tr>
          <td>Data &lt; L3 cache</td>
          <td>Measures cache speed, not the algorithm</td>
      </tr>
      <tr>
          <td>No durability (no fsync)</td>
          <td>fsync cost spans orders of magnitude across hardware classes<sup id="fnref:6"><a href="#fn:6" class="footnote-ref" role="doc-noteref">6</a></sup></td>
      </tr>
      <tr>
          <td>Fixed-size values</td>
          <td>No memory fragmentation, no variable-length encoding overhead</td>
      </tr>
      <tr>
          <td>Clean state every run</td>
          <td>No WAL growth, no fragmentation, no background compaction</td>
      </tr>
  </tbody>
</table>
<p>Each row can shift results by an order of magnitude. They don&rsquo;t stack linearly — some overlap, some cancel — but they compound.</p>
<p><strong>1. Compare two things.</strong> A number alone is noise.</p>
<div class="highlight"><pre data-lang="csharp"><code>[Benchmark(Baseline = true)]
public byte[] SerializeJson() =&gt; JsonSerializer.SerializeToUtf8Bytes(_payload);

[Benchmark]
public byte[] SerializeMessagePack() =&gt; MessagePackSerializer.Serialize(_payload);</code></pre></div>
<p><code>Baseline = true</code> tells <a href="https://benchmarkdotnet.org/">BenchmarkDotNet</a>: this is the reference point. Every other <code>[Benchmark]</code> method gets a <code>Ratio</code> column — how many times faster or slower relative to baseline — plus <code>Error</code> and <code>StdDev</code> so the difference is real, not noise.</p>
<p>&ldquo;312 ns.&rdquo; Fast? Slow? No way to tell. &ldquo;0.43× baseline&rdquo; — MessagePack is 2.3× faster than JSON. That&rsquo;s a decision, not a number.</p>
<p><strong>2. Realistic data, realistic scale.</strong> CPUs have a hierarchy of memory, each level larger and slower than the last. L1 cache (~32 KB) responds in ~1 ns. L2 (~256 KB) in ~4 ns. L3 (~30 MB, shared across cores) in ~12 ns. Main memory (DRAM) in ~60–100 ns. A benchmark whose dataset fits in L1 measures cache speed, not the algorithm.</p>
<p><code>[Params]</code> tells BenchmarkDotNet to run the same benchmark at each data size — crossing every boundary in the hierarchy:</p>
<div class="highlight"><pre data-lang="csharp"><code>[Params(
    1_000,        // 4 KB  → fits in L1
    64_000,       // 256 KB → fits in L2
    8_000_000,    // 32 MB → near L3 boundary (30 MB)
    64_000_000    // 256 MB → spills to DRAM
)]
public int DataSize { get; set; }</code></pre></div>
<p>If the numbers don&rsquo;t change across sizes — the algorithm is cache-insensitive. If they cliff at 32 MB — the benchmark was measuring cache, and production (where the working set is 40 GB) will see a very different number.</p>
<p><strong>3. Error bars.</strong> BenchmarkDotNet reports <code>Error</code> (half-width of the 99.9% confidence interval) and <code>StdDev</code> (spread of measurements) for every benchmark. Overlapping CIs mean the difference is unresolved. Popper (1934): a benchmark can falsify a hypothesis — &ldquo;A is not faster than B&rdquo; — but overlapping confidence intervals can never confirm one. For formal tests, BDN supports Welch&rsquo;s t-test and Mann-Whitney U.<sup id="fnref:7"><a href="#fn:7" class="footnote-ref" role="doc-noteref">7</a></sup></p>
<div class="highlight"><pre data-lang="text"><code>| Method | Mean     | Error   | StdDev  |
| V1     | 152.3 ns | ±2.1 ns | 1.98 ns |
| V2     | 148.7 ns | ±3.4 ns | 3.19 ns |</code></pre></div>
<p>V1 ranges from 150.2 to 154.4 ns. V2 ranges from 145.3 to 152.1 ns. The ranges overlap — the difference is unresolved at this confidence level. Without a formal statistical test, shipping this as a &ldquo;3% win&rdquo; risks shipping noise.</p>
<p><strong>4. Check JIT output.</strong> .NET compiles code at runtime (Just-In-Time). The JIT compiler is smart — sometimes too smart. It might inline a method (copy its body into the caller, eliminating call overhead), eliminate a loop it can prove does nothing, auto-vectorize scalar operations to use SIMD instructions, or optimize away the very thing being benchmarked because the result is never used.</p>
<div class="highlight"><pre data-lang="csharp"><code>[DisassemblyDiagnoser(maxDepth: 3)]
public class MyBenchmark { ... }</code></pre></div>
<p><code>DisassemblyDiagnoser</code> dumps the actual machine code the JIT produced. If the benchmark method compiled down to three instructions — the JIT optimized away the work. The numbers are real. What they measure is not.</p>
<p><strong>5. Reproducibility.</strong> A benchmark result without its environment is anecdote, not data. Minimum context: CPU model and cache sizes, RAM speed and capacity, OS and kernel version, .NET version, GC mode (Workstation vs Server), dataset size + distribution + random seed, and the exact command to run.</p>
<p>The environment is part of the result. Publishing numbers without it is publishing conclusions without evidence.</p>
<hr>
<p>Benchmark design is architecture. Not metaphorically. Data distribution, concurrency model, durability semantics, warm-up strategy, statistical methodology — each choice shapes what the measurement can see and what it&rsquo;s blind to.</p>
<p>The Lazy benchmark measured real throughput. Single-threaded sequential inserts into an in-memory structure — that&rsquo;s exactly what it tested. The Narrow benchmark promised 49.3M ops/sec — production delivered 126K. On this hardware, on this workload — 391× between the map and the territory. Under production I/O, <code>ConcurrentDictionary</code> and <code>Dictionary + lock</code> converge — the CIs overlap completely. Three benchmarks, three maps. None of them the territory.</p>
<p>Without measurement architecture, the right optimization looks like a regression, the wrong one looks like progress, and every decision downstream inherits the distortion.</p>
<hr>
<h2 id="further-reading">Further reading</h2>
<ul>
<li><a href="https://research.google/pubs/the-tail-at-scale/">The Tail at Scale</a> — Dean &amp; Barroso, 2013. The definitive paper on tail latency.</li>
<li><a href="https://people.freebsd.org/~lstewart/articles/cpumemory.pdf">What Every Programmer Should Know About Memory</a> — Drepper, 2007. Memory hierarchy, caches, prefetch, NUMA. Still relevant.</li>
<li><a href="https://benchmarkdotnet.org/">BenchmarkDotNet</a> — .NET benchmarking framework.</li>
<li><a href="https://www.brendangregg.com/activebenchmarking.html">Active Benchmarking</a> — Brendan Gregg on methodology.</li>
<li><a href="https://en.wikipedia.org/wiki/Goodhart%27s_law">Goodhart&rsquo;s Law</a> — Goodhart (1975) on statistical regularity; Strathern (1997) for the popular phrasing: &ldquo;When a measure becomes a target, it ceases to be a good measure.&rdquo;</li>
<li><a href="https://en.wikipedia.org/wiki/Little%27s_law">Little&rsquo;s Law</a> — L = λW.</li>
</ul>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>.NET <code>ConcurrentDictionary</code> uses conceptually striped/fine-grained locking for writes and lock-free reads on common paths like <code>TryGetValue</code>. The exact internal structure may evolve across .NET versions, but the design principle — partition contention across multiple locks instead of one — is stable. <a href="https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Collections/Concurrent/ConcurrentDictionary.cs">Source</a>, <a href="https://learn.microsoft.com/en-us/dotnet/api/system.collections.concurrent.concurrentdictionary-2">docs</a>.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>Cooper et al., <a href="https://dl.acm.org/doi/10.1145/1807128.1807152">Benchmarking Cloud Serving Systems with YCSB</a>, SoCC 2010. Core workloads (A/B/C) use Zipfian record selection — a small fraction of keys receives a disproportionate share of requests.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:3">
<p>Mohan et al., <a href="https://dl.acm.org/doi/10.1145/128765.128770">ARIES: A Transaction Recovery Method</a>, ACM TODS 1992. The WAL protocol behind PostgreSQL, MySQL, SQLite, RocksDB.&#160;<a href="#fnref:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:4">
<p>Drepper, <a href="https://people.freebsd.org/~lstewart/articles/cpumemory.pdf">What Every Programmer Should Know About Memory</a>, 2007, §3.3 and §6.2. Sequential access triggers hardware prefetch — the CPU loads cache lines before code asks for them. Random access falls back to full DRAM latency (~60–100 ns).&#160;<a href="#fnref:4" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:5">
<p>United States EPA, <a href="https://www.epa.gov/sites/default/files/2015-10/documents/vw-nov-caa-09-18-15.pdf">Notice of Violation</a>, Sep 18 2015 — lists the defeat device inputs: steering wheel position, vehicle speed, engine duration, and barometric pressure (p. 2). Real-world NOx emissions: up to 40× the US legal limit. Overview: <a href="https://www.epa.gov/vw/learn-about-volkswagen-violations">EPA VW violations page</a>.&#160;<a href="#fnref:5" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:6">
<p>fsync latency spans orders of magnitude depending on hardware class (enterprise NVMe with PLP vs consumer SSD vs spinning disk), firmware, filesystem, and write pattern. The spread between device classes alone can be 100×. Group commit amortizes single-fsync cost across many transactions — standard practice in PostgreSQL, MySQL, and most OLTP databases. See Pillai et al., <a href="https://www.usenix.org/conference/osdi14/technical-sessions/presentation/pillai">All File Systems Are Not Created Equal</a>, OSDI 2014, for how filesystem and protocol choices further multiply the difference.&#160;<a href="#fnref:6" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:7">
<p>CI overlap is a conservative heuristic: if 99.9% CIs overlap, the difference is likely noise. But non-overlapping CIs don&rsquo;t guarantee practical significance — a full treatment requires effect size and formal statistical tests (Welch&rsquo;s t-test, Mann-Whitney U).&#160;<a href="#fnref:7" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>]]></content:encoded>
    </item>
  </channel>
</rss>
