<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Statistics on 0x3F</title>
    <link>https://0x3f.blog/tags/statistics/</link>
    <description>Recent content in Statistics on 0x3F</description>
    <generator>Hugo -- 0.152.2</generator>
    <language>en-us</language>
    <lastBuildDate>Fri, 06 Mar 2026 18:00:00 +0100</lastBuildDate>
    <atom:link href="https://0x3f.blog/tags/statistics/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>First Things First: Statistics That Matter</title>
      <link>https://0x3f.blog/posts/first-things-first-statistics-that-matter/</link>
      <pubDate>Fri, 06 Mar 2026 18:00:00 +0100</pubDate>
      <guid>https://0x3f.blog/posts/first-things-first-statistics-that-matter/</guid>
      <description>3% slower — or noise? The error bar stretches 10%. Confidence intervals, effect size, and micro vs macro: the three layers between a number and a conclusion.</description>
      <content:encoded><![CDATA[<h2 id="3-slower-ship-it">3% slower. Ship it.</h2>
<p>Two filter variants over 20 million integers. Five benchmark iterations. FilterTernary: 26.11 ms. FilterBranch: 25.30 ms. The ternary is 3% slower. PR description writes itself. Merge. Deploy.</p>
<p>Next day, rollback. Regression in production — on hardware where the difference vanishes, on data where it reverses.</p>
<p>Design fixed. Environment defended. Data collected honestly. The benchmark is solid. The number is real. The interpretation is not.</p>
<p>All code in this post: clone, build, run. Numbers below were measured on dual Xeon E5-2697 v2 using BenchmarkDotNet v0.14.0, pinned to a single NUMA node — run the companion code on your hardware for your own results. Different machine, different numbers.</p>
<p><em>Convention: charts use milliseconds unless otherwise noted; tables reproduce BenchmarkDotNet output. BDN&rsquo;s Error column is the half-width of the 99.9% confidence interval.</em></p>
<hr>
<h2 id="the-number-is-the-answer">The number is the answer</h2>
<div class="highlight"><pre data-lang="csharp"><code>[Benchmark(Baseline = true)]
public long FilterBranch()
{
    long sum = 0;
    int[] data = _data;
    for (int i = 0; i &lt; data.Length; i&#43;&#43;)
    {
        if (data[i] &gt; 0)
            sum &#43;= data[i];
    }
    return sum;
}

[Benchmark]
public long FilterTernary()
{
    long sum = 0;
    int[] data = _data;
    for (int i = 0; i &lt; data.Length; i&#43;&#43;)
    {
        int v = data[i];
        sum &#43;= v &gt; 0 ? v : 0;
    }
    return sum;
}</code></pre></div>
<p><small>Two filter variants over 20M integers (~95% positive). Full source in companion code.</small></p>
<p>Every benchmarking tutorial ends here: compare two means, pick the lower one. FilterTernary = 26.11 ms, FilterBranch = 25.30 ms — 3% difference. The ternary loses.</p>
<p>How many times did you run it?</p>
<hr>
<h2 id="layer-1--confidence-intervals-eat-your-win">Layer 1 — Confidence intervals eat your win</h2>
<p>BenchmarkDotNet doesn&rsquo;t just give you a mean. It gives you Mean ± Error — where Error is the half-width of the 99.9% confidence interval, computed using a Student&rsquo;s t-distribution with n-1 degrees of freedom.<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></p>
<p>The 5-iteration run — the one that said &ldquo;3% slower&rdquo;:</p>
<div class="highlight"><pre data-lang=""><code>| Method        | N        | Mean     | Error    | StdDev   | Ratio | RatioSD |
|-------------- |--------- |---------:|---------:|---------:|------:|--------:|
| FilterBranch  | 20000000 | 25.30 ms | 0.408 ms | 0.063 ms |  1.00 |    0.00 |
| FilterTernary | 20000000 | 26.11 ms | 2.624 ms | 0.681 ms |  1.03 |    0.02 |</code></pre></div>
<p>The 99.9% CI for FilterBranch: 25.30 ± 0.408 ms → <strong>[24.89, 25.71]</strong>. For FilterTernary: 26.11 ± 2.624 ms → <strong>[23.49, 28.73]</strong>. FilterBranch&rsquo;s entire range sits inside FilterTernary&rsquo;s confidence interval. The &ldquo;3% slower&rdquo; could be a scheduling hiccup. Five iterations cannot tell you that.</p>
<p>You know this from <a href="/posts/first-things-first-why-benchmarks-lie/">Part 1</a>. Overlapping CIs, unresolved difference. Run more iterations.</p>
<p>Twenty iterations:</p>
<div class="highlight"><pre data-lang=""><code>| Method        | N        | Mean     | Error    | StdDev   | Ratio |
|-------------- |--------- |---------:|---------:|---------:|------:|
| FilterBranch  | 20000000 | 25.25 ms | 0.173 ms | 0.177 ms |  1.00 |
| FilterTernary | 20000000 | 25.64 ms | 0.111 ms | 0.109 ms |  1.02 |</code></pre></div>
<p>The 99.9% CI for FilterBranch: <strong>[25.08, 25.42]</strong>. For FilterTernary: <strong>[25.53, 25.75]</strong>. No overlap. A manual Welch t-test on this data gives p &lt; 0.001.<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> The difference is real.</p>
<p>FilterTernary is 2% slower. The 5-iteration run saw the right direction but had no basis to trust it — the CI was so wide it could not separate signal from noise.</p>
<div class="chart-container">
  <canvas id="chart-1735f2ba415780ef75e867f7bf9e0cef"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-1735f2ba415780ef75e867f7bf9e0cef').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['5 iterations', '20 iterations'],
    datasets: [
      {
        label: 'FilterBranch',
        data: [25.30, 25.25],
        backgroundColor: '#89b4fa',
        borderColor: '#89b4fa',
        borderWidth: 1
      },
      {
        label: 'FilterTernary',
        data: [26.11, 25.64],
        backgroundColor: '#f38ba8',
        borderColor: '#f38ba8',
        borderWidth: 1
      }
    ]
  },
  options: {
    plugins: {
      title: { display: true, text: 'More iterations, narrower confidence intervals' },
      subtitle: { display: true, text: '5 iter: CIs overlap — inconclusive. 20 iter: CIs separate — confirmed.' },
      legend: { display: true }
    },
    scales: {
      y: {
        title: { display: true, text: 'Time (ms)' },
        min: 22,
        max: 30
      }
    }
  },
  plugins: [{
    id: 'errBars',
    afterDraw: function(chart) {
      var ctx = chart.ctx;
      // 99.9% CI half-widths — BDN Error column directly (no conversion)
      // FilterBranch: 5iter=0.408, 20iter=0.173
      // FilterTernary: 5iter=2.624, 20iter=0.111
      var ci = [[0.408, 0.173], [2.624, 0.111]];
      chart.data.datasets.forEach(function(ds, di) {
        var meta = chart.getDatasetMeta(di);
        meta.data.forEach(function(bar, i) {
          var hw = ci[di][i];
          var yLo = chart.scales.y.getPixelForValue(ds.data[i] - hw);
          var yHi = chart.scales.y.getPixelForValue(ds.data[i] + hw);
          ctx.save();
          ctx.strokeStyle = '#cdd6f4';
          ctx.lineWidth = 2;
          ctx.beginPath(); ctx.moveTo(bar.x, yLo); ctx.lineTo(bar.x, yHi); ctx.stroke();
          ctx.beginPath(); ctx.moveTo(bar.x - 6, yLo); ctx.lineTo(bar.x + 6, yLo); ctx.stroke();
          ctx.beginPath(); ctx.moveTo(bar.x - 6, yHi); ctx.lineTo(bar.x + 6, yHi); ctx.stroke();
          ctx.restore();
        });
      });
    }
  }]
}
);
  })();
</script>

<p>The Error on FilterTernary dropped from ±2.6 ms to ±0.1 ms. An order of magnitude. More iterations, sure. But .NET&rsquo;s JIT compiles in tiers: Tier-0 (quick, unoptimized) on first calls, Tier-1 (full optimization) after enough invocations. If BDN&rsquo;s warmup didn&rsquo;t fully promote both methods, the 5-iteration run might have caught Tier-0 code while the 20-iteration run measured Tier-1. Different machine code, different variance profile.</p>
<p>Worth checking. Expand the ternary first:</p>
<div class="highlight"><pre data-lang="csharp"><code>// FilterBranch
if (data[i] &gt; 0)
    sum &#43;= data[i];

// FilterTernary — expand v &gt; 0 ? v : 0
if (v &gt; 0) sum &#43;= v;
else        sum &#43;= 0;</code></pre></div>
<p>The branch skips. The ternary always adds — even zero. Structurally different operations.</p>
<p><code>[DisassemblyDiagnoser]</code> (<a href="/posts/first-things-first-enemies-of-measurement/">Enemy 6</a> introduced the tool) on the class dumps native code — run the benchmark, check <code>BenchmarkDotNet.Artifacts/results/*-asm.md</code>. Five iterations:</p>
<div class="highlight"><pre data-lang="nasm"><code>; FilterBranch — 54 bytes
M00_L00:
       mov       edi,[rcx]        ; load data[i]
       test      edi,edi          ; data[i] &gt; 0?
       jle       short M00_L01    ; skip if not
       movsxd    rdi,edi          ; sign-extend to 64-bit
       add       rax,rdi          ; sum &#43;= data[i]
M00_L01:
       add       rcx,4            ; i&#43;&#43;
       dec       edx
       jne       short M00_L00

; FilterTernary — 58 bytes
M00_L00:
       mov       edi,[rcx]        ; load v = data[i]
       test      edi,edi          ; v &gt; 0?
       jle       short M00_L03    ; if not, jump to zero path
M00_L01:
       movsxd    rdi,edi          ; sign-extend
       add       rax,rdi          ; sum &#43;= v (or sum &#43;= 0)
       add       rcx,4            ; i&#43;&#43;
       dec       edx
       jne       short M00_L00
; ...
M00_L03:
       xor       edi,edi          ; v = 0
       jmp       short M00_L01    ; jump back to add</code></pre></div>
<p>Twenty iterations:</p>
<div class="highlight"><pre data-lang="nasm"><code>; FilterBranch — 54 bytes
M00_L00:
       mov       edi,[rcx]
       test      edi,edi
       jle       short M00_L01
       movsxd    rdi,edi
       add       rax,rdi
M00_L01:
       add       rcx,4
       dec       edx
       jne       short M00_L00

; FilterTernary — 58 bytes
M00_L00:
       mov       edi,[rcx]
       test      edi,edi
       jle       short M00_L03
M00_L01:
       movsxd    rdi,edi
       add       rax,rdi
       add       rcx,4
       dec       edx
       jne       short M00_L00
; ...
M00_L03:
       xor       edi,edi
       jmp       short M00_L01</code></pre></div>
<p>Identical machine code. Both runs. The Error dropped because more iterations and lower observed variance both narrowed the confidence interval. BDN&rsquo;s Error is t(0.0005, n−1) × StdDev / √n — StdDev for FilterTernary fell from 0.681 ms to 0.109 ms (6×), and the larger sample brought a smaller t-value and larger √n. The variance reduction did most of the work.</p>
<p>A number without error bars is an opinion. Five iterations produced CIs so wide that either outcome fit the data. Twenty produced CIs narrow enough to separate signal from noise — not certainty, but 99.9% confidence that FilterBranch is faster on this hardware. If you had shipped after five, you&rsquo;d have deployed a guess as a conclusion.</p>
<p>CI answers one question: does a difference exist? It says nothing about whether the difference matters.</p>
<hr>
<h2 id="layer-2--effect-size-when-significant-doesnt-mean-meaningful">Layer 2 — Effect size: when &ldquo;significant&rdquo; doesn&rsquo;t mean &ldquo;meaningful&rdquo;</h2>
<p>The 20-iteration result says FilterTernary is 2% slower. The CIs don&rsquo;t overlap. The difference is statistically real. But 0.4 ms on a 25 ms operation over 20 million integers. Is that worth changing the code?</p>
<p>Statistical significance asks <em>does a difference exist?</em> Practical significance asks <em>does it matter?</em> BDN answers the first. You answer the second.</p>
<p>Cohen&rsquo;s d — the standardized effect size — measures the distance between two means in units of the pooled standard deviation:<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></p>
<blockquote>
<p>d = |mean_1 - mean_2| / pooled SD</p>
</blockquote>
<div class="highlight"><pre data-lang="csharp"><code>public static double CohensD(double mean1, double stdDev1, double mean2, double stdDev2)
{
    double pooledSd = Math.Sqrt((stdDev1 * stdDev1 &#43; stdDev2 * stdDev2) / 2.0);
    if (pooledSd == 0) return 0;
    return Math.Abs(mean1 - mean2) / pooledSd;
}</code></pre></div>
<p><small>Cohen&rsquo;s d computation — full source in <code>Analysis/StatisticalReport.cs</code>.</small></p>
<p>Cohen&rsquo;s d for FilterBranch vs FilterTernary: |25.25 - 25.64| / sqrt((0.177^2 + 0.109^2)/2) = 0.39 / 0.147 = <strong>2.65</strong>. By the standard thresholds (0.2 = small, 0.5 = medium, 0.8 = large), that&rsquo;s a &ldquo;large&rdquo; effect.</p>
<p>But 2.65 for a 2% difference? Something is off.</p>
<h3 id="the-threshold-trap">The threshold trap</h3>
<p>Cohen&rsquo;s d thresholds were calibrated for psychology experiments where within-group variance is naturally high. BenchmarkDotNet&rsquo;s within-run variance is very low in controlled microbenchmarks — sub-1% coefficient of variation for compute-bound loops. When the denominator (pooled SD) is tiny, even a trivial mean difference produces a massive d.</p>
<p>Three pairs from the companion code:</p>
<table>
  <thead>
      <tr>
          <th>Pair</th>
          <th style="text-align: right">Ratio</th>
          <th style="text-align: right">Delta practical</th>
          <th style="text-align: right">Cohen&rsquo;s d</th>
          <th>&ldquo;Interpretation&rdquo;</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>FilterBranch vs FilterTernary</td>
          <td style="text-align: right">1.02</td>
          <td style="text-align: right">2%</td>
          <td style="text-align: right">2.65</td>
          <td>&ldquo;large&rdquo;</td>
      </tr>
      <tr>
          <td>SumArray vs SumSpan</td>
          <td style="text-align: right">1.01</td>
          <td style="text-align: right">0.5%</td>
          <td style="text-align: right">1.98</td>
          <td>&ldquo;large&rdquo;</td>
      </tr>
      <tr>
          <td>SearchLinear vs SearchBinary</td>
          <td style="text-align: right">0.001</td>
          <td style="text-align: right">1,071x</td>
          <td style="text-align: right">368</td>
          <td>&ldquo;large&rdquo;</td>
      </tr>
  </tbody>
</table>
<p>All three &ldquo;large&rdquo; by Cohen&rsquo;s thresholds. Only one is a meaningful optimization. Wittgenstein (1953): meaning is use — a word means what it means in the language game where it was coined. Cohen&rsquo;s thresholds were coined in a game where within-group variance is high and effect sizes are modest. Microbenchmarking is a different game — sub-1% coefficient of variation, deterministic loops, controlled environments. &ldquo;Large&rdquo; means something in psychology. The standard interpretation becomes misleading when BDN&rsquo;s precision makes the denominator vanishingly small. A 0.5% difference and a 1,071x difference land in the same bucket.</p>
<p>Popper (1934): a hypothesis survives by resisting falsification, not by accumulating confirmation. &ldquo;3% faster&rdquo; is a hypothesis. Non-overlapping CIs survived the first test — the difference exists. But Cohen&rsquo;s d at 2.65 for a 2% change is the hypothesis flattering itself. The effect size, on BDN&rsquo;s terrain, does not survive scrutiny. Seek the conditions under which the claim fails, not the ones where it holds.</p>
<p>For microbenchmarks, <strong>rely primarily on BDN&rsquo;s Ratio column</strong> rather than Cohen&rsquo;s d. Ratio ~ 1.00 means &ldquo;no practical difference.&rdquo; Ratio ~ 0.001 means &ldquo;algorithmic change.&rdquo; Whether 2% matters depends on context — a hot loop called billions of times, or a function called once per request. Define your threshold before you run.</p>
<h3 id="two-extremes">Two extremes</h3>
<p><strong>Small practical effect</strong> — array indexing vs Span indexing over 1M integers:</p>
<div class="highlight"><pre data-lang=""><code>| Method   | Categories  | N       | Mean     | Error   | StdDev  | Ratio |
|--------- |------------ |-------- |---------:|--------:|--------:|------:|
| SumArray | SmallEffect | 1000000 | 512.7 us | 1.16 us | 1.19 us |  1.00 |
| SumSpan  | SmallEffect | 1000000 | 515.3 us | 1.28 us | 1.42 us |  1.01 |</code></pre></div>
<p>Ratio = 1.01. The JIT produces nearly identical code for both — bounds-check elimination applies to <code>int[]</code> and <code>ReadOnlySpan&lt;int&gt;</code> alike on .NET 9. The 2.6 us difference (0.5%) is likely real — the CIs don&rsquo;t overlap, which is a conservative indicator — but not worth a code change.</p>
<p><strong>Large practical effect</strong> — linear search vs binary search over 1M integers:</p>
<div class="highlight"><pre data-lang=""><code>| Method       | Categories  | N       | Mean         | Error    | StdDev   | Ratio |
|------------- |------------ |-------- |-------------:|---------:|---------:|------:|
| SearchLinear | LargeEffect | 1000000 | 248,303.3 us | 928.6 us | 953.6 us | 1.000 |
| SearchBinary | LargeEffect | 1000000 |     231.8 us |   1.5 us |   1.7 us | 0.001 |</code></pre></div>
<p>Ratio = 0.001. O(n) vs O(log n). An algorithmic change — not a JIT quirk, not a cache alignment artifact. 1,071x faster on this hardware. The algorithmic advantage holds on any platform with sorted data, though the exact multiplier will vary.</p>
<div class="chart-container">
  <canvas id="chart-8acdec6e1897b21786f81e66d908845e"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-8acdec6e1897b21786f81e66d908845e').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['Array vs Span\n(0.5% difference)', 'Linear vs Binary\n(1,071× difference)'],
    datasets: [
      {
        label: 'SumArray / SearchLinear',
        data: [0.513, 248.3],
        backgroundColor: '#89b4fa',
        borderColor: '#89b4fa',
        borderWidth: 1
      },
      {
        label: 'SumSpan / SearchBinary',
        data: [0.515, 0.232],
        backgroundColor: '#f38ba8',
        borderColor: '#f38ba8',
        borderWidth: 1
      }
    ]
  },
  options: {
    plugins: {
      title: { display: true, text: 'Small vs large practical effect' },
      subtitle: { display: true, text: 'Both "statistically significant" — only one worth shipping' },
      legend: { display: true }
    },
    scales: {
      y: {
        type: 'logarithmic',
        title: { display: true, text: 'Time (ms) — log scale' },
        min: 0.1,
        max: 1000
      }
    }
  }
}
);
  })();
</script>

<p>A number with error bars but no effect size is only half an answer.</p>
<hr>
<h2 id="layer-3--micro-vs-macro-right-question-wrong-scale">Layer 3 — Micro vs macro: right question, wrong scale</h2>
<p>A microbenchmark isolates a function. A macrobenchmark places it inside a pipeline. They answer different questions — and the answers disagree.</p>
<div class="highlight"><pre data-lang="csharp"><code>// Micro: isolated lookup — Dictionary vs linear search over 10,000 elements
[BenchmarkCategory(&#34;Micro&#34;)]
[Benchmark(Baseline = true)]
public int LookupLinear()
{
    int found = 0;
    for (int i = 0; i &lt; _searchKeys.Length; i&#43;&#43;)
    {
        if (Array.IndexOf(_data, _searchKeys[i]) &gt;= 0)
            found&#43;&#43;;
    }
    return found;
}

[BenchmarkCategory(&#34;Micro&#34;)]
[Benchmark]
public int LookupDictionary()
{
    int found = 0;
    for (int i = 0; i &lt; _searchKeys.Length; i&#43;&#43;)
    {
        if (_dict.ContainsKey(_searchKeys[i]))
            found&#43;&#43;;
    }
    return found;
}</code></pre></div>
<p><small>Microbenchmark — isolated lookup comparison over 200 search keys. Full source in companion code.</small></p>
<div class="highlight"><pre data-lang=""><code>| Method           | Categories | Mean       | Error    | StdDev   | Ratio |
|----------------- |----------- |-----------:|---------:|---------:|------:|
| LookupLinear     | Micro      | 412.089 us | 1.609 us | 1.788 us | 1.000 |
| LookupDictionary | Micro      |   1.571 us | 0.012 us | 0.014 us | 0.004 |</code></pre></div>
<p>Dictionary is <strong>262x faster</strong>. Ship it?</p>
<p>The lookup lives inside a pipeline:</p>
<div class="highlight"><pre data-lang="csharp"><code>[Benchmark(Baseline = true)]
public long PipelineLinear()
{
    long v = ValidateArray(_workload);     // ~40% — sequential scan, 3M elements
    long t = PolynomialTransform(_workload); // ~40% — multiply/add/xor, 3M elements
    int  l = LookupAllLinear(_data, _searchKeys); // ~6% — 200 keys × Array.IndexOf
    long a = Aggregate(_workload);          // ~15% — weighted sum, stride 4
    return v ^ t ^ l ^ a;
}

[Benchmark]
public long PipelineDictionary()
{
    long v = ValidateArray(_workload);
    long t = PolynomialTransform(_workload);
    int  l = LookupAllDictionary(_searchKeys); // Dictionary.ContainsKey
    long a = Aggregate(_workload);
    return v ^ t ^ l ^ a;
}</code></pre></div>
<p><small>Only the lookup step changes. Full source in companion code.</small></p>
<p>94% of the work doesn&rsquo;t change regardless of lookup strategy.</p>
<div class="highlight"><pre data-lang=""><code>| Method             | Categories | Mean         | Error     | StdDev    | Ratio |
|------------------- |----------- |-------------:|----------:|----------:|------:|
| PipelineLinear     | Macro      | 7,181.115 us | 59.636 us | 66.285 us |  1.00 |
| PipelineDictionary | Macro      | 6,611.982 us | 11.094 us | 11.871 us |  0.92 |</code></pre></div>
<p>Pipeline with Dictionary is <strong>8% faster</strong>. Not 262x. Eight percent.</p>
<div class="chart-container">
  <canvas id="chart-483253b27459f166f6cda9715e4b2e69"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-483253b27459f166f6cda9715e4b2e69').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['Micro: Lookup only', 'Macro: Full pipeline'],
    datasets: [
      {
        label: 'Linear (baseline)',
        data: [0.412, 7.181],
        backgroundColor: '#89b4fa',
        borderColor: '#89b4fa',
        borderWidth: 1
      },
      {
        label: 'Dictionary (variant)',
        data: [0.002, 6.612],
        backgroundColor: '#f38ba8',
        borderColor: '#f38ba8',
        borderWidth: 1
      }
    ]
  },
  options: {
    plugins: {
      title: { display: true, text: 'Micro vs macro — isolated speedup vs end-to-end impact' },
      subtitle: { display: true, text: '262× micro speedup on 6% of pipeline → 8% end-to-end' },
      legend: { display: true }
    },
    scales: {
      y: {
        title: { display: true, text: 'Time (ms)' }
      }
    }
  }
}
);
  })();
</script>

<p>The lookup consumes 412 us out of 7,181 us total — 5.7% of the pipeline. A 262x speedup on 5.7% gives a theoretical maximum improvement of 1 / (1 - 0.057 + 0.057/262) = <strong>6.0%</strong> (Amdahl&rsquo;s law<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup>). The measured 8% is higher — cache effects from eliminating the linear scan likely benefit subsequent pipeline steps.</p>
<p>Micro answers <em>&ldquo;is this function faster?&rdquo;</em> Macro answers <em>&ldquo;will the user notice?&rdquo;</em></p>
<p>Baudrillard (1981): the fourth phase of the simulacrum — the image bears no relation to any reality whatever. The microbenchmark says 262x. The macrobenchmark says 8%. Both have error bars. Both passed statistical tests. Both are internally consistent. Neither describes what the user experiences. Two maps orbiting each other, each valid within its own coordinate system, each detached from the territory they claim to represent. The micro number didn&rsquo;t lie. The macro number didn&rsquo;t lie. The lie was believing either one alone was the answer.</p>
<p>Eight percent might be worth it — or might not, depending on whether the pipeline runs once per request or once per hour. The microbenchmark alone cannot tell you.</p>
<hr>
<h2 id="before-you-ship-the-number">Before you ship the number</h2>
<table>
  <thead>
      <tr>
          <th>Check</th>
          <th>Question</th>
          <th>If no&hellip;</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Iterations</td>
          <td>Did you run enough iterations? (&gt;= 15 in this setup, configured via SimpleJob)</td>
          <td>Your CIs are too wide — the result might be noise (see Layer 1)</td>
      </tr>
      <tr>
          <td>CI overlap</td>
          <td>Do the 99.9% CIs (BDN Error) <em>not</em> overlap?</td>
          <td>Overlapping CIs suggest noise — but non-overlap is conservative, not definitive. Confirm with a formal test (Welch / Mann-Whitney)</td>
      </tr>
      <tr>
          <td>Practical size</td>
          <td>Is the Ratio meaningfully different from 1.00? Does it exceed your SESOI?</td>
          <td>Statistically real but practically irrelevant — move on</td>
      </tr>
      <tr>
          <td>Micro = Macro</td>
          <td>Does the micro speedup translate to end-to-end improvement?</td>
          <td>The bottleneck is elsewhere — profile before optimizing</td>
      </tr>
      <tr>
          <td>Reproducible</td>
          <td>Same result on different hardware / OS / runtime?</td>
          <td>Environment-dependent — see <a href="/posts/first-things-first-enemies-of-measurement/">Part 2</a></td>
      </tr>
  </tbody>
</table>
<p>Three rules:</p>
<ol>
<li>
<p><strong>Always report confidence intervals.</strong> A mean without CI is a claim, not evidence. BenchmarkDotNet provides the Error column (99.9% CI half-width) — use it. CI overlap is a useful quick screening heuristic: overlapping CIs suggest noise, non-overlapping CIs suggest a real difference — but neither is definitive. Overlapping CIs can still hide a significant difference, and non-overlapping CIs are a conservative rule, not proof. For a formal conclusion, use a statistical test (Welch&rsquo;s t-test, Mann-Whitney U). If you only ran 5 iterations, run more.</p>
</li>
<li>
<p><strong>Distinguish statistical from practical significance.</strong> Non-overlapping CIs mean the difference exists. They don&rsquo;t mean it matters. Define a SESOI (smallest effect size of interest) before running the benchmark — the minimum improvement that justifies the code change. BDN&rsquo;s Ratio column tells you the proportional difference: if it doesn&rsquo;t cross your SESOI threshold, the result is real but not actionable.</p>
</li>
<li>
<p><strong>Confirm micro with macro.</strong> A microbenchmark shows a function is faster in isolation. A macrobenchmark shows the user will notice. Run both — or explain why you didn&rsquo;t. A 262x micro speedup sounds compelling until Amdahl reduces it to 8%.</p>
</li>
</ol>
<h3 id="run-it-yourself">Run it yourself</h3>
<div class="highlight"><pre data-lang="bash"><code>git clone https://github.com/0x3f-blog/companion-code.git
cd companion-code/first-things-first/statistics-that-matter

# All benchmarks (20 iterations, ~3 min)
# Pin to a single NUMA node to eliminate cross-socket variance
taskset -c 0-11 dotnet run -c Release

# Individual scenarios
taskset -c 0-11 dotnet run -c Release -- --filter &#39;*NoisyComparison*&#39;
taskset -c 0-11 dotnet run -c Release -- --filter &#39;*EffectSizeDemo*&#39;
taskset -c 0-11 dotnet run -c Release -- --filter &#39;*MicroVsMacro*&#39;

# Reproduce the CI overlap demo (5 iterations — wide error bars)
taskset -c 0-11 dotnet run -c Release -- --filter &#39;*NoisyComparison*&#39; --iterationCount 5 --warmupCount 3</code></pre></div>
<hr>
<h2 id="benchmark-environment">Benchmark environment</h2>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CPU</td>
          <td>2x Intel Xeon E5-2697 v2 @ 2.70 GHz (24 cores / 48 threads)</td>
      </tr>
      <tr>
          <td>RAM</td>
          <td>~115 GB DDR3-1866 (quad-channel per socket)</td>
      </tr>
      <tr>
          <td>OS</td>
          <td>Fedora Linux 42 (kernel 6.17)</td>
      </tr>
      <tr>
          <td>Runtime</td>
          <td>.NET 9.0.11 (RyuJIT AVX)</td>
      </tr>
      <tr>
          <td>SDK</td>
          <td>.NET SDK 10.0.102</td>
      </tr>
      <tr>
          <td>BenchmarkDotNet</td>
          <td>v0.14.0</td>
      </tr>
      <tr>
          <td>GC</td>
          <td>Server GC, Concurrent (BDN enables Server GC in benchmark processes by default; host process uses Workstation)</td>
      </tr>
      <tr>
          <td>Pinning</td>
          <td><code>taskset -c 0-11</code> — single socket, physical cores only</td>
      </tr>
      <tr>
          <td>Job</td>
          <td>SimpleJob (WarmupCount=5, IterationCount=20)</td>
      </tr>
  </tbody>
</table>
<p><strong>Limitations:</strong> Single machine, dual-socket NUMA. All benchmarks pinned to one socket to eliminate cross-socket memory access and thread migration — without pinning, NoisyComparison variance doubles and absolute values shift by 5-10% between runs (<a href="/posts/first-things-first-enemies-of-measurement/">Part 2</a>). <code>EffectSizeDemo</code> uses sorted data for binary search — the algorithmic advantage is inherent, not hardware-dependent. <code>MicroVsMacro</code> pipeline proportions (40/40/6/15%) are approximate — workload ratios on your hardware will vary.</p>
<hr>
<p>Even with honest design, controlled environment, and correct measurement — the number still needs interpretation. Too few iterations and the CI swallows the difference. Tight CIs inflate Cohen&rsquo;s d into meaninglessness. Microbenchmarks promise 262x while the user sees 8%.</p>
<p>Hume (1739): no finite number of observations guarantees the next will conform. But the problem isn&rsquo;t too few observations — it&rsquo;s too much readiness to conclude. The confirmation doesn&rsquo;t come from the data. It comes from you. The number said &ldquo;3% slower&rdquo; and you heard &ldquo;regression&rdquo; because you were already looking for one. The CIs were wide enough to hold any story. You picked the one that matched.</p>
<p>&ldquo;3% faster&rdquo; is not a result. It&rsquo;s a hypothesis. Treat it like one — confirm it with sufficient iterations, assess practical significance, and validate it against end-to-end behavior. Or revert the merge.</p>
<hr>
<h2 id="further-reading">Further reading</h2>
<ul>
<li>Cohen, <em>Statistical Power Analysis for the Behavioral Sciences</em> (1988) — the standard reference for effect size. Defines Cohen&rsquo;s d and the small/medium/large thresholds.<sup id="fnref1:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></li>
<li>Georges, Buytaert, Eeckhout, <a href="https://dl.acm.org/doi/10.1145/1297027.1297033">Statistically Rigorous Java Performance Evaluation</a> (OOPSLA 2007) — how many iterations, which statistical tests, how to report. Directly applies to BDN methodology.<sup id="fnref:5"><a href="#fn:5" class="footnote-ref" role="doc-noteref">5</a></sup></li>
<li>Mytkowicz et al., <a href="https://dl.acm.org/doi/10.1145/1508244.1508275">Producing Wrong Data Without Doing Anything Obviously Wrong</a> (ASPLOS 2009) — measurement bias from setup sensitivity. Small environmental changes flip benchmark results.<sup id="fnref:6"><a href="#fn:6" class="footnote-ref" role="doc-noteref">6</a></sup></li>
<li>Kalibera &amp; Jones, <a href="https://dl.acm.org/doi/10.1145/2491894.2464160">Rigorous Benchmarking in Reasonable Time</a> (ISMM 2013) — how many iterations you actually need, steady-state detection, randomization.<sup id="fnref:7"><a href="#fn:7" class="footnote-ref" role="doc-noteref">7</a></sup></li>
<li>Andrey Akinshin, <em>Pro .NET Benchmarking</em> (Apress, 2019) — the BenchmarkDotNet author on statistics, confidence intervals, comparing results.<sup id="fnref:8"><a href="#fn:8" class="footnote-ref" role="doc-noteref">8</a></sup></li>
<li>BenchmarkDotNet documentation, <a href="https://benchmarkdotnet.org/articles/features/statistics.html">Statistics</a> and <a href="https://benchmarkdotnet.org/articles/samples/IntroStatisticalTesting.html">IntroStatisticalTesting</a> — Mann-Whitney, Welch&rsquo;s t-test, the Ratio column, CI computation.<sup id="fnref1:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></li>
<li>Brendan Gregg, <a href="https://www.brendangregg.com/activebenchmarking.html">Benchmarking Gone Wrong</a> (LISA 2014) — visual comparison, ignoring variance, cherry picking. Anti-patterns that match the &ldquo;3% slower&rdquo; scenario.<sup id="fnref:9"><a href="#fn:9" class="footnote-ref" role="doc-noteref">9</a></sup></li>
<li>Matt Dowle, <a href="https://h2oai.github.io/db-benchmark/">Database-like ops benchmark</a> — ratio-based comparison and reproducibility in practice.<sup id="fnref:10"><a href="#fn:10" class="footnote-ref" role="doc-noteref">10</a></sup></li>
<li>Gene M. Amdahl, <a href="https://dl.acm.org/doi/10.1145/1465482.1465560">Validity of the single processor approach to achieving large scale computing capabilities</a> (AFIPS 1967) — the law that explains why micro speedups vanish at macro scale.<sup id="fnref1:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup></li>
</ul>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>BenchmarkDotNet, <a href="https://benchmarkdotnet.org/articles/features/statistics.html">Statistics</a> and <a href="https://benchmarkdotnet.org/articles/samples/IntroStatisticalTesting.html">IntroStatisticalTesting</a>. Documents the Mann-Whitney U test and Welch&rsquo;s t-test implementations, the Ratio column semantics, and how the Error column is computed. Error is the half-width of the 99.9% confidence interval using a Student&rsquo;s t-distribution: Error = t(0.0005, n-1) x StdDev / sqrt(n), where n is the number of iterations after outlier removal. Because the t-distribution has heavier tails at small n, the Error column naturally grows when iterations are few — making CI overlap a useful (conservative) visual screening tool. For formal inference, prefer BDN&rsquo;s built-in Welch or Mann-Whitney tests.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a>&#160;<a href="#fnref1:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>Welch&rsquo;s t-test (unequal variances), computed manually from BDN&rsquo;s summary statistics. With the 20-iteration data: t = (25.25 - 25.64) / sqrt(0.177^2/n_1 + 0.109^2/n_2) = -8.4, df = 32 (Welch-Satterthwaite), p &lt; 0.001. BDN&rsquo;s own StatisticalTestColumn uses a Welch-based TOST equivalence test or Mann-Whitney — see <a href="https://benchmarkdotnet.org/articles/samples/IntroStatisticalTesting.html">IntroStatisticalTesting</a> for details on the built-in tests.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:3">
<p>Jacob Cohen, <em>Statistical Power Analysis for the Behavioral Sciences</em>, 2nd ed. (Lawrence Erlbaum, 1988). The canonical source for effect size conventions. d = 0.2 (small), 0.5 (medium), 0.8 (large) — thresholds that became standard by widespread adoption, not mathematical derivation. Cohen himself warned against rigid cutoffs; in microbenchmarking, BDN&rsquo;s sub-1% CoV makes d misleadingly large for trivial differences.&#160;<a href="#fnref:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a>&#160;<a href="#fnref1:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:4">
<p>Gene M. Amdahl, <a href="https://dl.acm.org/doi/10.1145/1465482.1465560">Validity of the single processor approach to achieving large scale computing capabilities</a>, AFIPS 1967. If the optimized component is fraction f of total runtime, the maximum speedup is 1 / (1 - f + f/S), where S is the component speedup. For f = 0.057 and S = 262: 1 / (1 - 0.057 + 0.057/262) = 1 / 0.9432 = 1.060 — a 6.0% end-to-end improvement.&#160;<a href="#fnref:4" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a>&#160;<a href="#fnref1:4" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:5">
<p>Georges, Buytaert, Eeckhout, <a href="https://dl.acm.org/doi/10.1145/1297027.1297033">Statistically Rigorous Java Performance Evaluation</a>, OOPSLA 2007. Demonstrates that many published benchmarks use insufficient iterations and no confidence intervals. Proposes a methodology that BenchmarkDotNet later adopted — including the minimum iteration count that prevents the instability shown in this post.&#160;<a href="#fnref:5" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:6">
<p>Mytkowicz, Diwan, Hauswirth, Sweeney, <a href="https://dl.acm.org/doi/10.1145/1508244.1508275">Producing Wrong Data Without Doing Anything Obviously Wrong</a>, ASPLOS 2009. Changing the UNIX environment size or link order flips benchmark results. The case for randomization and effect sizes over raw means.&#160;<a href="#fnref:6" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:7">
<p>Kalibera &amp; Jones, <a href="https://dl.acm.org/doi/10.1145/2491894.2464160">Rigorous Benchmarking in Reasonable Time</a>, ISMM 2013. A practical methodology for choosing iteration counts — too few and your CIs are meaningless, too many and you&rsquo;re wasting time. The sweet spot depends on the coefficient of variation.&#160;<a href="#fnref:7" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:8">
<p>Andrey Akinshin, <em>Pro .NET Benchmarking</em> (Apress, 2019). Chapters 5-7 cover statistics, confidence intervals, and comparing benchmark results. The authoritative guide for BenchmarkDotNet users.&#160;<a href="#fnref:8" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:9">
<p>Brendan Gregg, <a href="https://www.brendangregg.com/activebenchmarking.html">Benchmarking Gone Wrong / Active Benchmarking</a>, LISA 2014. Anti-patterns: visual comparison (&ldquo;this graph looks faster&rdquo;), ignoring variance, cherry-picking runs. The &ldquo;3% slower with 5 iterations&rdquo; scenario in this post is Gregg&rsquo;s &ldquo;visual comparison&rdquo; anti-pattern compounded with insufficient sample size.&#160;<a href="#fnref:9" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:10">
<p>Matt Dowle, <a href="https://h2oai.github.io/db-benchmark/">Database-like ops benchmark</a>. A practical example of ratio-based comparison across implementations, with reproducibility as a first-class concern.&#160;<a href="#fnref:10" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>]]></content:encoded>
    </item>
  </channel>
</rss>
