<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>CPU Architecture on 0x3F</title>
    <link>https://0x3f.blog/tags/cpu-architecture/</link>
    <description>Recent content in CPU Architecture on 0x3F</description>
    <generator>Hugo -- 0.152.2</generator>
    <language>en-us</language>
    <lastBuildDate>Fri, 27 Feb 2026 17:00:00 +0100</lastBuildDate>
    <atom:link href="https://0x3f.blog/tags/cpu-architecture/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>First Things First: Enemies of Measurement</title>
      <link>https://0x3f.blog/posts/first-things-first-enemies-of-measurement/</link>
      <pubDate>Fri, 27 Feb 2026 17:00:00 +0100</pubDate>
      <guid>https://0x3f.blog/posts/first-things-first-enemies-of-measurement/</guid>
      <description>Six forces that change benchmark results 2–6× without changing the algorithm. Same storage engine, same data, same machine — different answers.</description>
      <content:encoded><![CDATA[<h2 id="same-engine-different-answers">Same engine, different answers</h2>
<p>Design fixed. Environment changed: cache temperature, GC pressure, data order, JIT tier. The numbers move by 2–6× without touching the algorithm.</p>
<table>
  <thead>
      <tr>
          <th>Enemy</th>
          <th>Effect</th>
          <th>What it distorts</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1. JIT Optimization Level</td>
          <td>6×</td>
          <td>Machine code quality</td>
      </tr>
      <tr>
          <td>2. GC Pauses</td>
          <td>2.3×</td>
          <td>Allocation in hot path</td>
      </tr>
      <tr>
          <td>3. System Noise</td>
          <td>3.7× σ</td>
          <td>Measurement variance</td>
      </tr>
      <tr>
          <td>4. Cache State</td>
          <td>2.9×</td>
          <td>Memory hierarchy</td>
      </tr>
      <tr>
          <td>5. Branch Predictor</td>
          <td>5.0×</td>
          <td>Data order</td>
      </tr>
      <tr>
          <td>6. Dead Code Elimination</td>
          <td>5.9×</td>
          <td>Return type</td>
      </tr>
  </tbody>
</table>
<p>The first three, BenchmarkDotNet defends against — if you know to look. The last three, you&rsquo;re on your own. Some enemies use the storage engine directly (E2, E3, E4). Others isolate CPU-level effects using data derived from the storage engine (E1, E5, E6) — because these distortions hide in any hot path, not just <code>Insert</code> and <code>Get</code>.</p>
<p>All code: <a href="https://github.com/0x3f-blog/companion-code">clone, build, run</a>. Numbers below: dual Xeon E5-2697 v2, 48 threads, 30 MB L3 per socket, ~115 GB DDR3-1866, Fedora 42, .NET 9.0.11 (RyuJIT AVX), BenchmarkDotNet v0.14.0.<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> No WAL — these enemies hide in the in-memory path, where fsync can&rsquo;t drown the signal. Different hardware, different numbers — that&rsquo;s half the lesson.</p>
<hr>
<h2 id="enemy-1--jit-optimization-level">Enemy 1 — JIT Optimization Level</h2>
<p>The storage engine holds 100,000 rows (via <code>Row.Generate</code>). Setup extracts all payloads into a contiguous <code>byte[]</code> of ~14.5 MB — an integrity-check scenario. Two versions of the same loop. Same data. Same operation. One difference: <code>[MethodImpl(MethodImplOptions.NoOptimization)]</code> — forcing the JIT to emit completely unoptimized code (no register promotion, no SIMD, no bounds check elimination).</p>
<p>Descartes: <em>de omnibus dubitandum est</em> — doubt everything, starting with your own setup. This is <em>not</em> a Tier-0 vs Tier-1 comparison.<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> <code>NoOptimization</code> disables <em>all</em> optimizations — the absolute lower bound. Real Tier-0 → Tier-1 transitions (short methods without loops, where Tier-0 applies) show 2–4×. The 6× here is the extreme case, deliberately exaggerated to make the enemy visible.</p>
<div class="highlight"><pre data-lang="csharp"><code>[DisassemblyDiagnoser(maxDepth: 3)]
public class E1_JitWarmup
{
    private byte[] _payload; // ~14.5 MB — all payloads from 100K rows

    [GlobalSetup]
    public void Setup()
    {
        using var table = new StripedTable&lt;int, Row&gt;();
        for (int i = 0; i &lt; 100_000; i&#43;&#43;)
            table.Insert(i, Row.Generate(i));

        // Extract all payloads into contiguous array
        // ... (full source in companion code)
    }

    [Benchmark]
    [MethodImpl(MethodImplOptions.NoOptimization)]
    public long SumPayloadCold()
    {
        long sum = 0;
        var data = _payload;
        for (int i = 0; i &lt; data.Length; i&#43;&#43;)
            sum &#43;= data[i];
        return sum;
    }

    [Benchmark(Baseline = true)]
    public long SumPayloadWarm()
    {
        long sum = 0;
        var data = _payload;
        for (int i = 0; i &lt; data.Length; i&#43;&#43;)
            sum &#43;= data[i];
        return sum;
    }
}</code></pre></div>
<p>Identical loop. Identical data. Identical result.</p>
<div class="chart-container">
  <canvas id="chart-fbc74b58a700254e6911c71790250880"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-fbc74b58a700254e6911c71790250880').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['Unoptimized (NoOptimization — 124 B)', 'Optimized (default JIT — 49 B)'],
    datasets: [{
      label: 'Mean (ms)',
      data: [49.764, 8.247],
      backgroundColor: ['#f38ba8', '#89b4fa'],
      borderColor: ['#f38ba8', '#89b4fa'],
      borderWidth: 1
    }]
  },
  options: {
    plugins: {
      title: { display: true, text: 'JIT optimization impact — payload checksum over 100K rows' },
      subtitle: { display: true, text: 'Same loop, same data — NoOptimization vs default JIT (not Tier-0 vs Tier-1)' },
      legend: { display: false }
    },
    scales: {
      y: { title: { display: true, text: 'Mean (ms)' }, min: 0, max: 55 }
    }
  }
}
);
  })();
</script>

<table>
  <thead>
      <tr>
          <th>Method</th>
          <th style="text-align: right">Mean</th>
          <th style="text-align: right">Code Size</th>
          <th style="text-align: right">Ratio</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SumPayloadCold</td>
          <td style="text-align: right">49.764 ms</td>
          <td style="text-align: right">124 B</td>
          <td style="text-align: right">6.03</td>
      </tr>
      <tr>
          <td>SumPayloadWarm</td>
          <td style="text-align: right">8.247 ms</td>
          <td style="text-align: right">49 B</td>
          <td style="text-align: right">1.00</td>
      </tr>
  </tbody>
</table>
<p><strong>6× on this hardware.</strong> The <code>[DisassemblyDiagnoser]</code> on the class generates full JIT output in <code>BenchmarkDotNet.Artifacts/results/</code> — 124 bytes of machine code vs 49. The unoptimized path pays for stack-based locals, bounds checks on every array access, scalar arithmetic — one byte at a time. The optimized path gets register promotion, bounds check elimination, and potentially SIMD vectorization. Same source code. Different machine code. 6× gap (remember: this is the extreme case — real Tier-0 → Tier-1 deltas are smaller but still significant).</p>
<p>BenchmarkDotNet runs warmup iterations by default (6–50 adaptive, plus 15–100 measurement iterations)<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup> — conservative enough that Tier-0 compiles to Tier-1 before measurement begins. Defense exists. But in benchmarks where tiered compilation actually applies (short methods without loops — where Tier-0 <em>is</em> the first compile), overriding warmup count too low or testing a method short enough to stay below the recompilation threshold can let unoptimized code leak into the measurement window. The first enemy hides in the JIT pipeline — and the <code>DisassemblyDiagnoser</code> is the only way to see it.</p>
<hr>
<h2 id="enemy-2--gc-pauses">Enemy 2 — GC Pauses</h2>
<p>Insert 100,000 rows into <code>StripedTable</code>. Same keys, same table, same final state. One difference: where the <code>Row</code> objects come from.</p>
<div class="highlight"><pre data-lang="csharp"><code>[MemoryDiagnoser]
public class E2_GcPauses
{
    private const int N = 100_000;
    private ITable&lt;int, Row&gt; _table;
    private int[] _keys;
    private Row[] _preAllocated;

    [GlobalSetup]  // keys &#43; rows generated once, reused across iterations
    public void Setup()
    {
        var rng = new Random(42);
        _keys = new int[N];
        _preAllocated = new Row[N];
        for (int i = 0; i &lt; N; i&#43;&#43;)
        {
            _keys[i] = rng.Next(0, 200_000);
            _preAllocated[i] = Row.Generate(_keys[i]);
        }
    }

    [IterationSetup]
    public void IterationSetup()
    {
        _table = new StripedTable&lt;int, Row&gt;(); // fresh table per iteration
    }

    [Benchmark]
    public void InsertAllocHeavy()
    {
        for (int i = 0; i &lt; N; i&#43;&#43;)
            _table.Insert(_keys[i], Row.Generate(_keys[i])); // new byte[] per insert
    }

    [Benchmark(Baseline = true)]
    public void InsertPreAllocated()
    {
        for (int i = 0; i &lt; N; i&#43;&#43;)
            _table.Insert(_keys[i], _preAllocated[i]); // no per-insert allocation
    }
}</code></pre></div>
<p><code>Row.Generate(key)</code> allocates a fresh <code>byte[32..256]</code> every call. 100K inserts = 100K allocations = GC pressure. The baseline pre-allocates all rows in <code>GlobalSetup</code> — no per-insert payload allocations in the hot path. (The 7.52 MB in the table comes from <code>ConcurrentDictionary</code> internal growth — both methods pay that cost.)</p>
<div class="chart-container">
  <canvas id="chart-0fd375f3d9bcb7aa2f7d3d92f7afe5b1"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-0fd375f3d9bcb7aa2f7d3d92f7afe5b1').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['AllocHeavy (23.9 MB alloc)', 'PreAllocated (7.52 MB alloc)'],
    datasets: [{
      label: 'Mean (ms)',
      data: [36.81, 16.35],
      backgroundColor: ['#f38ba8', '#89b4fa'],
      borderColor: ['#f38ba8', '#89b4fa'],
      borderWidth: 1
    }]
  },
  options: {
    plugins: {
      title: { display: true, text: 'GC pause impact — 100K inserts into StripedTable' },
      subtitle: { display: true, text: 'Row.Generate per insert (allocation) vs pre-allocated rows' },
      legend: { display: false }
    },
    scales: {
      y: { title: { display: true, text: 'Mean (ms)' }, min: 0, max: 45 }
    }
  }
}
);
  })();
</script>

<table>
  <thead>
      <tr>
          <th>Method</th>
          <th style="text-align: right">Mean</th>
          <th style="text-align: right">StdDev</th>
          <th style="text-align: right">Allocated</th>
          <th style="text-align: right">Alloc Ratio</th>
          <th style="text-align: right">Ratio</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>InsertAllocHeavy</td>
          <td style="text-align: right">36.81 ms</td>
          <td style="text-align: right">1.744 ms</td>
          <td style="text-align: right">23.9 MB</td>
          <td style="text-align: right">3.18</td>
          <td style="text-align: right">2.25</td>
      </tr>
      <tr>
          <td>InsertPreAllocated</td>
          <td style="text-align: right">16.35 ms</td>
          <td style="text-align: right">1.448 ms</td>
          <td style="text-align: right">7.52 MB</td>
          <td style="text-align: right">1.00</td>
          <td style="text-align: right">1.00</td>
      </tr>
  </tbody>
</table>
<p><strong>2.3× on this workload.</strong> <code>MemoryDiagnoser</code> shows why: 24 MB allocated vs 8 MB. Both methods grow the <code>ConcurrentDictionary</code> from scratch (fresh table per iteration), but <code>AllocHeavy</code> adds 100K <code>Row.Generate</code> allocations on top — each creating a new <code>byte[]</code>. The extra allocation pressure triggers GC collections mid-measurement — each pause adds microseconds that accumulate into milliseconds. Look at <code>StdDev</code>: 1.74 ms for the allocating path — and BenchmarkDotNet flagged <code>PreAllocated</code> as <em>bimodal</em> (mValue = 3.94), consistent with GC pauses splitting the distribution into two clusters: iterations where a collection fired vs iterations where it didn&rsquo;t. GC pauses are non-deterministic: sometimes a collection lands inside the timed region, sometimes it doesn&rsquo;t.</p>
<p>BenchmarkDotNet can force GC between iterations (<a href="https://benchmarkdotnet.org/articles/configs/jobs.html"><code>GcForce</code></a>) and report allocation pressure (<a href="https://benchmarkdotnet.org/articles/configs/diagnosers.html"><code>MemoryDiagnoser</code></a>). The defense exists — but you have to <em>look</em>. A benchmark that allocates in the hot path and doesn&rsquo;t report memory is measuring GC behavior, not your algorithm. The <code>StdDev</code> rises and nobody knows why.</p>
<hr>
<h2 id="enemy-3--system-noise">Enemy 3 — System Noise</h2>
<p>Two identical methods. Same table. Same data. Same code — literally copy-paste. The table is pre-populated in <code>GlobalSetup</code> — every <code>Insert</code> is an update, not a growth event. Deterministic, constant-cost work where OS noise is the only variable.<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup></p>
<div class="highlight"><pre data-lang="csharp"><code>public class E3_OsNoise
{
    private const int N = 100_000;
    private ITable&lt;int, Row&gt; _table;
    private int[] _keys;
    private Row[] _rows;

    [GlobalSetup]
    public void Setup()
    {
        _table = new StripedTable&lt;int, Row&gt;();
        // ... generate keys and rows ...
        for (int i = 0; i &lt; N; i&#43;&#43;)
            _table.Insert(_keys[i], _rows[i]);  // pre-populate
    }

    [Benchmark(Baseline = true)]
    public void InsertBaseline()
    {
        for (int i = 0; i &lt; N; i&#43;&#43;)
            _table.Insert(_keys[i], _rows[i]);
    }

    [Benchmark]
    public void InsertSame()
    {
        for (int i = 0; i &lt; N; i&#43;&#43;)
            _table.Insert(_keys[i], _rows[i]);
    }
}</code></pre></div>
<p>The interesting number isn&rsquo;t the ratio between methods — it&rsquo;s the <code>StdDev</code> across two <em>runs</em> of the same benchmark under different conditions:</p>
<div class="highlight"><pre data-lang="bash"><code># Linux-only — taskset requires a real scheduler (not available on macOS/Windows)

# === Run 1: Noisy — saturate all CPU cores, then benchmark ===

# If the script exits (Ctrl-C or error), kill all background jobs automatically
trap &#39;kill $(jobs -p) 2&gt;/dev/null&#39; EXIT

# Spawn one infinite busy loop per CPU core — fills the scheduler with work
# $(nproc) returns your core count (e.g. 48), each loop burns 100% of one core
for i in $(seq 1 $(nproc)); do
  (while true; do :; done) &amp;   # &amp; sends each loop to background
done

# Now run the benchmark — the OS scheduler must fight for CPU time
dotnet run -c Release -- --filter &#39;*E3*&#39;

# Stop all busy loops
kill $(jobs -p)

# === Run 2: Isolated — pin benchmark to a single core, no contention ===

# taskset -c 0 = run only on core 0, no migration, no sharing
taskset -c 0 dotnet run -c Release -- --filter &#39;*E3*&#39;</code></pre></div>
<p><strong>Noisy run</strong> (all cores saturated):</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th style="text-align: right">Mean</th>
          <th style="text-align: right">StdDev</th>
          <th style="text-align: right">Ratio</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>InsertBaseline</td>
          <td style="text-align: right">18.95 ms</td>
          <td style="text-align: right">0.945 ms</td>
          <td style="text-align: right">1.00</td>
      </tr>
      <tr>
          <td>InsertSame</td>
          <td style="text-align: right">18.11 ms</td>
          <td style="text-align: right">0.583 ms</td>
          <td style="text-align: right">0.96</td>
      </tr>
  </tbody>
</table>
<p><strong>Isolated run</strong> (pinned to core 0, idle system):</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th style="text-align: right">Mean</th>
          <th style="text-align: right">StdDev</th>
          <th style="text-align: right">Ratio</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>InsertBaseline</td>
          <td style="text-align: right">12.98 ms</td>
          <td style="text-align: right">0.252 ms</td>
          <td style="text-align: right">1.00</td>
      </tr>
      <tr>
          <td>InsertSame</td>
          <td style="text-align: right">13.17 ms</td>
          <td style="text-align: right">0.254 ms</td>
          <td style="text-align: right">1.01</td>
      </tr>
  </tbody>
</table>
<div class="chart-container">
  <canvas id="chart-784ff5851d7c3933d8da2d1e3dc9ceb8"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-784ff5851d7c3933d8da2d1e3dc9ceb8').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['InsertBaseline', 'InsertSame'],
    datasets: [
      {
        label: 'Noisy (all cores saturated)',
        data: [0.945, 0.583],
        backgroundColor: '#f38ba8',
        borderColor: '#f38ba8',
        borderWidth: 1
      },
      {
        label: 'Isolated (pinned core)',
        data: [0.252, 0.254],
        backgroundColor: '#89b4fa',
        borderColor: '#89b4fa',
        borderWidth: 1
      }
    ]
  },
  options: {
    plugins: {
      title: { display: true, text: 'System noise impact — StdDev of identical insert loops' },
      subtitle: { display: true, text: 'Same code, same data, same machine — different running conditions' },
      legend: { display: true }
    },
    scales: {
      y: { title: { display: true, text: 'StdDev (ms)' }, min: 0, max: 1.1 }
    }
  }
}
);
  })();
</script>

<p>Same code. Same data. Same machine. The noisy run is 46% slower (mean) and <strong>3.7× noisier</strong> (StdDev). The noise isn&rsquo;t just the OS scheduler — it&rsquo;s the entire system under contention. Thread migration between cores flushes caches. Context switches inject 10–100 μs of jitter.<sup id="fnref:5"><a href="#fn:5" class="footnote-ref" role="doc-noteref">5</a></sup> Competing processes saturate the memory bus and evict cache lines that the benchmark needs. Interrupts and kernel work preempt the benchmark thread mid-iteration. Under CPU saturation, these effects stack: on a 13 ms insert loop, the mean shifts by 46% and the variance explodes. On a 100 μs microbenchmark, the effect is destruction — not noise.</p>
<p>The defense: <code>taskset</code> pins to a core (add <code>nice -n -20</code> with root for higher priority), more iterations average out the noise. BenchmarkDotNet&rsquo;s <a href="https://benchmarkdotnet.org/articles/configs/jobs.html"><code>MinIterationCount</code></a> and <a href="https://benchmarkdotnet.org/articles/configs/jobs.html"><code>Affinity</code></a> (CPU core mask — equivalent of <code>taskset</code> inside the process) settings help. But the scheduler is always there — and the smaller your operation, the larger the enemy.</p>
<hr>
<p>Three enemies down. All three live in the execution environment — BenchmarkDotNet can detect or mitigate them because it controls the process. The next three live at the boundary between your code and the hardware. Korzybski (1933): <em>the map is not the territory.</em> The framework maps the process. It can&rsquo;t map a dataset that fits in L3, a data order that trains the branch predictor, or a return type that lets the JIT eliminate your computation. Those are your choices — and the hardware responds to them silently.</p>
<hr>
<h2 id="enemy-4--cache-state">Enemy 4 — Cache State</h2>
<p>Random <code>Get()</code> on <code>StripedTable</code> — in-memory, no WAL (hence nanosecond latencies, not microsecond-scale numbers where fsync dominates). Same operation. Same code. One parameter: how many entries in the table.</p>
<div class="highlight"><pre data-lang="csharp"><code>public class E4_CacheState
{
    private const int LookupCount = 100_000;

    [Params(10_000, 2_000_000)]
    public int TableSize { get; set; }

    private ITable&lt;int, Row&gt; _table;
    private int[] _lookupKeys;

    [GlobalSetup]
    public void Setup()
    {
        _table = new StripedTable&lt;int, Row&gt;();
        for (int i = 0; i &lt; TableSize; i&#43;&#43;)
            _table.Insert(i, Row.Generate(i));
        // ... random lookup keys ...
    }

    [Benchmark(OperationsPerInvoke = LookupCount)]
    public Row? LookupRandom()
    {
        Row? last = default;
        var table = _table;
        var keys = _lookupKeys;
        for (int i = 0; i &lt; LookupCount; i&#43;&#43;)
            last = table.Get(keys[i]);
        return last;
    }
}</code></pre></div>
<p><code>OperationsPerInvoke</code> divides total time by 100K — reporting per-lookup latency. Same <code>Get()</code>. Same <code>StripedTable</code>. Different table size.</p>
<div class="chart-container">
  <canvas id="chart-a3ed0e909a3d6554cae2af400231ddbd"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-a3ed0e909a3d6554cae2af400231ddbd').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['10K entries (fits L3)', '2M entries (spills to DRAM)'],
    datasets: [{
      label: 'Per-lookup latency (ns)',
      data: [17.05, 50.08],
      backgroundColor: ['#89b4fa', '#f38ba8'],
      borderColor: ['#89b4fa', '#f38ba8'],
      borderWidth: 1
    }]
  },
  options: {
    plugins: {
      title: { display: true, text: 'Cache hierarchy impact — random Get() on StripedTable' },
      subtitle: { display: true, text: 'Same operation, same code — different table size' },
      legend: { display: false }
    },
    scales: {
      y: { title: { display: true, text: 'Latency per lookup (ns)' }, min: 0, max: 70 }
    }
  }
}
);
  })();
</script>

<table>
  <thead>
      <tr>
          <th>TableSize</th>
          <th style="text-align: right">Mean</th>
          <th style="text-align: right">StdDev</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>10,000</td>
          <td style="text-align: right">17.05 ns</td>
          <td style="text-align: right">0.190 ns</td>
      </tr>
      <tr>
          <td>2,000,000</td>
          <td style="text-align: right">50.08 ns</td>
          <td style="text-align: right">1.919 ns</td>
      </tr>
  </tbody>
</table>
<p><strong>2.9× on this hardware.</strong> <code>StdDev</code> tells the rest of the story.</p>
<p>10K entries: the benchmark&rsquo;s working set — <code>ConcurrentDictionary</code> bucket arrays (~80 KB) and <code>Node</code> objects (~400 KB) — totals ~500 KB, comfortably within the 30 MB L3 on the local socket (dual-socket NUMA — each socket has its own 30 MB L3; the benchmark thread runs on one).<sup id="fnref:6"><a href="#fn:6" class="footnote-ref" role="doc-noteref">6</a></sup> The <code>Row</code> payloads (~1.4 MB of <code>byte[]</code>) exist on the heap but <code>LookupRandom</code> never dereferences them — it returns the <code>Row</code> struct, not the data. So only the dictionary traversal structure needs to fit in cache. Every lookup hits cached memory. <code>StdDev</code> is 0.19 ns — tight, repeatable.</p>
<p>2M entries: the dictionary working set (bucket arrays ~32 MB + nodes ~80 MB ≈ 112 MB) exceeds L3 by a wide margin and spills to DRAM. Random access means random cache misses — each miss costs 60–100 ns instead of 4–12 ns. <code>StdDev</code> jumps to 1.9 ns — 10× noisier — because DRAM latency varies with access pattern, NUMA topology, and memory controller contention.</p>
<p>Cache doesn&rsquo;t just change the speed — it changes the <em>quality</em> of the measurement. Tight numbers, low StdDev, repeatable results — and potentially misleading. Popper (1934): a benchmark can falsify a hypothesis but never confirm one. The 2.9× gap and 10× StdDev increase point at cache hierarchy — <code>perf stat -e cache-misses,cache-references</code> would confirm, but the measurement already suggests the answer.</p>
<p>Same symptom — inflated speed and false confidence. Different cause. Hot cache vs cold DRAM.</p>
<hr>
<h2 id="enemy-5--branch-predictor-training">Enemy 5 — Branch Predictor Training</h2>
<p>Scan the results from the storage engine. <code>Row.Generate(key)</code> produces payloads of 32–256 bytes (formula: <code>32 + key % 225</code>). Count how many exceed a threshold. Standard aggregation — the kind you&rsquo;d run after querying the table.</p>
<div class="highlight"><pre data-lang="csharp"><code>public class E5_BranchPredictor
{
    [Params(8_000_000)]
    public int N { get; set; }

    private int[] _sorted;  // Row sizes from Row.Generate formula, sorted
    private int[] _random;  // Same values, shuffled

    [GlobalSetup]
    public void Setup()
    {
        _sorted = new int[N];
        for (int i = 0; i &lt; N; i&#43;&#43;)
            _sorted[i] = 32 &#43; (i % 225);  // Row.Generate payload formula
        Array.Sort(_sorted);

        _random = _sorted.ToArray();
        new Random(42).Shuffle(_random);
    }

    [Benchmark]
    public int ScanSorted()
    {
        int count = 0, threshold = 150;
        var data = _sorted;
        for (int i = 0; i &lt; data.Length; i&#43;&#43;)
            if (data[i] &gt; threshold) count&#43;&#43;;
        return count;
    }

    [Benchmark(Baseline = true)]
    public int ScanRandom()
    {
        int count = 0, threshold = 150;
        var data = _random;
        for (int i = 0; i &lt; data.Length; i&#43;&#43;)
            if (data[i] &gt; threshold) count&#43;&#43;;
        return count;
    }
}</code></pre></div>
<p>Same values. Same count returned. Both arrays accessed sequentially — the prefetcher treats them identically.<sup id="fnref:7"><a href="#fn:7" class="footnote-ref" role="doc-noteref">7</a></sup> Same memory layout, same access pattern. Only the value order differs — which is what branch predictors respond to.</p>
<div class="chart-container">
  <canvas id="chart-19f69055fb99c286b2bc0911279df80d"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-19f69055fb99c286b2bc0911279df80d').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['Sorted (predictable)', 'Random (unpredictable)'],
    datasets: [{
      label: 'Mean (ms)',
      data: [8.214, 41.363],
      backgroundColor: ['#a6e3a1', '#f38ba8'],
      borderColor: ['#a6e3a1', '#f38ba8'],
      borderWidth: 1
    }]
  },
  options: {
    plugins: {
      title: { display: true, text: 'Branch prediction impact — 8M Row sizes, threshold filter' },
      subtitle: { display: true, text: 'Same values from Row.Generate formula, different order' },
      legend: { display: false }
    },
    scales: {
      y: { title: { display: true, text: 'Mean (ms)' }, min: 0, max: 50 }
    }
  }
}
);
  })();
</script>

<table>
  <thead>
      <tr>
          <th>Method</th>
          <th style="text-align: right">Mean</th>
          <th style="text-align: right">Ratio</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ScanSorted</td>
          <td style="text-align: right">8.214 ms</td>
          <td style="text-align: right">0.20</td>
      </tr>
      <tr>
          <td>ScanRandom</td>
          <td style="text-align: right">41.363 ms</td>
          <td style="text-align: right">1.00</td>
      </tr>
  </tbody>
</table>
<p><strong>5.0× on this hardware.</strong> Same algorithm, same data, same cache behavior — different order.</p>
<p>Threshold 150 splits the range roughly in half — 106 out of 225 possible values exceed it (~47%). Near 50–50 is maximum branch unpredictability.<sup id="fnref:8"><a href="#fn:8" class="footnote-ref" role="doc-noteref">8</a></sup> The sorted array presents a clean pattern: all values below threshold, then all above. The branch predictor learns after a few iterations and predicts correctly for millions of subsequent elements. The shuffled array is a coin flip every iteration — the predictor guesses wrong ~47% of the time, and each misprediction costs 15–20 cycles while the pipeline flushes and refills.</p>
<p>Sequential keys feed the prefetcher — a data design problem. Here the data is random but sorted — and the branch predictor likely changes the result without your knowledge. You&rsquo;re trying to measure the storage engine&rsquo;s aggregation cost. You&rsquo;re mostly measuring the CPU pipeline&rsquo;s response to data order.</p>
<hr>
<h2 id="enemy-6--dead-code-elimination">Enemy 6 — Dead Code Elimination</h2>
<p>Sum the data from <code>Row.Generate</code>&rsquo;s formula — a checksum for integrity verification. 10 million iterations, pure arithmetic: <code>32 + (i % 225)</code>. No memory access. No exceptions. No side effects.</p>
<div class="highlight"><pre data-lang="csharp"><code>[DisassemblyDiagnoser(maxDepth: 3)]
public class E6_DeadCode
{
    [Params(10_000_000)]
    public int N { get; set; }

    [Benchmark]
    public void ChecksumEliminated()
    {
        long checksum = 0;
        for (int i = 0; i &lt; N; i&#43;&#43;)
            checksum &#43;= 32 &#43; (i % 225);
        // checksum not returned — JIT drops the accumulation
    }

    [Benchmark(Baseline = true)]
    public long ChecksumPreserved()
    {
        long checksum = 0;
        for (int i = 0; i &lt; N; i&#43;&#43;)
            checksum &#43;= 32 &#43; (i % 225);
        return checksum;
    }
}</code></pre></div>
<p>Identical loop. One returns the result. One doesn&rsquo;t.</p>
<div class="chart-container">
  <canvas id="chart-30568c15ff06091da31d3c38fb6c0b2d"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-30568c15ff06091da31d3c38fb6c0b2d').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['Eliminated (void — 21 B)', 'Preserved (return — 66 B)'],
    datasets: [{
      label: 'Mean (ms)',
      data: [3.750, 22.220],
      backgroundColor: ['#f38ba8', '#89b4fa'],
      borderColor: ['#f38ba8', '#89b4fa'],
      borderWidth: 1
    }]
  },
  options: {
    plugins: {
      title: { display: true, text: 'Dead code elimination — checksum over Row.Generate formula' },
      subtitle: { display: true, text: 'void (JIT strips accumulation) vs return (full computation)' },
      legend: { display: false }
    },
    scales: {
      y: { title: { display: true, text: 'Mean (ms)' }, min: 0, max: 25 }
    }
  }
}
);
  })();
</script>

<table>
  <thead>
      <tr>
          <th>Method</th>
          <th style="text-align: right">Mean</th>
          <th style="text-align: right">Code Size</th>
          <th style="text-align: right">Ratio</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ChecksumEliminated</td>
          <td style="text-align: right">3.750 ms</td>
          <td style="text-align: right">21 B</td>
          <td style="text-align: right">0.17</td>
      </tr>
      <tr>
          <td>ChecksumPreserved</td>
          <td style="text-align: right">22.220 ms</td>
          <td style="text-align: right">66 B</td>
          <td style="text-align: right">1.00</td>
      </tr>
  </tbody>
</table>
<p><strong>5.9× on this hardware.</strong> The <code>DisassemblyDiagnoser</code> shows why — the actual machine code for both methods:</p>
<div class="highlight"><pre data-lang="nasm"><code>; ChecksumEliminated — 21 bytes
M00_L00:
  inc   eax          ; i&#43;&#43;
  cmp   eax, ecx     ; i &lt; N?
  jl    M00_L00      ; loop

; ChecksumPreserved — 66 bytes
M00_L00:
  mov   edx, 91A2B3C5 ; magic constant for i % 225
  imul  esi            ; compiler-generated modulo
  ; ... 8 more instructions for 32 &#43; (i % 225) ...
  add   rcx, rax      ; checksum &#43;= result
  inc   esi            ; i&#43;&#43;
  cmp   esi, edi       ; i &lt; N?
  jl    M00_L00        ; loop</code></pre></div>
<p><code>[DisassemblyDiagnoser]</code> on the class generates this — run the benchmark and check <code>BenchmarkDotNet.Artifacts/results/</code> for the full listing (HTML + Markdown).</p>
<p>The JIT determined that <code>checksum</code> has no observable side effects — nobody reads it — and stripped out the entire accumulation. What remains is <code>inc/cmp/jl</code>: the loop counter, iterating 10 million times over nothing.<sup id="fnref:9"><a href="#fn:9" class="footnote-ref" role="doc-noteref">9</a></sup> The fix is simple: <a href="https://benchmarkdotnet.org/articles/guides/good-practices.html#avoid-dead-code-elimination">always return the computed value</a> so the JIT must preserve it.</p>
<p>Here&rsquo;s what makes this the most dangerous enemy: <strong>3.75 ms looks plausible.</strong> It&rsquo;s not zero. It&rsquo;s not suspiciously fast. It looks like a reasonable time for 10 million iterations of lightweight arithmetic. Without <code>DisassemblyDiagnoser</code>, you&rsquo;d trust it. You&rsquo;d compare it against another implementation. You&rsquo;d ship a conclusion based on a number that measures empty loop iterations.</p>
<p>21 bytes vs 66 bytes. The disassembler is the only reliable way to catch this. Because the lie that looks reasonable is worse than the lie that looks absurd.</p>
<hr>
<h2 id="know-your-enemies">Know your enemies</h2>
<table>
  <thead>
      <tr>
          <th>Enemy</th>
          <th>Effect</th>
          <th>Symptom</th>
          <th>Defense</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1. JIT Optimization Level</td>
          <td>6×†</td>
          <td>NoOptimization 6× slower (†extreme case; real Tier-0→1: 2–4×)</td>
          <td>Warmup (BDN default) + DisassemblyDiagnoser</td>
      </tr>
      <tr>
          <td>2. GC Pauses</td>
          <td>2.3×</td>
          <td>Allocation in hot path, StdDev spike</td>
          <td>MemoryDiagnoser + GcForce + pre-allocate</td>
      </tr>
      <tr>
          <td>3. System Noise</td>
          <td>3.7× StdDev</td>
          <td>Mean +46%, StdDev 3.7× under load</td>
          <td>taskset + nice + more iterations</td>
      </tr>
      <tr>
          <td>4. Cache State</td>
          <td>2.9×</td>
          <td>Working set &gt; L3</td>
          <td>Conscious choice: cold vs warm vs hot</td>
      </tr>
      <tr>
          <td>5. Branch Predictor</td>
          <td>5.0×</td>
          <td>Sorted data 5× faster</td>
          <td>Realistic (shuffled) data</td>
      </tr>
      <tr>
          <td>6. Dead Code Elimination</td>
          <td>5.9×</td>
          <td>Code Size 21 B vs 66 B</td>
          <td><a href="https://benchmarkdotnet.org/articles/guides/good-practices.html#avoid-dead-code-elimination">Return result</a> + DisassemblyDiagnoser</td>
      </tr>
  </tbody>
</table>
<p>Each enemy alone shifted the result 2–6× on this hardware. Stack three and the benchmark and production are different universes.</p>
<p>A reference checklist — not a universal shield, but a starting point that covers what BDN configuration <em>can</em> cover (enemies 1–3) and adds inspection tooling for what it can&rsquo;t (enemies 4–6). The enemy benchmarks in the companion code intentionally don&rsquo;t use it — defenses must be <em>down</em> to show the enemies in action:</p>
<div class="highlight"><pre data-lang="csharp"><code>public class EnemyDefenseConfig : ManualConfig
{
    public EnemyDefenseConfig()
    {
        AddJob(Job.Default
            .WithWarmupCount(3)               // E1: ensure Tier-1 before measurement
            .WithGcServer(true)               // E2: Server GC — fewer, larger collections
            .WithGcForce(true)                // E2: force GC between iterations
            .WithMinIterationCount(15)        // E3: average out scheduler noise
            .WithMaxIterationCount(100)       // E3: let BDN adapt when noise is present
            .WithAffinity((IntPtr)0b11));     // E3: pin to cores 0–1

        AddDiagnoser(MemoryDiagnoser.Default);              // E2: allocation pressure
        AddDiagnoser(new DisassemblyDiagnoser(              // E1&#43;E6: JIT output
            new DisassemblyDiagnoserConfig(maxDepth: 3)));

        AddColumn(StatisticColumn.StdDev);                  // E3: noise visible
    }
}</code></pre></div>
<p>Enemies 1–3: configuration. Enemies 4–6: conscious data design. No config setting shuffles your test data for you.</p>
<h3 id="run-it-yourself">Run it yourself</h3>
<div class="highlight"><pre data-lang="bash"><code>git clone https://github.com/0x3f-blog/companion-code.git
cd companion-code/first-things-first/enemies-of-measurement

# All six enemies
dotnet run -c Release

# One enemy at a time
dotnet run -c Release -- --filter &#39;*E5*&#39;

# OS noise comparison (Linux) — see E3 section for full commands
trap &#39;kill $(jobs -p) 2&gt;/dev/null&#39; EXIT
for i in $(seq 1 $(nproc)); do (while true; do :; done) &amp; done
dotnet run -c Release -- --filter &#39;*E3*&#39;
kill $(jobs -p)
taskset -c 0 dotnet run -c Release -- --filter &#39;*E3*&#39;</code></pre></div>
<p>The <em>direction</em> reproduces. The exact ratios depend on your hardware.</p>
<hr>
<h2 id="benchmark-environment">Benchmark environment</h2>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CPU</td>
          <td>2× Intel Xeon E5-2697 v2 @ 2.70 GHz (24 cores / 48 threads)</td>
      </tr>
      <tr>
          <td>L3 Cache</td>
          <td>30 MB per socket</td>
      </tr>
      <tr>
          <td>RAM</td>
          <td>~115 GB DDR3-1866 (quad-channel per socket)</td>
      </tr>
      <tr>
          <td>OS</td>
          <td>Fedora Linux 42 (kernel 6.17)</td>
      </tr>
      <tr>
          <td>Runtime</td>
          <td>.NET 9.0.11 (RyuJIT AVX)</td>
      </tr>
      <tr>
          <td>SDK</td>
          <td>.NET SDK 10.0.102</td>
      </tr>
      <tr>
          <td>BenchmarkDotNet</td>
          <td>v0.14.0</td>
      </tr>
      <tr>
          <td>Job</td>
          <td>DefaultJob (BDN auto-selects iteration count, typically 15+)</td>
      </tr>
      <tr>
          <td>GC</td>
          <td>Server GC, Concurrent (BDN enables Server GC in benchmark processes by default; host process uses Workstation)</td>
      </tr>
      <tr>
          <td>Storage</td>
          <td>In-memory (no WAL) — enemies hide in the in-memory path</td>
      </tr>
      <tr>
          <td>Power</td>
          <td><code>performance</code> governor, no frequency scaling</td>
      </tr>
      <tr>
          <td>Hygiene</td>
          <td>No browser, IDE, or heavy processes during runs</td>
      </tr>
  </tbody>
</table>
<p><strong>No WAL in this post.</strong> These enemies operate in the in-memory path, where fsync can&rsquo;t drown the signal.</p>
<hr>
<h2 id="we-walked-the-same-path">We walked the same path</h2>
<p>Same storage engine. Same path. Different place.</p>
<p>Heraclitus (~500 BCE): <em>you cannot step into the same river twice.</em> JIT, GC, scheduler, cache, branch predictor, dead code — the river moved between measurements. Six enemies, each shifting the answer 2–6× on this hardware. They stack.</p>
<p>A number that survives design review but not these six enemies is a comfortable lie — it looks right, it feels reproducible, and it&rsquo;s wrong.</p>
<p>Don&rsquo;t trust a number that hasn&rsquo;t survived six enemies.</p>
<hr>
<h2 id="further-reading">Further reading</h2>
<ul>
<li>Mytkowicz et al., <a href="https://dl.acm.org/doi/10.1145/1508244.1508275">Producing Wrong Data Without Doing Anything Obviously Wrong</a>, ASPLOS 2009 — how link order, environment variable size, and filesystem layout change benchmark results by 30%+.</li>
<li>Georges, Buytaert &amp; Eeckhout, <a href="https://dl.acm.org/doi/10.1145/1297027.1297033">Statistically Rigorous Java Performance Evaluation</a>, OOPSLA 2007 — methodology: how many iterations, confidence intervals, how to report results.</li>
<li>Curtsinger &amp; Berger, <a href="https://dl.acm.org/doi/10.1145/2451116.2451141">Stabilizer: Statistically Sound Performance Evaluation</a>, ASPLOS 2013 — randomizing code/data layout to eliminate cache alignment bias.</li>
<li>Blackburn et al., <a href="https://dl.acm.org/doi/10.1145/1378704.1378723">Wake Up and Smell the Coffee</a>, CACM 2008 — GC-aware benchmarking, steady-state vs startup.</li>
<li>Fog, <a href="https://www.agner.org/optimize/microarchitecture.pdf">Microarchitecture of Intel, AMD and VIA CPUs</a>, 2024 — branch prediction, cache hierarchy, instruction latency tables.</li>
<li>Fog, <a href="https://www.agner.org/optimize/optimizing_cpp.pdf">Optimizing Software in C++</a>, 2024 — compiler optimizations, dead code elimination, benchmark pitfalls.</li>
<li>Drepper, <a href="https://people.freebsd.org/~lstewart/articles/cpumemory.pdf">What Every Programmer Should Know About Memory</a>, 2007 — cache hierarchy, prefetch, NUMA, TLB. Still the definitive reference.</li>
<li>Akinshin, <a href="https://link.springer.com/book/10.1007/978-1-4842-4941-3">Pro .NET Benchmarking</a>, Apress 2019 — author of BenchmarkDotNet, comprehensive treatment of all six enemies.</li>
<li>Gregg, <a href="https://www.brendangregg.com/systems-performance-2nd-edition-book.html">Systems Performance</a>, 2nd ed., 2020 — CPU scheduler, context switches, interrupt coalescing.</li>
<li><a href="https://benchmarkdotnet.org/">BenchmarkDotNet Documentation</a> — <a href="https://benchmarkdotnet.org/articles/configs/jobs.html">Jobs</a> (WarmupCount, IterationCount defaults), <a href="https://benchmarkdotnet.org/articles/guides/good-practices.html">Good Practices</a> (dead code, setup/cleanup), <a href="https://benchmarkdotnet.org/articles/configs/diagnosers.html">MemoryDiagnoser</a>, <a href="https://benchmarkdotnet.org/articles/configs/diagnosers.html">DisassemblyDiagnoser</a>.</li>
<li>.NET Runtime, <a href="https://github.com/dotnet/runtime/blob/main/docs/design/features/tiered-compilation.md">Tiered Compilation Design Doc</a> — Tier 0 → Tier 1 → OSR.</li>
<li>.NET Runtime, <a href="https://github.com/dotnet/runtime/blob/main/docs/design/coreclr/botr/garbage-collection.md">GC Design Doc</a> — Gen0/1/2, Server vs Workstation GC, suspension.</li>
</ul>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>BenchmarkDotNet uses <code>DefaultJob</code> for all benchmarks. E2 reports a custom job name (<code>Job-XSSCPO</code>) because <code>[IterationSetup]</code> forces <code>InvocationCount=1</code> and <code>UnrollFactor=1</code> — BDN cannot batch-invoke methods that require per-iteration setup. The iteration count is still auto-selected.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>.NET&rsquo;s tiered compilation: Tier-0 (quick JIT — fast compile, slow code) → Tier-1 (optimized — slow compile, fast code). Since .NET Core 3.0, <em>quick JIT for loops</em> is disabled by default (<code>TC_QuickJitForLoops</code> off) — methods containing loops go straight to Tier-1. <code>NoOptimization</code> is more extreme than Tier-0: it disables <em>all</em> optimizations, not just the expensive ones. For the full pipeline, see .NET Runtime <a href="https://github.com/dotnet/runtime/blob/main/docs/design/features/tiered-compilation.md">Tiered Compilation Design Doc</a>.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:3">
<p>BenchmarkDotNet <code>DefaultJob</code> settings: <code>MinWarmupIterationCount</code> = 6, <code>MaxWarmupIterationCount</code> = 50 (adaptive), <code>MinIterationCount</code> = 15, <code>MaxIterationCount</code> = 100 (adaptive). See <a href="https://benchmarkdotnet.org/articles/configs/jobs.html">BenchmarkDotNet Jobs documentation</a> and source: <a href="https://github.com/dotnet/BenchmarkDotNet/blob/master/src/BenchmarkDotNet/Jobs/JobExtensions.cs"><code>DefaultConfig</code></a>.&#160;<a href="#fnref:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:4">
<p>Unlike E2 (which uses <code>[IterationSetup]</code> for a fresh table per iteration — because GC pressure needs fresh allocations), E3 intentionally uses <code>[GlobalSetup]</code> with a pre-populated table. Every iteration does updates to existing keys, not inserts that grow the <code>ConcurrentDictionary</code>. Fresh-table inserts add resize variance that drowns the OS noise signal we&rsquo;re trying to isolate.&#160;<a href="#fnref:4" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:5">
<p>Gregg, <em>Systems Performance</em> (2nd ed., 2020), Ch. 6. Context switch overhead varies from ~5 μs (hot cache, same core) to 100+ μs (cold cache, cross-NUMA migration). On a dual-socket system, thread migration between sockets adds memory access latency on top of the pipeline flush.&#160;<a href="#fnref:5" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:6">
<p>Drepper, <a href="https://people.freebsd.org/~lstewart/articles/cpumemory.pdf">What Every Programmer Should Know About Memory</a>, 2007, sections 3 and 6. L1: ~1 ns, L2: ~4 ns, L3: ~12 ns, DRAM: 60–100 ns. Random access to a dataset larger than L3 falls back to full DRAM latency — no prefetch, no spatial locality, every access is a cache miss.&#160;<a href="#fnref:6" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:7">
<p>Drepper, <a href="https://people.freebsd.org/~lstewart/articles/cpumemory.pdf">What Every Programmer Should Know About Memory</a>, 2007, sections 3.3 and 6.2. Sequential access triggers hardware prefetch — the CPU loads cache lines before code asks for them. Random access falls back to full DRAM latency.&#160;<a href="#fnref:7" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:8">
<p>Fog, <a href="https://www.agner.org/optimize/microarchitecture.pdf">Microarchitecture of Intel, AMD and VIA CPUs</a>, 2024, section 3. Branch prediction uses pattern history tables. A perfectly sorted sequence is trivially predictable after the transition point. A uniformly random ~50/50 pattern achieves the worst-case misprediction rate — the predictor has no pattern to learn. Each misprediction flushes the pipeline (15–20 cycles on modern Intel).&#160;<a href="#fnref:8" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:9">
<p>The JIT&rsquo;s dead code elimination for <code>ChecksumEliminated</code> is partial: it removes the accumulation (<code>checksum += 32 + (i % 225)</code>) because the result is never observed, but retains the loop counter (<code>i++</code>, compare, branch). The method still executes 10M loop iterations — it just does nothing useful in each one. This produces a plausible-looking 3.75 ms instead of the expected ~22 ms. The <code>DisassemblyDiagnoser</code> reveals the difference: 21 bytes of machine code (inc/cmp/jl) vs 66 bytes (full arithmetic + accumulation).&#160;<a href="#fnref:9" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>]]></content:encoded>
    </item>
  </channel>
</rss>
