Same engine, different answers

Design fixed. Environment changed: cache temperature, GC pressure, data order, JIT tier. The numbers move by 2–6× without touching the algorithm.

Enemy Effect What it distorts
1. JIT Optimization Level Machine code quality
2. GC Pauses 2.3× Allocation in hot path
3. System Noise 3.7× σ Measurement variance
4. Cache State 2.9× Memory hierarchy
5. Branch Predictor 5.0× Data order
6. Dead Code Elimination 5.9× Return type

The first three, BenchmarkDotNet defends against — if you know to look. The last three, you’re on your own. Some enemies use the storage engine directly (E2, E3, E4). Others isolate CPU-level effects using data derived from the storage engine (E1, E5, E6) — because these distortions hide in any hot path, not just Insert and Get.

All code: clone, build, run. Numbers below: dual Xeon E5-2697 v2, 48 threads, 30 MB L3 per socket, ~115 GB DDR3-1866, Fedora 42, .NET 9.0.11 (RyuJIT AVX), BenchmarkDotNet v0.14.0.1 No WAL — these enemies hide in the in-memory path, where fsync can’t drown the signal. Different hardware, different numbers — that’s half the lesson.


Enemy 1 — JIT Optimization Level

The storage engine holds 100,000 rows (via Row.Generate). Setup extracts all payloads into a contiguous byte[] of ~14.5 MB — an integrity-check scenario. Two versions of the same loop. Same data. Same operation. One difference: [MethodImpl(MethodImplOptions.NoOptimization)] — forcing the JIT to emit completely unoptimized code (no register promotion, no SIMD, no bounds check elimination).

Descartes: de omnibus dubitandum est — doubt everything, starting with your own setup. This is not a Tier-0 vs Tier-1 comparison.2 NoOptimization disables all optimizations — the absolute lower bound. Real Tier-0 → Tier-1 transitions (short methods without loops, where Tier-0 applies) show 2–4×. The 6× here is the extreme case, deliberately exaggerated to make the enemy visible.

[DisassemblyDiagnoser(maxDepth: 3)]
public class E1_JitWarmup
{
    private byte[] _payload; // ~14.5 MB — all payloads from 100K rows

    [GlobalSetup]
    public void Setup()
    {
        using var table = new StripedTable<int, Row>();
        for (int i = 0; i < 100_000; i++)
            table.Insert(i, Row.Generate(i));

        // Extract all payloads into contiguous array
        // ... (full source in companion code)
    }

    [Benchmark]
    [MethodImpl(MethodImplOptions.NoOptimization)]
    public long SumPayloadCold()
    {
        long sum = 0;
        var data = _payload;
        for (int i = 0; i < data.Length; i++)
            sum += data[i];
        return sum;
    }

    [Benchmark(Baseline = true)]
    public long SumPayloadWarm()
    {
        long sum = 0;
        var data = _payload;
        for (int i = 0; i < data.Length; i++)
            sum += data[i];
        return sum;
    }
}

Identical loop. Identical data. Identical result.

Method Mean Code Size Ratio
SumPayloadCold 49.764 ms 124 B 6.03
SumPayloadWarm 8.247 ms 49 B 1.00

6× on this hardware. The [DisassemblyDiagnoser] on the class generates full JIT output in BenchmarkDotNet.Artifacts/results/ — 124 bytes of machine code vs 49. The unoptimized path pays for stack-based locals, bounds checks on every array access, scalar arithmetic — one byte at a time. The optimized path gets register promotion, bounds check elimination, and potentially SIMD vectorization. Same source code. Different machine code. 6× gap (remember: this is the extreme case — real Tier-0 → Tier-1 deltas are smaller but still significant).

BenchmarkDotNet runs warmup iterations by default (6–50 adaptive, plus 15–100 measurement iterations)3 — conservative enough that Tier-0 compiles to Tier-1 before measurement begins. Defense exists. But in benchmarks where tiered compilation actually applies (short methods without loops — where Tier-0 is the first compile), overriding warmup count too low or testing a method short enough to stay below the recompilation threshold can let unoptimized code leak into the measurement window. The first enemy hides in the JIT pipeline — and the DisassemblyDiagnoser is the only way to see it.


Enemy 2 — GC Pauses

Insert 100,000 rows into StripedTable. Same keys, same table, same final state. One difference: where the Row objects come from.

[MemoryDiagnoser]
public class E2_GcPauses
{
    private const int N = 100_000;
    private ITable<int, Row> _table;
    private int[] _keys;
    private Row[] _preAllocated;

    [GlobalSetup]  // keys + rows generated once, reused across iterations
    public void Setup()
    {
        var rng = new Random(42);
        _keys = new int[N];
        _preAllocated = new Row[N];
        for (int i = 0; i < N; i++)
        {
            _keys[i] = rng.Next(0, 200_000);
            _preAllocated[i] = Row.Generate(_keys[i]);
        }
    }

    [IterationSetup]
    public void IterationSetup()
    {
        _table = new StripedTable<int, Row>(); // fresh table per iteration
    }

    [Benchmark]
    public void InsertAllocHeavy()
    {
        for (int i = 0; i < N; i++)
            _table.Insert(_keys[i], Row.Generate(_keys[i])); // new byte[] per insert
    }

    [Benchmark(Baseline = true)]
    public void InsertPreAllocated()
    {
        for (int i = 0; i < N; i++)
            _table.Insert(_keys[i], _preAllocated[i]); // no per-insert allocation
    }
}

Row.Generate(key) allocates a fresh byte[32..256] every call. 100K inserts = 100K allocations = GC pressure. The baseline pre-allocates all rows in GlobalSetup — no per-insert payload allocations in the hot path. (The 7.52 MB in the table comes from ConcurrentDictionary internal growth — both methods pay that cost.)

Method Mean StdDev Allocated Alloc Ratio Ratio
InsertAllocHeavy 36.81 ms 1.744 ms 23.9 MB 3.18 2.25
InsertPreAllocated 16.35 ms 1.448 ms 7.52 MB 1.00 1.00

2.3× on this workload. MemoryDiagnoser shows why: 24 MB allocated vs 8 MB. Both methods grow the ConcurrentDictionary from scratch (fresh table per iteration), but AllocHeavy adds 100K Row.Generate allocations on top — each creating a new byte[]. The extra allocation pressure triggers GC collections mid-measurement — each pause adds microseconds that accumulate into milliseconds. Look at StdDev: 1.74 ms for the allocating path — and BenchmarkDotNet flagged PreAllocated as bimodal (mValue = 3.94), consistent with GC pauses splitting the distribution into two clusters: iterations where a collection fired vs iterations where it didn’t. GC pauses are non-deterministic: sometimes a collection lands inside the timed region, sometimes it doesn’t.

BenchmarkDotNet can force GC between iterations (GcForce) and report allocation pressure (MemoryDiagnoser). The defense exists — but you have to look. A benchmark that allocates in the hot path and doesn’t report memory is measuring GC behavior, not your algorithm. The StdDev rises and nobody knows why.


Enemy 3 — System Noise

Two identical methods. Same table. Same data. Same code — literally copy-paste. The table is pre-populated in GlobalSetup — every Insert is an update, not a growth event. Deterministic, constant-cost work where OS noise is the only variable.4

public class E3_OsNoise
{
    private const int N = 100_000;
    private ITable<int, Row> _table;
    private int[] _keys;
    private Row[] _rows;

    [GlobalSetup]
    public void Setup()
    {
        _table = new StripedTable<int, Row>();
        // ... generate keys and rows ...
        for (int i = 0; i < N; i++)
            _table.Insert(_keys[i], _rows[i]);  // pre-populate
    }

    [Benchmark(Baseline = true)]
    public void InsertBaseline()
    {
        for (int i = 0; i < N; i++)
            _table.Insert(_keys[i], _rows[i]);
    }

    [Benchmark]
    public void InsertSame()
    {
        for (int i = 0; i < N; i++)
            _table.Insert(_keys[i], _rows[i]);
    }
}

The interesting number isn’t the ratio between methods — it’s the StdDev across two runs of the same benchmark under different conditions:

# Linux-only — taskset requires a real scheduler (not available on macOS/Windows)

# === Run 1: Noisy — saturate all CPU cores, then benchmark ===

# If the script exits (Ctrl-C or error), kill all background jobs automatically
trap 'kill $(jobs -p) 2>/dev/null' EXIT

# Spawn one infinite busy loop per CPU core — fills the scheduler with work
# $(nproc) returns your core count (e.g. 48), each loop burns 100% of one core
for i in $(seq 1 $(nproc)); do
  (while true; do :; done) &   # & sends each loop to background
done

# Now run the benchmark — the OS scheduler must fight for CPU time
dotnet run -c Release -- --filter '*E3*'

# Stop all busy loops
kill $(jobs -p)

# === Run 2: Isolated — pin benchmark to a single core, no contention ===

# taskset -c 0 = run only on core 0, no migration, no sharing
taskset -c 0 dotnet run -c Release -- --filter '*E3*'

Noisy run (all cores saturated):

Method Mean StdDev Ratio
InsertBaseline 18.95 ms 0.945 ms 1.00
InsertSame 18.11 ms 0.583 ms 0.96

Isolated run (pinned to core 0, idle system):

Method Mean StdDev Ratio
InsertBaseline 12.98 ms 0.252 ms 1.00
InsertSame 13.17 ms 0.254 ms 1.01

Same code. Same data. Same machine. The noisy run is 46% slower (mean) and 3.7× noisier (StdDev). The noise isn’t just the OS scheduler — it’s the entire system under contention. Thread migration between cores flushes caches. Context switches inject 10–100 μs of jitter.5 Competing processes saturate the memory bus and evict cache lines that the benchmark needs. Interrupts and kernel work preempt the benchmark thread mid-iteration. Under CPU saturation, these effects stack: on a 13 ms insert loop, the mean shifts by 46% and the variance explodes. On a 100 μs microbenchmark, the effect is destruction — not noise.

The defense: taskset pins to a core (add nice -n -20 with root for higher priority), more iterations average out the noise. BenchmarkDotNet’s MinIterationCount and Affinity (CPU core mask — equivalent of taskset inside the process) settings help. But the scheduler is always there — and the smaller your operation, the larger the enemy.


Three enemies down. All three live in the execution environment — BenchmarkDotNet can detect or mitigate them because it controls the process. The next three live at the boundary between your code and the hardware. Korzybski (1933): the map is not the territory. The framework maps the process. It can’t map a dataset that fits in L3, a data order that trains the branch predictor, or a return type that lets the JIT eliminate your computation. Those are your choices — and the hardware responds to them silently.


Enemy 4 — Cache State

Random Get() on StripedTable — in-memory, no WAL (hence nanosecond latencies, not microsecond-scale numbers where fsync dominates). Same operation. Same code. One parameter: how many entries in the table.

public class E4_CacheState
{
    private const int LookupCount = 100_000;

    [Params(10_000, 2_000_000)]
    public int TableSize { get; set; }

    private ITable<int, Row> _table;
    private int[] _lookupKeys;

    [GlobalSetup]
    public void Setup()
    {
        _table = new StripedTable<int, Row>();
        for (int i = 0; i < TableSize; i++)
            _table.Insert(i, Row.Generate(i));
        // ... random lookup keys ...
    }

    [Benchmark(OperationsPerInvoke = LookupCount)]
    public Row? LookupRandom()
    {
        Row? last = default;
        var table = _table;
        var keys = _lookupKeys;
        for (int i = 0; i < LookupCount; i++)
            last = table.Get(keys[i]);
        return last;
    }
}

OperationsPerInvoke divides total time by 100K — reporting per-lookup latency. Same Get(). Same StripedTable. Different table size.

TableSize Mean StdDev
10,000 17.05 ns 0.190 ns
2,000,000 50.08 ns 1.919 ns

2.9× on this hardware. StdDev tells the rest of the story.

10K entries: the benchmark’s working set — ConcurrentDictionary bucket arrays (~80 KB) and Node objects (~400 KB) — totals ~500 KB, comfortably within the 30 MB L3 on the local socket (dual-socket NUMA — each socket has its own 30 MB L3; the benchmark thread runs on one).6 The Row payloads (~1.4 MB of byte[]) exist on the heap but LookupRandom never dereferences them — it returns the Row struct, not the data. So only the dictionary traversal structure needs to fit in cache. Every lookup hits cached memory. StdDev is 0.19 ns — tight, repeatable.

2M entries: the dictionary working set (bucket arrays ~32 MB + nodes ~80 MB ≈ 112 MB) exceeds L3 by a wide margin and spills to DRAM. Random access means random cache misses — each miss costs 60–100 ns instead of 4–12 ns. StdDev jumps to 1.9 ns — 10× noisier — because DRAM latency varies with access pattern, NUMA topology, and memory controller contention.

Cache doesn’t just change the speed — it changes the quality of the measurement. Tight numbers, low StdDev, repeatable results — and potentially misleading. Popper (1934): a benchmark can falsify a hypothesis but never confirm one. The 2.9× gap and 10× StdDev increase point at cache hierarchy — perf stat -e cache-misses,cache-references would confirm, but the measurement already suggests the answer.

Same symptom — inflated speed and false confidence. Different cause. Hot cache vs cold DRAM.


Enemy 5 — Branch Predictor Training

Scan the results from the storage engine. Row.Generate(key) produces payloads of 32–256 bytes (formula: 32 + key % 225). Count how many exceed a threshold. Standard aggregation — the kind you’d run after querying the table.

public class E5_BranchPredictor
{
    [Params(8_000_000)]
    public int N { get; set; }

    private int[] _sorted;  // Row sizes from Row.Generate formula, sorted
    private int[] _random;  // Same values, shuffled

    [GlobalSetup]
    public void Setup()
    {
        _sorted = new int[N];
        for (int i = 0; i < N; i++)
            _sorted[i] = 32 + (i % 225);  // Row.Generate payload formula
        Array.Sort(_sorted);

        _random = _sorted.ToArray();
        new Random(42).Shuffle(_random);
    }

    [Benchmark]
    public int ScanSorted()
    {
        int count = 0, threshold = 150;
        var data = _sorted;
        for (int i = 0; i < data.Length; i++)
            if (data[i] > threshold) count++;
        return count;
    }

    [Benchmark(Baseline = true)]
    public int ScanRandom()
    {
        int count = 0, threshold = 150;
        var data = _random;
        for (int i = 0; i < data.Length; i++)
            if (data[i] > threshold) count++;
        return count;
    }
}

Same values. Same count returned. Both arrays accessed sequentially — the prefetcher treats them identically.7 Same memory layout, same access pattern. Only the value order differs — which is what branch predictors respond to.

Method Mean Ratio
ScanSorted 8.214 ms 0.20
ScanRandom 41.363 ms 1.00

5.0× on this hardware. Same algorithm, same data, same cache behavior — different order.

Threshold 150 splits the range roughly in half — 106 out of 225 possible values exceed it (~47%). Near 50–50 is maximum branch unpredictability.8 The sorted array presents a clean pattern: all values below threshold, then all above. The branch predictor learns after a few iterations and predicts correctly for millions of subsequent elements. The shuffled array is a coin flip every iteration — the predictor guesses wrong ~47% of the time, and each misprediction costs 15–20 cycles while the pipeline flushes and refills.

Sequential keys feed the prefetcher — a data design problem. Here the data is random but sorted — and the branch predictor likely changes the result without your knowledge. You’re trying to measure the storage engine’s aggregation cost. You’re mostly measuring the CPU pipeline’s response to data order.


Enemy 6 — Dead Code Elimination

Sum the data from Row.Generate’s formula — a checksum for integrity verification. 10 million iterations, pure arithmetic: 32 + (i % 225). No memory access. No exceptions. No side effects.

[DisassemblyDiagnoser(maxDepth: 3)]
public class E6_DeadCode
{
    [Params(10_000_000)]
    public int N { get; set; }

    [Benchmark]
    public void ChecksumEliminated()
    {
        long checksum = 0;
        for (int i = 0; i < N; i++)
            checksum += 32 + (i % 225);
        // checksum not returned — JIT drops the accumulation
    }

    [Benchmark(Baseline = true)]
    public long ChecksumPreserved()
    {
        long checksum = 0;
        for (int i = 0; i < N; i++)
            checksum += 32 + (i % 225);
        return checksum;
    }
}

Identical loop. One returns the result. One doesn’t.

Method Mean Code Size Ratio
ChecksumEliminated 3.750 ms 21 B 0.17
ChecksumPreserved 22.220 ms 66 B 1.00

5.9× on this hardware. The DisassemblyDiagnoser shows why — the actual machine code for both methods:

; ChecksumEliminated — 21 bytes
M00_L00:
  inc   eax          ; i++
  cmp   eax, ecx     ; i < N?
  jl    M00_L00      ; loop

; ChecksumPreserved — 66 bytes
M00_L00:
  mov   edx, 91A2B3C5 ; magic constant for i % 225
  imul  esi            ; compiler-generated modulo
  ; ... 8 more instructions for 32 + (i % 225) ...
  add   rcx, rax      ; checksum += result
  inc   esi            ; i++
  cmp   esi, edi       ; i < N?
  jl    M00_L00        ; loop

[DisassemblyDiagnoser] on the class generates this — run the benchmark and check BenchmarkDotNet.Artifacts/results/ for the full listing (HTML + Markdown).

The JIT determined that checksum has no observable side effects — nobody reads it — and stripped out the entire accumulation. What remains is inc/cmp/jl: the loop counter, iterating 10 million times over nothing.9 The fix is simple: always return the computed value so the JIT must preserve it.

Here’s what makes this the most dangerous enemy: 3.75 ms looks plausible. It’s not zero. It’s not suspiciously fast. It looks like a reasonable time for 10 million iterations of lightweight arithmetic. Without DisassemblyDiagnoser, you’d trust it. You’d compare it against another implementation. You’d ship a conclusion based on a number that measures empty loop iterations.

21 bytes vs 66 bytes. The disassembler is the only reliable way to catch this. Because the lie that looks reasonable is worse than the lie that looks absurd.


Know your enemies

Enemy Effect Symptom Defense
1. JIT Optimization Level 6׆ NoOptimization 6× slower (†extreme case; real Tier-0→1: 2–4×) Warmup (BDN default) + DisassemblyDiagnoser
2. GC Pauses 2.3× Allocation in hot path, StdDev spike MemoryDiagnoser + GcForce + pre-allocate
3. System Noise 3.7× StdDev Mean +46%, StdDev 3.7× under load taskset + nice + more iterations
4. Cache State 2.9× Working set > L3 Conscious choice: cold vs warm vs hot
5. Branch Predictor 5.0× Sorted data 5× faster Realistic (shuffled) data
6. Dead Code Elimination 5.9× Code Size 21 B vs 66 B Return result + DisassemblyDiagnoser

Each enemy alone shifted the result 2–6× on this hardware. Stack three and the benchmark and production are different universes.

A reference checklist — not a universal shield, but a starting point that covers what BDN configuration can cover (enemies 1–3) and adds inspection tooling for what it can’t (enemies 4–6). The enemy benchmarks in the companion code intentionally don’t use it — defenses must be down to show the enemies in action:

public class EnemyDefenseConfig : ManualConfig
{
    public EnemyDefenseConfig()
    {
        AddJob(Job.Default
            .WithWarmupCount(3)               // E1: ensure Tier-1 before measurement
            .WithGcServer(true)               // E2: Server GC — fewer, larger collections
            .WithGcForce(true)                // E2: force GC between iterations
            .WithMinIterationCount(15)        // E3: average out scheduler noise
            .WithMaxIterationCount(100)       // E3: let BDN adapt when noise is present
            .WithAffinity((IntPtr)0b11));     // E3: pin to cores 0–1

        AddDiagnoser(MemoryDiagnoser.Default);              // E2: allocation pressure
        AddDiagnoser(new DisassemblyDiagnoser(              // E1+E6: JIT output
            new DisassemblyDiagnoserConfig(maxDepth: 3)));

        AddColumn(StatisticColumn.StdDev);                  // E3: noise visible
    }
}

Enemies 1–3: configuration. Enemies 4–6: conscious data design. No config setting shuffles your test data for you.

Run it yourself

git clone https://github.com/0x3f-blog/companion-code.git
cd companion-code/first-things-first/enemies-of-measurement

# All six enemies
dotnet run -c Release

# One enemy at a time
dotnet run -c Release -- --filter '*E5*'

# OS noise comparison (Linux) — see E3 section for full commands
trap 'kill $(jobs -p) 2>/dev/null' EXIT
for i in $(seq 1 $(nproc)); do (while true; do :; done) & done
dotnet run -c Release -- --filter '*E3*'
kill $(jobs -p)
taskset -c 0 dotnet run -c Release -- --filter '*E3*'

The direction reproduces. The exact ratios depend on your hardware.


Benchmark environment

Component Details
CPU 2× Intel Xeon E5-2697 v2 @ 2.70 GHz (24 cores / 48 threads)
L3 Cache 30 MB per socket
RAM ~115 GB DDR3-1866 (quad-channel per socket)
OS Fedora Linux 42 (kernel 6.17)
Runtime .NET 9.0.11 (RyuJIT AVX)
SDK .NET SDK 10.0.102
BenchmarkDotNet v0.14.0
Job DefaultJob (BDN auto-selects iteration count, typically 15+)
GC Server GC, Concurrent (BDN enables Server GC in benchmark processes by default; host process uses Workstation)
Storage In-memory (no WAL) — enemies hide in the in-memory path
Power performance governor, no frequency scaling
Hygiene No browser, IDE, or heavy processes during runs

No WAL in this post. These enemies operate in the in-memory path, where fsync can’t drown the signal.


We walked the same path

Same storage engine. Same path. Different place.

Heraclitus (~500 BCE): you cannot step into the same river twice. JIT, GC, scheduler, cache, branch predictor, dead code — the river moved between measurements. Six enemies, each shifting the answer 2–6× on this hardware. They stack.

A number that survives design review but not these six enemies is a comfortable lie — it looks right, it feels reproducible, and it’s wrong.

Don’t trust a number that hasn’t survived six enemies.


Further reading


  1. BenchmarkDotNet uses DefaultJob for all benchmarks. E2 reports a custom job name (Job-XSSCPO) because [IterationSetup] forces InvocationCount=1 and UnrollFactor=1 — BDN cannot batch-invoke methods that require per-iteration setup. The iteration count is still auto-selected. ↩︎

  2. .NET’s tiered compilation: Tier-0 (quick JIT — fast compile, slow code) → Tier-1 (optimized — slow compile, fast code). Since .NET Core 3.0, quick JIT for loops is disabled by default (TC_QuickJitForLoops off) — methods containing loops go straight to Tier-1. NoOptimization is more extreme than Tier-0: it disables all optimizations, not just the expensive ones. For the full pipeline, see .NET Runtime Tiered Compilation Design Doc↩︎

  3. BenchmarkDotNet DefaultJob settings: MinWarmupIterationCount = 6, MaxWarmupIterationCount = 50 (adaptive), MinIterationCount = 15, MaxIterationCount = 100 (adaptive). See BenchmarkDotNet Jobs documentation and source: DefaultConfig↩︎

  4. Unlike E2 (which uses [IterationSetup] for a fresh table per iteration — because GC pressure needs fresh allocations), E3 intentionally uses [GlobalSetup] with a pre-populated table. Every iteration does updates to existing keys, not inserts that grow the ConcurrentDictionary. Fresh-table inserts add resize variance that drowns the OS noise signal we’re trying to isolate. ↩︎

  5. Gregg, Systems Performance (2nd ed., 2020), Ch. 6. Context switch overhead varies from ~5 μs (hot cache, same core) to 100+ μs (cold cache, cross-NUMA migration). On a dual-socket system, thread migration between sockets adds memory access latency on top of the pipeline flush. ↩︎

  6. Drepper, What Every Programmer Should Know About Memory, 2007, sections 3 and 6. L1: ~1 ns, L2: ~4 ns, L3: ~12 ns, DRAM: 60–100 ns. Random access to a dataset larger than L3 falls back to full DRAM latency — no prefetch, no spatial locality, every access is a cache miss. ↩︎

  7. Drepper, What Every Programmer Should Know About Memory, 2007, sections 3.3 and 6.2. Sequential access triggers hardware prefetch — the CPU loads cache lines before code asks for them. Random access falls back to full DRAM latency. ↩︎

  8. Fog, Microarchitecture of Intel, AMD and VIA CPUs, 2024, section 3. Branch prediction uses pattern history tables. A perfectly sorted sequence is trivially predictable after the transition point. A uniformly random ~50/50 pattern achieves the worst-case misprediction rate — the predictor has no pattern to learn. Each misprediction flushes the pipeline (15–20 cycles on modern Intel). ↩︎

  9. The JIT’s dead code elimination for ChecksumEliminated is partial: it removes the accumulation (checksum += 32 + (i % 225)) because the result is never observed, but retains the loop counter (i++, compare, branch). The method still executes 10M loop iterations — it just does nothing useful in each one. This produces a plausible-looking 3.75 ms instead of the expected ~22 ms. The DisassemblyDiagnoser reveals the difference: 21 bytes of machine code (inc/cmp/jl) vs 66 bytes (full arithmetic + accumulation). ↩︎