27.2M ops/sec — and a lie

Same two classes. Same data. Same machine. One benchmark says Dictionary + lock is 2× faster. Another says ConcurrentDictionary is 17× faster. A third says it doesn’t matter — fsync buries the difference in noise. Same optimization — three verdicts.

All code in this post: clone, build, run. Numbers below: dual Xeon E5-2697 v2, 48 threads, 30 MB L3 per socket (NUMA — two sockets, two separate caches, cross-socket traffic is real), ~115 GB DDR3-1866, Fedora 42, .NET 9.0.11 (RyuJIT AVX), BenchmarkDotNet v0.14.0, ShortRun job (LaunchCount=1, WarmupCount=3, IterationCount=3). No process or thread affinity pinning — on dual-socket NUMA, thread migration and cross-socket memory access can widen variance. Different hardware, different numbers — that’s half the lesson.

Throughput in this post is reported in M ops/sec, derived from BenchmarkDotNet’s per-operation time via OperationsPerInvoke = 500_000.

Methodology details (click to expand)
  • Inserts per invocation: N = 500,000 (insert-or-update via dictionary[key] = value). BDN divides invocation time by N to report nanoseconds per insert, from which ops/sec follows directly.
  • Sequential/Narrow keys (0..N-1): every invocation after the first is pure overwrite — steady-state from the second call onward.
  • Realistic keys (random from [0..1M)): _randomKeys is pre-generated once in GlobalSetup — a deterministic multiset with ~393K unique keys (expected for 500K draws from 1M: n(1 - e^{-m/n})). Subsequent invocations replay the same keys, modeling steady-state overwrites.
  • Why GlobalSetup, not IterationSetup: the table is created once and persists across all iterations — no fresh table per run. This is deliberate. IterationSetup (reset the table before every iteration) measures cold inserts into an empty dictionary; GlobalSetup measures steady-state overwrites into a populated one. Cold inserts are dominated by dictionary resizing and memory allocation — costs that flatten the Lock vs Striped difference to ~1×, hiding the contention the post is about. Production systems don’t restart between requests. The 17× Narrow advantage and the three-verdict divergence only appear under steady-state, where lock contention — not allocation — is the bottleneck.
  • WAL storage: btrfs (/var/tmp, not tmpfs) — fsync hits a real filesystem, verified via stat -f -c %T.
  • ShortRun job: ShortRun is deliberately quick (3 measurement iterations) — enough to show relative behavior across scenarios, not enough for tight confidence intervals or absolute production throughput. BDN emits the exact job config in every run: Job=ShortRun-.NET 9.0 Runtime=.NET 9.0 IterationCount=3 LaunchCount=1 WarmupCount=3.
  • GC: BenchmarkDotNet can force a full GC between invocations (controlled by GcForce in the job config) — another laboratory condition that limits real-world representativeness.
  • For publication-grade numbers: use [SimpleJob] (15+ iterations, tighter confidence intervals / CIs); ShortRun trades precision for speed.

The anatomy of a comfortable lie

private LockTable<int, Row> _table;

[Benchmark]
public void InsertBenchmark()
{
    for (int i = 0; i < 1_000_000; i++)
        _table.Insert(i, GenerateRow(i));
}

A storage engine — the running example here because it stacks every distortion layer: disk I/O, caching, concurrency, durability, all tangled. LockTable wraps Dictionary<TKey, TValue> behind a single lock; StripedTable wraps ConcurrentDictionary (striped locks1). They exist so the examples compile and run, and the locking semantics match. Swap “storage engine” for HTTP server, JSON serializer, or crypto library and the mechanics are identical. Anything touching memory hierarchies or I/O carries this gap.

Four assumptions hiding inside that one method:

What the benchmark assumes What load exposes
Sequential keys (0, 1, 2, …) Random keys (uniform here; production often Zipfian2)
Single thread, zero contention Dozens of threads, lock pressure
Data fits in L3 cache (30 MB per socket) Working set 10–100× larger
No durability (no fsync) WAL3 + fsync per batch

The instinct is correct: fix it. Swap the lock for ConcurrentDictionary, add threads, use real data. The mistake comes next — believing the benchmark that exposed the problem can also verify the fix.


Scenario A — The flat line (Lazy)

[Benchmark(Baseline = true, OperationsPerInvoke = N)]
public void InsertSequential()
{
    for (int i = 0; i < N; i++)
        _table.Insert(i, Row.Default);
}

Sequential keys. Single method. No durability. Run against both backends — Dictionary + lock and ConcurrentDictionary — with a ThreadCount parameter set to 1, 4, 16, 32. InsertSequential ignores it. That’s the point.

| Backend              | 1 thread | 4 threads | 16 threads | 32 threads |
|----------------------|----------|-----------|------------|------------|
| Dictionary + lock    | 27.2M    | 27.6M     | 27.3M      | 24.6M      |
| ConcurrentDictionary | 12.3M    | 10.5M     | 10.4M      | 10.6M      |

These are steady-state overwrite numbers — state persists across invocations within a parameter combination, so after the first invocation the dictionary is full and every subsequent insert (including warmup and measurement) is a key update, not a growth/rehash event. Lock ranges from 24.6M to 27.6M across all thread counts — essentially flat (InsertSequential ignores ThreadCount; the spread is ShortRun noise). ConcurrentDictionary ranges from 10.4M to 12.3M. Is that variation real? Here’s the raw BenchmarkDotNet output for ConcurrentDictionary InsertSequential:

| ThreadCount | Mean      | Error      | StdDev   | → M ops/sec |
|-------------|-----------|------------|----------|-------------|
| 1           | 81.28 ns  | ±24.88 ns  | 1.36 ns  | 12.3M       |
| 4           | 95.66 ns  | ±62.53 ns  | 3.43 ns  | 10.5M       |
| 16          | 96.15 ns  | ±24.42 ns  | 1.34 ns  | 10.4M       |
| 32          | 94.09 ns  | ±67.09 ns  | 3.68 ns  | 10.6M       |

Error is half of BenchmarkDotNet’s confidence interval — throughout this post, “CI” means BDN’s default (99.9%, very conservative). At ShortRun’s 3 measurement iterations the t-distribution with 2 degrees of freedom inflates the CI further. StdDev is small (1.3–3.7 ns), meaning the individual runs were consistent, but the statistical confidence is low. The CIs overlap massively across all four rows. The variation looks like noise.

Neither backend reacts to thread count. Sequential keys feed the prefetcher4, spatial locality holds the working set in cache, no fsync means no I/O stalls.

The benchmark doesn’t react because it has nothing to react to.

Verdict: ConcurrentDictionary is a regression. Revert.


Scenario B — The better lie (Narrow)

[Benchmark(OperationsPerInvoke = N)]
public void InsertNarrow()
{
    int opsPerThread = N / ThreadCount;
    var options = new ParallelOptions { MaxDegreeOfParallelism = ThreadCount };
    Parallel.For(0, ThreadCount, options, threadIdx =>
    {
        int start = threadIdx * opsPerThread;
        for (int i = 0; i < opsPerThread; i++)
            _table.Insert(start + i, Row.Default);
    });
}

One lie removed: threads now run in parallel. Three remain — sequential keys per partition, working set still fits in L3 cache, and no durability. MaxDegreeOfParallelism caps concurrency at ThreadCount; the actual threads come from the .NET ThreadPool, which ramps up workers on demand — for long-running CPU-bound work like this, the pool usually saturates quickly.

| Backend              | 1 thread | 4 threads | 16 threads | 32 threads |
|----------------------|----------|-----------|------------|------------|
| Dictionary + lock    | 26.5M    | 5.8M      | 4.1M       | 2.8M       |
| ConcurrentDictionary | 10.4M    | 16.8M     | 40.5M      | 49.3M      |

The direction of the crossover is clear — even with ShortRun’s wide CIs, the magnitude is too large to dismiss. Raw BenchmarkDotNet output for InsertNarrow (ns per insert, via OperationsPerInvoke). Note: ns/op here is wall‑clock time divided by N, not single‑thread latency — parallel overlap makes ns/op smaller than any individual thread’s per‑insert time. Note the Error column — ShortRun (n=3) produces CIs wider than the Mean in several rows:

| Backend    | ThreadCount | Mean      | Error       | StdDev    | → M ops/sec |
|------------|-------------|-----------|-------------|-----------|-------------|
| Lock       | 1           | 37.67 ns  | ±5.63 ns    | 0.31 ns   | 26.5M       |
| Lock       | 32          | 351.35 ns | ±1,159.4 ns | 63.55 ns  | 2.8M        |
| ConcDic    | 1           | 95.79 ns  | ±60.33 ns   | 3.31 ns   | 10.4M       |
| ConcDic    | 32          | 20.28 ns  | ±35.85 ns   | 1.97 ns   | 49.3M       |

Lock collapses from 26.5M to 2.8M. ConcurrentDictionary climbs from 10.4M to 49.3M. The direction is unambiguous; the exact ratio (~17× at 32 threads) is approximate given ShortRun’s wide CIs. The verdict flips: ship the optimization.

Except — 49.3M ops/sec will never happen in production. Sequential keys per partition mean the prefetcher runs at full tilt. The working set still fits comfortably in L3 cache. No fsync means zero I/O wait. On this hardware, the gap between Narrow and Realistic is 391× (49.3M vs 126K). Fsync drowns both backends to the point where the CIs overlap completely.

Goodhart’s Law (1975), in Strathern’s phrasing (1997): “When a measure becomes a target, it ceases to be a good measure.”

Volkswagen — Deutsche Gründlichkeit — shipped ECU software that detected test conditions and switched calibration5. When the car knew it was being tested, it changed behavior. Passed every bench in Europe and the US for years. Not better engineering. Better cheating.

A benchmark that reacts in the right direction but overstates the magnitude is more dangerous than one that doesn’t react at all — it breeds the confidence to ship without measuring further.


Scenario C — Same optimization, opposite conclusions (Realistic)

[GlobalSetup]
public void Setup()
{
    var walDir = Path.Combine("/var/tmp", "bench-wal");
    Directory.CreateDirectory(walDir);
    var walPath = Path.Combine(walDir, $"wal-{BackendType}-{ThreadCount}.log");
    _table = BackendType == Backend.Lock
        ? new LockTable<int, Row>(walPath: walPath)
        : new StripedTable<int, Row>(walPath: walPath);
    var rng = new Random(42);
    _randomKeys = new int[N];
    _randomRows = new Row[N];
    for (int i = 0; i < N; i++)
    {
        _randomKeys[i] = rng.Next(0, KeySpace);
        _randomRows[i] = Row.Generate(_randomKeys[i]);
    }
}

[Benchmark(OperationsPerInvoke = N)]
public void InsertRealistic()
{
    int opsPerThread = N / ThreadCount;
    var options = new ParallelOptions { MaxDegreeOfParallelism = ThreadCount };
    Parallel.For(0, ThreadCount, options, threadIdx =>
    {
        int start = threadIdx * opsPerThread;
        for (int i = 0; i < opsPerThread; i++)
            _table.Insert(_randomKeys[start + i], _randomRows[start + i]);
    });
    _table.FlushWAL();  // one fsync per batch — group commit, not per-insert
}

All four distortions from the opening table addressed. Random keys. Parallel threads. Working set that approaches L3 capacity — at ~500K entries with variable-size Row payloads, the hash table reaches the 30 MB L3 boundary (keys + rows + bucket overhead), before counting the WAL buffer. Durability enforced — FlushWAL() calls fsync once per batch (group commit), not per insert. WAL on btrfs (/var/tmp), not tmpfs — fsync hits a real filesystem. A real OLTP system calling fsync per transaction would be slower still.

| Backend              | 1 thread | 4 threads | 16 threads | 32 threads |
|----------------------|----------|-----------|------------|------------|
| Dictionary + lock    | 125K     | 122K      | 124K       | 120K       |
| ConcurrentDictionary | 125K     | 124K      | 126K       | 126K       |

Are these differences real? Raw BenchmarkDotNet output for InsertRealistic at 32 threads:

| Backend    | Mean        | Error       | StdDev     | → K ops/sec |
|------------|-------------|-------------|------------|-------------|
| Lock       | 8,335 ns/op | ±4,166 ns   | 228 ns     | 120K        |
| ConcDic    | 7,934 ns/op | ±1,538 ns   | 84 ns      | 126K        |

The 99.9% CIs overlap (Lock: 4,169–12,501 ns; ConcDic: 6,396–9,472 ns). With only 3 iterations (ShortRun), the CIs are too wide to resolve a 5% difference — it’s indistinguishable from noise at this sample size.

The lines don’t cross. They overlap. Both backends converge to ~124K ops/sec regardless of thread count. Note the unit: ns/op here is wall‑clock time per invocation divided by N, so each insert’s share includes 1/N of the single FlushWAL() call — amortized fsync, not per‑insert fsync. In batch terms: ~8,000 ns/op × 500K = ~4 seconds per batch, one fsync per batch — a commit rate of ~0.25 batches/sec.

How do we know fsync is the bottleneck and not just noise? Remove it. A fourth benchmark, InsertRealisticNoFlush, runs the same random‑key parallel inserts without calling FlushWAL():

[Benchmark(OperationsPerInvoke = N)]
public void InsertRealisticNoFlush()
{
    int opsPerThread = N / ThreadCount;
    var options = new ParallelOptions { MaxDegreeOfParallelism = ThreadCount };
    Parallel.For(0, ThreadCount, options, threadIdx =>
    {
        int start = threadIdx * opsPerThread;
        for (int i = 0; i < opsPerThread; i++)
            _table.Insert(_randomKeys[start + i], _randomRows[start + i]);
    });
    // no FlushWAL — pure in-memory
}

Raw BenchmarkDotNet output at 32 threads:

| Method                 | Backend | Mean        | Error       | StdDev    |
|------------------------|---------|-------------|-------------|-----------|
| InsertRealistic        | Lock    | 8,335 ns/op | ±4,166 ns   | 228 ns    |
| InsertRealisticNoFlush | Lock    | 422 ns/op   | ±684 ns     | 38 ns     |
| InsertRealistic        | ConcDic | 7,934 ns/op | ±1,538 ns   | 84 ns     |
| InsertRealisticNoFlush | ConcDic | 33 ns/op    | ±15 ns      | 0.8 ns    |

Without fsync, ConcurrentDictionary appears substantially faster than lock at 32 threads (33 ns vs 422 ns, ~12.8×) — but note Lock’s Error (±684 ns) exceeds its Mean (422 ns), so the exact ratio is approximate at this sample size. The direction is consistent with the Narrow scenario: lock contention dominates when I/O is removed. With fsync, both backends are crushed to ~0.12M ops/sec. The ~7,900 ns that fsync adds per insert is the same for both backends, and it drowns everything else. The optimization that the Narrow benchmark promised at 49.3M ops/sec is unresolved at this sample size under production I/O — the CIs overlap completely.

Three benchmarks. Three verdicts.

Lazy (flat) Narrow (inflated) Realistic (fsync)
Dictionary + lock 27.2M ops/sec 2.8M ops/sec 120K ops/sec
ConcurrentDictionary 12.3M ops/sec 49.3M ops/sec 126K ops/sec
ConcDic / Lock 0.45× ~17× ~1.05×
Verdict Revert — lock is ~2× Ship — ConcDic is ~17× Irrelevant — both ≈124K

M ops/sec → K ops/sec. Korzybski (1933): the map is not the territory. Three benchmarks, three maps — each accurate within its own borders, each blind beyond them.

Run it yourself

git clone https://github.com/0x3f-blog/companion-code.git
cd companion-code/first-things-first/why-benchmarks-lie

# All benchmarks (Lazy, Narrow, Realistic)
dotnet run -c Release

# Two benchmark classes — WhyBenchmarksLieSequentialBenchmark (Lazy)
# and WhyBenchmarksLieBenchmark (Narrow, Realistic, NoFlush)
dotnet run -c Release -- --filter '*WhyBenchmarksLie*'

The direction reproduces; the exact magnitudes don’t.

Benchmark environment

Component Details
CPU 2× Intel Xeon E5-2697 v2 @ 2.70 GHz (24 cores / 48 threads)
L3 Cache 30 MB per socket
RAM ~115 GB DDR3-1866 (quad-channel per socket)
OS Fedora Linux 42 (kernel 6.17)
Runtime .NET 9.0.11 (RyuJIT AVX)
SDK .NET SDK 10.0.102
BenchmarkDotNet v0.14.0
Job ShortRun (WarmupCount=3, IterationCount=3)
GC Server GC, Concurrent (BDN default for benchmark processes)
WAL storage btrfs (/var/tmp, not tmpfs) — fsync hits a real filesystem
NVMe Samsung SSD 970 EVO Plus 1TB (consumer, no PLP)

Limitations: Dual-socket NUMA — no process or thread affinity pinning, so thread migration and cross-socket memory access can widen variance. ShortRun (n=3) provides wide CIs — sufficient for relative comparisons, not for tight absolute numbers.


What to do starting tomorrow

Little’s Law (1961): L = λ × Win-flight requests = throughput × latency.

Throughput without latency is half the story:

Throughput Latency In-flight (L = λ × W) What it looks like
100K ops/sec 10 μs ~1 Single-threaded, no queuing
100K ops/sec 1 ms 100 Moderate concurrency
100K ops/sec 10 ms 1,000 Heavy concurrency, queuing

Same throughput. Three different systems. Same throughput can hide wildly different p99 — queueing shows up in the tails first. Reporting throughput without a latency distribution is a hospital bragging about patient volume and omitting the mortality rate.

Red flags — any of these and the numbers are suspect:

Red flag Why it matters
Sequential keys only Perfect branch prediction, zero hash collisions, sequential cache access
Single thread No lock contention, no false sharing, no NUMA effects
Data < L3 cache Measures cache speed, not the algorithm
No durability (no fsync) fsync cost spans orders of magnitude across hardware classes6
Fixed-size values No memory fragmentation, no variable-length encoding overhead
Clean state every run No WAL growth, no fragmentation, no background compaction

Each row can shift results by an order of magnitude. They don’t stack linearly — some overlap, some cancel — but they compound.

1. Compare two things. A number alone is noise.

[Benchmark(Baseline = true)]
public byte[] SerializeJson() => JsonSerializer.SerializeToUtf8Bytes(_payload);

[Benchmark]
public byte[] SerializeMessagePack() => MessagePackSerializer.Serialize(_payload);

Baseline = true tells BenchmarkDotNet: this is the reference point. Every other [Benchmark] method gets a Ratio column — how many times faster or slower relative to baseline — plus Error and StdDev so the difference is real, not noise.

“312 ns.” Fast? Slow? No way to tell. “0.43× baseline” — MessagePack is 2.3× faster than JSON. That’s a decision, not a number.

2. Realistic data, realistic scale. CPUs have a hierarchy of memory, each level larger and slower than the last. L1 cache (~32 KB) responds in ~1 ns. L2 (~256 KB) in ~4 ns. L3 (~30 MB, shared across cores) in ~12 ns. Main memory (DRAM) in ~60–100 ns. A benchmark whose dataset fits in L1 measures cache speed, not the algorithm.

[Params] tells BenchmarkDotNet to run the same benchmark at each data size — crossing every boundary in the hierarchy:

[Params(
    1_000,        // 4 KB  → fits in L1
    64_000,       // 256 KB → fits in L2
    8_000_000,    // 32 MB → near L3 boundary (30 MB)
    64_000_000    // 256 MB → spills to DRAM
)]
public int DataSize { get; set; }

If the numbers don’t change across sizes — the algorithm is cache-insensitive. If they cliff at 32 MB — the benchmark was measuring cache, and production (where the working set is 40 GB) will see a very different number.

3. Error bars. BenchmarkDotNet reports Error (half-width of the 99.9% confidence interval) and StdDev (spread of measurements) for every benchmark. Overlapping CIs mean the difference is unresolved. Popper (1934): a benchmark can falsify a hypothesis — “A is not faster than B” — but overlapping confidence intervals can never confirm one. For formal tests, BDN supports Welch’s t-test and Mann-Whitney U.7

| Method | Mean     | Error   | StdDev  |
| V1     | 152.3 ns | ±2.1 ns | 1.98 ns |
| V2     | 148.7 ns | ±3.4 ns | 3.19 ns |

V1 ranges from 150.2 to 154.4 ns. V2 ranges from 145.3 to 152.1 ns. The ranges overlap — the difference is unresolved at this confidence level. Without a formal statistical test, shipping this as a “3% win” risks shipping noise.

4. Check JIT output. .NET compiles code at runtime (Just-In-Time). The JIT compiler is smart — sometimes too smart. It might inline a method (copy its body into the caller, eliminating call overhead), eliminate a loop it can prove does nothing, auto-vectorize scalar operations to use SIMD instructions, or optimize away the very thing being benchmarked because the result is never used.

[DisassemblyDiagnoser(maxDepth: 3)]
public class MyBenchmark { ... }

DisassemblyDiagnoser dumps the actual machine code the JIT produced. If the benchmark method compiled down to three instructions — the JIT optimized away the work. The numbers are real. What they measure is not.

5. Reproducibility. A benchmark result without its environment is anecdote, not data. Minimum context: CPU model and cache sizes, RAM speed and capacity, OS and kernel version, .NET version, GC mode (Workstation vs Server), dataset size + distribution + random seed, and the exact command to run.

The environment is part of the result. Publishing numbers without it is publishing conclusions without evidence.


Benchmark design is architecture. Not metaphorically. Data distribution, concurrency model, durability semantics, warm-up strategy, statistical methodology — each choice shapes what the measurement can see and what it’s blind to.

The Lazy benchmark measured real throughput. Single-threaded sequential inserts into an in-memory structure — that’s exactly what it tested. The Narrow benchmark promised 49.3M ops/sec — production delivered 126K. On this hardware, on this workload — 391× between the map and the territory. Under production I/O, ConcurrentDictionary and Dictionary + lock converge — the CIs overlap completely. Three benchmarks, three maps. None of them the territory.

Without measurement architecture, the right optimization looks like a regression, the wrong one looks like progress, and every decision downstream inherits the distortion.


Further reading


  1. .NET ConcurrentDictionary uses conceptually striped/fine-grained locking for writes and lock-free reads on common paths like TryGetValue. The exact internal structure may evolve across .NET versions, but the design principle — partition contention across multiple locks instead of one — is stable. Source, docs↩︎

  2. Cooper et al., Benchmarking Cloud Serving Systems with YCSB, SoCC 2010. Core workloads (A/B/C) use Zipfian record selection — a small fraction of keys receives a disproportionate share of requests. ↩︎

  3. Mohan et al., ARIES: A Transaction Recovery Method, ACM TODS 1992. The WAL protocol behind PostgreSQL, MySQL, SQLite, RocksDB. ↩︎

  4. Drepper, What Every Programmer Should Know About Memory, 2007, §3.3 and §6.2. Sequential access triggers hardware prefetch — the CPU loads cache lines before code asks for them. Random access falls back to full DRAM latency (~60–100 ns). ↩︎

  5. United States EPA, Notice of Violation, Sep 18 2015 — lists the defeat device inputs: steering wheel position, vehicle speed, engine duration, and barometric pressure (p. 2). Real-world NOx emissions: up to 40× the US legal limit. Overview: EPA VW violations page↩︎

  6. fsync latency spans orders of magnitude depending on hardware class (enterprise NVMe with PLP vs consumer SSD vs spinning disk), firmware, filesystem, and write pattern. The spread between device classes alone can be 100×. Group commit amortizes single-fsync cost across many transactions — standard practice in PostgreSQL, MySQL, and most OLTP databases. See Pillai et al., All File Systems Are Not Created Equal, OSDI 2014, for how filesystem and protocol choices further multiply the difference. ↩︎

  7. CI overlap is a conservative heuristic: if 99.9% CIs overlap, the difference is likely noise. But non-overlapping CIs don’t guarantee practical significance — a full treatment requires effect size and formal statistical tests (Welch’s t-test, Mann-Whitney U). ↩︎