p99 = 1 ms — flip one switch — p99 = 195 ms
Same service. Same pause pattern. Same nominal target rate. One change in the client model — p99 jumps 182×. Not a system failure. A measurement failure.
Design can lie. The environment can lie. Fix both — the benchmark looks solid, the percentiles look clean. Too clean. The measurement method itself can lie — a systematic omission baked into how the test collects data.
All code in this post: clone, build, run. Numbers below were measured on dual Xeon E5-2697 v2 — run the companion code on your hardware for your own results. Different hardware, different numbers — that’s half the lesson.
Convention: charts use milliseconds; tables reproduce raw simulation output. Histograms are approximate visualizations of the recorded latency distribution — the percentile tables are the authoritative data.
Send, wait, measure, repeat
public static LatencyReport Run(SimulatedService service, int ratePerSec, int durationSec)
{
int totalRequests = ratePerSec * durationSec;
var recorder = new LatencyRecorder();
for (int i = 0; i < totalRequests; i++)
{
long start = Stopwatch.GetTimestamp();
service.Process();
long elapsed = Stopwatch.GetTimestamp() - start;
recorder.Record(elapsed);
}
return recorder.GetReport();
}Closed-loop client — full source in companion code.
Send a request. Wait for the response. Measure the elapsed time. Send the next one. The client and the service take turns — a lockstep conversation where neither moves without the other. This pattern has a name: closed-loop.1 Most load test frameworks default to it. Most dashboards assume it.
What does your test do when the system slows down?
The comfortable picture
The system under test: a simulated service with ~1 ms baseline latency (calibrated SpinWait) and a 200 ms pause every 500th request — modeling GC, compaction, or any periodic maintenance event. Target rate: 450 req/sec over 30 seconds (13,500 total). Average service time: (499 × 1 ms + 1 × 200 ms) / 500 = 1.4 ms. At 450 req/sec the service needs 630 ms of work per second — ~63% utilization, with headroom to spare. The pauses are the problem, not the capacity.
The closed-loop client has no rate limiter, no inter-request delay — totalRequests is just a count (rate × duration) to match the open-loop’s output volume. The effective rate is whatever the service delivers. During normal processing (~1 ms per request), well above 450 req/sec. During a 200 ms pause: zero. The arrival rate follows the system. When the system slows, the test slows with it.
| Metric | Closed-loop |
|--------|-------------:|
| Count | 13,500 |
| p50 | 1.00 ms |
| p90 | 1.00 ms |
| p99 | 1.07 ms |
| p99.9 | 200.15 ms |
| max | 200.28 ms |The dashboard looks clean. 99th percentile: 1 ms. Only p99.9 shows any trouble — and that’s 27 requests out of 13,500, the ones that directly hit a pause. Every other request: ~1 ms, tight distribution, no tail. You read the numbers and move on.
The dashboard maps what the test recorded — not what users experienced.
Hume (1739): no finite set of observations guarantees the next. A thousand closed-loop measurements say p99 = 1 ms. The thousand-and-first doesn’t have to agree. Induction from data that systematically omits the worst moments is induction from a sample that excludes its own counterexamples.
Flip one switch
Same service. Same pause injector. Same nominal target rate. One change: the client sends on a fixed schedule, regardless of whether the previous request came back.
public static LatencyReport Run(SimulatedService service, int ratePerSec, int durationSec)
{
var recorder = new LatencyRecorder();
long intervalTicks = Stopwatch.Frequency / ratePerSec;
long deadline = Stopwatch.GetTimestamp() + (long)durationSec * Stopwatch.Frequency;
long nextSend = Stopwatch.GetTimestamp();
while (Stopwatch.GetTimestamp() < deadline)
{
long intendedStart = nextSend;
nextSend += intervalTicks;
service.Process();
long now = Stopwatch.GetTimestamp();
long latency = now - intendedStart; // ← intended, not actual
recorder.Record(latency);
while (Stopwatch.GetTimestamp() < nextSend)
Thread.SpinWait(10);
}
return recorder.GetReport();
}Open-loop client — full source in companion code. Note: intervalTicks uses integer division, introducing sub-microsecond step quantization at 450 req/sec — negligible for this demonstration.
One line changed: now - intendedStart instead of now - actualStart. The user’s clock starts when they click, not when the server gets around to processing their request. When the service pauses, requests that should have been sent during the pause pile up — each measured from when it was supposed to start, because that’s when the user started waiting.
Bimodal. A peak at ~1 ms and a wide spread from 50–200 ms. Two different experiences on the same chart.
| Metric | Closed-loop | Open-loop | Ratio |
|--------|-------------:|-------------:|----------:|
| Count | 13,500 | 13,500 | |
| p50 | 1.00 ms | 1.00 ms | 1.0x |
| p90 | 1.00 ms | 137.89 ms | 137.9x |
| p99 | 1.07 ms | 194.64 ms | 182.4x |
| p99.9 | 200.15 ms | 200.15 ms | 1.0x |
| max | 200.28 ms | 200.41 ms | 1.0x |Ratios computed from raw data before rounding to displayed precision.
Same system. Same load. Same pause. One variable: whether the test waits for a response before sending the next request.
Closed-loop p99 = 1 ms. Open-loop p99 = 195 ms. 182× on this workload.
The mechanism — coordinated omission
During a 200 ms pause, the closed-loop client waits. While waiting, it sends no new requests — it goes with the system, slowing down exactly when the system slows down. 200 ms × 450 req/sec = 90 requests that should have been sent but weren’t. They don’t appear in the histogram. They don’t exist in the data. The dashboard stays clean.
The open-loop client doesn’t coordinate. It tracks what the schedule should have been. After the pause resolves:
- Request N+1: intended at T+2 ms, completed at T+201 ms → latency = 199 ms
- Request N+2: intended at T+4 ms, completed at T+202 ms → latency = 198 ms
- Request N+3: intended at T+7 ms, completed at T+203 ms → latency = 196 ms
- …catch-up continues for ~160 requests until the schedule recovers
Each pause contaminates ~160 subsequent requests with elevated latency. 27 pauses × ~160 requests = ~4,300 requests — roughly a third of all traffic — experiencing latency between 2 ms and 200 ms. That’s why the open-loop p90 is 138 ms: the top 10% of requests (1,350 out of 13,500) fall squarely in that contaminated range.
The closed-loop client sees 27 bad requests. The open-loop client sees 4,300. Same service. Same pauses.
The worse the failure, the more requests the closed-loop client skips, the cleaner the dashboard. The mechanism is inversely proportional to the problem. A 200 ms pause omits 90 measurements. A 2-second pause omits 900. A 10-second GC stop-the-world omits 4,500. The worst event your system can produce is the one your test is least likely to record.
Gil Tene named this Coordinated Omission — the test coordinates with the system’s failures, omitting measurements precisely when they would be most damning.2
Baudrillard (1981): the third phase of the simulacrum — the image masks the absence of reality. The closed-loop benchmark doesn’t distort measurements. It masks their nonexistence. Those 90 requests during the pause aren’t poorly measured. They don’t exist. The dashboard is a simulacrum — it doesn’t lie about the system. It replaces it.
How to stop coordinating
| Property | Closed-loop | Open-loop |
|---|---|---|
| Request timing | After previous response | Fixed schedule, independent of response |
| What it measures | Response time of sent requests (omits unsent) | Response time from intended start (incl. queuing) |
| During a pause | Stops sending → omits measurements | Tracks intended schedule → captures queuing |
| p99 under pauses | Looks clean (only direct hits visible) | Shows full impact (queued requests visible) |
| Best for | Throughput measurement, saturation testing | Latency measurement, SLA validation |
Four rules for latency measurement:
-
Open-loop by default for latency load tests. Closed-loop is still useful for throughput and saturation testing — finding the breaking point. But if your SLAs are latency percentiles, you need open-loop. Closed-loop tells you the system can handle the load; open-loop tells you what users experience while it does.1
-
Measure from intended time, not actual time.
latency = now - intendedStart, notnow - actualStart. The user’s clock starts when they click, not when the server gets around to reading their request. -
Record the full tail. p50 and p99 are not enough. Report p99.9 and max. Coordinated omission hides in the gap between p99 and p99.9 — the range where closed-loop sees nothing and open-loop sees the damage.
-
Use histograms that can handle it. HdrHistogram3 records values across a wide dynamic range with configurable precision — from sub-millisecond to multi-second latencies in the same histogram. Fixed-bucket histograms clip the tail.
Tools that get it right
| Tool | Open-loop | CO correction | Notes |
|---|---|---|---|
| wrk24 | Yes | Built-in | Constant-rate HTTP benchmark, HdrHistogram output |
| Gatling | Yes | Configurable | Open-loop mode available, reports percentiles |
| k6 | Partial | Manual | Constant-rate via scenarios, no auto-correction |
| Custom (this post) | Yes | By design | intendedStart tracking, HdrHistogram.NET |
Capabilities and defaults vary by tool version and configuration; verify settings in your release.
Run it yourself
git clone https://github.com/0x3f-blog/companion-code.git
cd companion-code/first-things-first/coordinated-omission
dotnet run -c ReleaseBenchmark environment
| Component | Value |
|---|---|
| CPU | 2× Intel Xeon E5-2697 v2 @ 2.70 GHz (24 cores / 48 threads) |
| RAM | ~115 GB DDR3-1866 (quad-channel per socket) |
| OS | Fedora Linux 42 (kernel 6.17) |
| Runtime | .NET 9.0.11 (RyuJIT AVX) |
| SDK | .NET SDK 10.0.102 |
| HdrHistogram | HdrHistogram.NET 2.5.0 |
| Simulation | 450 req/sec, 30 sec, 200 ms pause every 500 requests |
Not BenchmarkDotNet — this is a custom in-process simulation. SpinWait calibrated at startup for ~1 ms baseline on current hardware (binary search, 50 samples, median). Fresh SimulatedService instance per client — no counter contamination.
Limitations: In-process simulation — no HTTP, no network stack, no kernel-level queuing. The open-loop client is single-threaded and blocks on Process(), so it tracks the intended schedule rather than dispatching concurrently (a real open-loop system like wrk2 or Gatling sends requests asynchronously). These simplifications isolate the coordinated omission mechanism from transport noise — the measurement effect is the same, but absolute numbers would differ in a networked setup.
Popper (1934): a meaningful test must be capable of producing a negative result. The closed-loop client cannot falsify the hypothesis “the system is healthy” — it hides the counterexamples. Measurements that would disprove it don’t exist. Open-loop is the falsification instrument: it doesn’t ask the system whether it’s ready. It measures regardless.
Each layer of deception sits closer to you. Design — visible in the code. Environment — visible in the configuration. The method of collection — buried in an assumption you never questioned. Data collected correctly. But what do the data mean?
A metric that looks better the worse the system performs isn’t a metric. It’s anesthesia.
Further reading
- Gil Tene, How NOT to Measure Latency (Strange Loop 2015) — the definitive talk on coordinated omission, open vs closed loop, and why percentile measurements lie.2
- Gil Tene, How NOT to Measure Latency (QCon San Francisco 2015) — recorded version of the talk, more on why averages and even p99 are insufficient without the full distribution.5
- Schroeder, Wierman, Harchol-Balter, Open Versus Closed: A Cautionary Tale (NSDI 2006) — the formal paper showing that open-loop and closed-loop produce fundamentally different results.1
- Dean & Barroso, The Tail at Scale (CACM 2013) — why tail latency matters in distributed systems, fan-out amplification.6
- Ousterhout, Always Measure One Level Deeper (CACM 2018) — the general principle: measure the layer below where you think the problem is.7
- HdrHistogram — high dynamic range histogram for latency recording, with coordinated omission correction. Ports: Java, C#, C, Go, Rust, JavaScript, Python, Erlang.3
- Gil Tene, wrk2 — constant-rate HTTP benchmark with built-in coordinated omission correction and HdrHistogram output.4
- Brendan Gregg, Active Benchmarking — methodology and anti-patterns for honest measurement.8
- Martin Thompson, Mechanical Sympathy — latency-focused systems programming, false sharing, memory access patterns.9
- Andrey Akinshin, Pro .NET Benchmarking (Apress, 2019) — comprehensive guide to .NET measurement, including percentile pitfalls.10
-
Schroeder, Wierman, Harchol-Balter, Open Versus Closed: A Cautionary Tale, NSDI 2006. The formal demonstration that open-loop and closed-loop benchmarks produce fundamentally different performance characteristics — even on the same system under the same nominal load. ↩︎ ↩︎ ↩︎
-
Gil Tene, How NOT to Measure Latency, Strange Loop 2015. Defines coordinated omission, demonstrates the mechanism, introduces HdrHistogram. The single most important talk on latency measurement. ↩︎ ↩︎
-
HdrHistogram by Gil Tene. Records values across a configurable dynamic range (e.g., 1 microsecond to 1 hour) with uniform precision at any percentile level. .NET port: HdrHistogram.NET on NuGet. ↩︎ ↩︎
-
Gil Tene, wrk2. A fork of wrk that maintains a constant request rate (open-loop) and records latency from intended send time. The output includes full HdrHistogram percentile data — no coordinated omission by construction. ↩︎ ↩︎
-
Gil Tene, How NOT to Measure Latency, QCon San Francisco 2015. Why the mean is useless, why p99 isn’t enough, why you need the full distribution. ↩︎
-
Dean & Barroso, The Tail at Scale, CACM 2013. In a fan-out architecture, the probability of hitting at least one slow backend grows with the number of backends. Tail latency isn’t a statistics curiosity — it’s the dominant user experience at scale. ↩︎
-
Ousterhout, Always Measure One Level Deeper, CACM 2018. The general principle: if the numbers don’t make sense, measure the layer below. Coordinated omission is a measurement-layer problem — you have to look at how the test records latency, not just what it reports. ↩︎
-
Brendan Gregg, Active Benchmarking. Methodology for honest benchmarking: verify work done, eliminate perturbation, report confidence. Includes a section on coordinated omission as a common anti-pattern. ↩︎
-
Martin Thompson, Mechanical Sympathy. Blog series on latency-sensitive systems programming — false sharing, memory access patterns, lock-free data structures. Context for understanding why sub-millisecond measurement matters. ↩︎
-
Andrey Akinshin, Pro .NET Benchmarking (Apress, 2019). The BenchmarkDotNet author’s comprehensive treatment of measurement in .NET — warmup, outliers, statistics, environment control, percentile reporting. ↩︎