<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Perf on 0x3F</title>
    <link>https://0x3f.blog/tags/perf/</link>
    <description>Recent content in Perf on 0x3F</description>
    <generator>Hugo -- 0.152.2</generator>
    <language>en-us</language>
    <lastBuildDate>Tue, 10 Mar 2026 19:00:00 +0100</lastBuildDate>
    <atom:link href="https://0x3f.blog/tags/perf/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>First Things First: Hardware Counters</title>
      <link>https://0x3f.blog/posts/first-things-first-hardware-counters/</link>
      <pubDate>Tue, 10 Mar 2026 19:00:00 +0100</pubDate>
      <guid>https://0x3f.blog/posts/first-things-first-hardware-counters/</guid>
      <description>12.9× slower — BDN tells you that. But it cannot tell you why, or predict that the ratio will jump when the dataset outgrows the cache. Hardware counters can. Time is a result, not a cause.</description>
      <content:encoded><![CDATA[<h2 id="129-slower--and-thats-the-easy-part">12.9× slower — and that&rsquo;s the easy part</h2>
<p>Two loops over the same array. Same data. Same sum operation. One walks the array sequentially; the other uses a random permutation for indirection. BenchmarkDotNet says SumRandom is 12.88× slower at one million elements. No surprise — random memory access is slower. Everyone knows that.</p>
<p>But <em>how much slower will it get</em> when the dataset grows 64×?</p>
<p>BDN measures time. Time compresses everything the CPU did — cache behavior, prefetch, pipeline stalls, memory latency — into a single scalar. It answers <em>how much</em>. It cannot answer <em>why</em>. And without <em>why</em>, the next question — <em>what happens when conditions change</em> — is a guess.</p>
<p>The first four posts taught doubt. Design lies through omission. Environment masks distortion. Data collection coordinates with failure. Interpretation drifts from evidence. Each layer peeled back a way the measurement could mislead, and each time the tools were doing it <em>to you</em> while appearing to work <em>with you</em>.</p>
<p>This post goes somewhere different.</p>
<p>All code in this post: clone, build, run. Numbers below were measured on dual Xeon E5-2697 v2 using BenchmarkDotNet v0.14.0. <em>Charts use milliseconds unless otherwise noted; tables reproduce BenchmarkDotNet output. BDN&rsquo;s Error column is the half-width of the 99.9% confidence interval.</em></p>
<hr>
<h2 id="the-setup--two-paths-same-operation">The setup — two paths, same operation</h2>
<div class="highlight"><pre data-lang="csharp"><code>[Benchmark(Baseline = true)]
public long SumSequential()
{
    long sum = 0;
    long[] data = _data;
    for (int i = 0; i &lt; data.Length; i&#43;&#43;)
        sum &#43;= data[i];
    return sum;
}

[Benchmark]
public long SumRandom()
{
    long sum = 0;
    long[] data = _data;
    int[] indices = _indices;
    for (int i = 0; i &lt; indices.Length; i&#43;&#43;)
        sum &#43;= data[indices[i]];
    return sum;
}</code></pre></div>
<p><small>Sequential vs random access over <code>long[]</code>. <code>_indices</code> is a Fisher-Yates shuffle of 0..N-1 — same elements, different order. Full source in companion code.</small></p>
<p>Both methods compute the same sum. Both touch every element exactly once. The only difference: the order of access. Sequential walks the array from start to end. Random jumps through a pre-shuffled index array.</p>
<p>At one million elements (8 MB of <code>long[]</code> — plus 4 MB of <code>int[]</code> indices for the random variant — both fit comfortably in the 30 MB L3 cache on Ivy Bridge-EP):</p>
<div class="highlight"><pre data-lang=""><code>| Method        | N       | Mean       | Error    | StdDev   | Ratio | RatioSD |
|-------------- |-------- |-----------:|---------:|---------:|------:|--------:|
| SumSequential | 1000000 |   561.0 us |  1.85 us |  1.44 us |  1.00 |    0.00 |
| SumRandom     | 1000000 | 7,223.9 us | 12.74 us | 10.63 us | 12.88 |    0.04 |</code></pre></div>
<p>12.88× slower. The confidence intervals don&rsquo;t overlap. The difference is real and large. Random access is slower — water is wet. Ship the sequential version, move on.</p>
<p>BDN told you the <em>what</em>. It didn&rsquo;t tell you the <em>why</em>. And without the <em>why</em>, you can&rsquo;t predict <em>what happens next</em>.</p>
<hr>
<h2 id="level-1--perf-stat-the-vital-signs">Level 1 — perf stat: the vital signs</h2>
<p><code>perf stat</code> reads hardware performance counters — registers built into the CPU that count events like cycles, instructions, cache accesses, and cache misses. No sampling, no code instrumentation, and typically negligible overhead — the CPU increments these counters in hardware, and <code>perf stat</code> reads the registers at process start/stop. When you request more events than the CPU has physical counter registers, <code>perf</code> multiplexes (time-shares) and scales the results, which introduces estimation error — the percentages in the output below reflect this.<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></p>
<div class="highlight"><pre data-lang="bash"><code>perf stat -e cycles,instructions,cache-references,cache-misses,\
branch-instructions,branch-misses,L1-dcache-loads,L1-dcache-load-misses,\
LLC-loads,LLC-load-misses \
    dotnet run -c Release -- --filter &#39;*Sequential*&#39;</code></pre></div>
<p><em>If any event is unsupported on your CPU, <code>perf stat</code> will report <code>&lt;not supported&gt;</code> for that counter. Run <code>perf list</code> to see available events. At minimum, <code>cycles</code> and <code>instructions</code> (for IPC) are widely available on modern x86 CPUs; verify with <code>perf list</code>.</em></p>
<p>Run this for both variants and you get a side-by-side comparison of what the CPU was actually doing:<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup></p>
<table>
  <thead>
      <tr>
          <th>Counter</th>
          <th>Sequential</th>
          <th>Random</th>
          <th>What it means</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>IPC (instructions/cycle)</td>
          <td>1.54</td>
          <td>0.42</td>
          <td>CPU throughput — how many instructions retire per clock cycle</td>
      </tr>
      <tr>
          <td>L1 data cache miss rate</td>
          <td>11.64%</td>
          <td>24.90%</td>
          <td>Fraction of loads that miss the fastest cache (32 KB, ~4 cycle latency)</td>
      </tr>
      <tr>
          <td>LLC load miss rate</td>
          <td>53.02%*</td>
          <td>30.38%*</td>
          <td>Fraction of last-level cache loads that go to DRAM — <em>inverted due to aggregation; see note</em><sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></td>
      </tr>
      <tr>
          <td>Branch misprediction rate</td>
          <td>0.87%</td>
          <td>2.60%</td>
          <td>Fraction of branches predicted wrong — both are low</td>
      </tr>
  </tbody>
</table>
<p><em>A caveat these numbers have earned: they are aggregated across the full BDN process — warmup, pilot, and actual iterations at all three dataset sizes (1M, 8M, 64M). They diagnose the mechanism (memory-bound vs compute-bound), not behavior at any single N. The IPC gap (1.54 vs 0.42) and the L1 miss rate gap (11.64% vs 24.90%) are directionally stable across aggregation — random access is memory-bound regardless of how you slice the data. The LLC miss rates are less trustworthy: sequential appears worse (53% vs 30%) because it runs ~3× more total iterations, and the 64M dataset dominates its LLC totals — see <sup id="fnref1:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup> for details. The prediction in Level 3 rests on the mechanism (memory-bound + L3-dependent), not on the exact percentages.</em></p>
<p>The first four posts taught methodical doubt. What to trust? What to distrust? How deep does the distortion go? Descartes reached for the <em>cogito</em> — the one thing doubt couldn&rsquo;t dissolve. Here the descent through software abstraction reaches something similar. <code>perf stat</code> doesn&rsquo;t measure time. It doesn&rsquo;t measure abstractions. It reads registers that the silicon increments whether anyone is watching or not. The counters exist at the boundary where software models end and physics begins. Doubt doesn&rsquo;t end in nihilism. It ends in firmer ground.</p>
<p><strong>IPC is the headline.</strong> Sequential executes 1.54 instructions per cycle. Random executes 0.42. The CPU is 3.7× more productive on sequential access on this Ivy Bridge-EP — not because it runs different instructions, but because it <em>doesn&rsquo;t stall</em>. The hardware prefetcher detects the sequential stride, fetches cache lines ahead of the loop, and the data is waiting in L1 before the load instruction executes.<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup></p>
<p>Random access defeats the prefetcher. Every load is a surprise. The CPU issues the load, waits 10-40 cycles for L2/L3, and the pipeline stalls. The instructions are the same — the wait is different.</p>
<div class="chart-container">
  <canvas id="chart-1983876d39dfa57fd14a677d1487a52b"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-1983876d39dfa57fd14a677d1487a52b').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['IPC\n(insn/cycle)', 'L1 miss rate\n(%)', 'Branch miss rate\n(%)'],
    datasets: [
      {
        label: 'Sequential',
        data: [1.54, 11.64, 0.87],
        backgroundColor: '#89b4fa',
        borderColor: '#89b4fa',
        borderWidth: 1
      },
      {
        label: 'Random',
        data: [0.42, 24.90, 2.60],
        backgroundColor: '#f38ba8',
        borderColor: '#f38ba8',
        borderWidth: 1
      }
    ]
  },
  options: {
    plugins: {
      title: { display: true, text: 'perf stat — same code, different microarchitecture behavior' },
      subtitle: { display: true, text: 'IPC: 1.54 vs 0.42 — the CPU stalls 3.7× more on random access' },
      legend: { display: true }
    },
    scales: {
      y: {
        title: { display: true, text: 'Value' }
      }
    }
  }
}
);
  })();
</script>

<p>BDN said &ldquo;12.88× slower.&rdquo; The aggregate hardware counters<sup id="fnref1:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> reveal the mechanism: the CPU is stalling on cache misses. The instructions aren&rsquo;t slower — they&rsquo;re <em>waiting</em>. And waiting scales with memory latency, which scales with working set size.</p>
<p>That&rsquo;s the basis for a prediction.</p>
<hr>
<h2 id="level-2--flame-graphs-the-shape-of-time">Level 2 — Flame graphs: the shape of time</h2>
<p>Hardware counters tell you <em>what</em> the CPU is doing — stalling on cache misses, mispredicting branches. Flame graphs tell you <em>where the cost concentrates</em> in the code path.<sup id="fnref:5"><a href="#fn:5" class="footnote-ref" role="doc-noteref">5</a></sup></p>
<p>A flame graph is a visualization of stack traces sampled by <code>perf record</code>. The x-axis is not time — it&rsquo;s the population of samples. Wider frames mean more time spent in that function.</p>
<div class="highlight"><pre data-lang="bash"><code># Record stack traces at 99 Hz (standalone runner, no BDN overhead)
perf record -g -F 99 --call-graph dwarf -- \
    dotnet run -c Release -- perf-sequential 8000000

# Convert to flame graph SVG
perf script | stackcollapse-perf.pl | flamegraph.pl &gt; sequential.svg</code></pre></div>
<p>Sequential — one hot column, tight loop, no stalls wide enough to sample:</p>
<div class="flamegraph-wrap">
  <div id="fg-ca5bd376abffe4f2d28d4b6288df8c15" class="flamegraph-canvas"></div>
</div>
<script>
  (function() {
    fetch('\/flamegraphs\/sequential.json')
      .then(function(r) { return r.json(); })
      .then(function(data) {
        var el = document.getElementById('fg-ca5bd376abffe4f2d28d4b6288df8c15');
        new FlameGraph(el, data, { title: 'SumSequential — 8M elements, 2000 iterations' });
      });
  })();
</script>

<p>Random — wider, flatter. The hot loop is still there, but the sampled stacks spread more broadly around it, consistent with the CPU spending more time waiting on the memory subsystem:</p>
<div class="flamegraph-wrap">
  <div id="fg-b761154c9fcbff5118ff07c2f405d7a2" class="flamegraph-canvas"></div>
</div>
<script>
  (function() {
    fetch('\/flamegraphs\/random.json')
      .then(function(r) { return r.json(); })
      .then(function(data) {
        var el = document.getElementById('fg-b761154c9fcbff5118ff07c2f405d7a2');
        new FlameGraph(el, data, { title: 'SumRandom — 8M elements, 100 iterations' });
      });
  })();
</script>

<p><code>perf stat</code> diagnosed the disease. The flame graph shows the hot path around it. Sequential&rsquo;s samples concentrate in one tight column — the loop body runs, the prefetcher feeds it, the pipeline stays full. Random spreads wider — same loop body, but more of the sampled time accumulates in and around that path while the core waits for data. The structure of time, not just the quantity of it.</p>
<p>Three tools, three levels:</p>
<table>
  <thead>
      <tr>
          <th>Tool</th>
          <th>Question</th>
          <th>Answer</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>BDN</td>
          <td>How much slower?</td>
          <td>12.88× (at 1M)</td>
      </tr>
      <tr>
          <td>perf stat</td>
          <td>Why?</td>
          <td>IPC 0.42 vs 1.54 — cache miss stalls</td>
      </tr>
      <tr>
          <td>Flame graph</td>
          <td>Where?</td>
          <td>The hot path around the inner loop</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="level-3--the-prediction">Level 3 — The prediction</h2>
<p>This is where hardware counters do something BDN cannot. They don&rsquo;t just explain the past. They make the future falsifiable.</p>
<p>At one million elements, sequential walks 8 MB of <code>long[]</code>. Random also loads a 4 MB <code>int[]</code> index array, bringing its working set to 12 MB. The L3 cache on this Xeon E5-2697 v2 is 30 MB — everything fits. Random access is slow because it misses L1 and L2 — but it hits L3. L3 latency is ~30 cycles. Bad, but bounded.</p>
<p>The hypothesis: if random access is memory-bound — the aggregate counters showed IPC 0.42 and L1 miss rate 24.90%, diagnosing the <em>mechanism</em> even though they span all dataset sizes<sup id="fnref2:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> — and the current performance depends on L3 absorbing those misses, then exceeding L3 capacity will force misses to DRAM at ~200 cycles. The ratio should jump.</p>
<p>At 8 million elements, the data array alone is 64 MB. With the 32 MB index array, random&rsquo;s working set reaches 96 MB — well beyond the 30 MB L3. At 64 million elements (512 MB data + 256 MB indices), there&rsquo;s no question. Random access now predominantly misses all cache levels and goes to DRAM.</p>
<p>This is a falsifiable prediction. Not a statistical extrapolation from benchmark numbers. A deduction from cache architecture, informed by hardware counters that revealed the mechanism. Run it. See what happens.</p>
<div class="highlight"><pre data-lang=""><code>| Method        | N        | Mean           | Error       | StdDev      | Ratio | RatioSD |
|-------------- |--------- |---------------:|------------:|------------:|------:|--------:|
| SumSequential | 1000000  |       561.0 us |     1.85 us |     1.44 us |  1.00 |    0.00 |
| SumRandom     | 1000000  |     7,223.9 us |    12.74 us |    10.63 us | 12.88 |    0.04 |
|               |          |                |             |             |       |         |
| SumSequential | 8000000  |     6,434.3 us |    51.43 us |    48.11 us |  1.00 |    0.01 |
| SumRandom     | 8000000  |   125,635.5 us | 2,319.67 us | 2,056.33 us | 19.53 |    0.34 |
|               |          |                |             |             |       |         |
| SumSequential | 64000000 |    82,974.9 us |   933.52 us |   728.83 us |  1.00 |    0.01 |
| SumRandom     | 64000000 | 1,613,935.5 us | 3,277.61 us | 2,736.96 us | 19.45 |    0.17 |</code></pre></div>
<div class="chart-container">
  <canvas id="chart-f454392f3be961dee1f24f4af208edff"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-f454392f3be961dee1f24f4af208edff').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['1M (8 MB data)\nfits L3', '8M (64 MB data)\nexceeds L3', '64M (512 MB data)\nDRAM only'],
    datasets: [
      {
        label: 'Sequential',
        data: [0.561, 6.434, 82.975],
        backgroundColor: '#89b4fa',
        borderColor: '#89b4fa',
        borderWidth: 1
      },
      {
        label: 'Random',
        data: [7.224, 125.636, 1613.936],
        backgroundColor: '#f38ba8',
        borderColor: '#f38ba8',
        borderWidth: 1
      }
    ]
  },
  options: {
    plugins: {
      title: { display: true, text: 'Sequential vs Random — scaling with dataset size' },
      subtitle: { display: true, text: 'Random: ~13× at 1M → ~20× at 8M (this run). Data array sizes shown; random also loads int[] indices.' },
      legend: { display: true }
    },
    scales: {
      y: {
        type: 'logarithmic',
        title: { display: true, text: 'Time (ms) — log scale' },
        min: 0.1,
        max: 10000
      }
    }
  }
}
);
  })();
</script>

<p><strong>Sequential scales smoothly but super-linearly.</strong> 8× more data yields ~11.5× more time; 64× more data yields ~148× more time.<sup id="fnref:6"><a href="#fn:6" class="footnote-ref" role="doc-noteref">6</a></sup> The extra factor comes from the L3 boundary: at 1M (8 MB), sequential reads hit L3 at ~30 cycle latency. At 8M+ (64 MB+), the prefetcher must pull from DRAM (~200 cycles). Even within the DRAM-resident range (8M to 64M), scaling is ~12.9× for 8× data — still super-linear, likely due to TLB pressure at large working sets. The prefetcher hides most of the latency increase — but not all of it.</p>
<p><strong>Random hits a cliff.</strong> In this run, the ratio jumps from ~13× at 1M to ~19.5× at 8M — about a 50% degradation on this dual-socket NUMA system.<sup id="fnref:7"><a href="#fn:7" class="footnote-ref" role="doc-noteref">7</a></sup> Beyond that, it stays in the same rough range rather than snapping back. The cliff happened between 1M and 8M, exactly where the working set crossed the L3 boundary. The exact ratios will differ on your hardware — the cliff at the L3 boundary won&rsquo;t.</p>
<div class="chart-container">
  <canvas id="chart-d4fc857cdb14550dcc03b6ed24e25bad"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-d4fc857cdb14550dcc03b6ed24e25bad').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['1M (8 MB)', '8M (64 MB)', '64M (512 MB)'],
    datasets: [
      {
        label: 'Ratio (Random / Sequential)',
        data: [12.88, 19.53, 19.45],
        backgroundColor: ['#a6e3a1', '#fab387', '#f38ba8'],
        borderColor: ['#a6e3a1', '#fab387', '#f38ba8'],
        borderWidth: 1
      }
    ]
  },
  options: {
    plugins: {
      title: { display: true, text: 'The ratio cliff — where L3 runs out' },
      subtitle: { display: true, text: '~13× → ~20× when working set exceeds 30 MB L3 cache (exact ratios shift with NUMA topology)' },
      legend: { display: false }
    },
    scales: {
      y: {
        title: { display: true, text: 'Ratio (Random / Sequential)' },
        min: 0,
        max: 25
      }
    }
  }
}
);
  })();
</script>

<p>The prediction survived the test — on this hardware, on this run.</p>
<p>Popper, <em>Logik der Forschung</em> (1934)<sup id="fnref:8"><a href="#fn:8" class="footnote-ref" role="doc-noteref">8</a></sup>: a falsifiable prediction distinguishes science from storytelling. &ldquo;Random access is memory-bound (low IPC, high L1 miss rate). The working set fits L3 at 1M. At 8M, it won&rsquo;t. The ratio will jump.&rdquo; Run it. In this run, the ratio jumps from ~13× to ~19.5×. The exact numbers are unstable — dual-socket NUMA, thread migration, prefetcher heuristics all shift them.<sup id="fnref1:7"><a href="#fn:7" class="footnote-ref" role="doc-noteref">7</a></sup> The mechanism isn&rsquo;t. The theory wasn&rsquo;t adjusted after the fact. It was stated before the data, derived from the cache hierarchy, and the data confirmed the <em>shape</em> — a cliff at the L3 boundary. That&rsquo;s not extrapolation from a benchmark number. That&rsquo;s deduction from architecture.</p>
<p>Through the first four posts, every tool revealed its distortion after the damage was done. Post-factum. Reactive. Hardware counters are the first tool in this series that generates a falsifiable hypothesis <em>before</em> the benchmark runs. Not a better explanation of the past — a testable claim about the future.</p>
<hr>
<h2 id="numa--where-the-numbers-shift-and-the-shape-doesnt">NUMA — where the numbers shift and the shape doesn&rsquo;t</h2>
<p>This machine has two sockets. Two Xeon E5-2697 v2, each with its own 30 MB L3 cache, its own memory controller, its own DRAM. When a thread runs on socket 0 and accesses memory allocated on socket 1, the load crosses the QPI interconnect — ~40 ns extra latency. When the OS migrates a thread between sockets mid-benchmark, the prefetcher resets, the L1/L2 are cold, and the next few thousand loads hit DRAM instead of cache.</p>
<p>BDN doesn&rsquo;t know which socket it&rsquo;s running on. It reports a single number. On dual-socket NUMA, that number carries noise from topology that has nothing to do with the code being measured.</p>
<p>Three runs: unpinned (OS schedules freely), pinned to socket 0 (<code>taskset -c 0-11</code>), pinned to socket 1 (<code>taskset -c 12-23</code>). Same binary, same data, same benchmark. Different answers.</p>
<div class="highlight"><pre data-lang=""><code>Unpinned (OS schedules freely):
| Method        | N        | Mean           | Error        | StdDev       | Ratio | RatioSD |
|-------------- |--------- |---------------:|-------------:|-------------:|------:|--------:|
| SumSequential | 1000000  |       555.0 us |      1.26 us |      1.12 us |  1.00 |    0.00 |
| SumRandom     | 1000000  |     7,577.9 us |    181.88 us |    151.88 us | 13.65 |    0.26 |
| SumSequential | 8000000  |     9,146.4 us |  1,431.15 us |  1,338.70 us |  1.02 |    0.21 |
| SumRandom     | 8000000  |   127,720.5 us |  1,591.92 us |  1,329.32 us | 14.25 |    2.03 |
| SumSequential | 64000000 |    65,306.5 us |    522.56 us |    488.81 us |  1.00 |    0.01 |
| SumRandom     | 64000000 | 1,665,816.7 us | 35,695.93 us | 33,389.99 us | 25.51 |    0.53 |

Pinned to socket 0 (taskset -c 0-11):
| Method        | N        | Mean           | Error        | StdDev       | Ratio | RatioSD |
|-------------- |--------- |---------------:|-------------:|-------------:|------:|--------:|
| SumSequential | 1000000  |       564.9 us |      6.22 us |      5.51 us |  1.00 |    0.01 |
| SumRandom     | 1000000  |     8,199.0 us |    596.29 us |    557.77 us | 14.51 |    0.97 |
| SumSequential | 8000000  |     8,942.9 us |  1,201.30 us |  1,123.70 us |  1.02 |    0.18 |
| SumRandom     | 8000000  |   125,272.7 us |  2,566.78 us |  2,400.96 us | 14.24 |    1.93 |
| SumSequential | 64000000 |    67,722.2 us |    574.32 us |    479.59 us |  1.00 |    0.01 |
| SumRandom     | 64000000 | 1,675,995.7 us | 14,298.58 us | 11,939.97 us | 24.75 |    0.24 |

Pinned to socket 1 (taskset -c 12-23):
| Method        | N        | Mean           | Error       | StdDev      | Ratio | RatioSD |
|-------------- |--------- |---------------:|------------:|------------:|------:|--------:|
| SumSequential | 1000000  |       560.5 us |     1.47 us |     1.30 us |  1.00 |    0.00 |
| SumRandom     | 1000000  |     7,330.0 us |    40.50 us |    37.89 us | 13.08 |    0.07 |
| SumSequential | 8000000  |     6,961.7 us |   147.33 us |   137.82 us |  1.00 |    0.03 |
| SumRandom     | 8000000  |   124,440.5 us | 2,756.00 us | 2,577.97 us | 17.88 |    0.50 |
| SumSequential | 64000000 |    56,263.4 us |   685.72 us |   641.42 us |  1.00 |    0.02 |
| SumRandom     | 64000000 | 1,650,334.5 us | 4,652.78 us | 3,885.28 us | 29.34 |    0.33 |</code></pre></div>
<div class="chart-container">
  <canvas id="chart-df02df7162ee4c354c4ab6183399391c"></canvas>
</div>
<script>
  (function() {
    var ctx = document.getElementById('chart-df02df7162ee4c354c4ab6183399391c').getContext('2d');
    new Chart(ctx, 
{
  type: 'bar',
  data: {
    labels: ['1M\nUnpinned', '1M\nSocket 0', '1M\nSocket 1', '8M\nUnpinned', '8M\nSocket 0', '8M\nSocket 1', '64M\nUnpinned', '64M\nSocket 0', '64M\nSocket 1'],
    datasets: [
      {
        label: 'Ratio (Random / Sequential)',
        data: [13.65, 14.51, 13.08, 14.25, 14.24, 17.88, 25.51, 24.75, 29.34],
        backgroundColor: ['#89b4fa', '#89b4fa', '#89b4fa', '#fab387', '#fab387', '#fab387', '#f38ba8', '#f38ba8', '#f38ba8'],
        borderColor: ['#89b4fa', '#89b4fa', '#89b4fa', '#fab387', '#fab387', '#fab387', '#f38ba8', '#f38ba8', '#f38ba8'],
        borderWidth: 1
      }
    ]
  },
  options: {
    plugins: {
      title: { display: true, text: 'The ratio cliff across NUMA topologies' },
      subtitle: { display: true, text: 'Same code, same data, different thread placement — the cliff is always there' },
      legend: { display: false }
    },
    scales: {
      y: {
        title: { display: true, text: 'Ratio (Random / Sequential)' },
        min: 0,
        max: 35
      }
    }
  }
}
);
  })();
</script>

<p>The ratio at 1M is stable: 13.08–14.51×. Everything fits in L3 regardless of socket — NUMA doesn&rsquo;t matter when the prefetcher keeps the pipeline full and the working set is cache-resident.</p>
<p>At 8M, the topology starts to show. Socket 1 reports 17.88× while unpinned and socket 0 hover around 14.2×. Sequential at 8M diverges the most: unpinned reports 9,146 us (BDN flagged bimodal distribution — thread migration mid-run), socket 0 reports 8,943 us, socket 1 reports 6,962 us. A 31% spread on the same sequential sum, same data, same binary. The difference is where the thread ran and whether it stayed there.</p>
<p>At 64M, the spread widens further: 24.75× (socket 0) to 29.34× (socket 1). An 18% swing in the ratio from thread placement alone. Random access times are close (~1.65–1.68s) — DRAM latency dominates and both sockets pay roughly the same price. Sequential is where the sockets diverge: socket 1 runs sequential 17% faster than socket 0 (56,263 vs 67,722 us), likely because socket 1&rsquo;s memory controller has less contention from OS and runtime threads that default to socket 0.</p>
<p>The exact ratios from the earlier section — 12.88×, 19.53×, 19.45× — came from yet another run. They don&rsquo;t match any of these three. That&rsquo;s the point. On some runs the cliff at 8M is sharp (socket 1: 13.08× → 17.88×); on others it&rsquo;s muted (unpinned: 13.65× → 14.25×, with the full impact deferred to 64M where DRAM dominates regardless of topology). Five runs, five sets of numbers, one shape: a cliff where the working set crosses the L3 boundary. Whether it lands at 8M or spreads across 8M–64M depends on thread placement and memory allocation — not on the code.</p>
<p><code>taskset</code> and <code>numactl</code> aren&rsquo;t exotic tools. They&rsquo;re part of the measurement environment — the same environment that FTF-2 warned you about. On single-socket machines, none of this matters. On NUMA, it&rsquo;s the difference between a 24.75× and a 29.34× — same code, same data, same question, different answer depending on which socket the OS picked.</p>
<hr>
<h2 id="the-hardware-checklist">The hardware checklist</h2>
<p>Five questions hardware counters answer that benchmarks cannot:</p>
<table>
  <thead>
      <tr>
          <th>Question</th>
          <th>Counter</th>
          <th>What to look for</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Is my code cache-efficient?</td>
          <td><code>L1-dcache-load-misses</code>, <code>LLC-load-misses</code></td>
          <td>Miss rate above ~10% on this hardware suggests access pattern worth investigating</td>
      </tr>
      <tr>
          <td>Is the CPU pipeline efficient?</td>
          <td><code>instructions</code> / <code>cycles</code> (IPC)</td>
          <td>IPC below ~1.0 on this hardware suggests stalling on memory or branch misses</td>
      </tr>
      <tr>
          <td>Is branch prediction working?</td>
          <td><code>branch-misses</code> / <code>branch-instructions</code></td>
          <td>Miss rate above ~5% on this hardware suggests unpredictable branches</td>
      </tr>
      <tr>
          <td>Will this scale with data size?</td>
          <td>Compare cache miss rates at small vs large N</td>
          <td>Rising miss rate as N grows points toward a performance cliff</td>
      </tr>
      <tr>
          <td>Where is time spent?</td>
          <td><code>perf record</code> + flame graph</td>
          <td>Wide stacks indicate distributed stalls; narrow stacks indicate a hot loop</td>
      </tr>
  </tbody>
</table>
<p><em>These thresholds are priors, not axioms — useful starting points for investigation on this hardware, unverified on yours.</em></p>
<h3 id="when-to-use-what">When to use what</h3>
<p><strong>BDN suffices</strong> most of the time:</p>
<ul>
<li>You&rsquo;re comparing two implementations and the ratio is clear (&gt;1.5× or &lt;0.7×)</li>
<li>The result is stable across runs</li>
<li>You&rsquo;re making a ship/no-ship decision on a known bottleneck</li>
</ul>
<p><strong>Reach for perf stat</strong> when:</p>
<ul>
<li>Two variants show similar BDN times but you suspect different underlying behavior</li>
<li>The ratio changes unexpectedly across dataset sizes</li>
<li>You need to understand <em>why</em> something is slow, not just <em>how much</em></li>
<li>You want to predict scaling behavior before running the full benchmark suite</li>
</ul>
<p><strong>Use flame graphs</strong> when:</p>
<ul>
<li><code>perf stat</code> says &ldquo;cache misses&rdquo; but you don&rsquo;t know which access pattern causes them</li>
<li>A complex function is slow and you need to identify the hot path</li>
<li>You&rsquo;re profiling an entire application, not an isolated benchmark</li>
</ul>
<hr>
<h3 id="run-it-yourself">Run it yourself</h3>
<div class="highlight"><pre data-lang="bash"><code>git clone https://github.com/0x3f-blog/companion-code.git
cd companion-code/first-things-first/hardware-counters

# All benchmarks — 3 dataset sizes (~2 min)
dotnet run -c Release -- --filter &#39;*&#39;

# perf stat comparison (Linux only) — full event set matching the blog post
perf stat -e cycles,instructions,cache-references,cache-misses,\
branch-instructions,branch-misses,L1-dcache-loads,L1-dcache-load-misses,\
LLC-loads,LLC-load-misses \
    dotnet run -c Release -- --filter &#39;*Sequential*&#39;

perf stat -e cycles,instructions,cache-references,cache-misses,\
branch-instructions,branch-misses,L1-dcache-loads,L1-dcache-load-misses,\
LLC-loads,LLC-load-misses \
    dotnet run -c Release -- --filter &#39;*Random*&#39;

# Or use the included scripts
./Scripts/perf-stat.sh
./Scripts/run-scaling.sh</code></pre></div>
<hr>
<h2 id="benchmark-environment">Benchmark environment</h2>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CPU</td>
          <td>2× Intel Xeon E5-2697 v2 @ 2.70 GHz (24 cores / 48 threads)</td>
      </tr>
      <tr>
          <td>L3 Cache</td>
          <td>30 MB per socket</td>
      </tr>
      <tr>
          <td>RAM</td>
          <td>~115 GB DDR3-1866 (quad-channel per socket)</td>
      </tr>
      <tr>
          <td>OS</td>
          <td>Fedora Linux 42 (kernel 6.17)</td>
      </tr>
      <tr>
          <td>Runtime</td>
          <td>.NET 9.0.11 (RyuJIT AVX)</td>
      </tr>
      <tr>
          <td>SDK</td>
          <td>.NET SDK 10.0.102 (targets net9.0 — SDK 10 builds 9.0 apps)</td>
      </tr>
      <tr>
          <td>BenchmarkDotNet</td>
          <td>v0.14.0</td>
      </tr>
      <tr>
          <td>perf</td>
          <td>v6.18.6, <code>perf_event_paranoid=2</code></td>
      </tr>
      <tr>
          <td>GC</td>
          <td>Server GC, Concurrent (BDN enables Server GC in benchmark processes by default)</td>
      </tr>
  </tbody>
</table>
<p><strong>Limitations:</strong> Different machine, different numbers. Dual-socket NUMA — thread migration can widen variance. <code>perf stat</code> numbers are aggregated over the full BDN process (warmup, pilot, actual iterations at all three dataset sizes), not isolated per-benchmark. Absolute counter values include BDN overhead; ratios between variants are meaningful. The L3 cache boundary (30 MB) is specific to Ivy Bridge-EP — your cache hierarchy will produce a cliff at a different dataset size. The IPC values reflect aggregate process behavior, not just the hot loop; isolated hot-loop IPC would be higher for sequential (~3.0+) and similar for random (~0.3-0.5).</p>
<hr>
<h2 id="piercing-through">Piercing through</h2>
<p>Five posts. Five layers.</p>
<p>Design — what you measure. Environment — what surrounds the measurement. Data collection — how you gather it. Interpretation — what you do with the numbers. Cause — why the numbers are what they are.</p>
<p>Through the first four posts, the image moved steadily away from reality. Benchmark design distorted it. The environment masked the distortion. Coordinated omission replaced absent data with comfortable silence. Statistical interpretation severed the last thread connecting numbers to the thing they claimed to represent. Baudrillard&rsquo;s phases of the simulacrum, played out in measurement: the image that distorts reality, the image that masks its absence, the image that bears no relation to reality at all.</p>
<p><code>perf stat</code> pierces through. It doesn&rsquo;t build another image. It reads registers that the silicon increments at every clock edge — cache miss, branch mispredict, instruction retired. Not a model of what happened. Not an abstraction of what happened. What happened, counted in hardware, whether anyone is watching or not. The first tool in five posts that measures the territory, not the map.</p>
<p>The series started with a lie — 27.2M ops/sec and three contradictory verdicts from the same optimization. It ends not with an answer but with a framework. Five layers, five dimensions. You don&rsquo;t need to measure all of them every time. You need to know they exist, and when to reach for which one.</p>
<p>You have the tools. You know when to reach for which one.</p>
<hr>
<h2 id="further-reading">Further reading</h2>
<ul>
<li>Brendan Gregg, <em>Systems Performance</em>, 2nd ed. (Addison-Wesley, 2020) — chapters 6 (CPU) and 7 (Memory). The definitive reference for PMU counters, <code>perf</code>, and flame graphs.<sup id="fnref:9"><a href="#fn:9" class="footnote-ref" role="doc-noteref">9</a></sup></li>
<li>Brendan Gregg, <a href="https://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html">CPU Flame Graphs</a> (2016) — the original methodology for flame graph generation and interpretation.<sup id="fnref1:5"><a href="#fn:5" class="footnote-ref" role="doc-noteref">5</a></sup></li>
<li>Ahmad Yasin, <a href="https://ieeexplore.ieee.org/document/6844459">A Top-Down Method for Performance Analysis and Counters Architecture</a> (ISPASS 2014) — the framework Intel uses: Frontend Bound, Backend Bound, Bad Speculation, Retiring.<sup id="fnref:10"><a href="#fn:10" class="footnote-ref" role="doc-noteref">10</a></sup></li>
<li>Ulrich Drepper, <a href="https://people.freebsd.org/~lstewart/articles/cpumemory.pdf">What Every Programmer Should Know About Memory</a> (2007) — cache hierarchy, prefetch, TLB. The foundation for understanding cache miss counters.<sup id="fnref:11"><a href="#fn:11" class="footnote-ref" role="doc-noteref">11</a></sup></li>
<li>Agner Fog, <a href="https://www.agner.org/optimize/microarchitecture.pdf">Microarchitecture of Intel, AMD and VIA CPUs</a> (2025, continuously updated) — pipeline, execution ports, cache latencies at the microarchitecture level.<sup id="fnref:12"><a href="#fn:12" class="footnote-ref" role="doc-noteref">12</a></sup></li>
<li>Denis Bakhvalov, <a href="https://book.easyperf.net/perf_book"><em>Performance Analysis and Tuning on Modern CPUs</em></a> (easyperf.net, 2020) — practical guide to PMU, <code>perf</code>, and top-down analysis.<sup id="fnref:13"><a href="#fn:13" class="footnote-ref" role="doc-noteref">13</a></sup></li>
<li>Andi Kleen, <a href="https://github.com/andikleen/pmu-tools">pmu-tools / toplev</a> — automated top-down microarchitecture analysis using hardware counters.<sup id="fnref:14"><a href="#fn:14" class="footnote-ref" role="doc-noteref">14</a></sup></li>
<li>perf wiki, <a href="https://perfwiki.github.io/main/tutorial/">Tutorial</a> — official documentation for Linux <code>perf</code> tools.<sup id="fnref:15"><a href="#fn:15" class="footnote-ref" role="doc-noteref">15</a></sup></li>
<li>BenchmarkDotNet, <a href="https://benchmarkdotnet.org/articles/configs/diagnosers.html">Diagnosers</a> — BDN&rsquo;s built-in hardware counter collection via ETW (Windows only).<sup id="fnref:16"><a href="#fn:16" class="footnote-ref" role="doc-noteref">16</a></sup></li>
<li>Intel, <a href="https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html"><em>64 and IA-32 Architectures Optimization Reference Manual</em></a> (2025) — chapter 3: top-down analysis, performance counter event codes.<sup id="fnref:17"><a href="#fn:17" class="footnote-ref" role="doc-noteref">17</a></sup></li>
</ul>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Ivy Bridge-EP has 4 general-purpose and 3 fixed hardware counter registers per core. When you request more events than available registers (as in the 10-event <code>perf stat</code> command above), <code>perf</code> time-multiplexes: it rotates events through the available registers and scales the counts by the sampling ratio. The percentages shown in the output (e.g., <code>(39.98%)</code>) indicate what fraction of runtime each counter was actually active. This introduces estimation error, but for long-running workloads like BDN benchmarks the error is typically small.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>The <code>perf stat</code> numbers shown are aggregated over the entire BDN process, which includes warmup, pilot runs, and actual iterations at all three dataset sizes. This means the absolute values include BDN framework overhead. The <em>ratios</em> between Sequential and Random are meaningful — both variants include the same overhead. For isolated hot-loop counters, use BDN&rsquo;s <code>[HardwareCounters]</code> diagnoser or run <code>perf stat</code> on a standalone loop outside BDN.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a>&#160;<a href="#fnref1:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a>&#160;<a href="#fnref2:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:3">
<p>The LLC miss rates appear counterintuitive — sequential (53%) is higher than random (30%). This is an artifact of aggregation: sequential runs ~3× more total iterations (being much faster per-op), and the 64M dataset (512 MB, far exceeding 30 MB L3) dominates the sequential counter totals. Random access, being slower, runs fewer iterations, so its LLC counters are weighted more toward the smaller (L3-resident) datasets. For a fair LLC comparison, you would need per-dataset-size <code>perf stat</code> runs — which requires running benchmarks outside BDN or using BDN&rsquo;s <code>[HardwareCounters]</code> diagnoser. The IPC and L1 miss rate comparisons are more robust to this aggregation effect.&#160;<a href="#fnref:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a>&#160;<a href="#fnref1:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:4">
<p>The Intel hardware prefetcher on Ivy Bridge-EP detects sequential and strided access patterns and prefetches cache lines into L1/L2 before the load instruction executes. For a sequential <code>long[]</code> walk with 8-byte stride, the prefetcher can stay ahead of the loop, effectively hiding memory latency. Random access has no predictable stride — every load is a cache miss that the CPU must wait for.&#160;<a href="#fnref:4" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:5">
<p>Brendan Gregg, <a href="https://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html">CPU Flame Graphs</a> (2016). Flame graphs collapse stack traces into a single visualization where width = time. The x-axis is alphabetical (not temporal) — a common source of misinterpretation.&#160;<a href="#fnref:5" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a>&#160;<a href="#fnref1:5" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:6">
<p>Sequential access is not perfectly linear across this range. At 1M (8 MB), data fits in L3 (~30 cycle latency). At 8M (64 MB), it exceeds L3 and every cache line comes from DRAM (~200 cycle latency). The prefetcher hides most of this increase by issuing DRAM requests ahead of the loop, but the transition from L3-resident to DRAM-resident adds a constant factor. Within the DRAM-resident range (8M → 64M), scaling is closer to linear: 8× more data → ~12.9× more time. The remaining super-linearity likely reflects TLB pressure and NUMA effects at 512 MB working set on this dual-socket system.&#160;<a href="#fnref:6" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:7">
<p>Dual-socket NUMA adds variance to these ratios. Thread migration between sockets, local vs remote memory access, and OS scheduling decisions can shift the ratio by 1-2× between runs. Pinning to a single socket with <code>taskset</code> or <code>numactl</code> reduces this. The shape of the curve — cliff at the L3 boundary — is stable; the exact height of the cliff is not.&#160;<a href="#fnref:7" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a>&#160;<a href="#fnref1:7" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:8">
<p>Karl Popper, <em>Logik der Forschung</em> (1934), published in English as <em>The Logic of Scientific Discovery</em> (Hutchinson, 1959). The demarcation criterion — a theory is scientific if and only if it is falsifiable — applies directly: &ldquo;cache miss rate is high, working set fits L3, exceeding L3 will degrade the ratio&rdquo; is falsifiable. &ldquo;Random access is slow because it&rsquo;s random&rdquo; is not.&#160;<a href="#fnref:8" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:9">
<p>Brendan Gregg, <em>Systems Performance</em>, 2nd ed. (Addison-Wesley, 2020). Chapters 6 and 7 cover CPU and memory performance analysis with <code>perf</code>. The methodology sections — USE method, TSA method — apply directly to interpreting the IPC and cache miss data in this post.&#160;<a href="#fnref:9" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:10">
<p>Ahmad Yasin, <a href="https://ieeexplore.ieee.org/document/6844459">A Top-Down Method for Performance Analysis and Counters Architecture</a>, ISPASS 2014. Classifies every cycle into four categories: Frontend Bound, Backend Bound (memory/core), Bad Speculation, Retiring. The random access pattern in this post is Backend Bound (memory) — the CPU is ready to execute but waiting for data.&#160;<a href="#fnref:10" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:11">
<p>Ulrich Drepper, <a href="https://people.freebsd.org/~lstewart/articles/cpumemory.pdf">What Every Programmer Should Know About Memory</a> (2007). Sections 3 (CPU caches) and 6 (programming for performance) explain why sequential access is fast (hardware prefetch, spatial locality) and random access is slow (no predictable stride, no prefetch).&#160;<a href="#fnref:11" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:12">
<p>Agner Fog, <a href="https://www.agner.org/optimize/microarchitecture.pdf">Microarchitecture of Intel, AMD and VIA CPUs</a> (2025, continuously updated). Table of cache latencies: L1 ~4 cycles, L2 ~12 cycles, L3 ~30 cycles, DRAM ~200 cycles on Ivy Bridge-EP. These latencies explain the ~13× ratio (L3 hits) vs ~20–29× ratio (DRAM) observed across runs in the benchmark results.&#160;<a href="#fnref:12" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:13">
<p>Denis Bakhvalov, <a href="https://book.easyperf.net/perf_book"><em>Performance Analysis and Tuning on Modern CPUs</em></a> (easyperf.net, 2020). Chapters on PMU counters and <code>perf</code> provide practical workflows for exactly the kind of analysis shown in this post — from <code>perf stat</code> to diagnosis.&#160;<a href="#fnref:13" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:14">
<p>Andi Kleen, <a href="https://github.com/andikleen/pmu-tools">pmu-tools / toplev</a>. Automates the Yasin top-down analysis method. For Intel CPUs, <code>toplev</code> can classify bottlenecks without manual counter selection.&#160;<a href="#fnref:14" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:15">
<p>perf wiki, <a href="https://perfwiki.github.io/main/tutorial/">Tutorial</a>. Documents <code>perf stat</code> (counter aggregation), <code>perf record</code> (sampling), <code>perf report</code> (analysis). The <code>perf_event_paranoid</code> sysctl controls access: <code>2</code> allows per-process counters without root.&#160;<a href="#fnref:15" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:16">
<p>BenchmarkDotNet, <a href="https://benchmarkdotnet.org/articles/configs/diagnosers.html">Diagnosers</a>. BDN can collect hardware counters per-benchmark via ETW on Windows. The <code>[HardwareCounters]</code> attribute enables collection of specific counters (e.g., <code>InstructionRetired</code>, <code>CacheMisses</code>). On Linux, BDN does not natively collect hardware counters — use <code>perf stat</code> externally as shown in this post.&#160;<a href="#fnref:16" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:17">
<p>Intel, <a href="https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html"><em>64 and IA-32 Architectures Optimization Reference Manual</em></a> (2025), chapter 3. Defines the performance monitoring events and their architectural guarantees. <code>cycles</code> and <code>instructions</code> are the safest architectural counters — widely available on modern x86 CPUs; verify with <code>perf list</code>. <code>cache-references</code> and <code>cache-misses</code> are also architectural in the Intel PMU spec, but their mapping to physical events varies by microarchitecture (e.g., they may count LLC references on one µarch and L2 on another). On non-Intel CPUs, check <code>perf list</code> for available mappings.&#160;<a href="#fnref:17" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>]]></content:encoded>
    </item>
  </channel>
</rss>
