Wasm Performance Benchmarking

Most WebAssembly benchmarks you see online are wrong. They run a tight loop once, print the first performance.now() delta they get, and conclude that Wasm is either 50× faster than JavaScript or barely faster at all — usually both, on the same machine, depending on the day. The truth is that a microbenchmark is a measuring instrument, and an uncalibrated instrument produces noise. This guide treats benchmarking as engineering: a reproducible harness with warmup and fixed iteration counts, statistical reporting instead of single samples, and a clear separation between what you meant to measure and the overhead you accidentally measured. It also covers reading the optimizer — wasm-opt’s pass pipeline and the Binaryen text IR it emits — so you can explain why a number changed, not just observe that it did.

Prerequisites

  • [ ] binaryen 116 or newer on your PATH (wasm-opt --version prints wasm-opt version 116)
  • [ ] Node.js 20+ for process.hrtime.bigint() and stable --allow-natives-syntax behaviour
  • [ ] A Chromium-based browser (Chrome 120+) and Firefox 121+ for DevTools profiling
  • [ ] wabt for wasm-objdump and wasm-dis (verification cross-checks)
  • [ ] A quiet machine: close other tabs, disable turbo boost if you can, and run on AC power
  • [ ] WebAssembly global available in your runtime (all the above satisfy this)

A benchmark is only reproducible if the inputs are pinned. Lock your toolchain versions in CI the same way setting up CI/CD for Rust Wasm projects pins the Rust toolchain — a binaryen minor bump can change which passes run by default and move your numbers several percent.

The harness as a measuring loop

Every honest microbenchmark has the same shape: drive the function untimed until the JIT and the Wasm tiering compiler have settled (warmup), then run a fixed number of timed iterations, then aggregate the per-iteration samples into robust statistics. The diagram below is the loop you are building.

flowchart LR A[Load & instantiate module] --> B[Warmup: run N iterations untimed] B --> C{Tiers settled?} C -- no --> B C -- yes --> D[Timed loop: M iterations] D --> E[Record per-iteration sample] E --> F{M reached?} F -- no --> D F -- yes --> G[Discard warmup, sort samples] G --> H[Report median, p95, stddev]

Two properties make this trustworthy. First, warmup is discarded, never averaged in — the first few hundred calls run in the baseline (Liftoff in V8, the interpreter tier in SpiderMonkey) before the optimizing tier (TurboFan/Ironmonkey) kicks in, and mixing cold and hot samples produces a meaningless mean. Second, you report the distribution, not the mean — the median is robust to the occasional GC pause or scheduler preemption, and the p95 tells you the tail. A single number hides both.

The reason warmup is not optional in WebAssembly specifically is that browsers and Node deliberately compile a module twice. The baseline compiler trades code quality for compile speed so the page can start running almost immediately; a background thread then recompiles the hot functions with the optimizing compiler and hot-swaps them in. During that window a function can run 3–10× slower than it will once tiered. If your timed loop straddles the swap, you average two different machines together and the result is reproducible only by accident. The fix is mechanical: run the function untimed long enough that every hot function has tiered up, confirm by checking that the median stops moving when you double the warmup, and only then start recording.

The choice of clock matters just as much as the warmup. In Node, process.hrtime.bigint() returns a monotonic nanosecond counter that never jumps backwards and is not affected by wall-clock adjustments — exactly what you want for differences. In the browser, performance.now() is the equivalent, but its resolution is intentionally coarsened to defend against high-resolution timing attacks, so you time a batch of iterations under one clock read and divide. Never use Date.now() for either: it is millisecond-resolution and not guaranteed monotonic, so a single NTP correction can produce a negative duration.

Step-by-step workflow

The workflow below produces comparable artifacts, proves the optimizer actually changed the body, runs the harness, and localizes any regression. Each step is a single runnable command or a focused edit; run them in order, because attributing a throughput change to a flag is only meaningful when the two binaries you compare differ by exactly that flag and nothing else.

1. Build an optimized and an unoptimized artifact

Compile your kernel, then produce explicit variants so you can attribute differences to the optimizer rather than the compiler:

# baseline: whatever your toolchain emits, no post-processing
cp kernel.wasm kernel.O0.wasm

# the three levels you will actually compare
wasm-opt kernel.O0.wasm -O2 -o kernel.O2.wasm
wasm-opt kernel.O0.wasm -O3 -o kernel.O3.wasm
wasm-opt kernel.O0.wasm -Os -o kernel.Os.wasm

2. Inspect what wasm-opt did before trusting the number

Dump the Binaryen text IR so a faster run has an explanation:

wasm-opt kernel.O0.wasm -O3 --print -o /dev/null | head -n 40

--print runs the full -O3 pipeline and then prints the optimized module as Binaryen IR. If a function shrank from a loop to a constant, you will see it here — and you will know the optimizer folded your benchmark away. Reading that IR fluently is its own skill, covered in reading Binaryen IR from wasm-opt.

3. Measure size and instruction counts

wasm-opt kernel.O3.wasm --metrics -o /dev/null
ls -l kernel.O0.wasm kernel.O3.wasm

--metrics prints a per-category instruction census (total, binary, call, load, loop, etc.). A throughput win should correlate with fewer loop/call nodes or a tighter binary; if the metrics are identical, your “improvement” is measurement noise.

4. Run the harness

node bench.mjs kernel.O3.wasm 200000

The harness — built in the next section — instantiates the module once, warms up, runs the timed loop, and prints a stats table. Run it three times; if the medians disagree by more than a couple of percent, your machine is too noisy and you need to pin to a worker and quiet the system.

5. Profile to localize a regression

When a number moves the wrong way, open Chrome DevTools → Performance, record the timed loop, and look at the bottom-up tree. Wasm frames appear with their function index or name (if a name section survives); a hot Liftoff frame that never tiers up is a warmup bug, not a kernel problem. Firefox’s profiler shows the same with explicit baseline/ion tier annotations.

DevTools profiling adds a second dimension the harness alone cannot give you: where the time goes. A median tells you the kernel is slow; a flame chart tells you it is slow inside a single bounds-checked load in an inner loop, or that half the samples land in a memory.grow you did not expect. In Chrome, enable “Memory” in the Performance recording to overlay GC events — a sawtooth heap with frequent minor collections during the timed window explains a fat p95 immediately. In Firefox, the per-frame tier badge is the fastest way to confirm warmup: if the badge says baseline on a frame you expected to be hot, the function never tiered and your warmup is too short or the function is too large for the inliner. Treat the harness and the profiler as complementary — the harness produces the number, the profiler explains it.

A reproducible harness

This is a minimal but honest Node harness. It instantiates once, warms up, times M iterations with process.hrtime.bigint() (nanosecond resolution, monotonic), consumes the result so dead-code elimination cannot delete the work, and reports median/p95/stddev.

// bench.mjs — run: node bench.mjs <file.wasm> <iterations>
import { readFile } from "node:fs/promises";

const [, , file, iterArg] = process.argv;
const M = Number(iterArg ?? 100_000);
const WARMUP = Math.max(1000, M / 10);

const bytes = await readFile(file);
const { instance } = await WebAssembly.instantiate(bytes, {});
const kernel = instance.exports.run; // exported i32->i32 hot function

let sink = 0; // accumulator the optimizer cannot prove is dead

// warmup: drive the tiering compiler, results discarded
for (let i = 0; i < WARMUP; i++) sink ^= kernel(i);

// timed loop: one sample per call
const samples = new Float64Array(M);
for (let i = 0; i < M; i++) {
  const t0 = process.hrtime.bigint();
  sink ^= kernel(i);
  const t1 = process.hrtime.bigint();
  samples[i] = Number(t1 - t0); // nanoseconds
}

// consume sink so the JIT keeps the loop body
if (sink === 0.5) console.log("unreachable", sink);

samples.sort((a, b) => a - b);
const median = samples[Math.floor(M * 0.5)];
const p95 = samples[Math.floor(M * 0.95)];
const mean = samples.reduce((a, b) => a + b, 0) / M;
const stddev = Math.sqrt(
  samples.reduce((a, b) => a + (b - mean) ** 2, 0) / M,
);

console.log(
  `median ${median.toFixed(1)} ns  p95 ${p95.toFixed(1)} ns  ` +
    `stddev ${stddev.toFixed(1)} ns  (n=${M})`,
);

The single most important line is sink ^= kernel(i). Without it — if you call kernel(i) and throw the result away — V8’s escape analysis can prove the call has no observable effect and delete it, and you end up timing an empty loop at ~0.3 ns/iter. Always feed the result into something the runtime cannot prove is dead, such as an XOR accumulator you print at the end. In the browser, swap process.hrtime.bigint() for performance.now(), but be aware its resolution is clamped (see gotchas), so prefer timing a batch of iterations and dividing.

There is a deliberate asymmetry in this harness worth naming. It records one timer pair per iteration, which is correct only when the kernel runs long enough — hundreds of nanoseconds or more — that the two hrtime calls (≈30–60 ns of overhead together) are a small fraction of the sample. For a genuinely tiny kernel that asymmetry inverts: the timer dominates, every sample is mostly clock-read cost, and the distribution is meaningless. The remedy is batch timing — wrap BATCH calls in one timer pair and divide the elapsed time by BATCH — which amortizes the timer overhead to near zero at the cost of losing per-iteration granularity. Pick per-iteration timing when you want the full distribution and the kernel is large; pick batch timing when the kernel is small and you only need a robust central estimate. The harness in building a reproducible Wasm benchmark harness shows both forms side by side and when to reach for each.

Note also that the importObject here is empty ({}). If your kernel imports host functions — a Math.random, a logging callback, a memory you supply — those imports become part of what you measure, and a slow JavaScript import called inside the hot loop will swamp the kernel. Keep the timed function pure where you can, and if it must call back into the host, measure that import’s cost separately so you know how much of the number belongs to Wasm and how much to the boundary it crosses.

Optimization flags & tradeoffs, with numbers

The three levels you compared in step 1 trade throughput against binary size. Representative figures for a numeric kernel (a SAXPY inner loop, y[i] = a*x[i] + y[i] over 1M elements) compiled from Rust and post-processed with wasm-opt:

Pass Throughput (Melem/s) .wasm size When to pick it
-O0 (none) 410 12.4 KB never ship this; baseline only
-Os 980 7.1 KB size-constrained bundles, cold-start sensitive
-O2 1180 8.0 KB the safe default — most of -O3, smaller
-O3 1240 9.6 KB compute-bound hot paths; aggressive inlining

The headline: -O3 buys only ~5% over -O2 here but costs 20% more bytes, because the SAXPY loop is memory-bandwidth bound and inlining cannot help bandwidth. On a branchy, call-heavy kernel the gap widens to 20–40% because -O3’s --inlining-optimizing removes call overhead the bandwidth-bound case never had. This is exactly why you measure your kernel rather than trusting a table: the right level is workload-dependent. The size side of this tradeoff — and how wasm-opt achieves it — is the focus of reducing Wasm bundle size with wasm-opt, and the full flag matrix lives in Wasm optimization flags & size reduction.

Two practical corollaries follow. First, the size column is not free even when you are chasing throughput: a larger .wasm takes longer to download and compile, so on a cold start the instantiate cost can erase a steady-state win you only realize after thousands of calls. If your module runs a kernel a handful of times per page load, -Os may beat -O3 end-to-end despite being slower per iteration. Second, --converge (repeating the pass pipeline until the output stabilizes) can squeeze another few percent of size out of any level, but it does not change throughput meaningfully and roughly doubles optimize time — reach for it on shipping artifacts, not on every benchmark build. The discipline is to benchmark the level you actually intend to ship, on the input distribution you actually expect, rather than reporting the headline peak from a synthetic loop.

Gotchas & failure modes

The optimizer deleted your benchmark. If a “Wasm” run reports sub-nanosecond per-iteration times, the optimizer constant-folded the kernel or DCE removed the loop. The fix is twofold: make the input data-dependent (read from a buffer the harness fills at runtime, not a compile-time constant) and consume the output. In Wasm specifically, wasm-opt --precompute will evaluate any expression with constant operands at optimize time — so run(42) where 42 is hard-coded can become a single i32.const return.

You never warmed up the JIT. Timing the first 100 calls measures the baseline tier (Liftoff/interpreter), which can be 3–10× slower than the optimized tier. The classic symptom is “Wasm is barely faster than JS” — because you compared cold optimized-by-nobody Wasm against already-hot JS. Warm both to steady state before timing.

You measured the boundary, not the kernel. A call from JavaScript into Wasm costs a few nanoseconds of marshaling; if your kernel itself takes 5 ns, half your number is boundary overhead. Either amortize by doing more work per call (process 10,000 elements, not one), or — if the boundary is what you care about — measure it deliberately and label it as such.

GC noise contaminates the mean. A major GC during the timed loop adds a multi-millisecond spike. The median ignores it; the mean does not. This is the whole reason to report the median, and to look at p95 separately to see whether the tail is acceptable.

Cold vs hot tiers across runs. V8 caches compiled Wasm within a process but not across node invocations. If you compare two .wasm files by running node twice, the second run is not penalized by the first’s warmup — good — but neither benefits from it. Keep the harness in one process and benchmark both variants back to back if you want apples-to-apples.

Verification

Before publishing a number, prove the artifact is what you think it is:

# instruction census — confirm -O3 actually changed the body
wasm-opt kernel.O0.wasm --metrics -o /dev/null
wasm-opt kernel.O3.wasm --metrics -o /dev/null

# disassemble the hot function and confirm the loop survived
wasm-objdump -d kernel.O3.wasm | grep -A 20 'func\[.*run'

# structural sanity
wasm-validate kernel.O3.wasm

If --metrics shows the same instruction counts for -O0 and -O3, the optimizer had nothing to do (or you optimized the wrong file). If wasm-objdump shows your loop replaced by a single constant return, DCE/precompute folded it — your benchmark is measuring nothing. These two checks catch the majority of bogus benchmark results before they reach a slide deck.

In this guide

Frequently Asked Questions

Why report the median and p95 instead of the average? Microbenchmarks are contaminated by rare, large outliers — a GC pause, a scheduler preemption, a thermal throttle event. These skew the arithmetic mean by an unpredictable amount but barely move the median. The median answers “what does a typical iteration cost?” and the p95 answers “how bad is the tail?”. The mean answers neither reliably, which is why it is the wrong default for latency data.

Should I include WebAssembly.instantiate time in the benchmark? Only if startup is what you are measuring. Instantiation (compile + link) is a one-time cost that has nothing to do with steady-state throughput, so for a compute benchmark you instantiate once before the timed loop. For a cold-start benchmark — “how fast can I go from bytes to first result?” — you measure exactly that and label it separately. Conflating the two is the most common way Wasm looks artificially slow.

Is performance.now() accurate enough for nanosecond-scale work? No. Browsers clamp performance.now() resolution to 5 µs (sometimes 100 µs without cross-origin isolation) as a Spectre mitigation, so a single 5 ns iteration is unmeasurable. Time a batch of, say, 100,000 iterations and divide. In Node, process.hrtime.bigint() gives true nanosecond resolution and does not need batching.

Why does my Wasm benchmark get faster the second time I run the function? Tiering. The engine first runs your module in a fast-to-compile baseline tier, then recompiles hot functions in an optimizing tier in the background. The transition can take a few hundred to a few thousand calls, which is exactly why the warmup phase exists — to reach steady state before any timed sample is recorded.

Does wasm-opt -O3 always beat -O2 on throughput? No. -O3 adds more aggressive (and slower-to-run) passes like extra inlining, but on bandwidth-bound or already-simple kernels the gain is in the noise while the binary grows. Treat -O2 as the default and only adopt -O3 when your harness shows a real, repeatable win on your kernel.

← Back to Compilation Pipelines & Toolchain Setup