Wasm Optimization Flags & Size Reduction
A WebAssembly module that runs fast still hurts users if it ships fat. Every kilobyte of .wasm
is bytes to download, parse, and compile before a single export runs, and an unoptimized release
build routinely carries 30–60% dead weight: DWARF debug sections, the name custom section,
unused library functions, panic-formatting machinery, and verbose instruction sequences LLVM never
cleaned up. Size reduction is a disciplined two-stage pipeline — get the compiler to emit less, then
let a Wasm-aware tool rewrite what remains — followed by transport compression. This guide walks the
whole chain with concrete byte counts at each step.
The reason this is a pipeline rather than a single switch is that each tool sees a different scope.
The compiler frontend reasons about your source and its dependencies but emits for an abstract
machine; the LLVM backend lowers to WebAssembly but optimizes conservatively because it cannot prove
what the host will and will not call; wasm-opt sees the whole finished binary and can delete
anything genuinely unreachable; and the HTTP layer knows nothing about Wasm but compresses byte
redundancy the previous stages left behind. No single stage subsumes the others, so the discipline is
to run all of them and to measure between each, attributing every saved kilobyte to a specific cause.
Below, every number is from one representative ~200 KB Rust compute module — your absolutes will
differ, but the direction and magnitude of each transform are what transfer.
Prerequisites
- [ ]
wasm-optfrom Binaryen ≥ 116 (wasm-opt --version) — the post-compilation optimizer - [ ]
twiggy≥ 0.7 (cargo install twiggy) — code-size profiler that attributes bytes to functions - [ ]
wasm-stripandwasm-objdumpfrom WABT ≥ 1.0.34 (wasm-strip --version) - [ ] A toolchain that emits raw
.wasm: Rust +wasm-pack≥ 0.13, or Emscripten ≥ 3.1 - [ ]
gzipandbrotliCLIs for measuring transfer size - [ ]
ls -l/wc -cfor byte-accurate before/after measurement
The size-reduction pipeline
Reducing a binary is a sequence, not a single flag. The compiler produces the first artifact; a Wasm-to-Wasm optimizer rewrites it; a stripper drops metadata; and finally the HTTP layer compresses the bytes on the wire. Each stage targets a different class of waste, so skipping one leaves savings on the table.
The dashed branches are measurement, not transformation: twiggy tells you what is large before
you optimize, and wasm-objdump -h confirms which sections survived afterward. Never optimize
blind — measure, change one stage, measure again.
Step-by-step workflow
The example is a Rust crate, but the post-compile stages (steps 4–7) apply identically to Emscripten
output. The toolchain that produces the raw .wasm is covered in the
Rust to Wasm compilation guide;
here we focus on shrinking the artifact it emits.
1. Configure the release profile for size
Tell LLVM to optimize for size and strip the heaviest sources of bloat at the compiler stage. In
Cargo.toml:
[profile.release]
opt-level = "z" # equivalent to clang -Oz: density over speed
lto = "thin" # cross-crate dead-code elimination, fast link
codegen-units = 1 # one unit lets LTO see everything (smaller, slower build)
panic = "abort" # drops unwinding tables + panic-fmt machinery: ~8–15 KB
strip = true # removes symbol + debug info at link time
panic = "abort" is the single biggest source-level win for typical Rust modules — the unwinding
machinery and its formatting strings are pure overhead in a sandboxed module that cannot catch a
panic anyway. opt-level = "z" is the size-first mode; "s" is a slightly larger, slightly faster
sibling discussed under optimization flags & tradeoffs.
codegen-units = 1 is subtle: splitting a crate into many codegen units lets the compiler
parallelize, but each unit optimizes in isolation, so functions duplicated across units never get
merged. Forcing a single unit makes the build slower but gives lto = "thin" a complete view, which
is exactly what you want for a release artifact you build once and ship many times.
For C and C++ the equivalent knobs live on the Emscripten link line rather than in a manifest. The
size-first level is -Oz, -flto enables link-time DCE across translation units, and the JavaScript
glue that Emscripten emits has its own size flags — -s ASSERTIONS=0 to drop runtime checks, -g0
to strip symbols, and --closure 1 to minify the wrapper. Those glue-code controls and the matching
EXPORTED_FUNCTIONS whitelist are covered in the
C/C++ to Wasm with Emscripten
guide; the .wasm-shrinking steps below (4–7) apply to its output unchanged.
2. Produce the raw release binary
wasm-pack build --target web --release --out-dir pkg
ls -l pkg/*_bg.wasm
# -rw-r--r-- 1 dev dev 198304 Jun 21 10:02 pkg/app_bg.wasm # ~194 KB raw
Record this number. Every later step is judged against it. For a non-trivial module that touches
std, 150–250 KB raw is typical even after the profile tuning above.
3. Profile where the bytes live
Before reaching for wasm-opt, find out what is large. twiggy top attributes shallow and
retained size to each function and section:
twiggy top -n 12 pkg/app_bg.wasm
Shallow Bytes │ Shallow % │ Item
───────────────┼───────────┼────────────────────────────────────
41 280 │ 20.8 % │ "function names" subsection
18 944 │ 9.5 % │ data[0]
12 110 │ 6.1 % │ core::fmt::Formatter::pad
9 633 │ 4.9 % │ ::fmt
... │ ... │ ...
That 20.8% in the name subsection is metadata you will strip in step 5. Large core::fmt entries
signal a stray format! or Debug derive pulling in formatting code — fixing the source is worth
more than any flag. This is the part beginners skip and seniors never do: a tool can only delete code
that is unreachable, but twiggy finds code that is reachable yet shouldn’t be there. A single
println!-style debug call or a #[derive(Debug)] on a hot type can drag core::fmt into the
binary and add tens of kilobytes that no wasm-opt pass will touch, because the call site keeps it
alive. Reading the profile before optimizing turns “shrink the binary” from guesswork into a ranked
to-do list.
twiggy also exposes retained size with twiggy dominators, which attributes to each function not
just its own bytes but everything only it keeps alive. A 200-byte function that is the sole caller of
a 9 KB formatting tree shows up small under top but huge under dominators — and deleting that one
call site reclaims all 9 KB. Run both views before deciding what to cut.
4. Run wasm-opt for peephole optimization and DCE
Binaryen’s wasm-opt performs Wasm-specific rewrites LLVM cannot — block merging, local coalescing,
and whole-module dead-code elimination across the final binary. The detailed pass tuning lives in the
companion guide below; the canonical size invocation is:
wasm-opt pkg/app_bg.wasm \
-Oz \
--converge \
--strip-producers \
-o pkg/app.opt.wasm
ls -l pkg/app.opt.wasm
# 162992 bytes → ~17% smaller than the 198 KB raw input
--converge re-runs the pass pipeline until size stops dropping (usually 2–3 iterations, a further
3–7% over a single pass). --strip-producers deletes the producers custom section that records
toolchain versions.
-Oz here is not the same -Oz you passed the compiler — it is Binaryen’s own meta-pass that
expands to a fixed sequence of Wasm-level transforms. You can print that sequence with
wasm-opt --print-passes -Oz to see exactly what runs and in what order, which matters when you need
to reproduce a build in CI or bisect a pass that produces a broken binary. The transforms most
responsible for the reduction are dce (whole-module dead-code elimination), vacuum (removes
no-op and unreachable code), merge-blocks (collapses redundant control flow), and coalesce-locals
(reuses local slots so the locals declaration shrinks). The deep mechanics of each pass and how to
read the resulting Binaryen IR are covered in the focused companion guide linked below.
5. Strip remaining metadata
wasm-opt -Oz keeps the name section by default because it aids debugging. For a production
artifact, drop it:
wasm-strip pkg/app.opt.wasm # removes name + remaining custom sections in place
ls -l pkg/app.opt.wasm
# 128784 bytes → the 41 KB name subsection from twiggy is gone
Equivalently, fold it into the wasm-opt call with --strip-debug. Keep an unstripped copy in your
build artifacts so you can still symbolicate stack traces from production reports.
6. Decide on extra feature passes
If your module uses post-MVP features, wasm-opt must be told they are allowed or it will refuse to
optimize and may even error. The most common is bulk memory (memory.copy / memory.fill), which
shrinks memcpy-heavy code:
wasm-opt pkg/app_bg.wasm -Oz --enable-bulk-memory --converge -o pkg/app.opt.wasm
7. Compress for transport
The bytes that hit the network are the compressed bytes. Measure both encodings — brotli at
quality 11 typically beats gzip -9 by 15–25% on Wasm:
gzip -9 -k -c pkg/app.opt.wasm | wc -c # 54213 → gzip
brotli -q 11 -c pkg/app.opt.wasm | wc -c # 44102 → brotli, ~19% smaller
Serve .wasm.br with Content-Encoding: br and the correct Content-Type: application/wasm so the
browser can still use instantiateStreaming. A misconfigured MIME type silently disables streaming
and roughly doubles startup latency.
Precompress at build time rather than per request. brotli -q 11 is slow — hundreds of milliseconds
on a large binary — but a static .wasm’s contents never change between requests, so paying that cost
once at build and serving the precomputed .br gives you brotli’s full ratio with none of its
runtime cost. On-the-fly compression at the edge almost always falls back to a lower quality level
(commonly -q 4 or -q 5) to stay fast, giving up a meaningful chunk of the savings. Treat the
compressed artifact as a build output and hash it into your asset filenames so the CDN caches it
immutably.
A binding & loading example
Optimization is worthless if the optimized binary fails to load. Stream-instantiate it with an
explicit import object and verify the export you expect survived the DCE passes:
async function loadOptimized(url, imports = {}) {
const resp = await fetch(url, { headers: { Accept: "application/wasm" } });
if (!resp.ok) throw new Error(`fetch failed: ${resp.status}`);
// instantiateStreaming compiles while the body downloads — the main payoff of a small .wasm
const { instance } = await WebAssembly.instantiateStreaming(resp, imports);
if (typeof instance.exports.process_batch !== "function") {
throw new Error("process_batch was stripped — mark it exported in source");
}
return instance.exports;
}
const wasm = await loadOptimized("/pkg/app.opt.wasm");
The guard matters: aggressive --remove-unused-module-elements and -Oz will eliminate any function
that is not reachable from an export, so a symbol you call only from JavaScript must be exported in
the source (#[wasm_bindgen] or Emscripten’s EXPORTED_FUNCTIONS) or it disappears.
Optimization flags & tradeoffs
The optimization level sets the balance between size and execution speed. Numbers below are representative of a ~200 KB compute module; absolute values vary, but the ordering is stable.
| Level | Intent | Relative size | Relative speed | When to use |
|---|---|---|---|---|
-O2 |
Balanced default | baseline | baseline | General release builds where you have not yet measured |
-O3 |
Max speed | +8–15% larger | fastest tight loops | Compute-bound kernels (physics, codecs, ML) where size is secondary |
-Os |
Speed, then size | −5–10% vs -O2 |
within ~3% of -O2 |
Frontend modules wanting smaller bytes with little speed cost |
-Oz |
Size above all | −12–20% vs -O2 |
can regress hot loops 5–15% | Strict bundle budgets; disables loop unrolling and some inlining |
Beyond the level, individual passes and flags each remove a distinct class of bytes:
--strip-debugremoves DWARF sections. On a debug-info-heavy build this is the largest single reduction — often 30–50% — but it ends source-level debugging in DevTools.--dce/--remove-unused-module-elementsdrop unreachable functions, globals, and imports. Effective only for symbols not reachable from an export; anything exported is kept.--enable-bulk-memorylets the optimizer lowermemcpy/memsetto singlememory.copy/memory.fillinstructions, shrinking byte-shuffling code and speeding it up. Requires the target runtime to support bulk memory (all current browsers do).- gzip vs brotli:
gzip -9is universal and fast to produce;brotli -q 11is 15–25% smaller on Wasm but slower to compress. Precompress at build time and serve the static.br, so the encode cost is paid once, not per request.
The headline tradeoff: -Oz minimizes bytes but can slow tight loops by disabling unrolling, while
-O3 does the reverse. Pick per-module by profiling, not by reflex — measure the loop with the
Wasm performance benchmarking
harness before assuming -Oz is free.
Linear memory layout is a size lever too
Size is not only code. A module’s declared linear memory affects both the binary and runtime
behavior, and the knobs pull in opposite directions. A large INITIAL_MEMORY makes instantiation
predictable and avoids growth events, but the memory is described in the binary and its data segments
ship with it. A small initial memory plus growth keeps the payload lean, but every memory.grow may
copy the entire buffer to a new region — a 50–200 ms main-thread stall each time. The right answer is
to size initial memory to the steady-state working set: large enough to avoid growth in the common
path, small enough not to inflate the download. For workloads with a fixed buffer size (image tiles,
audio frames) pre-allocate exactly that and never grow at all.
Enforce the win with a size budget
Optimization that is not defended regresses. The cheapest guard is a byte-count check in CI that fails
a pull request when the compressed .wasm grows past a threshold, so a careless dependency bump or a
stray Debug derive is caught at review time rather than in production:
CURRENT=$(brotli -q 11 -c pkg/app.opt.wasm | wc -c)
BASELINE=$(cat .wasm-size-baseline) # committed, updated only on approved PRs
DELTA=$((CURRENT - BASELINE))
echo "compressed: $CURRENT baseline: $BASELINE delta: $DELTA"
[ "$DELTA" -gt 2048 ] && { echo "::error::wasm grew ${DELTA}B (>2KB)"; exit 1; }
Track the compressed number, not the raw one — it is what users actually download, and a change can shrink raw bytes while growing the compressed payload if it adds low-redundancy data. Wiring this into a Rust pipeline is the subject of setting up CI/CD for Rust + Wasm projects.
Gotchas & failure modes
wasm-opterrors withFatal: error in validating input— almost always an unrecognized feature. The binary uses SIMD, threads, or bulk memory and you did not pass the matching--enable-*flag. Add it (or--all-featureswhile diagnosing) and re-run.- An export vanishes after optimization.
-Ozremoved it because nothing reachable called it. Diff the export tables:wasm-objdump -x before.wasm | grep ^Exportagainst the optimized file. Fix by exporting the symbol in source, not by weakening the optimization. - Streaming silently disabled. If the server sends
Content-Type: application/octet-stream,WebAssembly.instantiateStreamingrejects and you fall back to the buffered path, doubling effective load time. Serveapplication/wasm. memory.growstalls after shrinking initial memory. TrimmingINITIAL_MEMORYto cut payload means each growth event copies the whole buffer — 50–200 ms main-thread stalls. Size initial memory to the steady-state working set, not the minimum.twiggyreports tiny functions but a hugedatasegment. Large static data (embedded fonts, lookup tables,include_bytes!assets) is not touched by code optimization —wasm-optrewrites instructions, not your data. Move the asset out of the binary and fetch it separately, or compress it at the source level, so the.wasmcarries logic and the bytes ride the asset pipeline.--convergenever terminates quickly. On a very large module convergence can take many passes and minutes. That is expected for a once-per-release artifact; for fast iterative dev builds, drop--convergeand accept a single-Ozpass, which already captures most of the win.- The optimized binary is larger than the input. This happens if you re-run
wasm-opton an already-stripped binary with--debugor a feature-enabling flag that forces a more conservative lowering, or if you accidentally pass an optimization level below the compiler’s. Always start from the raw compiler output, not a previously processed file.
Verification
After every change, confirm three things: the bytes actually shrank, the structure is still valid, and the right sections survived.
# 1. Bytes — the only number that ships
ls -l pkg/app_bg.wasm pkg/app.opt.wasm
wc -c pkg/app.opt.wasm
# 2. Structure — fail fast on a corrupt rewrite
wasm-validate pkg/app.opt.wasm && echo OK
# 3. Sections — confirm name/DWARF are gone, code/data remain
wasm-objdump -h pkg/app.opt.wasm
Sections:
Type start=0x0000000b end=0x0000002f
Function start=0x00000031 end=0x00000060
Memory start=0x00000062 end=0x00000067
Export start=0x00000069 end=0x00000091
Code start=0x00000095 end=0x0001f4a1
Data start=0x0001f4a3 end=0x0001fb20
# no "name" or ".debug_*" custom sections → strip succeeded
Then re-run twiggy top on the optimized file to confirm the items you targeted (the name
subsection, stray fmt code) are gone, and that no new surprise dominates the budget.
The fourth check is behavioral: a smaller binary that no longer runs is a regression, not an
optimization. Load the optimized module and exercise its real exports, ideally in the same harness you
use for benchmarking, because some failures — a stripped export, a feature-flag mismatch that produced
a subtly wrong binary — only surface at instantiation or first call, not during wasm-validate. The
loading example above is the minimal form of this check; in CI, run your actual test suite against the
optimized artifact rather than the debug build, so the bytes you test are the bytes you ship.
# Compare export tables to prove nothing the host needs was eliminated
wasm-objdump -x pkg/app_bg.wasm | grep '^Export' | sort > /tmp/pre.txt
wasm-objdump -x pkg/app.opt.wasm | grep '^Export' | sort > /tmp/post.txt
diff /tmp/pre.txt /tmp/post.txt && echo "exports preserved"
In this guide
- Reducing Wasm bundle size with wasm-opt
— the exact
wasm-optpass pipeline, build-system integration, and before/after byte counts.
Frequently Asked Questions
Should I optimize at the compiler or with wasm-opt?
Both, in that order. The compiler (opt-level = "z", panic = "abort", lto) removes whole
categories of code at the source level; wasm-opt then performs Wasm-specific rewrites and
whole-module DCE that LLVM cannot do because it targets many architectures. Skipping either leaves
10–20% on the table.
Does -Oz ever make a module slower?
Yes. -Oz disables loop unrolling and trims inlining to save bytes, which can regress a tight numeric
loop by 5–15%. For compute-bound kernels prefer -O3 or -Os and accept the larger binary; for glue
and UI logic -Oz is almost always the right call.
Why is my brotli file barely smaller than gzip?
Either you used a low brotli quality (use -q 11 for static assets) or the binary is already dense —
a well-optimized, stripped .wasm has little redundancy left for either compressor to exploit. The
gap is widest on debug builds full of repetitive metadata.
Do I still need wasm-strip if I pass --strip-debug to wasm-opt?
No — --strip-debug inside wasm-opt and a separate wasm-strip pass do the same job for DWARF and
the name section. Use one or the other; running both is harmless but redundant.
How do I keep source-level debugging while still shipping a small binary?
Build two artifacts from one compile: keep an unstripped .wasm with DWARF for local debugging and
crash symbolication, and ship the stripped, wasm-opt-processed binary to users. Never debug against
the size-optimized file — -Oz reorders and merges code so line numbers no longer map cleanly.
Does compressing make compiler and wasm-opt optimization pointless?
No. Compression removes byte redundancy but cannot delete code that is present; DCE and stripping
remove the code and metadata entirely, so the engine never parses them. The two stack: a stripped,
DCE’d binary compresses to fewer bytes and parses faster than a fat binary compressed to the same
wire size, because parse time tracks the decompressed code, not the transferred bytes.
Related
- Reducing Wasm bundle size with wasm-opt — the focused
wasm-optpass pipeline and integration. - Wasm performance benchmarking — measure the speed cost of a size flag before you ship it.
- Rust to Wasm compilation guide — producing the raw
.wasmthese passes shrink. - C/C++ to Wasm with Emscripten — linker flags and glue-code stripping for C/C++ modules.
← Back to Compilation Pipelines & Toolchain Setup