Wasm Optimization Flags & Size Reduction

A WebAssembly module that runs fast still hurts users if it ships fat. Every kilobyte of .wasm is bytes to download, parse, and compile before a single export runs, and an unoptimized release build routinely carries 30–60% dead weight: DWARF debug sections, the name custom section, unused library functions, panic-formatting machinery, and verbose instruction sequences LLVM never cleaned up. Size reduction is a disciplined two-stage pipeline — get the compiler to emit less, then let a Wasm-aware tool rewrite what remains — followed by transport compression. This guide walks the whole chain with concrete byte counts at each step.

The reason this is a pipeline rather than a single switch is that each tool sees a different scope. The compiler frontend reasons about your source and its dependencies but emits for an abstract machine; the LLVM backend lowers to WebAssembly but optimizes conservatively because it cannot prove what the host will and will not call; wasm-opt sees the whole finished binary and can delete anything genuinely unreachable; and the HTTP layer knows nothing about Wasm but compresses byte redundancy the previous stages left behind. No single stage subsumes the others, so the discipline is to run all of them and to measure between each, attributing every saved kilobyte to a specific cause. Below, every number is from one representative ~200 KB Rust compute module — your absolutes will differ, but the direction and magnitude of each transform are what transfer.

Prerequisites

  • [ ] wasm-opt from Binaryen ≥ 116 (wasm-opt --version) — the post-compilation optimizer
  • [ ] twiggy ≥ 0.7 (cargo install twiggy) — code-size profiler that attributes bytes to functions
  • [ ] wasm-strip and wasm-objdump from WABT ≥ 1.0.34 (wasm-strip --version)
  • [ ] A toolchain that emits raw .wasm: Rust + wasm-pack ≥ 0.13, or Emscripten ≥ 3.1
  • [ ] gzip and brotli CLIs for measuring transfer size
  • [ ] ls -l / wc -c for byte-accurate before/after measurement

The size-reduction pipeline

Reducing a binary is a sequence, not a single flag. The compiler produces the first artifact; a Wasm-to-Wasm optimizer rewrites it; a stripper drops metadata; and finally the HTTP layer compresses the bytes on the wire. Each stage targets a different class of waste, so skipping one leaves savings on the table.

flowchart LR A["source<br/>(Rust / C++)"] -->|"compile -Oz / opt-level=z"| B[".wasm<br/>raw release"] B -->|"wasm-opt -Oz --converge"| C[".wasm<br/>peephole + DCE"] C -->|"wasm-strip / --strip-debug"| D[".wasm<br/>no DWARF / name"] D -->|"gzip -9 / brotli -q 11"| E[".wasm.br<br/>on the wire"] B -.->|"twiggy top"| F["size attribution<br/>per function"] D -.->|"wasm-objdump -h"| G["verify sections"]

The dashed branches are measurement, not transformation: twiggy tells you what is large before you optimize, and wasm-objdump -h confirms which sections survived afterward. Never optimize blind — measure, change one stage, measure again.

Step-by-step workflow

The example is a Rust crate, but the post-compile stages (steps 4–7) apply identically to Emscripten output. The toolchain that produces the raw .wasm is covered in the Rust to Wasm compilation guide; here we focus on shrinking the artifact it emits.

1. Configure the release profile for size

Tell LLVM to optimize for size and strip the heaviest sources of bloat at the compiler stage. In Cargo.toml:

[profile.release]
opt-level = "z"      # equivalent to clang -Oz: density over speed
lto = "thin"         # cross-crate dead-code elimination, fast link
codegen-units = 1    # one unit lets LTO see everything (smaller, slower build)
panic = "abort"      # drops unwinding tables + panic-fmt machinery: ~8–15 KB
strip = true         # removes symbol + debug info at link time

panic = "abort" is the single biggest source-level win for typical Rust modules — the unwinding machinery and its formatting strings are pure overhead in a sandboxed module that cannot catch a panic anyway. opt-level = "z" is the size-first mode; "s" is a slightly larger, slightly faster sibling discussed under optimization flags & tradeoffs. codegen-units = 1 is subtle: splitting a crate into many codegen units lets the compiler parallelize, but each unit optimizes in isolation, so functions duplicated across units never get merged. Forcing a single unit makes the build slower but gives lto = "thin" a complete view, which is exactly what you want for a release artifact you build once and ship many times.

For C and C++ the equivalent knobs live on the Emscripten link line rather than in a manifest. The size-first level is -Oz, -flto enables link-time DCE across translation units, and the JavaScript glue that Emscripten emits has its own size flags — -s ASSERTIONS=0 to drop runtime checks, -g0 to strip symbols, and --closure 1 to minify the wrapper. Those glue-code controls and the matching EXPORTED_FUNCTIONS whitelist are covered in the C/C++ to Wasm with Emscripten guide; the .wasm-shrinking steps below (4–7) apply to its output unchanged.

2. Produce the raw release binary

wasm-pack build --target web --release --out-dir pkg
ls -l pkg/*_bg.wasm
# -rw-r--r-- 1 dev dev 198304 Jun 21 10:02 pkg/app_bg.wasm   # ~194 KB raw

Record this number. Every later step is judged against it. For a non-trivial module that touches std, 150–250 KB raw is typical even after the profile tuning above.

3. Profile where the bytes live

Before reaching for wasm-opt, find out what is large. twiggy top attributes shallow and retained size to each function and section:

twiggy top -n 12 pkg/app_bg.wasm
 Shallow Bytes │ Shallow % │ Item
───────────────┼───────────┼────────────────────────────────────
        41 280 │    20.8 % │ "function names" subsection
        18 944 │     9.5 % │ data[0]
        12 110 │     6.1 % │ core::fmt::Formatter::pad
         9 633 │     4.9 % │ ::fmt
         ...   │     ...   │ ...

That 20.8% in the name subsection is metadata you will strip in step 5. Large core::fmt entries signal a stray format! or Debug derive pulling in formatting code — fixing the source is worth more than any flag. This is the part beginners skip and seniors never do: a tool can only delete code that is unreachable, but twiggy finds code that is reachable yet shouldn’t be there. A single println!-style debug call or a #[derive(Debug)] on a hot type can drag core::fmt into the binary and add tens of kilobytes that no wasm-opt pass will touch, because the call site keeps it alive. Reading the profile before optimizing turns “shrink the binary” from guesswork into a ranked to-do list.

twiggy also exposes retained size with twiggy dominators, which attributes to each function not just its own bytes but everything only it keeps alive. A 200-byte function that is the sole caller of a 9 KB formatting tree shows up small under top but huge under dominators — and deleting that one call site reclaims all 9 KB. Run both views before deciding what to cut.

4. Run wasm-opt for peephole optimization and DCE

Binaryen’s wasm-opt performs Wasm-specific rewrites LLVM cannot — block merging, local coalescing, and whole-module dead-code elimination across the final binary. The detailed pass tuning lives in the companion guide below; the canonical size invocation is:

wasm-opt pkg/app_bg.wasm \
  -Oz \
  --converge \
  --strip-producers \
  -o pkg/app.opt.wasm
ls -l pkg/app.opt.wasm
# 162992 bytes  → ~17% smaller than the 198 KB raw input

--converge re-runs the pass pipeline until size stops dropping (usually 2–3 iterations, a further 3–7% over a single pass). --strip-producers deletes the producers custom section that records toolchain versions.

-Oz here is not the same -Oz you passed the compiler — it is Binaryen’s own meta-pass that expands to a fixed sequence of Wasm-level transforms. You can print that sequence with wasm-opt --print-passes -Oz to see exactly what runs and in what order, which matters when you need to reproduce a build in CI or bisect a pass that produces a broken binary. The transforms most responsible for the reduction are dce (whole-module dead-code elimination), vacuum (removes no-op and unreachable code), merge-blocks (collapses redundant control flow), and coalesce-locals (reuses local slots so the locals declaration shrinks). The deep mechanics of each pass and how to read the resulting Binaryen IR are covered in the focused companion guide linked below.

5. Strip remaining metadata

wasm-opt -Oz keeps the name section by default because it aids debugging. For a production artifact, drop it:

wasm-strip pkg/app.opt.wasm           # removes name + remaining custom sections in place
ls -l pkg/app.opt.wasm
# 128784 bytes  → the 41 KB name subsection from twiggy is gone

Equivalently, fold it into the wasm-opt call with --strip-debug. Keep an unstripped copy in your build artifacts so you can still symbolicate stack traces from production reports.

6. Decide on extra feature passes

If your module uses post-MVP features, wasm-opt must be told they are allowed or it will refuse to optimize and may even error. The most common is bulk memory (memory.copy / memory.fill), which shrinks memcpy-heavy code:

wasm-opt pkg/app_bg.wasm -Oz --enable-bulk-memory --converge -o pkg/app.opt.wasm

7. Compress for transport

The bytes that hit the network are the compressed bytes. Measure both encodings — brotli at quality 11 typically beats gzip -9 by 15–25% on Wasm:

gzip -9 -k -c pkg/app.opt.wasm | wc -c     # 54213  → gzip
brotli -q 11 -c pkg/app.opt.wasm | wc -c   # 44102  → brotli, ~19% smaller

Serve .wasm.br with Content-Encoding: br and the correct Content-Type: application/wasm so the browser can still use instantiateStreaming. A misconfigured MIME type silently disables streaming and roughly doubles startup latency.

Precompress at build time rather than per request. brotli -q 11 is slow — hundreds of milliseconds on a large binary — but a static .wasm’s contents never change between requests, so paying that cost once at build and serving the precomputed .br gives you brotli’s full ratio with none of its runtime cost. On-the-fly compression at the edge almost always falls back to a lower quality level (commonly -q 4 or -q 5) to stay fast, giving up a meaningful chunk of the savings. Treat the compressed artifact as a build output and hash it into your asset filenames so the CDN caches it immutably.

A binding & loading example

Optimization is worthless if the optimized binary fails to load. Stream-instantiate it with an explicit import object and verify the export you expect survived the DCE passes:

async function loadOptimized(url, imports = {}) {
  const resp = await fetch(url, { headers: { Accept: "application/wasm" } });
  if (!resp.ok) throw new Error(`fetch failed: ${resp.status}`);
  // instantiateStreaming compiles while the body downloads — the main payoff of a small .wasm
  const { instance } = await WebAssembly.instantiateStreaming(resp, imports);
  if (typeof instance.exports.process_batch !== "function") {
    throw new Error("process_batch was stripped — mark it exported in source");
  }
  return instance.exports;
}

const wasm = await loadOptimized("/pkg/app.opt.wasm");

The guard matters: aggressive --remove-unused-module-elements and -Oz will eliminate any function that is not reachable from an export, so a symbol you call only from JavaScript must be exported in the source (#[wasm_bindgen] or Emscripten’s EXPORTED_FUNCTIONS) or it disappears.

Optimization flags & tradeoffs

The optimization level sets the balance between size and execution speed. Numbers below are representative of a ~200 KB compute module; absolute values vary, but the ordering is stable.

Level Intent Relative size Relative speed When to use
-O2 Balanced default baseline baseline General release builds where you have not yet measured
-O3 Max speed +8–15% larger fastest tight loops Compute-bound kernels (physics, codecs, ML) where size is secondary
-Os Speed, then size −5–10% vs -O2 within ~3% of -O2 Frontend modules wanting smaller bytes with little speed cost
-Oz Size above all −12–20% vs -O2 can regress hot loops 5–15% Strict bundle budgets; disables loop unrolling and some inlining

Beyond the level, individual passes and flags each remove a distinct class of bytes:

  • --strip-debug removes DWARF sections. On a debug-info-heavy build this is the largest single reduction — often 30–50% — but it ends source-level debugging in DevTools.
  • --dce / --remove-unused-module-elements drop unreachable functions, globals, and imports. Effective only for symbols not reachable from an export; anything exported is kept.
  • --enable-bulk-memory lets the optimizer lower memcpy/memset to single memory.copy/ memory.fill instructions, shrinking byte-shuffling code and speeding it up. Requires the target runtime to support bulk memory (all current browsers do).
  • gzip vs brotli: gzip -9 is universal and fast to produce; brotli -q 11 is 15–25% smaller on Wasm but slower to compress. Precompress at build time and serve the static .br, so the encode cost is paid once, not per request.

The headline tradeoff: -Oz minimizes bytes but can slow tight loops by disabling unrolling, while -O3 does the reverse. Pick per-module by profiling, not by reflex — measure the loop with the Wasm performance benchmarking harness before assuming -Oz is free.

Linear memory layout is a size lever too

Size is not only code. A module’s declared linear memory affects both the binary and runtime behavior, and the knobs pull in opposite directions. A large INITIAL_MEMORY makes instantiation predictable and avoids growth events, but the memory is described in the binary and its data segments ship with it. A small initial memory plus growth keeps the payload lean, but every memory.grow may copy the entire buffer to a new region — a 50–200 ms main-thread stall each time. The right answer is to size initial memory to the steady-state working set: large enough to avoid growth in the common path, small enough not to inflate the download. For workloads with a fixed buffer size (image tiles, audio frames) pre-allocate exactly that and never grow at all.

Enforce the win with a size budget

Optimization that is not defended regresses. The cheapest guard is a byte-count check in CI that fails a pull request when the compressed .wasm grows past a threshold, so a careless dependency bump or a stray Debug derive is caught at review time rather than in production:

CURRENT=$(brotli -q 11 -c pkg/app.opt.wasm | wc -c)
BASELINE=$(cat .wasm-size-baseline)      # committed, updated only on approved PRs
DELTA=$((CURRENT - BASELINE))
echo "compressed: $CURRENT  baseline: $BASELINE  delta: $DELTA"
[ "$DELTA" -gt 2048 ] && { echo "::error::wasm grew ${DELTA}B (>2KB)"; exit 1; }

Track the compressed number, not the raw one — it is what users actually download, and a change can shrink raw bytes while growing the compressed payload if it adds low-redundancy data. Wiring this into a Rust pipeline is the subject of setting up CI/CD for Rust + Wasm projects.

Gotchas & failure modes

  • wasm-opt errors with Fatal: error in validating input — almost always an unrecognized feature. The binary uses SIMD, threads, or bulk memory and you did not pass the matching --enable-* flag. Add it (or --all-features while diagnosing) and re-run.
  • An export vanishes after optimization. -Oz removed it because nothing reachable called it. Diff the export tables: wasm-objdump -x before.wasm | grep ^Export against the optimized file. Fix by exporting the symbol in source, not by weakening the optimization.
  • Streaming silently disabled. If the server sends Content-Type: application/octet-stream, WebAssembly.instantiateStreaming rejects and you fall back to the buffered path, doubling effective load time. Serve application/wasm.
  • memory.grow stalls after shrinking initial memory. Trimming INITIAL_MEMORY to cut payload means each growth event copies the whole buffer — 50–200 ms main-thread stalls. Size initial memory to the steady-state working set, not the minimum.
  • twiggy reports tiny functions but a huge data segment. Large static data (embedded fonts, lookup tables, include_bytes! assets) is not touched by code optimization — wasm-opt rewrites instructions, not your data. Move the asset out of the binary and fetch it separately, or compress it at the source level, so the .wasm carries logic and the bytes ride the asset pipeline.
  • --converge never terminates quickly. On a very large module convergence can take many passes and minutes. That is expected for a once-per-release artifact; for fast iterative dev builds, drop --converge and accept a single -Oz pass, which already captures most of the win.
  • The optimized binary is larger than the input. This happens if you re-run wasm-opt on an already-stripped binary with --debug or a feature-enabling flag that forces a more conservative lowering, or if you accidentally pass an optimization level below the compiler’s. Always start from the raw compiler output, not a previously processed file.

Verification

After every change, confirm three things: the bytes actually shrank, the structure is still valid, and the right sections survived.

# 1. Bytes — the only number that ships
ls -l pkg/app_bg.wasm pkg/app.opt.wasm
wc -c pkg/app.opt.wasm

# 2. Structure — fail fast on a corrupt rewrite
wasm-validate pkg/app.opt.wasm && echo OK

# 3. Sections — confirm name/DWARF are gone, code/data remain
wasm-objdump -h pkg/app.opt.wasm
Sections:
     Type start=0x0000000b end=0x0000002f
 Function start=0x00000031 end=0x00000060
   Memory start=0x00000062 end=0x00000067
   Export start=0x00000069 end=0x00000091
     Code start=0x00000095 end=0x0001f4a1
     Data start=0x0001f4a3 end=0x0001fb20
# no "name" or ".debug_*" custom sections → strip succeeded

Then re-run twiggy top on the optimized file to confirm the items you targeted (the name subsection, stray fmt code) are gone, and that no new surprise dominates the budget.

The fourth check is behavioral: a smaller binary that no longer runs is a regression, not an optimization. Load the optimized module and exercise its real exports, ideally in the same harness you use for benchmarking, because some failures — a stripped export, a feature-flag mismatch that produced a subtly wrong binary — only surface at instantiation or first call, not during wasm-validate. The loading example above is the minimal form of this check; in CI, run your actual test suite against the optimized artifact rather than the debug build, so the bytes you test are the bytes you ship.

# Compare export tables to prove nothing the host needs was eliminated
wasm-objdump -x pkg/app_bg.wasm   | grep '^Export' | sort > /tmp/pre.txt
wasm-objdump -x pkg/app.opt.wasm  | grep '^Export' | sort > /tmp/post.txt
diff /tmp/pre.txt /tmp/post.txt && echo "exports preserved"

In this guide

Frequently Asked Questions

Should I optimize at the compiler or with wasm-opt? Both, in that order. The compiler (opt-level = "z", panic = "abort", lto) removes whole categories of code at the source level; wasm-opt then performs Wasm-specific rewrites and whole-module DCE that LLVM cannot do because it targets many architectures. Skipping either leaves 10–20% on the table.

Does -Oz ever make a module slower? Yes. -Oz disables loop unrolling and trims inlining to save bytes, which can regress a tight numeric loop by 5–15%. For compute-bound kernels prefer -O3 or -Os and accept the larger binary; for glue and UI logic -Oz is almost always the right call.

Why is my brotli file barely smaller than gzip? Either you used a low brotli quality (use -q 11 for static assets) or the binary is already dense — a well-optimized, stripped .wasm has little redundancy left for either compressor to exploit. The gap is widest on debug builds full of repetitive metadata.

Do I still need wasm-strip if I pass --strip-debug to wasm-opt? No — --strip-debug inside wasm-opt and a separate wasm-strip pass do the same job for DWARF and the name section. Use one or the other; running both is harmless but redundant.

How do I keep source-level debugging while still shipping a small binary? Build two artifacts from one compile: keep an unstripped .wasm with DWARF for local debugging and crash symbolication, and ship the stripped, wasm-opt-processed binary to users. Never debug against the size-optimized file — -Oz reorders and merges code so line numbers no longer map cleanly.

Does compressing make compiler and wasm-opt optimization pointless? No. Compression removes byte redundancy but cannot delete code that is present; DCE and stripping remove the code and metadata entirely, so the engine never parses them. The two stack: a stripped, DCE’d binary compresses to fewer bytes and parses faster than a fat binary compressed to the same wire size, because parse time tracks the decompressed code, not the transferred bytes.

← Back to Compilation Pipelines & Toolchain Setup