Deep dives from building Tungsten — performance engineering, GPU work, and compiler internals, written while the paint was still wet.
The rung-by-rung climb of a self-hosted JSON lexer from 1.4 GB/s to 21 GB/s — SIMD classification, branch elimination, and the microarchitectural detail behind every jump.
A pure-Tungsten nvfp4 decode path for a 1.7B-parameter model, taken from 0.71× to 1.16× of Apple's hand-tuned MLX — on the same Apple silicon.
Getting Apple's new Metal 4 matmul2d and cooperative tensors running on the M5 Max — the headline instruction, and what it took to feed it.
Rebuilding the compiler's AST so each node reference is one machine word with zero bytes of header — the data-structure design behind a faster self-host.
Building a simdjson-class structural classifier from scratch: how 200 lines of C take a JSON lexer from 1.98 GB/s toward simdjson territory.
An 18% regression that turned out to be LTO silently refusing to inline across a target-features mismatch — and the one flag that fixed it.
ARM's Scalable Matrix Extension on Apple silicon: what SME and SME2 actually offer the CPU side, and how Tungsten reaches them.
Why the NEON shrn/xtn tail-compression trick backfires on Apple silicon's scan helpers — a benchmark-backed cautionary tale.