SME / SME2 on Apple Silicon
ARM's Scalable Matrix Extension (SME) — and SME2 — is the new CPU-side matrix-multiply hardware. Apple shipped it with M4 and refined it on M5 (FEAT_SME2p1, plus FP16-acc and BF16-acc outer-product variants).
We explored SME for LLM inference on M5 Max: smoke-tested the toolchain, built an FP16 matvec kernel, and probed the SME2 LUTI4 path for NVFP4 dequant. The conclusion is that SME is not viable as a primary inference path on this workload — the CPU-side bandwidth ceiling caps it at ~4-16× slower than GPU. But the toolchain knowledge is captured for future use.
What M5 Max has
$ sysctl hw.optional.arm | grep -iE "sme|sve"
hw.optional.arm.FEAT_SME: 1 ← base SME
hw.optional.arm.FEAT_SME2: 1
hw.optional.arm.FEAT_SME2p1: 1 ← newer than M4
hw.optional.arm.SME_F32F32: 1
hw.optional.arm.SME_F16F32: 1 ← FP16 mul → FP32 acc
hw.optional.arm.FEAT_SME_F16F16: 1 ← FP16 mul → FP16 acc (NEW vs M4)
hw.optional.arm.FEAT_SME_B16B16: 1 ← BF16 mul → BF16 acc (NEW vs M4)
hw.optional.arm.SME_I8I32: 1 ← INT8 mul → INT32 acc
hw.optional.arm.sme_max_svl_b: 64 ← SVL = 64 bytes = 16 FP32 / 32 FP16 lanes
Notably hw.optional.arm.FEAT_SVE is 0: Apple ships streaming-only SVE. SVE instructions are legal inside PSTATE.SM=1 but trap with SIGILL outside. This has consequences for how SME code can be written.
ZA tile (FP32 case): 16×16 elements. Each fmopa za, p, p, z, z performs a full outer product = 256 FMAs = 512 FLOPS.
The toolchain trap
Apple's clang accepts the ACLE SME function attributes:
__arm_streaming(type qualifier, after declarator) — caller must enter streaming mode before calling.__arm_locally_streaming(declarator) — function enters/exits streaming on entry/exit.__arm_inout("za")(type qualifier) — function reads/writes ZA state.__arm_new("za")(attribute, before declarator) — function takes fresh ZA storage.
All of them break on Apple silicon.
Clang's lowering for SME-attributed functions emits cntd / rdvl (non-streaming SVE) in the prologue/epilogue for stack management of SVE-aware spill space. On Apple silicon those instructions trap with SIGILL because non-streaming SVE isn't supported. The trap is in code the compiler emits before the smstart sm instruction.
Even functions that call SME-attributed functions inherit this — the compiler emits rdvl in their prologues for ABI compliance with SVE-aware stack alignment. With -march=armv9-a (which mandates SVE2), this is true for every function in the TU.
Fix: pure inline asm + split TUs
The only workable pattern we found:
- SME kernels are pure inline asm. No clang SME function attributes, no SME intrinsics (which use the attributes internally).
- SME kernels live in their own TU compiled with
-march=armv9.2-a+sme2. - Driver code (
main, harnesses) lives in a separate TU compiled with-march=armv8.6-a(no SVE codegen at all). - Inline asm clobber list must include
d0–d31.smstart sm/smstop smzero the entire FP/SVE register file, including callee-savedd8–d15. Without these clobbers, host-side code that uses any FP value across the asm reads garbage. (Our first FP16 matvec bench showedclock_gettimereturningt1 = infuntil we addedd8–d15to the clobber list.)
Minimal pattern:
// kernel.c — compiled with -march=armv9.2-a+sme2
void sme_matvec_f16(const __fp16 *W, const __fp16 *x, float *y,
uint64_t K, uint64_t N) {
__asm__ volatile(
"smstart sm \n"
"ptrue p0.h \n"
// ... matvec inner loops ...
"faddv h4, p0, z0.h \n"
"fcvt s4, h4 \n"
"str s4, [%[y]] \n"
"smstop sm \n"
:
: [W]"r"(W), [x]"r"(x), [y]"r"(y), [K]"r"(K), [N]"r"(N)
: "x4","x5","x6","p0","z0","z1","z2",
"d0","d1","d2","d3","d4","d5","d6","d7",
"d8","d9","d10","d11","d12","d13","d14","d15",
"d16","d17","d18","d19","d20","d21","d22","d23",
"d24","d25","d26","d27","d28","d29","d30","d31",
"memory"
);
}// main.c — compiled with -march=armv8.6-a
extern void sme_matvec_f16(const __fp16 *, const __fp16 *, float *,
uint64_t, uint64_t);
int main(void) { /* ... */ sme_matvec_f16(...); /* ... */ }Numbers
FP32 outer-product (single ZA tile, latency-bound)
1 billion fmopa: 0.94 s → 0.55 TFLOPS FP32 (sustained)
Single-accumulator latency-bound at 4 cycles per fmopa. Round-robin across 4 ZA tiles would scale toward ~2 TFLOPS (M4's published peak; M5 likely similar). Two SME units chip-wide → ~4 TFLOPS theoretical chip-wide ceiling.
FP16 matvec K=N=2048
SME FP16 matvec, K=2048 N=2048:
best-of-7: 0.260 ms/call
32.2 GFLOPS, 32.3 GB/s
verified correct vs scalar reference
For comparison, GPU on Lightning's path: ~16 µs/matvec. GPU is ~16× faster, even at single-batch decode where SME was supposed to win.
The bottleneck is bandwidth: 32 GB/s ≈ 16-32% of CPU-side LPDDR5 ceiling (~150-200 GB/s sustained). The CPU-side ceiling itself is fundamentally below the GPU's 546 GB/s on Apple silicon; CPU and GPU share the memory but the GPU has wider channels.
NVFP4 SME via LUTI4 — explored, has Apple-specific lane quirks
SME2's LUTI4 instruction uses the ZT0 register (a 64-byte lookup table, separate from ZA) for 4-bit dequant. In principle a perfect fit for NVFP4 weights:
- Load 64-byte LUT into ZT0 once:
ldr zt0, [x_lut] - Per group: load packed nibbles, run
luti4 zd.h, zt0, zn[idx], multiply by E4M3 scale, accumulate into row sum.
LUTI4 raw throughput is ~136 G dequants/sec (effective ~273 GB/s of FP16 produced) — far above any bandwidth concern.
But Apple's LUTI4 zd.h, zt0, zn[0] on M5 Max has unexpected lane semantics that diverge from the textbook ARMv9.4 SME2 spec:
- Reads only the low nibble of input bytes.
- Reads only from bytes 0-7 and 16-23 of
zn(skips 8-15 and 24-31). - Writes to lanes 0-7 and 16-23 of
zd(zeros 8-15 and 24-31). - Effective: 16 valid fp16 outputs per LUTI4, from 16 specific input nibbles — not the 32 contiguous nibbles a naive reading suggests.
We tried single-vector (1V), 2V, and 4V ({z0.h-z3.h}) variants; all show the same lane pattern. Either Apple's M5 implements a non-standard LUTI4, or the textbook reading of the ARM ARM is wrong for this microarchitecture. Reverse-engineering the full instruction matrix to use it correctly was more effort than the bandwidth-bound end result justified.
Why SME isn't the right tool here
The bandwidth math is the wall, not compute:
| Path | Ceiling | Reality |
|---|---|---|
| GPU FP16 matvec | 546 GB/s mem | ~12 TFLOPS measured; ~16 µs at K=N=2048 |
| SME FP16 matvec | ~150 GB/s CPU mem | 0.26 ms = 32 GB/s effective |
| SME NVFP4 matvec | ~150 GB/s mem (4× less weight bytes) | ~65 µs best case (predicted) |
Even with a perfect NVFP4 LUTI4 path, SME at best closes the gap to 4× slower than GPU. Multi-core SME (parallelizing matvec rows across P-cores) could bring it closer, but the chip-wide SME unit count is capped at 2 — Apple shares SME hardware between core clusters.
For a CPU-only inference target (no GPU available), SME would be the right tool. For Apple silicon with the GPU sitting right there sharing the same memory, GPU wins.
What's worth keeping
- The toolchain knowledge — how to compile SME code on macOS without hitting the streaming-SVE / SVE-attribute traps.
- Smoke files in
/tmp/sme_*.c(not committed) document the working pure-asm pattern and the LUTI4 quirks. - Confidence that Phase 3 was explored, characterized, and ruled out with real numbers — not predicted on theory.
References
<arm_sme.h>— ACLE SME intrinsics header (the intrinsic surface is unusable on Apple silicon for the reasons above; the assembly mnemonics it would emit are still useful).- ARM Architecture Reference Manual — SME / SME2 sections (LUTI4, fmopa, smstart/smstop encodings).
- xnu-arm-sme.md — Apple's docs on macOS streaming-mode handling.
- Hello-SME microbenchmarks (Friedrich-Schiller-Univ. Jena) on M4 P-cores: 2.0 TFLOPS FP32, 4.0 TOPS INT8 per SME unit.
- Apple Developer: "Determining Instruction Set Characteristics" — documents
sysctl hw.optional.arm.*for runtime feature detection.