← Field Notes

SME / SME2 on Apple Silicon

ARM's Scalable Matrix Extension (SME) — and SME2 — is the new CPU-side matrix-multiply hardware. Apple shipped it with M4 and refined it on M5 (FEAT_SME2p1, plus FP16-acc and BF16-acc outer-product variants).

We explored SME for LLM inference on M5 Max: smoke-tested the toolchain, built an FP16 matvec kernel, and probed the SME2 LUTI4 path for NVFP4 dequant. The conclusion is that SME is not viable as a primary inference path on this workload — the CPU-side bandwidth ceiling caps it at ~4-16× slower than GPU. But the toolchain knowledge is captured for future use.

What M5 Max has

$ sysctl hw.optional.arm | grep -iE "sme|sve"
hw.optional.arm.FEAT_SME:    1     ← base SME
hw.optional.arm.FEAT_SME2:   1
hw.optional.arm.FEAT_SME2p1: 1     ← newer than M4
hw.optional.arm.SME_F32F32:  1
hw.optional.arm.SME_F16F32:  1     ← FP16 mul → FP32 acc
hw.optional.arm.FEAT_SME_F16F16: 1 ← FP16 mul → FP16 acc (NEW vs M4)
hw.optional.arm.FEAT_SME_B16B16: 1 ← BF16 mul → BF16 acc (NEW vs M4)
hw.optional.arm.SME_I8I32:   1     ← INT8 mul → INT32 acc
hw.optional.arm.sme_max_svl_b: 64  ← SVL = 64 bytes = 16 FP32 / 32 FP16 lanes

Notably hw.optional.arm.FEAT_SVE is 0: Apple ships streaming-only SVE. SVE instructions are legal inside PSTATE.SM=1 but trap with SIGILL outside. This has consequences for how SME code can be written.

ZA tile (FP32 case): 16×16 elements. Each fmopa za, p, p, z, z performs a full outer product = 256 FMAs = 512 FLOPS.

The toolchain trap

Apple's clang accepts the ACLE SME function attributes:

All of them break on Apple silicon.

Clang's lowering for SME-attributed functions emits cntd / rdvl (non-streaming SVE) in the prologue/epilogue for stack management of SVE-aware spill space. On Apple silicon those instructions trap with SIGILL because non-streaming SVE isn't supported. The trap is in code the compiler emits before the smstart sm instruction.

Even functions that call SME-attributed functions inherit this — the compiler emits rdvl in their prologues for ABI compliance with SVE-aware stack alignment. With -march=armv9-a (which mandates SVE2), this is true for every function in the TU.

Fix: pure inline asm + split TUs

The only workable pattern we found:

  1. SME kernels are pure inline asm. No clang SME function attributes, no SME intrinsics (which use the attributes internally).
  2. SME kernels live in their own TU compiled with -march=armv9.2-a+sme2.
  3. Driver code (main, harnesses) lives in a separate TU compiled with -march=armv8.6-a (no SVE codegen at all).
  4. Inline asm clobber list must include d0d31. smstart sm / smstop sm zero the entire FP/SVE register file, including callee-saved d8d15. Without these clobbers, host-side code that uses any FP value across the asm reads garbage. (Our first FP16 matvec bench showed clock_gettime returning t1 = inf until we added d8d15 to the clobber list.)

Minimal pattern:

// kernel.c — compiled with -march=armv9.2-a+sme2
void sme_matvec_f16(const __fp16 *W, const __fp16 *x, float *y,
                    uint64_t K, uint64_t N) {
    __asm__ volatile(
        "smstart sm                                 \n"
        "ptrue p0.h                                 \n"
        // ... matvec inner loops ...
        "faddv h4, p0, z0.h                         \n"
        "fcvt s4, h4                                \n"
        "str s4, [%[y]]                             \n"
        "smstop sm                                  \n"
        :
        : [W]"r"(W), [x]"r"(x), [y]"r"(y), [K]"r"(K), [N]"r"(N)
        : "x4","x5","x6","p0","z0","z1","z2",
          "d0","d1","d2","d3","d4","d5","d6","d7",
          "d8","d9","d10","d11","d12","d13","d14","d15",
          "d16","d17","d18","d19","d20","d21","d22","d23",
          "d24","d25","d26","d27","d28","d29","d30","d31",
          "memory"
    );
}
// main.c — compiled with -march=armv8.6-a
extern void sme_matvec_f16(const __fp16 *, const __fp16 *, float *,
                           uint64_t, uint64_t);
int main(void) { /* ... */ sme_matvec_f16(...); /* ... */ }

Numbers

FP32 outer-product (single ZA tile, latency-bound)

1 billion fmopa: 0.94 s → 0.55 TFLOPS FP32 (sustained)

Single-accumulator latency-bound at 4 cycles per fmopa. Round-robin across 4 ZA tiles would scale toward ~2 TFLOPS (M4's published peak; M5 likely similar). Two SME units chip-wide → ~4 TFLOPS theoretical chip-wide ceiling.

FP16 matvec K=N=2048

SME FP16 matvec, K=2048 N=2048:
  best-of-7: 0.260 ms/call
  32.2 GFLOPS,  32.3 GB/s
  verified correct vs scalar reference

For comparison, GPU on Lightning's path: ~16 µs/matvec. GPU is ~16× faster, even at single-batch decode where SME was supposed to win.

The bottleneck is bandwidth: 32 GB/s ≈ 16-32% of CPU-side LPDDR5 ceiling (~150-200 GB/s sustained). The CPU-side ceiling itself is fundamentally below the GPU's 546 GB/s on Apple silicon; CPU and GPU share the memory but the GPU has wider channels.

NVFP4 SME via LUTI4 — explored, has Apple-specific lane quirks

SME2's LUTI4 instruction uses the ZT0 register (a 64-byte lookup table, separate from ZA) for 4-bit dequant. In principle a perfect fit for NVFP4 weights:

LUTI4 raw throughput is ~136 G dequants/sec (effective ~273 GB/s of FP16 produced) — far above any bandwidth concern.

But Apple's LUTI4 zd.h, zt0, zn[0] on M5 Max has unexpected lane semantics that diverge from the textbook ARMv9.4 SME2 spec:

We tried single-vector (1V), 2V, and 4V ({z0.h-z3.h}) variants; all show the same lane pattern. Either Apple's M5 implements a non-standard LUTI4, or the textbook reading of the ARM ARM is wrong for this microarchitecture. Reverse-engineering the full instruction matrix to use it correctly was more effort than the bandwidth-bound end result justified.

Why SME isn't the right tool here

The bandwidth math is the wall, not compute:

Path Ceiling Reality
GPU FP16 matvec 546 GB/s mem ~12 TFLOPS measured; ~16 µs at K=N=2048
SME FP16 matvec ~150 GB/s CPU mem 0.26 ms = 32 GB/s effective
SME NVFP4 matvec ~150 GB/s mem (4× less weight bytes) ~65 µs best case (predicted)

Even with a perfect NVFP4 LUTI4 path, SME at best closes the gap to 4× slower than GPU. Multi-core SME (parallelizing matvec rows across P-cores) could bring it closer, but the chip-wide SME unit count is capped at 2 — Apple shares SME hardware between core clusters.

For a CPU-only inference target (no GPU available), SME would be the right tool. For Apple silicon with the GPU sitting right there sharing the same memory, GPU wins.

What's worth keeping

References