Blog | Cedric Bail

May 10, 2026 SPMD

What a Reduction Loop Reveals About SPMD vs Per-Op Intrinsics

A side-by-side disassembly of the same AVX2 reduction reveals a structural advantage of whole-loop vectorization over per-operation intrinsics

Apr 15, 2026 SPMD

We Built Cross-Lane SIMD Primitives. None of Them Helped.

The most important negative result from our SPMD-for-Go proof of concept: explicit shuffles and rotations lost to compiler pattern detection on idiomatic Go

Apr 15, 2026 SPMD

How the Compiler Knows Your Load Is Contiguous

The most important backend optimization in SPMD: recognizing contiguous memory access through ChangeType and BinOp chains

Apr 15, 2026 SPMD

16 Bytes That Saved a Thousand Branches

The cheapest optimization in our SPMD proof of concept: a WASM linear memory guard zone for safe vector overreads

Apr 15, 2026 SPMD

Byte Iteration at 32 Lanes: The Decomposed Index Path

How to iterate a []byte on AVX2 without drowning in index-register pressure

Apr 15, 2026 SPMD

Pattern Matching Outperformed Hand-Written SIMD

How compiler pattern detection on idiomatic Go outperformed explicit cross-lane SIMD builtins in our SPMD proof of concept

Apr 15, 2026 SPMD

Loop Peeling: Where Most of the Speed Comes From

How SSA-level loop peeling enables the all-ones mask fast path that delivers ~2x of SPMD benchmark wins

Apr 15, 2026 SPMD

How SPMD Lives in the Compiler: Lessons from Building It

The mask-stack detour, predicated SSA, and why SPMD has to live at the heart of the compiler

Apr 15, 2026 SPMD

Writing SPMD Go: A Practical Guide

How to think about uniform vs varying, write go for loops, use reductions, and avoid the common pitfalls