Compiler on Cedric Bail

How the Compiler Knows Your Load Is Contiguous

Wed, 15 Apr 2026 10:07:00 -0700

The single most important question the SPMD backend asks is: “is this memory access contiguous?” The answer determines whether your loop runs at vector speed or crawls through gather/scatter. This article is about the compiler pass that answers that question, and why it was worth more than every other optimization we built combined.

Byte Iteration at 32 Lanes: The Decomposed Index Path

Wed, 15 Apr 2026 10:05:00 -0700

When we set out to make for i, b := range byteSlice fast on AVX2, the first thing that went wrong was the index vector. This article explains what happened, the technique we used to fix it, and the chain of bugs the fix resolved along the way.

Pattern Matching Outperformed Hand-Written SIMD

Wed, 15 Apr 2026 10:04:00 -0700

Our base64 decoder was implemented twice. Version 1 used explicit cross-lane operations — shuffles, rotations, compact stores. It peaked at roughly 2x scalar performance. Version 2 used four plain go for loops with no cross-lane operations at all. It hit approximately 17 GB/s on AVX2 — about 77% of simdutf C++ and 9x faster than Go’s encoding/base64. The simpler code outperformed the clever code by a wide margin.

Loop Peeling: Where Most of the Speed Comes From

Wed, 15 Apr 2026 10:03:00 -0700

If you took every optimization in our SPMD-for-Go proof of concept and ranked them by benchmark impact, loop peeling would be at the top. Not pattern detection. Not contiguous access analysis. Not the decomposed index path. Peeling. It is the structural foundation that everything else is built on, and the reason our hot loops run at one memory operation per store instead of three.

How SPMD Lives in the Compiler: Lessons from Building It

Wed, 15 Apr 2026 10:02:00 -0700

We added a way to express data parallelism in idiomatic Go. Earlier discussions around this space often stalled on a simple question: how would it actually work in the compiler? A working proof of concept that compiles go for loops to WASM SIMD128, x86 SSE, and x86 AVX2, with end-to-end tests passing and a base64 decoder reaching ~77% of simdutf C++ throughput, is a better answer than another round of speculation. The goal here is to make the implementation strategy concrete. Along the way we learned one lesson the hard way: SPMD is a compiler feature that has to live at the heart of the SSA form. Everything else follows from that.

This article is for compiler engineers. If you want to see the benchmarks and the short version, read the overview. If you want to write SPMD Go code, the practical guide is for you. Here, we talk about what we built inside the compiler, what we got wrong, and what we would do differently.