Welcome to my website

Recent Blog

What a Reduction Loop Reveals About SPMD vs Per-Op Intrinsics

On three identical AVX2 reductions over []int32 – sum, min, contains – our SPMD-compiled code is 1.8x to 2.6x faster than the same algorithms written against samber/lo/exp/simd, the experimental Go library built on Go’s new simd intrinsics package. Both run AVX2 8-wide. Both issue roughly the same number of vector ops in the body. The runtime gap is not about ISA choice. It is about what each compiler can see when it codegens the loop, and that turns out to be a structural property of how the intrinsic API is shaped.

We Built Cross-Lane SIMD Primitives. None of Them Helped.

We built six cross-lane SIMD primitives for our Go SPMD proof of concept, benchmarked them across ten examples, and none delivered a measurable win. Every example that shipped fast shipped without them.

How the Compiler Knows Your Load Is Contiguous

The single most important question the SPMD backend asks is: “is this memory access contiguous?” The answer determines whether your loop runs at vector speed or crawls through gather/scatter.

Cedric Bail

Random stuff

Recent Blog

What a Reduction Loop Reveals About SPMD vs Per-Op Intrinsics

We Built Cross-Lane SIMD Primitives. None of Them Helped.

How the Compiler Knows Your Load Is Contiguous

More

16 Bytes That Saved a Thousand Branches

Byte Iteration at 32 Lanes: The Decomposed Index Path

Pattern Matching Outperformed Hand-Written SIMD

Loop Peeling: Where Most of the Speed Comes From