Welcome to my website
Recent Blog
Why a Reduction Loop Tells the Story: SPMD vs Per-Op SIMD Intrinsics
We have a small surprise from our SPMD proof of concept. On three identical AVX2 reductions over []int32 – sum, min, contains – our SPMD-compiled code is 1.8x to 2.6x faster than the same algorithms written against samber/lo/exp/simd, the experimental Go library built on Go’s new simd intrinsics package. Both run AVX2 8-wide. Both issue roughly the same number of vector ops in the body. The runtime gap is not about ISA choice. It is about what each compiler can see when it codegens the loop, and that turns out to be a structural property of how the intrinsic API is shaped – not a missed optimization in go.
We Built Cross-Lane SIMD Primitives. None of Them Helped.
We built six cross-lane SIMD primitives for our Go SPMD proof of concept. We benchmarked them across ten examples. None delivered a measurable win. Every example that shipped fast shipped without them.
How the Compiler Knows Your Load Is Contiguous
The single most important question the SPMD backend asks is: “is this memory access contiguous?” The answer determines whether your loop runs at vector speed or crawls through gather/scatter. This article is about the compiler pass that answers that question, and why it was worth more than every other optimization we built combined.


