Why a Reduction Loop Tells the Story: SPMD vs Per-Op SIMD Intrinsics
By Cedric Bail
We have a small surprise from our SPMD proof of concept. On three identical AVX2 reductions over []int32 – sum, min, contains – our SPMD-compiled code is 1.8x to 2.6x faster than the same algorithms written against samber/lo/exp/simd, the experimental Go library built on Go’s new simd intrinsics package. Both run AVX2 8-wide. Both issue roughly the same number of vector ops in the body. The runtime gap is not about ISA choice. It is about what each compiler can see when it codegens the loop, and that turns out to be a structural property of how the intrinsic API is shaped – not a missed optimization in go.
This post walks through the disassembly for two of the three kernels and explains where the cycles go. The takeaway is not “intrinsics are slow.” It is narrower and more interesting: per-operation SIMD intrinsics that return a vector through the ABI return register cannot keep a loop-carried accumulator live across a call boundary. And there are a few other places where owning the whole loop gives the compiler optimizations the intrinsic user cannot reach from the library side.
The setup
Same hardware (AMD Ryzen 7 6800U, Zen3). Same workload: a 1024-element []int32. Three implementations of each kernel:
- scalar
samber/lo– generic Go, our baseline. samber/lo/exp/simd– the experimental library that wraps Go’s newsimdintrinsics. Compiled with stockgo.- SPMD – a
go forloop with alanes.Varying[int32]accumulator and a finalreduce.Add/reduce.Min/reduce.Any. Compiled with our TinyGo+SPMD fork targetingamd64-avx2.
The numbers:
| Kernel | lo/simd (Go intrinsics) | SPMD (TinyGo+SPMD) | SPMD ratio |
|---|---|---|---|
| Sum | 329 ns/op | 181 ns/op | 1.82x |
| Min | 337 ns/op | 160 ns/op | 2.11x |
| Contains (x8) | 178 ns/op | 69 ns/op | 2.57x |
Both binaries emit AVX2. Both have hot loops that consume 8 i32 lanes per iteration. So why the gap?
Sum: the accumulator goes to the stack and back, every iteration
Here is the hot loop of lo/simd.SumInt32x8, trimmed to the essentials (about 26 instructions per 8 elements):
vmovdqu [rsp+0x38], ymm0 ; spill the accumulator
mov ebx, 0x8
CALL LoadInt32x8Slice ; returns the loaded <8xi32> in ymm0
vmovdqu ymm1, [rsp+0x38] ; reload the accumulator
vpaddd ymm0, ymm1, ymm0 ; accumulate
mov rcx, [rsp+0x88] ; reload 3 loop variables the call may have clobbered
mov rdx, [rsp+0x60]
mov rax, [rsp+0x58]
lea/cmp/jb/cmp/jbe ; loop control + bounds checks
And here is the SPMD version (about 27 instructions per 8 elements):
; ...tail-mask setup (4-6 instrs, runs every iter even on full-width chunks)...
vpand ymm3, ymm3, [rbx] ; masked load via memory-operand AND
vpaddd ymm0, ymm3, ymm0 ; accumulator stays in ymm0 across all iterations
add r11d, -8 / add r10, r8 / add r9, 8 / jmp
Same vector op count. The difference is the store-call-reload chain on ymm0. The simd intrinsic LoadInt32x8Slice returns its result through ymm0 – the standard vector return register. That is also the natural home for the running accumulator. So before every call, the accumulator gets spilled to the stack; after every call, it gets reloaded; only then can the vpaddd proceed. On top of that, the compiler reloads three loop-control registers after each call because the callee is opaque to it.
The store→call→reload pattern adds roughly six cycles of latency per iteration (store-to-load forwarding plus the call boundary). Over 128 iterations of a 1024-element reduction, that is most of the gap.
SPMD avoids it entirely. Because the loop body is a single LLVM function with no opaque calls inside it, the accumulator simply stays live in ymm0 for the duration of the loop.
The horizontal reduce at the end shows the same pattern from a different angle. lo/simd stores the vector to the stack and then runs a scalar 8-element loop summing it back – 9 instructions, all scalar. SPMD does vextracti128 + 3×vphaddd + vmovd – 5 instructions, all vector. Both are correct. The latter happens because LLVM owns the reduction.
Contains: three structural wins compound
The Sum case shows the cost of the call boundary on the loop-carried accumulator. Contains shows what happens when the compiler also owns the loop shape.
lo/simd.ContainsInt32x8 hot loop, about 27 instructions per 8 elements:
...bounds checks...
CALL LoadInt32x8Slice
vmovdqu ymm1, [rsp+0x18] ; reload broadcast needle
vpcmpeqd ymm0, ymm0, ymm1
vmovmskps edx, ymm0 ; vector -> GP bitmask
test dl, dl
je <continue>
SPMD’s tight main loop, 9 instructions per 8 elements:
add r10, 0x8 / cmp r10, r8 / jge
lea r11, [r9+rcx] / sar r9, 0x1e
vpcmpeqd ymm2, ymm1, [rdi+r9] ; load + compare fused into one instruction
vtestps ymm2, ymm2 ; sets ZF directly from a vector
mov r9, r11
je <found>
Three things compound:
- Loop peeling. When the remaining length is at least 8, no mask is needed. SPMD emits a stripped main body for the all-active case and reserves the masked path for the tail. The intrinsic version cannot peel because it does not own the loop – the user wrote it.
- Memory-operand
vpcmpeqd. With the load fused into the compare, what was two instructions becomes one. ThesimdAPI exposes the load as a separate function, so the compiler never has the chance to fuse it. vtestpsinstead ofvmovmskps + test. The natural Go expression for “is any lane true?” iscmp.ToBits() != 0, which goes through a vector→GP-register move.vtestpssets ZF directly from a vector register. Today there is no intrinsic that exposes it.
These are not optimizations a library author can apply. They live in the compiler, and they require the compiler to see the loop as a unit.
Min has its own small story
For brevity I am skipping the Min disassembly, but it carries a firstInitialized bool so the first iteration seeds minVec instead of comparing. That bool turns into a movzx + test + je inside the hot loop – well-predicted, but still decoded and dispatched every iteration. SPMD does not need it: the mask handles the first iteration the same way it handles every other iteration.
(The horizontal reduce shows the same vector-vs-scalar split as Sum: 7 vector instructions for SPMD versus 17 scalar instructions for lo/simd.)
What the gap is, structurally
Three root causes, in order of how much they cost:
1. The intrinsic return-value ABI forces an accumulator spill. A vector intrinsic that returns a vector lands in ymm0. Any caller-side vector that has to stay live across the call – which is exactly what a reduction’s accumulator is – gets spilled. Per iteration. For the entire loop. There is no way around it inside the current API shape.
2. Per-op intrinsics are bodyless to the compiler. The library exposes LoadInt32x8Slice and friends as assembly-backed stubs that the compiler cannot inline. If go could see through them, the load would fold into the next op exactly the way LLVM does for SPMD, the spill would vanish, and Sum’s loop would collapse to roughly 15 instructions per 8 elements – faster than what SPMD currently emits. This is not a missed go optimization. It is a consequence of how the intrinsic surface is shaped.
3. Loop-level rewrites are not available to the intrinsic user. Loop peeling, choosing between vmovmskps and vtestps, deciding whether to do a horizontal reduce in vector or scalar – these are not library decisions. They are compiler-level transformations that need to see the loop as a whole.
Could the experimental simd package narrow the gap?
In two of the three kernels, yes – partially – with API or implementation changes:
- Sum and Min could close most of the gap if the load intrinsic wrote into a caller-owned accumulator instead of returning through
ymm0. Something shaped likeacc.AddFrom(slice)rather thanacc = acc.Add(LoadInt32x8Slice(slice))would let the compiler keep the accumulator live across the call. An alternative path is to make the intrinsic wrappers genuinely inlinable sogocan see through them; that has its own tradeoffs. - Contains is harder. Closing the gap there would need a new intrinsic exposing
vtestps-style direct ZF set, plus loop peeling at thegolevel. The first is a library extension; the second is a compiler change.
There is also a backend angle worth noting. If the simd package were ported to TinyGo, each intrinsic would naturally lower to LLVM vector IR rather than an opaque assembly stub, so two of the three causes above – the call-boundary accumulator spill and the missed load/compare fusion – would mostly disappear on their own: LLVM’s register allocator would keep the accumulator live across iterations, and its instruction selector would fuse memory operands and pattern-match idioms like “any-lane true” to vtestps without the library having to ask. The third cause – loop peeling and the rest of the loop-shape rewrites – would not be addressed by a backend change alone, because the compiler still does not own the loop. That residual is exactly the seam where a loop-level construct earns its keep, and it lines up with the conclusion above: backends can close the codegen gap, but loop-level transformations need a loop-level anchor.
None of this is a criticism of the simd package. It is doing exactly what an instruction-level intrinsic library is supposed to do. The point is that the API shape (per-op, vector return values) carries a structural cost on loop-carried state that no amount of library polish can remove, and a complementary loop-level approach can address that cost without competing with the intrinsic surface for control.
How this fits with archsimd
We have written before that SPMD and archsimd are complementary, and this comparison is a concrete example of where the layers live:
archsimdis the right tool when you want to pick exact instructions. It is honest about what it is: vector ops, one at a time, return values, explicit width.- A loop-level construct like
go foris the right tool when you want the compiler to choose width, generate the masked tail, hoist invariants, and apply loop peeling – without you writing any of it.
The two are not in competition. You would expect to use both in the same program: archsimd for the handful of kernels where intrinsic-level control is genuinely worth the source cost, and a loop-level construct for the long tail of reductions, scans, byte-classification loops, and per-pixel arithmetic where readability and portability matter more than picking instructions by hand.
The relevant lesson from these three reductions is that there is real performance available at the loop level that a per-op intrinsic library structurally cannot reach. That is a useful argument for having both layers, not for replacing one with the other.
Reproducing the comparison
- SPMD binaries:
PATH=$(pwd)/go/bin:$PATH GOEXPERIMENT=spmd ./tinygo/build/tinygo build -target=amd64-avx2 -o <out> test/integration/spmd/lo-<kernel>/main.go. lo/simdbinaries:go test -cintest/bench/simd/, thengo tool objdumpon the resulting test binary.- Three-way driver:
test/e2e/spmd-benchmark-x86.sh(scalarlovslo/simdvs SPMD).
The full report with the disassembly transcripts lives in docs/spmd-vs-go-simd-intrinsics.md in the SPMD repository.
If you want the broader picture of where SPMD’s speed comes from across all the workloads we have tested, the results post gathers the numbers. The loop peeling post goes deeper on the single transformation that does most of the work, including in the Contains case above.