Cross-Lane Communication: When Lanes Need to Talk

By Cedric Bail

July 12, 2025

Warning

Historical note. This post is a thought experiment that predates the actual TinyGo SPMD compiler. The high-level point (that base64 decoding requires data exchange between lanes) survived, but the actual base64 decoder we ended up shipping does not use SwizzleWithin + RotateWithin at all. The working AVX2 decoder uses a cascading byte → int16 → int32 pipeline that triggers vpmaddubsw + vpmaddwd + vpshufb + vpermd, achieving ~17 GB/s (~9x Go stdlib). See the base64-mula-lemire example and the vectorized-table-lookup section of Writing SPMD Go for the patterns that actually work. Other drift from this post: lanes.From returns a Varying[T] of bounded width (it caps to lane count), Varying field/element addressing via r[i] = ... is not how lane writes are expressed (use lanes.Index() and arithmetic), and []byte(output) conversion from a varying value is not supported; use lanes.CompactStore or contiguous slice writes. The *Within builtins are real, but constant-index only in current codegen.

When Independent Lanes Aren’t Enough

Most SPMD examples show lanes working independently, each processing its own data without communicating with neighbors. But real-world algorithms often require cross-lane communication: lanes must exchange data to solve the problem correctly. Base64 decoding demonstrates this problem, transforming groups of 4 ASCII characters into groups of 3 bytes through coordinated lane operations.

This draws from Miguel Young de la Sota’s “Designing a SIMD Algorithm from Scratch”. Go read his article for the deep dive – this one focuses on how the same algorithm could look in Go with an SPMD extension.

The Problem: 4-to-3 Data Transformation

Base64 encoding converts every 3 bytes into 4 ASCII characters. Decoding reverses this: 4 ASCII characters become 3 bytes. This mismatch creates the core challenge: we can’t simply process each character independently because the output structure differs from the input structure.

Let’s look first at how we would iterate over all the data.

// Package main demonstrates cross-lane communication in SPMD Go
// Based on: https://github.com/mcy/vb64/blob/main/src/simd.rs#L16-L144
package main

import (
 "lanes"
 "reduce"
)

func Decode(ascii []byte) ([]byte, bool) {
 if len(ascii) == 0 {
  return nil, true
 }
 if len(ascii)%4 != 0 {
  return nil, false // Base64 requires input length multiple of 4
 }

 decoded := make([]byte, 0, len(ascii)*3/4)
 pattern := outputPattern()

 go for _, v := range ascii {
  decodedChunk, valid := decodeChunk(v, pattern)
  if !valid {
   return nil, false
  }
  decoded = append(decoded, decodedChunk...)
 }

 return decoded, true
}

The range ascii syntax processes the byte slice in SPMD fashion. The algorithm’s cross-lane operations use *Within functions with group size 4 to maintain the correct 4-to-3 byte transformation pattern regardless of the hardware’s actual SIMD width.

The Three Core Cross-Lane Operations

Base64 decoding requires three fundamental cross-lane communication patterns. Let’s understand each before seeing how they work together:

1. Swizzle: Parallel Table Lookups

// Each lane indexes into a shared lookup table within groups of 8
offsetTable := []byte{255, 16, 19, 4, 191, 191, 185, 185}
offsets := lanes.SwizzleWithin(lanes.From(offsetTable), hashes, 8)

lanes.SwizzleWithin lets each lane access any position in a shared array based on its computed index, within groups of a given size. Lane 0 might read position 3, lane 1 might read position 6, and so on.

This enables parallel table lookups where each lane accesses different data simultaneously within each group. The group size parameter ensures correct behavior regardless of hardware SIMD width.

2. Rotation Within Groups: Neighboring Data Exchange

// Lane N receives data from lane N-1 within each group of 4
decodedChunks := shiftedLo | lanes.RotateWithin(shiftedHi, 1, 4)

lanes.RotateWithin shifts data between adjacent lanes within groups of a given size. With a group size of 4 and rotation of 1, data rotates within each 4-lane group independently, letting the same algorithm work regardless of hardware SIMD width.

Base64’s 6-to-8 bit conversion creates bit patterns that span lane boundaries. Each lane needs bits from its neighbor to form complete bytes.

3. Output Pattern: Selective Extraction

func outputPattern() lanes.Varying[uint8] {
 count := lanes.Count[uint8]()
 var r lanes.Varying[uint8]
 go for i := range count {
  r[i] = uint8(i + i/3) // Creates: [0,1,2,4,5,6,8,...]
 }
 return r
}

// Use pattern to select final bytes within each group of 4
output := lanes.SwizzleWithin(decodedChunks, pattern, 4)

The pattern [0,1,2,4] selects specific positions from the decoded data, skipping every fourth element. For larger lane counts this extends to [0,1,2,4,5,6,8,9,10,12,...].

This pattern automatically handles “contamination” from rotation operations that cross group boundaries: contaminated data lands in positions that get discarded anyway.

The Complete Algorithm

These operations work together in the complete decodeChunk function:

func decodeChunk(ascii lanes.Varying[byte], pattern lanes.Varying[uint8]) ([]byte, bool) {
 // Step 1: Perfect hash function for table indexing
 hashes := lanes.ShiftRight(ascii, 4)
 if ascii == '/' {
  hashes += 1
 }

 // Step 2: Convert ASCII to 6-bit values via table lookup (Swizzle)
 offsetTable := []byte{255, 16, 19, 4, 191, 191, 185, 185}
 offsets := lanes.SwizzleWithin(lanes.From(offsetTable), hashes, 8)
 sextets := ascii + offsets

 // Step 3: Validate characters using parallel lookups (SwizzleWithin + Reduction)
 loLUT := lanes.From([]byte{
  0b10101, 0b10001, 0b10001, 0b10001, 0b10001, 0b10001, 0b10001, 0b10001,
  0b10001, 0b10001, 0b10011, 0b11010, 0b11011, 0b11011, 0b11011, 0b11010,
 })
 hiLUT := lanes.From([]byte{
  0b10000, 0b10000, 0b00001, 0b00010, 0b00100, 0b01000, 0b00100, 0b01000,
  0b10000, 0b10000, 0b10000, 0b10000, 0b10000, 0b10000, 0b10000, 0b10000,
 })

 lo := lanes.SwizzleWithin(loLUT, ascii&0x0f, 16)
 hi := lanes.SwizzleWithin(hiLUT, lanes.ShiftRight(ascii, 4), 16)
 valid := reduce.Or(lo&hi) == 0

 // Step 4: Pack 6-bit values into bytes with cross-lane coordination (RotateWithin)
 shiftPattern := lanes.From([]uint16{2, 4, 6, 8})
 shifted := lanes.ShiftLeftWithin(sextets, shiftPattern, 4)

 shiftedLo := lanes.Varying[byte](shifted)
 shiftedHi := lanes.Varying[byte](lanes.ShiftRight(shifted, 8))
 decodedChunks := shiftedLo | lanes.RotateWithin(shiftedHi, 1, 4)

 // Step 5: Extract final 3 bytes using output pattern (SwizzleWithin)
 output := lanes.SwizzleWithin(decodedChunks, pattern, 4)
 return []byte(output), valid
}

Why Within-Group Operations Matter

Notice the *Within operations with group size 4 throughout the code. Base64’s 4-to-3 conversion requires operations that respect 4-element group boundaries. The *Within cross-lane operations work correctly regardless of SIMD width because:

The Output patterns maintain the 4:3 ratio across lane groups, discarding the fourth byte that could be “contaminated” by rotation or shift operation. This pattern is what makes the algorithm work for any SIMD width.

By using lanes.RotateWithin(value, offset, 4) and lanes.SwizzleWithin(value, indices, 4), we tell the compiler: “Perform these operations within groups of 4 lanes.” If the hardware has 4 lanes, one group is processed. If it has 16 lanes, four groups are processed simultaneously. The group size parameter ensures correct behavior regardless of hardware.

This lets a single codebase support all hardware, relying on the compiler for portability, readability, and maintainability without sacrificing performance.

The Cost of Communication

Cross-lane operations are expensive compared to independent lane operations. Simple arithmetic is fastest (each lane independent), shuffles and swizzles are moderately expensive (lanes access arbitrary positions), rotations and shifts are also fast (each lane still independent, just redirecting output), and reductions can be expensive depending on the operation.

But these costs enable algorithms impossible with purely independent processing. Base64 decoding with cross-lane communication can be significantly faster than scalar alternatives. The edge case of generating code for a computer with no SIMD will be interesting to see if it impacts performance compared to current code.

The Complexity Question

Cross-lane operations can replace hand-written assembly with portable Go code. Is the added complexity worth it?

Portability across SIMD widths and architectures, speedups for data transformation work, and no platform-specific assembly to maintain. On the other hand: code that’s harder to review, developers who need to understand lane interactions, and bugs that are more subtle than simple arithmetic errors.

The Minimal Alternative

Do we need the full suite of cross-lane operations, or would reduction alone cover most practical use cases?

// Simple parallel processing with reduction
sum := reduce.Add(data * coefficients)
anyInvalid := reduce.Or(validation_results)
maximum := reduce.Max(lane_values)

Reduction is easier to understand, broadly applicable (many algorithms only need to combine results), and less error-prone than rotation or swizzle operations.

The Design Decision

The base64 example shows what’s possible with full cross-lane communication, but algorithmic complexity is real. For a Go SPMD extension, the choice might be: full suite (swizzle, rotation, reduction), reduction only, or a gradual introduction starting with reduction and adding others based on demonstrated need.

The question is whether Go developers would adopt and correctly use these more complex operations, or if the cognitive overhead outweighs the performance benefits.

View Complete Source Code - Full implementation with usage examples and detailed algorithm explanations.

Implementation inspired by Miguel Young de la Sota’s excellent analysis in “Designing a SIMD Algorithm from Scratch”, adapted for hypothetical SPMD Go extension.

Previous in series: What if? Practical parallel data. - Learn basic SPMD patterns with simple string operations.

Next in series: Putting It All Together - See SPMD concepts applied to real-world IPv4 parsing performance.