Putting It All Together

By Cedric Bail

July 13, 2025

Warning

Historical note. This post is a thought experiment that predates the actual TinyGo SPMD compiler. Most of the building blocks now exist, but several specific claims in this post no longer match the working compiler: writing into a uniform [16]bool array from inside a go for (dotMask[i] = c == '.') is a per-lane scatter that the compiler treats as an anti-pattern (lane-count-dependent and not portable across SIMD widths); incrementing a uniform dotMaskTotal from inside a varying conditional is also lane-count-dependent — the portable form is reduce.Add(boolToInt(varyingDot)). reduce.FindFirstSet is currently a stub. reduce.Mask returns a bitmask but the shape (16-bit vs wider) depends on the active lane count, not the hardware register width. The IPv4 parser was implemented and benchmarked on x86-64; measured throughput plateaued ~0.58x of scalar because of an inherent SPMD overhead for non loop scenario. For the patterns that actually deliver, see Writing SPMD Go and SPMD Results.

Network address parsing is ubiquitous in Go applications, yet the standard library implementations process strings character by character, leaving significant performance on the table. In this comprehensive exploration, we’ll combine the SPMD concepts from our previous blogs to build a high-performance IPv4 parser inspired by Wojciech Muła’s SIMD research.

This post demonstrates how SPMD Go could be used to keep code readable, but significantly improve performance by applying the techniques we’ve explored: parallel processing, reduction operations, and cross-lane communication. This example is a lot less complex than trying things like base64 and shows the benefit of language-level support for parallel data manipulation in my opinion.

The Research Foundation

Wojciech Muła’s work on SIMD-ized IPv4 parsing demonstrates that clever parallel algorithms can achieve 2-3x performance improvements over traditional parsing. His approach uses several key insights:

16-byte parallel processing: Loading entire IPv4 strings into SIMD registers
Dot mask generation: Using parallel comparisons to create bitmasks of dot positions
Pattern-based field extraction: Leveraging precomputed lookup tables for field boundaries
Parallel digit conversion: Processing all four octets simultaneously

Our SPMD Go implementation adapts these techniques while maintaining readability and trying to keep it Go idiomatic. It should be readable without knowing assembly SIMD instructions.

The Traditional Sequential Approach

Go’s standard library processes IPv4 addresses character by character:

func parseIPv4Fields(in string, off, end int, fields []uint8) error {
    var val, pos int
    var digLen int
    s := in[off:end]
    for i := 0; i < len(s); i++ {
        if s[i] >= '0' && s[i] <= '9' {
            if digLen == 1 && val == 0 {
                return parseAddrError{in: in, msg: "IPv4 field has octet with leading zero"}
            }
            val = val*10 + int(s[i]) - '0'
            digLen++
            if val > 255 {
                return parseAddrError{in: in, msg: "IPv4 field has value >255"}
            }
        } else if s[i] == '.' {
            // Handle dot logic...
            fields[pos] = uint8(val)
            pos++
            val = 0
            digLen = 0
        } else {
            return parseAddrError{in: in, msg: "unexpected character"}
        }
    }
    return nil
}

This sequential approach, while correct and readable, processes one character at a time and can’t leverage modern CPU parallelism.

The SPMD Transformation

Phase 1: Parallel Character Analysis

Our SPMD approach begins by analyzing all characters simultaneously using 16-lane processing:

func parseIPv4(s string) ([4]byte, error) {
    if len(s) < 7 || len(s) > 15 {
        return [4]byte{}, parseAddrError{in: s, msg: "IPv4 address string too short or too long"}
    }

    // Pad string to 16 bytes with null terminators (like SSE register)
    input := [16]byte{}
    copy(input[:], s)

    // Process all bytes in parallel
    var dotMask [16]bool
    var dotMaskTotal lanes.Varying[uint32]

    var loop int
    go for i, c := range input {
        dotMask[i] = c == '.'
        if dotMask[i] {
            dotMaskTotal++
        }
        digitMask := (c >= '0' && c <= '9')

        // Valid if dot, digit, or null (padding)
        validChars := dotMask[i] || digitMask || c == 0
    }

This parallel analysis validates all characters simultaneously and creates boolean masks for dots and digits – a direct adaptation of Mula’s SIMD character classification.

The key insight here is the padding strategy: since we process the fixed-size [16]byte array in SPMD fashion, we need a consistent 16-byte input. The input := [16]byte{} creates a zero-initialized array, and copy(input[:], s) fills it with the IPv4 string, leaving trailing zeros as padding. The validation logic validChars := dotMask[i] || digitMask || c == 0 explicitly accepts null padding, making shorter IPv4 addresses work seamlessly with parallel processing. Note that dotMask is a regular [16]bool array – the parallel go for loop writes to it via scatter operations, and we use it later in a second go for loop to build the dot position bitmask.

After this initial parallel validation phase, the algorithm continues with the original string s for precise boundary calculations and error reporting.

Phase 2: Reduction-Based Validation

We use reduction operations to aggregate validation results across all lanes:

        // Check character validity with precise error location
        if !reduce.All(validChars) {
            return [4]byte{}, parseAddrError{in: s, at: reduce.FindFirstSet(!validChars) + loop, msg: "unexpected character"}
        }
        loop += lanes.Count(c)
    }

    // Count dots using reduction
    dotCount := reduce.Add(dotMaskTotal)
    if dotCount != 3 {
        return [4]byte{}, parseAddrError{in: s, msg: "invalid dot count"}
    }

    // Create dot position bitmask (mimics _mm_movemask_epi8)
    var mask uint16
    loop = 0
    go for _, isDot := range dotMask {
        mask |= uint16(reduce.Mask(isDot)) << loop
        loop += lanes.Count(isDot)
    }

The reduce.Mask() operation is particularly elegant—it converts the varying boolean into a bitmask, directly paralleling SSE’s _mm_movemask_epi8 instruction. The loop variable tracks the bit offset across iterations, and lanes.Count(isDot) returns the number of lanes for the element type, ensuring the bitmask is built correctly regardless of SIMD width.

Note the improved error reporting: reduce.FindFirstSet(!validChars) + loop locates the exact position of the first invalid character by combining the lane index with the iteration offset, providing precise error messages instead of generic failures. This demonstrates how reduction operations can enhance not just performance, but also debugging and user experience.

Phase 3: Race-Free Dot Position Extraction

Here we solve the potential race condition by using a normal for loop and bit manipulation on the mask instead of having lanes compete to write positions, sometimes you can’t do things in parallel:

    // Extract dot positions using bit manipulation
    var dotPositions [3]int
    for i := 0; i < 3; i++ {
        pos := bits.TrailingZeros16(mask)
        dotPositions[i] = pos
        mask &= mask - 1  // Clear lowest set bit
    }

    // Define field boundaries as separate arrays for efficient range processing
    starts := [4]int{0, dotPositions[0], dotPositions[1], dotPositions[2]}
    ends := [4]int{dotPositions[0], dotPositions[1], dotPositions[2], len(s)}

This approach eliminates race conditions while extracting dot positions in order, exactly as Wojciech Muła’s implementation does with bit manipulation.

Phase 4: Parallel Field Validation and Conversion

Now we process all four IPv4 octets in parallel, with each lane handling one field. Note the use of range, this gives the compiler precise information about the iteration count, enabling better optimization:

    // Validate field lengths in parallel
    go for i, start := range starts {
        end := ends[i]
        if i > 0 {
            start++ // Skip the dot
        }
        fieldLen := end - start
        if reduce.Any(fieldLen < 1 || fieldLen > 3) {
            return [4]byte{}, parseAddrError{in: s, msg: "invalid field length"}
        }
    }

    // Process all four fields in parallel
    var ip [4]byte

    go for field, start := range starts {
        end := ends[field]

        if field > 0 {
            start++ // Skip the dot
        }

        fieldLen := end - start
        var value int
        var hasLeadingZero bool

        // Convert field using optimized digit processing
        switch fieldLen {
        case 1:
            value = int(s[start] - '0')
        case 2:
            d1 := int(s[start] - '0')
            d0 := int(s[start+1] - '0')
            value = d1*10 + d0
            hasLeadingZero = (d1 == 0)
        case 3:
            d2 := int(s[start] - '0')
            d1 := int(s[start+1] - '0')
            d0 := int(s[start+2] - '0')
            value = d2*100 + d1*10 + d0
            hasLeadingZero = (d2 == 0)
        }

        // Validation: check each error condition across all lanes
        if reduce.Any(hasLeadingZero) {
            return [4]byte{}, parseAddrError{in: s, msg: "IPv4 field has octet with leading zero"}
        }
        if reduce.Any(value > 255) {
            return [4]byte{}, parseAddrError{in: s, msg: "IPv4 field has value >255"}
        }
        ip[field] = uint8(value)
    }

    return ip, nil
}

This parallel field processing mirrors Muła’s SSE_CONVERT_MAX1/2/3 macros, handling different field lengths efficiently while maintaining full validation.

Compiler Optimization: Array Range

The use of go for field, start := range starts is a subtle but important optimization. When ranging over a fixed-size array, the compiler knows the exact iteration count at compile time, enabling:

Loop unrolling: The compiler can unroll the loop entirely, generating direct code for each iteration
Better instruction scheduling: With known bounds, the compiler can optimize instruction ordering
Eliminated bounds checks: No runtime checks needed when array size is compile-time constant
Eliminate iteration: In this case especially, the number of iteration, 4, will fit on most architecture in just one SIMD register. So the iteration itself won’t be necessary and can be removed.

This represents a key principle for SPMD Go: give the compiler as much compile-time information as possible to enable maximum optimization.

Enhanced Error Reporting Through Reduction Operations

One significant advantage of the SPMD approach in Go itself is improved error reporting. If using SIMD intrinsics and assembly, it is harder to keep track of proper error handling, but with this proposal, it feels a lot more logical and simpler to do proper error reporting, like the locations when validating the initial content of the string:

// Instead of generic "unexpected character"
if !reduce.All(validChars) {
    return [4]byte{}, parseAddrError{in: s, at: reduce.FindFirstSet(!validChars) + loop, msg: "unexpected character"}
}

// And precise field error reporting using reduce.Any
if reduce.Any(hasLeadingZero) {
    return [4]byte{}, parseAddrError{in: s, msg: "IPv4 field has octet with leading zero"}
}
if reduce.Any(value > 255) {
    return [4]byte{}, parseAddrError{in: s, msg: "IPv4 field has value >255"}
}

The reduce.Any() and reduce.FindFirstSet() operations efficiently detect error conditions across all lanes simultaneously. Since reduce.Any() returns a uniform bool, the return statement is under a uniform condition – all lanes agree on whether to return – making it a valid early exit in the go for loop. This demonstrates how parallel processing in Go enables simpler and idiomatic Go error handling in a simple form.

Performance Implications

This SPMD approach offers several advantages over traditional parsing:

Character-level parallelism: All characters validated simultaneously
Field-level parallelism: All four octets processed in parallel
Reduced branching: Structured validation reduces conditional branches
Cache efficiency: Better memory access patterns
Instruction-level parallelism: Multiple operations execute simultaneously

Based on Muła’s research, we can expect 2-3x performance improvements on IPv4 parsing. The added code complexity is not the same as if we were writing this with intrinsics, but we rely on the compiler to be able to do all those optimizations.

ISPC and Mojo have shown it is doable, but there is a lot of work to get there. A potential Proof of Concept could likely be built more easily with TinyGo that uses LLVM like ISPC and would give a validation of the concept.

The Complexity Trade-off

While this SPMD approach offers significant performance benefits, it also raises important questions we explored in our cross-lane communication analysis:

Increased complexity: The code is harder to understand than sequential parsing
Debugging challenges: Parallel bugs are more subtle than sequential ones
Maintenance overhead: Requires understanding of SIMD concepts

However, in this example, the resulting code is readable and maintainable. If it did come with a proper benchmark and the go profiler was able to track the result properly, it should be quite manageable for a lot of developers to write this code, I would think. This is the main justification potential for such an addition. If most developers can write data parallel code and we democratize writing high-performance code, it is worth it. If not, leaving intrinsic and assembly to engineer that can do it might be actually the better way forward. What do you think?

View Complete Source Code - Full implementation with usage examples and detailed comments.

References

Wojciech Muła’s SIMD IPv4 parsing research - The foundational research this implementation is based on
Muła’s IPv4 parsing implementation - Complete SSE implementation and benchmarks
Practical Vector Processing in Go - Introduction to SPMD concepts and reduce operations
Cross-Lane Communication - Deep dive into advanced SPMD patterns and race condition solutions
Go’s net/netip package - The traditional IPv4 parsing implementation

Previous in series: Cross-Lane Communication: When Lanes Need to Talk - Understanding complex cross-lane operations and their trade-offs.

This concludes our SPMD Go blog series. We’ve explored the theoretical foundations, practical applications, advanced communication patterns, and real-world performance implementations.