SIMD Myths Busted: Block Loads Steal the Show in This Mind-Blowing String Search Benchmark!

Extended Benchmarks Across x86-64 Architectures

Aug 21, 2025

Motivation

Imagine you’re tasked with finding a single character in a massive string—how fast can you make it? In his thought-provoking post, Daniel Lemire argued that SIMD instructions are crucial, showing how a naive search crawls at 4 GB/s while a SIMD-powered simdutf::find rockets past 110 GB/s.

But is SIMD the whole story? I expanded his benchmark, adding GLIBC’s blazing memchr, small-string tests, and custom functions loading up to 64 bytes at once, all run on an AMD ThreadRipper across Westmere, Broadwell, and Zen 2 architectures.

What I found flips the script: while SIMD shines, it’s the art of block loads, not just fancy vector tricks, that drives jaw-dropping speed. Dive in for a deep, data-driven look at what really makes string searches scream, crafted for developers who live for performance.

Introduction

In a recent blog post titled "Why do we even need SIMD instructions?", Daniel Lemire presented a strong argument for the role of Single Instruction, Multiple Data (SIMD) operations in contemporary computing.

He illustrated this through the straightforward task of locating the first occurrence of a character in a string, showing how a basic byte-by-byte search maxes out at roughly 4 GB/s on his Apple M4 processor. In contrast, a SIMD-optimized function from the simdutf library soared beyond 110 GB/s for inputs in the kilobyte range.

Lemire's takeaway was clear: SIMD is indispensable for high-performance tasks, particularly when the goal is to handle data at speeds surpassing typical disk throughput, such as over 5 GB/s on everyday hardware.

Methodology

Lemire's comparisons focused on three methods: a simple naive_find loop that checks each byte in sequence, the std::find from the C++ standard library which the compiler might refine but generally mirrors the naive approach, and simdutf::find, a specialized routine from the simdutf library that harnesses SIMD intrinsics for simultaneous byte comparisons.

// Naive implementation of find
const char* naive_find(const char* start, const char* end, 
                       char character) {
    while (start != end) {
        if (*start == character) {
            return start;
        }
        ++start;
    }
    return end;
}

These results underscored how SIMD slashes the instruction count per byte, rendering it essential for performance-intensive code.

Yet, this raises questions about the broader picture. Standard libraries like GLIBC already embed SIMD optimizations, and handmade implementations could potentially close the gap without relying on external libraries. Performance also shifts depending on input length, hardware architecture, and vector register sizes.

To explore these aspects, I expanded Lemire's benchmark, with the updated code available on my fork on GitHub. I incorporated memchr from GLIBC, a finely tuned C library function for byte searches in memory that employs SIMD internally. I also introduced smaller input sizes—4, 16, 128, and 1024 bytes—to evaluate overhead on brief strings.

Furthermore, I developed three new search functions that load data in bigger segments rather than byte by byte: find64 handles 8 bytes through 64-bit scalars, findvec4x64 processes 32 bytes via 256-bit vectors, findvec32x8 manages 32 bytes using vector extensions, and findvec64x8 tackles 64 bytes with 512-bit vectors.

The benchmarks were compiled and executed for three distinct architectures: Westmere as the final pre-256-bit SIMD era, Broadwell marking the end before 512-bit support, and Zen 2 representing AMD's AVX2 generation.

All executions occurred on an AMD ThreadRipper 3960X running Zen 2, using GCC's -march flag to emulate the constraints of each target architecture.

Understanding the Search Functions

Each function aims to identify the first instance of a character - in this case, '=' - within a randomly generated ASCII string, returning a pointer to its position or the end if absent.

To mimic a worst-case scenario requiring a full scan, the target character is appended at the string's conclusion. Throughput is measured in GB/s, derived from averages across multiple runs via Lemire's event counter framework.

The baseline approaches include naive_find, the elementary loop from Lemire's article that processes one byte per cycle and demands around six instructions per byte, encompassing loads, comparisons, branches, and increments, capping throughput at about 4 GB/s on a 4 GHz processor due to instruction retirement limits.

Similarly, std::find from the algorithm header offers comparable performance, though compilers may apply loop unrolling or minor tweaks without yielding dramatic improvements.

Then there's simdutf::find from the simdutf library, which adapts platform-specific SIMD intrinsics—like NEON for ARM or AVX for x86—to evaluate 16 or more bytes concurrently, originally tailored for UTF-8/16 validation but effective here as well.

Among the additions, memchr stands out as GLIBC's POSIX-compliant implementation, optimized with architecture-tailored assembly code such as SSE2 or AVX2 on x86 platforms, managing alignment and using bitwise operations for match detection, serving as a benchmark standard especially for extended strings.

My custom chunked functions emphasize loading larger data units and employing bit manipulation to spot matches without per-byte loops, while manually addressing alignment for cross-platform reliability.

find64

The find64 function, implemented in the benchmark code to locate the first occurrence of a specified character within a string, operates by processing data in 8-byte (64-bit) chunks using scalar operations, offering a significant improvement over the byte-by-byte approach of naive_find.

Designed to enhance performance, it takes three parameters: a pointer to the start of the string, a pointer to its end (exclusive), and the target character. It returns a pointer to the first occurrence of the character or the end pointer if no match is found.

The function achieves efficiency through a combination of 64-bit word processing, clever bit manipulation, and careful handling of memory alignment.

The function begins by initializing a working pointer to the start of the string and setting up three 64-bit constants. The first, repeat_one, is a 64-bit value with each of its eight bytes set to 0x01, forming 0x0101010101010101ULL.

The second, repeat_char, replicates the target character across all eight bytes by multiplying the character’s 8-bit value by repeat_one (e.g., for character 'a' or 0x61, this yields 0x6161616161616161ULL).

The third, high_bits, is a mask with the most significant bit set in each byte (0x8080808080808080ULL), used later for match detection. These constants enable the function to compare eight bytes simultaneously using bitwise operations.

To ensure safe 64-bit reads, the function first checks if the starting pointer is aligned to an 8-byte boundary by computing the misalignment via (uintptr_t)p % 8. If the pointer is unaligned, it processes up to 8 - misalignment_bytes individually in a scalar loop, checking each byte against the target character and advancing the pointer until either a match is found or alignment is achieved. This step is crucial for portability, as some architectures penalize unaligned memory accesses, though the x86 test platform (AMD ThreadRipper 3960X) is more forgiving.

Once aligned, the function enters its main loop, which processes 8-byte chunks as long as at least eight bytes remain before the end pointer. In each iteration, it loads eight bytes from the current pointer into a 64-bit word, interpreted as a uint64_t. For example, for the string "stuvwxyz", the word might be 0x737475767778797aULL.

The function then XORs this word with repeat_char, producing a result where a byte matching the target character becomes 0x00, while non-matching bytes yield non-zero values. For instance, if searching for 'u' (0x75), XORing with 0x7575757575757575ULL zeros the byte corresponding to 'u'.

To detect these zero bytes, the function employs a bit manipulation technique: it subtracts repeat_one from the XOR result, causing matching (zero) bytes to underflow and set their most significant bit to 1 (e.g., 0x00 - 0x01 = 0xFF), while non-matching bytes keep their MSB at 0.

This result is ANDed with the bitwise NOT of the XOR result and then with high_bits, isolating the MSB of each byte. A non-zero result indicates at least one match in the 8-byte chunk.

When a match is detected, a scalar loop scans the eight bytes to pinpoint the first occurrence, returning the corresponding pointer. If no match is found, the pointer advances by eight bytes for the next iteration.

After the main loop, any remaining bytes (0 to 7) are handled in a final scalar loop, checking each byte individually until a match is found or the end is reached, at which point the end pointer is returned. This tail handling ensures correctness for strings not perfectly divisible by eight.

The efficiency of find64 stems from its use of 8-byte block loads, which drastically reduce memory accesses and instruction count compared to naive_find’s per-byte approach, which incurs around six instructions per byte (load, compare, branch, increment).

By processing eight bytes with a single load and a few bitwise operations, find64 minimizes CPU cycles, achieving up to 10 GB/s in the benchmarks compared to naive_find’s 3-4 GB/s.

The bit manipulation technique, inspired by algorithms like those in memchr, allows rapid match detection without explicit per-byte comparisons, while alignment handling ensures portability. However, it remains scalar, lacking the parallelism of SIMD-based approaches like memchr or simdutf::find, which explains its performance ceiling compared to their 100+ GB/s peaks for larger inputs.

const char* find64(const char* start, 
                       const char* end, 
                       char character) {
    const char* p = start;
    // Replicated constants for the byte-wise check
    const uint64_t repeat_one = 0x0101010101010101ULL;
    const uint64_t repeat_char = (uint64_t)(uint8_t)character * repeat_one;
    const uint64_t high_bits = 0x8080808080808080ULL;
    // Handle alignment to 8-byte boundary for safe unaligned access avoidance
    // (though on x86 it's fine, this ensures portability)
    uintptr_t misalignment = (uintptr_t)p % 8;
    if (misalignment != 0) {
        size_t align_steps = 8 - misalignment;
        for (size_t i = 0; i < align_steps && p < end; ++i) {
            if (*p == character) {
                return p;
            }
            ++p;
        }
    }
    // Main loop: process 8 bytes at a time
    while (p + 8 <= end) {
        uint64_t word = *(const uint64_t*)p;
        uint64_t xor_val = word ^ repeat_char;
        uint64_t has_match = (xor_val - repeat_one) & 
                             ~xor_val & high_bits;
        if (has_match != 0) {
            // A match in this 8-byte chunk; scan to find the first one
            for (int i = 0; i < 8; ++i) {
                if (p[i] == character) {
                    return p + i;
                }
            }
        }
        p += 8;
    }
    // Handle remaining bytes (0 to 7)
    while (p < end) {
        if (*p == character) {
            return p;
        }
        ++p;
    }
    return end;
}

The findvec4x64 function, implemented in the benchmark code to locate the first occurrence of a specified character within a string, is designed to process data in 32-byte (256-bit) chunks using vector operations, offering a significant performance boost over scalar approaches like naive_find or even find64.

It leverages GCC’s vector extensions to harness SIMD (Single Instruction, Multiple Data) capabilities, specifically targeting architectures with 256-bit vector support, such as Broadwell or Zen 2.

Unlike naive_find, which processes one byte at a time, or find64, which handles 8-byte chunks, findvec4x64 processes 32 bytes per iteration using a 256-bit vector type defined as vec4x64, which holds four 64-bit integers. This approach reduces memory accesses and leverages SIMD parallelism to check multiple bytes simultaneously, with alignment handling and a scalar fallback for remaining bytes.

The function begins by initializing a working pointer p to the start of the string. It defines three 64-bit constants, similar to those in find64, to facilitate byte-wise comparisons within each 64-bit lane of the 256-bit vector. The constant repeat_one is set to 0x0101010101010101ULL, with each byte as 0x01. The repeat_char constant replicates the target character across all eight bytes of a 64-bit value by multiplying the character’s 8-bit value by repeat_one (e.g., for 'a' or 0x61, this yields 0x6161616161616161ULL). The high_bits constant, 0x8080808080808080ULL, has the most significant bit set in each byte for match detection. These constants are used to construct 256-bit vectors that apply the same bit manipulation technique across four 64-bit lanes.

To ensure safe 256-bit vector reads, the function checks if the starting pointer is aligned to a 32-byte boundary by computing (uintptr_t)p % 32. If unaligned, it processes up to 32 - misalignment_bytes one-by-one in a scalar loop, checking each byte against the target character and advancing the pointer until a match is found or alignment is achieved.

The main loop processes 32 bytes at a time while at least 32 bytes remain before the end pointer. It loads 32 bytes from the current pointer into a vec4x64 vector, which is a 256-bit type holding four 64-bit integers (e.g., for a string segment, it might load four 8-byte chunks like "stuvwxyz", "01234567", etc.).

The function constructs three 256-bit vectors: charvec with repeat_char in each of its four 64-bit lanes, onevec with repeat_one, and highvec with high_bits.

It then performs a vectorized XOR operation, word ^ charvec, producing a 256-bit result where bytes matching the target character become 0x00, and non-matching bytes are non-zero. For example, if searching for 'u' (0x75), XORing zeros the byte corresponding to 'u' in each 64-bit lane.

To detect matches, the function applies a bit manipulation technique across all four lanes: it subtracts onevec from the XOR result, causing matching (zero) bytes to underflow and set their most significant bit to 1 (e.g., 0x00 - 0x01 = 0xFF), while non-matching bytes retain an MSB of 0. This result is ANDed with the bitwise NOT of the XOR result and then with highvec, isolating the MSB of each byte.

The resulting has_match vector has 0x80 in bytes where matches occur within each 64-bit lane. The function aggregates a scalar mask by ORing the four 64-bit lanes of has_match. If mask is non-zero, at least one match exists in the 32-byte chunk.

When a match is detected, the function iterates over the four 64-bit lanes. For each lane where has_match[i] is non-zero, it points to the corresponding 8-byte block (p + i * 8) and scans its eight bytes individually to find the first match, returning the pointer to that position. This nested loop ensures the earliest match is returned, maintaining correctness. If no match is found, the pointer advances by 32 bytes for the next iteration.

After the main loop, any remaining bytes (0 to 31) are processed in a scalar loop, checking each byte individually until a match is found or the end is reached, returning the end pointer if no match is found. This tail handling ensures correctness for strings not divisible by 32.

Why It’s Efficient

The efficiency of findvec4x64 comes from its use of 256-bit vector loads, processing 32 bytes per iteration, which drastically reduces memory accesses and instruction count compared to naive_find’s byte-by-byte approach (requiring ~6 instructions per byte) or even find64’s 8-byte chunks.

By leveraging SIMD operations via GCC’s vector extensions, it performs four 64-bit comparisons in parallel, applying the same bit manipulation technique as find64 but across a wider data path. Benchmarks show findvec4x64 achieving 37-58 GB/s on Broadwell and Zen 2 for larger inputs, compared to find64’s 10 GB/s and naive_find’s 3-4 GB/s.

The bit trick (XOR, subtract, AND) minimizes per-byte operations, and the vectorized approach exploits hardware parallelism on AVX2-capable architectures. However, on Westmere (128-bit SSE max), performance drops to 37-40 GB/s due to emulation of 256-bit operations, and alignment overhead impacts small inputs (1-2 GB/s for 4-16 bytes). Compared to memchr (up to 174 GB/s), findvec4x64 is less optimized but demonstrates that vector extensions can achieve competitive SIMD performance without explicit intrinsics, provided the target architecture supports 256-bit vectors natively.


// Define 256-bit vector type for four int64_t (32 bytes)
typedef uint64_t vec4x64 __attribute__((vector_size(32)));

const char* findvec4x64(const char* start, 
                       const char* end, 
                       char character) {
    const char* p = start;
    
    // Broadcast character to all bytes in a 32-byte vector
    const uint64_t char_broadcast = 
           (uint64_t)(uint8_t)character * 0x0101010101010101ULL;
    const vec4x64 target = {char_broadcast, char_broadcast, 
                            char_broadcast, char_broadcast};
    
    // Handle alignment to 32-byte boundary for vector loads
    uintptr_t misalignment = (uintptr_t)p % 32;
    if (misalignment != 0) {
        size_t align_steps = 32 - misalignment;
        for (size_t i = 0; i < align_steps && p < end; ++i) {
            if (*p == character) {
                return p;
            }
            ++p;
        }
    }
    
    // Main loop: process 32 bytes (4x int64_t) at a time
    while (p + 32 <= end) {
        // Load 32 bytes as four int64_t
        vec4x64 data = *(const vec4x64*)p;
        
        // Compare for equality (produces 0xFF in matching bytes, 
        // 0x00 elsewhere)
        vec4x64 cmp = data == target;
        
        // Combine results into a single 256-bit mask
        // Non-zero indicates at least one matching byte
        uint64_t mask = 0;
        for (int i = 0; i < 4; ++i) {
            mask |= (uint64_t)cmp[i];
        }
        
        if (mask != 0) {
            // Found a match; scan the 32 bytes to find the first one
            for (int i = 0; i < 32; ++i) {
                if (p[i] == character) {
                    return p + i;
                }
            }
        }
        p += 32;
    }
    // Handle remaining bytes (0 to 31)
    while (p < end) {
        if (*p == character) {
            return p;
        }
        ++p;
    }
    return end;
}

The function findvec32x8 begins by initializing a working pointer p to the start of the string. It defines three 64-bit constants, similar to those in find64, to facilitate byte-wise comparisons within each 64-bit lane of the 256-bit vector.

The constant repeat_one is set to 0x0101010101010101ULL, with each byte as 0x01. The repeat_char constant replicates the target character across all eight bytes of a 64-bit value by multiplying the character’s 8-bit value by repeat_one (e.g., for 'a' or 0x61, this yields 0x6161616161616161ULL). The high_bits constant, 0x8080808080808080ULL, has the most significant bit set in each byte for match detection. These constants are used to construct 256-bit vectors that apply the same bit manipulation technique across four 64-bit lanes.

The function begins by initializing a working pointer p to the start of the string. Unlike find64 and findvec4x64, which use 64-bit constants for byte-wise comparisons within 64-bit lanes, findvec32x8 operates directly on 8-bit elements within a 256-bit vector, simplifying the comparison logic by avoiding the need for bit manipulation tricks like those used in find64. The vec32x8 type is defined as a 256-bit vector holding 32 uint8_t elements, allowing direct byte comparisons using SIMD operations.

To ensure safe 256-bit vector reads, the function checks if the starting pointer is aligned to a 32-byte boundary in the same way as findvec4x64.

The main loop processes 32 bytes at a time while at least 32 bytes remain before the end pointer. It loads 32 bytes from the current pointer into a vec32x8 vector, which holds 32 8-bit elements (e.g., for a string segment "abcdefghijklmnopqrstuvwxyzABCDEF", it loads all 32 bytes as individual uint8_t values).

The function constructs a charvec vector with the target character replicated across all 32 elements (e.g., for 'a' or 0x61, charvec is a 256-bit vector with 0x61 in each byte). It then performs a vectorized equality comparison, word == charvec, producing a matches vector where each byte is 0xFF (true) if the corresponding byte in word matches the target character, or 0x00 (false) otherwise. For example, if searching for 'a', the first byte of the string would yield 0xFF in matches[0], while others would be 0x00.

To determine if any matches occurred, the function aggregates a 32-bit mask by iterating over the 32 elements of matches, converting each to a uint32_t (0xFF becomes 1, 0x00 becomes 0), and shifting it into the corresponding bit position (mask |= ((uint32_t)matches[i]) << i). If mask is non-zero, at least one match exists in the 32-byte chunk.

The function uses __builtin_ctz(mask) (count trailing zeros) to find the index of the first set bit in mask, which corresponds to the position of the earliest match in the 32-byte block. It returns the pointer p + pos for this match. If no match is found (mask == 0), the pointer advances by 32 bytes for the next iteration.

After the main loop, any remaining bytes (0 to 31) are processed in a scalar loop.

Why It’s Efficient

The efficiency of findvec32x8 derives from its use of 256-bit vector loads, processing 32 bytes per iteration, which significantly reduces memory accesses and instruction count compared to naive_find’s byte-by-byte approach (~6 instructions per byte) or find64’s 8-byte chunks.

By treating the 256-bit vector as 32 8-bit elements, it performs direct byte-wise comparisons using SIMD equality operations, avoiding the complex bit manipulation (XOR, subtract, AND) used in find64 and findvec4x64. This direct approach simplifies the comparison logic while leveraging hardware parallelism on AVX2-capable architectures like Broadwell and Zen 2, where benchmarks show findvec32x8 achieving 44-59 GB/s for larger inputs, compared to find64’s 10 GB/s and naive_find’s 3-4 GB/s.

The use of __builtin_ctz efficiently locates the first match, minimizing scalar post-processing. However, on Westmere (128-bit SSE max), performance drops to 41-48 GB/s due to emulation of 256-bit operations, and alignment overhead impacts small inputs (1-7 GB/s for 4-128 bytes).

Compared to memchr (up to 174 GB/s), findvec32x8 is less optimized but demonstrates that vector extensions can deliver competitive SIMD performance without explicit intrinsics, provided the architecture supports 256-bit vectors natively. Its byte-wise vector approach aligns closely with memchr’s strategy, contributing to its strong performance on compatible hardware.

// Define 256-bit vector type for 32 uint8_t (32 bytes)
typedef uint8_t vec32x8 __attribute__((vector_size(32)));

const char* findvec32x8(const char* start, 
                       const char* end, 
                       char character) {
    const char* p = start;
    
    // Broadcast character to all 32 bytes in a vector
    vec32x8 target = {(uint8_t)character, (uint8_t)character, (uint8_t)character, (uint8_t)character, 
                   (uint8_t)character, (uint8_t)character, (uint8_t)character, (uint8_t)character,
                   (uint8_t)character, (uint8_t)character, (uint8_t)character, (uint8_t)character,
                   (uint8_t)character, (uint8_t)character, (uint8_t)character, (uint8_t)character,
                   (uint8_t)character, (uint8_t)character, (uint8_t)character, (uint8_t)character,
                   (uint8_t)character, (uint8_t)character, (uint8_t)character, (uint8_t)character,
                   (uint8_t)character, (uint8_t)character, (uint8_t)character, (uint8_t)character,
                   (uint8_t)character, (uint8_t)character, (uint8_t)character, (uint8_t)character};
    
    // Handle alignment to 32-byte boundary for vector loads
    uintptr_t misalignment = (uintptr_t)p % 32;
    if (misalignment != 0) {
        size_t align_steps = 32 - misalignment;
        for (size_t i = 0; i < align_steps && p < end; ++i) {
            if (*p == character) {
                return p;
            }
            ++p;
        }
    }
    
    // Main loop: process 32 bytes at a time
    while (p + 32 <= end) {
        // Load 32 bytes as a vector
        vec32x8 data = *(const vec32x8*)p;
        
        // Compare for equality (produces 0xFF in matching bytes, 0x00 elsewhere)
        vec32x8 cmp = data == target;
        
        // Combine results into a mask (non-zero indicates a match)
        uint32_t mask = 0;
        for (int i = 0; i < 32; ++i) {
            mask |= (uint32_t)cmp[i];
        }
        
        if (mask != 0) {
            // Found a match; scan the 32 bytes to find the first one
            for (int i = 0; i < 32; ++i) {
                if (p[i] == character) {
                    return p + i;
                }
            }
        }
        
        p += 32;
    }
    
    // Handle remaining bytes (0 to 31)
    while (p < end) {
        if (*p == character) {
            return p;
        }
        ++p;
    }
    
    return end;
}

The findvec64x8 function, implemented in the benchmark code to locate the first occurrence of a specified character within a string, processes data in 64-byte (512-bit) chunks using vector operations, aiming to surpass the performance of scalar methods like naive_find, find64, and even 256-bit vector approaches like findvec4x64 and findvec32x8.

It leverages GCC’s vector extensions to exploit SIMD (Single Instruction, Multiple Data) capabilities, targeting architectures with 512-bit vector support, though it was tested on an AMD ThreadRipper 3960X (Zen 2), which emulates 512-bit operations via dual 256-bit micro-ops.

By processing 64 bytes per iteration using a 512-bit vector type defined as vec64x8 (holding 64 8-bit integers), it minimizes memory accesses and maximizes SIMD parallelism, with scalar loops for alignment and tail handling to ensure correctness.

The function initializes a working pointer p to the start of the string. Unlike find64 and findvec4x64, which use 64-bit constants for byte-wise comparisons within 64-bit lanes, findvec64x8 operates directly on 8-bit elements within a 512-bit vector, treating it as 64 uint8_t elements.

This allows straightforward byte-wise comparisons using SIMD operations, similar to findvec32x8, but over a larger 64-byte chunk, avoiding the bit manipulation tricks (XOR, subtract, AND) used in find64 and findvec4x64.

Why it is NOT Efficient

However, benchmarks on the Zen 2 platform (AMD ThreadRipper 3960X) show dismal performance, with findvec64x8 achieving only 0.47-1.83 GB/s across all input sizes, far below findvec32x8’s 44-59 GB/s or memchr’s 84-174 GB/s.

This is because Zen 2 emulates 512-bit AVX-512 instructions as dual 256-bit micro-ops, incurring significant overhead, including potential downclocking and increased latency. On Westmere (128-bit SSE max), performance is even worse due to further emulation, and alignment overhead heavily impacts small inputs (1-2 GB/s for 4-128 bytes).

Compared to memchr, findvec64x8 is suboptimal, but it illustrates the potential of wide vectors on compatible hardware, where native 512-bit support could theoretically double the throughput of 256-bit approaches like findvec32x8, aligning closely with memchr’s byte-wise SIMD strategy if properly optimized.

// Define 512-bit vector type for 64 uint8_t (64 bytes)
typedef uint8_t vec64x8 __attribute__((vector_size(64)));

const char* findvec64x8(const char* start, 
                       const char* end, 
                       char character) {
    const char* p = start;
    
    // Broadcast character to all 64 bytes in a vector
    vec64x8 target = {
        (uint8_t)character, (uint8_t)character, (uint8_t)character, (uint8_t)character,
        (uint8_t)character, (uint8_t)character, (uint8_t)character, (uint8_t)character,
        (uint8_t)character, (uint8_t)character, (uint8_t)character, (uint8_t)character,
        (uint8_t)character, (uint8_t)character, (uint8_t)character, (uint8_t)character,
        (uint8_t)character, (uint8_t)character, (uint8_t)character, (uint8_t)character,
        (uint8_t)character, (uint8_t)character, (uint8_t)character, (uint8_t)character,
        (uint8_t)character, (uint8_t)character, (uint8_t)character, (uint8_t)character,
        (uint8_t)character, (uint8_t)character, (uint8_t)character, (uint8_t)character,
        (uint8_t)character, (uint8_t)character, (uint8_t)character, (uint8_t)character,
        (uint8_t)character, (uint8_t)character, (uint8_t)character, (uint8_t)character,
        (uint8_t)character, (uint8_t)character, (uint8_t)character, (uint8_t)character,
        (uint8_t)character, (uint8_t)character, (uint8_t)character, (uint8_t)character,
        (uint8_t)character, (uint8_t)character, (uint8_t)character, (uint8_t)character,
        (uint8_t)character, (uint8_t)character, (uint8_t)character, (uint8_t)character,
        (uint8_t)character, (uint8_t)character, (uint8_t)character, (uint8_t)character,
        (uint8_t)character, (uint8_t)character, (uint8_t)character, (uint8_t)character
    };
    
    // Handle alignment to 64-byte boundary for vector loads
    uintptr_t misalignment = (uintptr_t)p % 64;
    if (misalignment != 0) {
        size_t align_steps = 64 - misalignment;
        for (size_t i = 0; i < align_steps && p < end; ++i) {
            if (*p == character) {
                return p;
            }
            ++p;
        }
    }
    
    // Main loop: process 64 bytes at a time
    while (p + 64 <= end) {
        // Load 64 bytes as a vector
        vec64x8 data = *(const vec64x8*)p;
        
        // Compare for equality (produces 0xFF in matching bytes, 0x00 elsewhere)
        vec64x8 cmp = data == target;
        
        // Combine results into a mask (non-zero indicates a match)
        uint64_t mask = 0;
        for (int i = 0; i < 64; ++i) {
            mask |= (uint64_t)cmp[i];
        }
        
        if (mask != 0) {
            // Found a match; scan the 64 bytes to find the first one
            for (int i = 0; i < 64; ++i) {
                if (p[i] == character) {
                    return p + i;
                }
            }
        }
        
        p += 64;
    }
    
    // Handle remaining bytes (0 to 63)
    while (p < end) {
        if (*p == character) {
            return p;
        }
        ++p;
    }
    
    return end;
}

These implementations demonstrate how escalating vector widths—from 128-bit SSE on Westmere to 256-bit AVX2 on Broadwell and Zen 2, up to 512-bit where feasible—affect efficiency.

Benchmark Setup

The setup generates random strings ranging from 4 bytes to 2 MB, with the '=' character added at the end. Functions are timed in loops, computing throughput as size divided by the quickest elapsed nanoseconds. Outputs appear as Markdown tables.

Compilations targeted Westmere (2010, Intel's pre-AVX with 128-bit SSE4.2 max, emulating wider vectors), Broadwell (2014, AVX2 with 256-bit integers, pre-AVX-512), and Znver2 (Zen 2, 2019, full AVX2 on the test hardware, executing 512-bit as dual 256-bit micro-ops potentially with penalties).

Running everything on the same Zen 2 system ensures uniform memory bandwidth (around 50-60 GB/s per channel) and cache hierarchy (32 KB L1 data, 512 KB L2, 16 MB L3 per core).

The tests are single-threaded to highlight per-core capabilities.

Results

Throughput figures in GB/s are presented below for each architecture and input size, where higher values indicate superior performance. For oversized inputs exceeding L3 cache, memory bandwidth constrains results to approximately 80-100 GB/s on this setup.

Westmere-Targeted Compilation

Input(B) vec64x8 vec32x8 vec4x64 find64 memchr stdfind simdutf naive
-------- ------- ------- ------- ------ ------ ------- ------- -----
4        1.26    1.28    1.68    2.33   1.90   3.41    0.84    2.39
16       1.42    1.49    2.14    6.11   7.17   4.21    1.57    1.49
128      0.79    6.19    7.53    9.48   27.22  4.86    5.56    2.58
1024     0.50    44.75   37.42   9.54   102.42 4.91    31.76   3.67
8192     0.48    41.98   37.04   10.04  173.82 5.04    66.60   3.77
65536    0.47    48.31   40.31   10.11  120.49 5.06    64.96   3.79
524288   0.47    48.09   40.36   10.12  98.01  5.06    65.14   3.79
2097152  0.47    48.27   40.11   10.11  84.91  5.06    64.35   3.80

Broadwell-Targeted Compilation

Input(B) vec64x8 vec32x8 vec4x64 find64 memchr  stdfind simdutf naive
-------- ------- ------- ------- ------ ------  ------- ------- -----
4        0.98    1.24    1.24    2.41   1.90   3.41    0.84    2.39
16       1.77    1.98    1.17    4.96   7.17   4.22    1.75    2.87
128      1.38    7.31    5.62    9.42   27.22  4.95    5.33    3.03
1024     0.98    57.05   49.88   9.54   102.43 4.91    30.30   3.67
8192     0.95    55.91   51.06   10.06  173.98 5.04    51.04   3.78
65536    0.95    59.33   58.79   10.12  120.50 5.06    58.77   3.79
524288   0.95    59.23   58.15   10.09  101.77 5.05    58.38   3.79
2097152  0.94    58.95   58.08   10.11  84.53  5.06    58.38   3.79

Zen 2-Targeted Compilation

Input(B) vec64x8 vec32x8 vec4x64 find64 memchr  stdfind simdutf naive
-------- ------- ------- ------- ------ ------  ------- ------- -----
4        1.03    1.09    1.22    3.16   1.90   3.16    0.84    2.95
16       1.83    1.58    1.85    6.45   7.17   4.02    1.87    3.40
128      1.40    6.53    8.06    9.42   27.21  4.80    5.57    2.66
1024     0.98    57.06   49.51   9.58   102.40 4.91    31.38   3.69
8192     0.95    53.95   50.42   10.05  173.91 5.04    55.97   3.78
65536    0.95    59.34   58.68   10.12  120.45 5.06    65.48   3.79
524288   0.95    59.54   58.62   10.12  102.19 5.06    64.79   3.80
2097152  0.95    58.98   58.15   10.11  84.51  5.06    63.78   3.79

Analysis

Performance varies notably with input size. For very short strings of 4 to 16 bytes, scalar methods perform best, achieving 3 to 4 GB/s with std::find and naive_find, as the setup costs for alignment and SIMD outweigh the benefits. In these cases, memchr and the custom vector functions trail at 1 to 2 GB/s due to their introductory overhead, while simdutf::find lags even further at around 0.8 GB/s, possibly from its runtime architecture detection.

As inputs grow to 128 to 1024 bytes, SIMD advantages emerge more clearly. Memchr accelerates dramatically to 27 to 102 GB/s, surpassing the competition. The custom 256-bit functions like findvec32x8 and findvec4x64 reach 37 to 57 GB/s on Broadwell and Zen 2 architectures. Simdutf::find rises to 5 to 31 GB/s, whereas scalar options remain steady at about 5 to 10 GB/s.

For medium-sized strings of 8 to 64 KB, peak efficiencies are realized. Memchr attains up to 174 GB/s, bound by L1 or L2 cache speeds and outstripping simdutf::find by two to three times, which tops out at 51 to 66 GB/s. The custom 256-bit methods approach 50 to 59 GB/s, and find64 stabilizes at 10 GB/s, better than the naive baseline but far from SIMD territory.

With larger inputs from 512 KB to 2 MB, memory constraints dominate, capping top performers at 80 to 120 GB/s due to DRAM bandwidth limitations of around 100 GB/s on the test system. Memchr settles at 84 to 102 GB/s but maintains its lead, while simdutf::find and the 256-bit customs sustain 58 to 65 GB/s, reflecting the impact of L3 cache misses.

Across functions, scalars like naive_find, std::find, and find64 deliver reliable but modest speeds of 3 to 10 GB/s, with find64's chunking providing a slight edge through bitwise optimizations, illustrating that non-SIMD techniques can still double throughput. Library-based SIMD options shine, with memchr excelling in mid-range scenarios thanks to GLIBC's meticulously crafted assembly, including unrolled AVX2 loops and prefetching, whereas simdutf::find offers robust but more versatile performance suited to broader applications.

The custom vector implementations reveal that 256-bit AVX2 yields steady gains of around 50 to 60 GB/s on compatible hardware, proving GCC extensions can rival intrinsics without them. However, findvec64x8 for 512-bit processing underperforms severely at under 1 GB/s, hampered by emulation costs on non-native platforms.

Architecture influences are pronounced. On Westmere with its 128-bit limit, wider vectors falter, as findvec64x8 slows to 0.5 GB/s under full emulation, though 256-bit functions manage 40 to 48 GB/s via SSE4 splits. Memchr and simdutf adapt seamlessly through SSE pathways. Broadwell and Zen 2, both AVX2-capable, produce akin outcomes, with 256-bit peaks slightly higher at 58 to 59 GB/s, but findvec64x8 stays sluggish at 0.95 GB/s from splitting or emulation. Zen 2 marginally outperforms Broadwell for simdutf at 64 versus 58 GB/s, potentially due to enhanced branch prediction or caching. Memchr's cross-architecture stability highlights GLIBC's runtime dispatch, selecting optimal code paths unlike the static custom implementations.

Conclusions

Lemire's core assertion holds true: SIMD plays a critical role in overcoming scalar limitations, delivering 20 to 40 times the speed in string searches. This expanded experiment, however, uncovers important subtleties for developers focused on performance.

Established libraries (eg glibc) often prove superior, as evidenced by memchr's twofold advantage over simdutf::find on lengthy strings, reaching 174 GB/s peaks through proven, portable SIMD integration without the need for custom coding.

Input scale is pivotal, with scalars preferable for strings under 128 bytes where SIMD setup drags, but wide parallelism excelling on cache-fitting data until hitting bandwidth ceilings around 100 GB/s.

Vector widths involve compromises, as 256-bit AVX2 provides reliable boosts, yet 512-bit efforts falter without hardware backing, and on Zen 2, AVX-512's micro-op divisions or throttling can backfire - favoring the broadest compatibility for deployment.

Compiling for legacy architectures like Westmere enforces emulation that cripples advanced SIMD, underscoring the value of intrinsics with dynamic dispatch akin to GLIBC for versatile binaries.

Optimal results demand attention to alignment, tails, and cache behavior, where even simple bitwise shortcuts in find64 yield easy gains, but true SIMD requires deep knowledge.

While simdutf leverages sophisticated SIMD techniques, the data suggests that the primary speed differentiator lies not in those specific tricks but in the block loads themselves - pulling in larger memory chunks at once, be it through SIMD vectors or scalar multi-byte operations, rather than plodding through data byte by byte.

For string optimization enthusiasts, experiment with the repository, profile on your hardware (I’m interested!), and adapt accordingly - the outcomes might reshape your approach. Share insights below and subscribe for further explorations in performance tuning.

SIMD Myths Busted: Block Loads Steal the Show in This Mind-Blowing String Search Benchmark!

Extended Benchmarks Across x86-64 Architectures

Motivation

Introduction

Methodology

Understanding the Search Functions

find64

Why It’s Efficient

Why It’s Efficient

Why it is NOT Efficient

Benchmark Setup

Results

Westmere-Targeted Compilation

Broadwell-Targeted Compilation

Zen 2-Targeted Compilation

Analysis

Conclusions

Discussion about this post