Saturday, October 4, 2025

Bulk operations in Boost.Bloom

Starting in Boost 1.90, Boost.Bloom will provide so-called bulk operations, which, in general, can speed up insertion and lookup by a sizable factor. The key idea behind this optimization is to separate in time the calculation of a position in the Bloom filter's array from its actual access. For instance, if this is the the algorithm for regular insertion into a Bloom filter with k = 1 (all the snippets in this article are simplified, illustrative versions of the actual source code):

void insert(const value_type& x)
{
  auto h = hash(x);
  auto p = position(h);
  set(position, 1);
}

then the bulk-mode variant for insertion of a range of values would look like:

void insert(const std::array<value_type, N>& x)
{
  std::size_t positions[N];
  
  // pipeline position calculation and memory access
  
  for(std::size_t i = 0; i < N; ++i) {
    auto h = hash(x[i]);
    positions[i] = position(h);
    prefetch(positions[i]);
  }
  
  for(std::size_t i = 0; i < N; ++i) {
    set(positions[i], 1);
  }
}

By prefetching the address of positions[i] way in advance of its actual usage in set(positions[i], 1), we make sure that the latter is accessing a cached value and avoid (or minimize) the CPU stalling that would result from reaching out to cold memory. We have studied bulk optimization in more detail in the context of boost::concurrent_flat_map. You can see actual measurements of the performance gains achieved in a dedicated repo; as expected, gains are higher for larger bit arrays not fitting in the lower levels of the CPU's cache hierarchy.

From an algorithmic point of view, the most interesting case is that of lookup operations for k > 1, since the baseline non-bulk procedure is not easily amenable to pipelining:

bool may_contain(const value_type& x)
{ 
  auto h = hash(x);
  for(int i = 0; i < k; ++i) {
    auto p = position(h);
    if(check(position) == false) return false;
    if(i < k - 1) h = next_hash(h);
  }
  return true;
}

This algorithm is branchful and can take anywhere from 1 to k iterations, the latter being the case for elements present in the filter and false positives. For instance, this diagram shows the number of steps taken to look up n = 64 elements on a filter with k = 10 and FPR = 1%, where the successful lookup rate (proportion of looked up elements actually in the filter) is p = 0.1:

As it can be seen, for non-successful lookups may_contain typically stops at the first few positions: the average number of positions checked (grayed cells) is  \[n\left(pk +(1-p)\frac{1-p_b^{k}}{1-p_b}\right),\] where \(p_b=\sqrt[k]{\text{FPR}}\) is the probability that an arbitrary bit in the filter's array is set to 1. In the example used, this results in only 34% of the total nk = 640 positions being checked.

Now, a naïve bulk-mode version could look as follows:

template<typename F>
void may_contain(
  const std::array<value_type, N>& x,
  F f) // f is fed lookup results
{ 
  std::uint64_t hashes[N];
  std::size_t   positions[N];
  bool          results[N];
  
  // initial round of hash calculation and prefetching
  
  for(std::size_t i = 0; i < N; ++i) {
    hashes[i] = hash(x[i]);
    positions[i] = position(hashes[i]);
    results[i] = true;
    prefetch(positions[i]);
  }
  
  // main loop
  
  for(int j = 0; j < k; ++i) {
    for(std::size_t i = 0; i < N; ++i) {
      if(results[i]) { // conditional branch X
        results[i] &= check(positions[i]);
        if(results[i] && j < k - 1) {
          hashes[i] = next_hash(hashes[i]);
          positions[i] = position(hashes[i]);
          prefetch(positions[i]);
        }
      }
    }
  }
  
  // feed results
  
  for(int i = 0; i < k; ++i) {
    f(results[i]);
  }
}

This simply stores partial results in an array and iterates row-first instead of column-first:

The problem with this approach is that, even though it calls check exactly the same number of times as the non-bulk algorithm, the conditional branch labeled X is executed \(nk\) times, and this has a huge impact on the CPU's branch predictor. Conditional branches could in principle be eliminated altogether:

for(int j = 0; j < k; ++i) {
  for(std::size_t i = 0; i < N; ++i) {
    results[i] &= check(positions[i]);
    if(j < k - 1) { // this check is optimized away at compile time
      hashes[i] = next_hash(hashes[i]);
      positions[i] = position(hashes[i]);
      prefetch(positions[i]);
    }
  }
}

but this would result in \(nk\) calls to check for ~3 times more computational work than the non-bulk version.

The challenge then is to reduce the number of iterations on each row to only those positions that still need to be evaluated. This is the solution adopted by Boost.Bloom:

template<typename F>
void may_contain(
  const std::array<value_type, N>& x,
  F f) // f is fed lookup results
{ 
  std::uint64_t hashes[N];
  std::size_t   positions[N];
  std::uint64_t results = 0; // mask of N bits
  
  // initial round of hash calculation and prefetching
  
  for(std::size_t i = 0; i < N; ++i) {
    hashes[i] = hash(x[i]);
    positions[i] = position(hashes[i]);
    results |= 1ull << i;
    prefetch(positions[i]);
  }
  
  // main loop
  
  for(int j = 0; j < k; ++i) {
    auto mask = results;
    if(!mask) break;
    do{
      auto i = std::countr_zero(mask);
      auto b = check(positions[i]);
      results &= ~(std::uint64_t(!b) << i);
      if(j < k - 1) { // this check is optimized away at compile time
        hashes[i] = next_hash(hashes[i]);
        positions[i] = position(hashes[i]);
        prefetch(positions[i]);
      }
      mask &= mask - 1; // reset least significant 1
    } while(mask);
  }
  
  // feed results
  
  for(int i = 0; i < k; ++i) {
    f(results & 1);
    results >>= 1;
  }
}

Instead of an array of partial results, we keep these as a bitmask, so that we can skip groups of terminated columns in constant time using std::countr_zero. For instance, in the 7th row the main loop does 11 iterations instead of n = 64.

In summary, the bulk version of may_contain only does n more conditional branches than the non-bulk version, plus \(n(1-p)\) superfluous memory fetches —the latter could be omitted at the expense of \(n(1-p)\) additional conditional branches, but benchmarks showed that the version with extra memory fetches is actually faster. These are measured speedups of bulk vs. non-bulk lookup for a boost::bloom::filter<int, K> containing 10M elements under GCC, 64-bit mode:

array
size
K p = 1 p = 0 p = 0.1
8M 6 0.78 2.11 1.43
12M 9 1.54 2.27 1.38
16M 11 2.08 2.45 1.46
20M 14 2.24 2.57 1.43

(More results here.)

Conclusions

Boost.Bloom will introduce bulk insertion and lookup capabilities in Boost 1.90, resulting in speedups of up to 3x, though results vary greatly depending on the filter configuration and its size, and may even have less performance than the regular case in some situations. We have shown how bulk lookup is implemented for the case k > 1, where the regular, non-bulk version is highly branched and so not readily amenable to pipelining. The key technique, based on iteration reduction with std::countr_zero, can be applied outside the context of Boost.Bloom to implement efficient pipelining of early-exit operations.