Saturday, October 4, 2025

Bulk operations in Boost.Bloom

Starting in Boost 1.90, Boost.Bloom will provide so-called bulk operations, which, in general, can speed up insertion and lookup by a sizable factor. The key idea behind this optimization is to separate in time the calculation of a position in the Bloom filter's array from its actual access. For instance, if this is the the algorithm for regular insertion into a Bloom filter with k = 1 (all the snippets in this article are simplified, illustrative versions of the actual source code):

void insert(const value_type& x)
{
  auto h = hash(x);
  auto p = position(h);
  set(position, 1);
}

then the bulk-mode variant for insertion of a range of values would look like:

void insert(const std::array<value_type, N>& x)
{
  std::size_t positions[N];
  
  // pipeline position calculation and memory access
  
  for(std::size_t i = 0; i < N; ++i) {
    auto h = hash(x[i]);
    positions[i] = position(h);
    prefetch(positions[i]);
  }
  
  for(std::size_t i = 0; i < N; ++i) {
    set(positions[i], 1);
  }
}

By prefetching the address of positions[i] way in advance of its actual usage in set(positions[i], 1), we make sure that the latter is accessing a cached value and avoid (or minimize) the CPU stalling that would result from reaching out to cold memory. We have studied bulk optimization in more detail in the context of boost::concurrent_flat_map. You can see actual measurements of the performance gains achieved in a dedicated repo; as expected, gains are higher for larger bit arrays not fitting in the lower levels of the CPU's cache hierarchy.

From an algorithmic point of view, the most interesting case is that of lookup operations for k > 1, since the baseline non-bulk procedure is not easily amenable to pipelining:

bool may_contain(const value_type& x)
{ 
  auto h = hash(x);
  for(int i = 0; i < k; ++i) {
    auto p = position(h);
    if(check(position) == false) return false;
    if(i < k - 1) h = next_hash(h);
  }
  return true;
}

This algorithm is branchful and can take anywhere from 1 to k iterations, the latter being the case for elements present in the filter and false positives. For instance, this diagram shows the number of steps taken to look up n = 64 elements on a filter with k = 10 and FPR = 1%, where the successful lookup rate (proportion of looked up elements actually in the filter) is p = 0.1:

As it can be seen, for non-successful lookups may_contain typically stops at the first few positions: the average number of positions checked (grayed cells) is  \[n\left(pk +(1-p)\frac{1-p_b^{k}}{1-p_b}\right),\] where \(p_b=\sqrt[k]{\text{FPR}}\) is the probability that an arbitrary bit in the filter's array is set to 1. In the example used, this results in only 34% of the total nk = 640 positions being checked.

Now, a naïve bulk-mode version could look as follows:

template<typename F>
void may_contain(
  const std::array<value_type, N>& x,
  F f) // f is fed lookup results
{ 
  std::uint64_t hashes[N];
  std::size_t   positions[N];
  bool          results[N];
  
  // initial round of hash calculation and prefetching
  
  for(std::size_t i = 0; i < N; ++i) {
    hashes[i] = hash(x[i]);
    positions[i] = position(hashes[i]);
    results[i] = true;
    prefetch(positions[i]);
  }
  
  // main loop
  
  for(int j = 0; j < k; ++i) {
    for(std::size_t i = 0; i < N; ++i) {
      if(results[i]) { // conditional branch X
        results[i] &= check(positions[i]);
        if(results[i] && j < k - 1) {
          hashes[i] = next_hash(hashes[i]);
          positions[i] = position(hashes[i]);
          prefetch(positions[i]);
        }
      }
    }
  }
  
  // feed results
  
  for(int i = 0; i < k; ++i) {
    f(results[i]);
  }
}

This simply stores partial results in an array and iterates row-first instead of column-first:

The problem with this approach is that, even though it calls check exactly the same number of times as the non-bulk algorithm, the conditional branch labeled X is executed \(nk\) times, and this has a huge impact on the CPU's branch predictor. Conditional branches could in principle be eliminated altogether:

for(int j = 0; j < k; ++i) {
  for(std::size_t i = 0; i < N; ++i) {
    results[i] &= check(positions[i]);
    if(j < k - 1) { // this check is optimized away at compile time
      hashes[i] = next_hash(hashes[i]);
      positions[i] = position(hashes[i]);
      prefetch(positions[i]);
    }
  }
}

but this would result in \(nk\) calls to check for ~3 times more computational work than the non-bulk version.

The challenge then is to reduce the number of iterations on each row to only those positions that still need to be evaluated. This is the solution adopted by Boost.Bloom:

template<typename F>
void may_contain(
  const std::array<value_type, N>& x,
  F f) // f is fed lookup results
{ 
  std::uint64_t hashes[N];
  std::size_t   positions[N];
  std::uint64_t results = 0; // mask of N bits
  
  // initial round of hash calculation and prefetching
  
  for(std::size_t i = 0; i < N; ++i) {
    hashes[i] = hash(x[i]);
    positions[i] = position(hashes[i]);
    results |= 1ull << i;
    prefetch(positions[i]);
  }
  
  // main loop
  
  for(int j = 0; j < k; ++i) {
    auto mask = results;
    if(!mask) break;
    do{
      auto i = std::countr_zero(mask);
      auto b = check(positions[i]);
      results &= ~(std::uint64_t(!b) << i);
      if(j < k - 1) { // this check is optimized away at compile time
        hashes[i] = next_hash(hashes[i]);
        positions[i] = position(hashes[i]);
        prefetch(positions[i]);
      }
      mask &= mask - 1; // reset least significant 1
    } while(mask);
  }
  
  // feed results
  
  for(int i = 0; i < k; ++i) {
    f(results & 1);
    results >>= 1;
  }
}

Instead of an array of partial results, we keep these as a bitmask, so that we can skip groups of terminated columns in constant time using std::countr_zero. For instance, in the 7th row the main loop does 11 iterations instead of n = 64.

In summary, the bulk version of may_contain only does n more conditional branches than the non-bulk version, plus \(n(1-p)\) superfluous memory fetches —the latter could be omitted at the expense of \(n(1-p)\) additional conditional branches, but benchmarks showed that the version with extra memory fetches is actually faster. These are measured speedups of bulk vs. non-bulk lookup for a boost::bloom::filter<int, K> containing 10M elements under GCC, 64-bit mode:

array
size
K p = 1 p = 0 p = 0.1
8M 6 0.78 2.11 1.43
12M 9 1.54 2.27 1.38
16M 11 2.08 2.45 1.46
20M 14 2.24 2.57 1.43

(More results here.)

Conclusions

Boost.Bloom will introduce bulk insertion and lookup capabilities in Boost 1.90, resulting in speedups of up to 3x, though results vary greatly depending on the filter configuration and its size, and may even have less performance than the regular case in some situations. We have shown how bulk lookup is implemented for the case k > 1, where the regular, non-bulk version is highly branched and so not readily amenable to pipelining. The key technique, based on iteration reduction with std::countr_zero, can be applied outside the context of Boost.Bloom to implement efficient pipelining of early-exit operations.

Sunday, July 6, 2025

Maps on chains

(From a conversation with Vassil Vassilev.) Suppose we want to have a C++ map where the keys are disjoint, integer intervals of the form [a, b]:

struct interval
{
  int min, max;
};

std::map<interval, std::string> m;

m[{0, 9}] = "ABC";
m[{10, 19}] = "DEF";

This looks easy enough, we just have to write the proper comparison operator for intervals, right?

bool operator<(const interval& x, const interval& y)
{
  return x.max < y.min;
}

But what happens if we try to insert an interval which is not disjoint with those already in the map?

m[{5, 14}] = "GHI"; // intersects both {0, 9} and {10, 19}

The short answer is that this is undefined behavior, but let's try to undertand why. C++ associative containers depend on the comparison function (typically, std::less<Key>) inducing a so-called strict weak ordering on elements of Key. In layman terms, a strict weak order < behaves as the "less than" relationship does for numbers, except that there may be incomparable elements x, y such that xy and y ≮ x; for numbers, this only happens if x = y, but in the case of a general SWO we allow for distinct, incomparable elements as long as they form equivalence classes. A convenient way to rephrase this condition is to require that incomparable elements are totally equivalent in how they compare to the rest of the elements, that is, they're truly indistinguishable from the point of view of the SWO. Getting back to our interval scenario, we have three possible cases when comparing [a, b] and [c, d]:

  • If b < c, the intervals don't overlap and [a, b] < [c, d].
  • If d < a, the intervals don't overlap and [c, d] < [a, b].
  • Otherwise, the intervals are incomparable. This can happen when [a, b] and [c, d] overlap partially or when they are exactly the same interval.

What we have described is a well known relationship called interval order. The problem is that the interval order is not a strict weak order. Let's depict a Hasse diagram for the interval order on integer intervals [a,b] between 0 and 4:

A Hasse diagram works like this: given two elements x and y, x < y iff there is a path going upwards that connects x to y. For instance, the fact that [1, 1] < [3, 4] is confirmed by the path [1, 1] → [2, 2] → [3, 4]. But the diagram also serves to show why this relationship is not a strict weak order: for it to be so, incomparable elements (those not connected) should be indistinguishable in that they are connected upwards and downwards with the same elements, and this is clearly not the case (in fact, it is not the case for any pair of incomparable elements). In mathematical terms, our relationship is of a more general type called a strict partial order.

Going back to C++, associative containers assume that the elements inserted form a linear arrangement with respect to <: when we try to insert a new element y that is incomparable with some previously inserted element x, the properties of strict weak orders allows us to determine that x and y are equivalent, so nothing breaks up (the insertion fails as a duplicate for a std::map, or y is added next to x for a std::multimap).

There's a way to accommodate our interval scenario with std::map, though. As long as the elements we are inserting belong to the same connecting path or chainstd::map can't possibly "know" if our relationship is a strict weak order or not: it certainly looks like one for the limited subset of elements it has seen so far. Implementation-wise, we just have to make sure we're not comparing partially overlapping intervals:

struct interval_overlap: std::runtime_error
{
  interval_overlap(): std::runtime_error("interval overlap"){}
};

bool operator<(const interval& x, const interval& y)
{
  if(x.min == y.min) {
    if(x.max != y.max) throw interval_overlap();
    return false;
  }
  else if(x.min < y.min) {
    if(x.max >= y.min) throw interval_overlap();
    return true;
  }
  else /* x.min > y.min */
  {
    if(x.min <= y.max) throw interval_overlap();
    return false;
  }
}

std::map<interval, std::string> m;

m[{0, 9}] = "ABC";
m[{10, 19}] = "DEF";
m[{5, 14}] = "GHI"; // throws interval_overlap

So, when we try to insert an element that would violate the strict weak ordering constraints (that is, it lies outside the chain connecting the intervals inserted so far), an exception is thrown and no undefined behavior is hit. A strict reading of the standard would not allow this workaround, as it is required that the comparison object for the map induce a strict weak ordering for all possible values of Key, not only those in the container (or that is my interpretation, at least): for all practical purposes, though, this works and will foreseeably continue to work for all future revisions of the standard.

Bonus point. Thanks to heterogeneous lookup, we can extend our use case to support lookup for integers inside the intervals:

struct less_interval
{
  using is_transparent = void;

  bool operator()(const interval& x, const interval& y) const
  {
    // as operator< before
  }

  bool operator()(int x, const interval& y) const
  {
    return x < y.min;
  }
  
  bool operator()(const interval& x, int y) const
  {
    return x.max < y; 
  }    
};

std::map<interval, std::string, less_interval> m;
  
m[{0, 9}] = "ABC";
m[{10, 19}] = "DEF";

std::cout << m.find(5)->second << "\n"; // prints "ABC"

Exercise for the reader: Can you formally prove that this works? (Hint: define a strict weak order on ℕ I, where ℕ is the set of natural numbers and I is a collection of disjoint integer intervals.)