Friday, November 14, 2025

Comparing the run-time performance of Fil-C and ASAN

After the publication of the experiments with Boost.Unordered on Fil-C, some readers asked for a comparison of run-time performances between Fil-C and Clang's AddressSanitizer (ASAN).

Warning

Please do not construe this article as implying that Fil-C and ASAN are competing technologies within the same application space. Whereas ASAN is designed to detect bugs resulting in memory access violations, Fil-C sports a stricter notion of memory safety including UB situations where a pointer is directed to a valid memory region that is nonetheless out of bounds with respect to the pointer's provenance. That said, there's some overlapping between both tools, so it's only natural to question about their relative impact on execution times.

Results

Our previous benchmarking repo has been updated to include results for plain Clang 18, Clang 18 with ASAN enabled, and Fil-C v0.674, all with release mode settings. The following figures show execution times in ns per element for Clang/ASAN (solid lines) and Fil-C (dashed lines) for three Boost.Unordered containers (boost::unordered_map,  boost::unordered_flat_map and boost::unordered_node_map) and four scenarios.


Running insertion Running erasure

Successful lookup Unsuccessful lookup

In summary:

  • Insertion:
    Fil-C is between 1.8x slower and 4.1x faster than ASAN (avg. 1.3x faster).
  • Erasure:
    Fil-C is between 1.3x slower and 9.2x faster than ASAN (avg. 1.9x faster).
  • Successful lookup:
    Fil-C is between 2.5x slower and 1.9x faster than ASAN (avg. 1.6x slower).
  • Unsuccessful lookup:
    Fil-C is between 2.6x slower  and 1.4x faster than ASAN (avg. 1.9x slower).

So, results don't allow us to establish a clear-cut "winner". When allocation/deallocation is involved, Fil-C seems to perform better (except for insertion when the memory working set gets past a certain threshold). For lookup, Fil-C is generally worse, and, again, the gap increases as more memory is used. A deeper analysis would require knowledge of the internals of both tools that I, unfortunately, lack.

(Update Nov 16, 2025) Memory usage

By request, we've measured peak memory usage in GB (as reported by time -v) for the three environments and three scenarios (insertion, erasure, combined successful and unsuccessful lookup) involving five different containers from Boost.Unordered and Abseil, each holding 10M elements. The combination of containers doesn't allow us to discern how closed vs. open addressing affect memory usage overheads in ASAN and Fil-C.

ASAN uses between 2.1x and 2.6x more memory than regular Clang, whereas Fil-C ranges between 1.8x and 3.9x. Results are again a mixed bag. Fil-C performs worst for the erasure scenario, perhaps because of delays in memory reclamation by this technology's embedded garbage collector.

Monday, November 10, 2025

Some experiments with Boost.Unordered on Fil-C

Fil-C is a C and C++ compiler built on top of LLVM that adds run-time memory-safety mechanisms preventing out-of-bounds and use-after-free accesses. This naturally comes at a price in execution time, so I was curious about how much of a penalty that is for a performance-oriented, relatively low-level library like Boost.Unordered.

From the user's perspective, Fil-C is basically a Clang clone, so it is fairly easy to integrate in previously existing toolchains. This repo shows how to plug Fil-C into Boost.Unordered's CI, which runs on GitHub Actions and is powered by Boost's own B2 build system. The most straightforward way to make B2 use Fil-C is by having a user-config.jam file like this:

using clang : : fil++ ;

which instructs B2 to use the clang toolset with the only change that the compiler name is not the default clang++ but fil++.

We've encountered only minor difficulties during the process:

  •  In the enviroment used (Linux x64), B2 automatically includes --target=x86_64-pc-linux as part of the commandline, which confuses the adapted version of libc++ shipping with Fil-C. This option had to be overridden with --target=x86_64-unknown-linux-gnu (which is the default for Clang).
  • As of this writing, Fil-C does not accept inline assembly code (asm or __asm__ blocks), which Boost.Unordered uses to provide embedded GDB pretty-printers. The feature was disabled with the macro BOOST_ALL_NO_EMBEDDED_GDB_SCRIPTS

Other than this, the extensive Boost.Unordered test suite compiled and ran successfully, except for some tests involving Boost.Interprocess, which uses inline assembly in some places. CI completed in around 2.5x the time it takes with a regular compiler. It is worth noting that Fil-C happily accepted SSE2 SIMD intrinsics crucially used by Boost.Unordered.

We ran some performance tests compiled with Fil-C v0.674 on a Linux machine, release settings (benchmark code and setup here). The figures show execution times in ns per element for Clang 15 (solid lines) and Fil-C (dashed lines) and three containers: boost::unordered_map (closed-addressing hashmap), and boost::unordered_flat_map and boost::unordered_node_map (open addressing).


Running insertion Running erasure

Successful lookup Unsuccessful lookup

Execution with Fil-C is around 2x-4x slower, with wide variations depending on the benchmarked scenario and container of choice. Closed-addressing boost::unordered_map is the container experiencing the largest degradation, presumably because it does the most amount of pointer chasing.

Saturday, October 4, 2025

Bulk operations in Boost.Bloom

Starting in Boost 1.90, Boost.Bloom will provide so-called bulk operations, which, in general, can speed up insertion and lookup by a sizable factor. The key idea behind this optimization is to separate in time the calculation of a position in the Bloom filter's array from its actual access. For instance, if this is the the algorithm for regular insertion into a Bloom filter with k = 1 (all the snippets in this article are simplified, illustrative versions of the actual source code):

void insert(const value_type& x)
{
  auto h = hash(x);
  auto p = position(h);
  set(position, 1);
}

then the bulk-mode variant for insertion of a range of values would look like:

void insert(const std::array<value_type, N>& x)
{
  std::size_t positions[N];
  
  // pipeline position calculation and memory access
  
  for(std::size_t i = 0; i < N; ++i) {
    auto h = hash(x[i]);
    positions[i] = position(h);
    prefetch(positions[i]);
  }
  
  for(std::size_t i = 0; i < N; ++i) {
    set(positions[i], 1);
  }
}

By prefetching the address of positions[i] way in advance of its actual usage in set(positions[i], 1), we make sure that the latter is accessing a cached value and avoid (or minimize) the CPU stalling that would result from reaching out to cold memory. We have studied bulk optimization in more detail in the context of boost::concurrent_flat_map. You can see actual measurements of the performance gains achieved in a dedicated repo; as expected, gains are higher for larger bit arrays not fitting in the lower levels of the CPU's cache hierarchy.

From an algorithmic point of view, the most interesting case is that of lookup operations for k > 1, since the baseline non-bulk procedure is not easily amenable to pipelining:

bool may_contain(const value_type& x)
{ 
  auto h = hash(x);
  for(int i = 0; i < k; ++i) {
    auto p = position(h);
    if(check(position) == false) return false;
    if(i < k - 1) h = next_hash(h);
  }
  return true;
}

This algorithm is branchful and can take anywhere from 1 to k iterations, the latter being the case for elements present in the filter and false positives. For instance, this diagram shows the number of steps taken to look up n = 64 elements on a filter with k = 10 and FPR = 1%, where the successful lookup rate (proportion of looked up elements actually in the filter) is p = 0.1:

As it can be seen, for non-successful lookups may_contain typically stops at the first few positions: the average number of positions checked (grayed cells) is  \[n\left(pk +(1-p)\frac{1-p_b^{k}}{1-p_b}\right),\] where \(p_b=\sqrt[k]{\text{FPR}}\) is the probability that an arbitrary bit in the filter's array is set to 1. In the example used, this results in only 34% of the total nk = 640 positions being checked.

Now, a naïve bulk-mode version could look as follows:

template<typename F>
void may_contain(
  const std::array<value_type, N>& x,
  F f) // f is fed lookup results
{ 
  std::uint64_t hashes[N];
  std::size_t   positions[N];
  bool          results[N];
  
  // initial round of hash calculation and prefetching
  
  for(std::size_t i = 0; i < N; ++i) {
    hashes[i] = hash(x[i]);
    positions[i] = position(hashes[i]);
    results[i] = true;
    prefetch(positions[i]);
  }
  
  // main loop
  
  for(int j = 0; j < k; ++i) {
    for(std::size_t i = 0; i < N; ++i) {
      if(results[i]) { // conditional branch X
        results[i] &= check(positions[i]);
        if(results[i] && j < k - 1) {
          hashes[i] = next_hash(hashes[i]);
          positions[i] = position(hashes[i]);
          prefetch(positions[i]);
        }
      }
    }
  }
  
  // feed results
  
  for(int i = 0; i < k; ++i) {
    f(results[i]);
  }
}

This simply stores partial results in an array and iterates row-first instead of column-first:

The problem with this approach is that, even though it calls check exactly the same number of times as the non-bulk algorithm, the conditional branch labeled X is executed \(nk\) times, and this has a huge impact on the CPU's branch predictor. Conditional branches could in principle be eliminated altogether:

for(int j = 0; j < k; ++i) {
  for(std::size_t i = 0; i < N; ++i) {
    results[i] &= check(positions[i]);
    if(j < k - 1) { // this check is optimized away at compile time
      hashes[i] = next_hash(hashes[i]);
      positions[i] = position(hashes[i]);
      prefetch(positions[i]);
    }
  }
}

but this would result in \(nk\) calls to check for ~3 times more computational work than the non-bulk version.

The challenge then is to reduce the number of iterations on each row to only those positions that still need to be evaluated. This is the solution adopted by Boost.Bloom:

template<typename F>
void may_contain(
  const std::array<value_type, N>& x,
  F f) // f is fed lookup results
{ 
  std::uint64_t hashes[N];
  std::size_t   positions[N];
  std::uint64_t results = 0; // mask of N bits
  
  // initial round of hash calculation and prefetching
  
  for(std::size_t i = 0; i < N; ++i) {
    hashes[i] = hash(x[i]);
    positions[i] = position(hashes[i]);
    results |= 1ull << i;
    prefetch(positions[i]);
  }
  
  // main loop
  
  for(int j = 0; j < k; ++i) {
    auto mask = results;
    if(!mask) break;
    do{
      auto i = std::countr_zero(mask);
      auto b = check(positions[i]);
      results &= ~(std::uint64_t(!b) << i);
      if(j < k - 1) { // this check is optimized away at compile time
        hashes[i] = next_hash(hashes[i]);
        positions[i] = position(hashes[i]);
        prefetch(positions[i]);
      }
      mask &= mask - 1; // reset least significant 1
    } while(mask);
  }
  
  // feed results
  
  for(int i = 0; i < k; ++i) {
    f(results & 1);
    results >>= 1;
  }
}

Instead of an array of partial results, we keep these as a bitmask, so that we can skip groups of terminated columns in constant time using std::countr_zero. For instance, in the 7th row the main loop does 11 iterations instead of n = 64.

In summary, the bulk version of may_contain only does n more conditional branches than the non-bulk version, plus \(n(1-p)\) superfluous memory fetches —the latter could be omitted at the expense of \(n(1-p)\) additional conditional branches, but benchmarks showed that the version with extra memory fetches is actually faster. These are measured speedups of bulk vs. non-bulk lookup for a boost::bloom::filter<int, K> containing 10M elements under GCC, 64-bit mode:

array
size
K p = 1 p = 0 p = 0.1
8M 6 0.78 2.11 1.43
12M 9 1.54 2.27 1.38
16M 11 2.08 2.45 1.46
20M 14 2.24 2.57 1.43

(More results here.)

Conclusions

Boost.Bloom will introduce bulk insertion and lookup capabilities in Boost 1.90, resulting in speedups of up to 3x, though results vary greatly depending on the filter configuration and its size, and may even have less performance than the regular case in some situations. We have shown how bulk lookup is implemented for the case k > 1, where the regular, non-bulk version is highly branched and so not readily amenable to pipelining. The key technique, based on iteration reduction with std::countr_zero, can be applied outside the context of Boost.Bloom to implement efficient pipelining of early-exit operations.

Sunday, July 6, 2025

Maps on chains

(From a conversation with Vassil Vassilev.) Suppose we want to have a C++ map where the keys are disjoint, integer intervals of the form [a, b]:

struct interval
{
  int min, max;
};

std::map<interval, std::string> m;

m[{0, 9}] = "ABC";
m[{10, 19}] = "DEF";

This looks easy enough, we just have to write the proper comparison operator for intervals, right?

bool operator<(const interval& x, const interval& y)
{
  return x.max < y.min;
}

But what happens if we try to insert an interval which is not disjoint with those already in the map?

m[{5, 14}] = "GHI"; // intersects both {0, 9} and {10, 19}

The short answer is that this is undefined behavior, but let's try to undertand why. C++ associative containers depend on the comparison function (typically, std::less<Key>) inducing a so-called strict weak ordering on elements of Key. In layman terms, a strict weak order < behaves as the "less than" relationship does for numbers, except that there may be incomparable elements x, y such that xy and y ≮ x; for numbers, this only happens if x = y, but in the case of a general SWO we allow for distinct, incomparable elements as long as they form equivalence classes. A convenient way to rephrase this condition is to require that incomparable elements are totally equivalent in how they compare to the rest of the elements, that is, they're truly indistinguishable from the point of view of the SWO. Getting back to our interval scenario, we have three possible cases when comparing [a, b] and [c, d]:

  • If b < c, the intervals don't overlap and [a, b] < [c, d].
  • If d < a, the intervals don't overlap and [c, d] < [a, b].
  • Otherwise, the intervals are incomparable. This can happen when [a, b] and [c, d] overlap partially or when they are exactly the same interval.

What we have described is a well known relationship called interval order. The problem is that the interval order is not a strict weak order. Let's depict a Hasse diagram for the interval order on integer intervals [a,b] between 0 and 4:

A Hasse diagram works like this: given two elements x and y, x < y iff there is a path going upwards that connects x to y. For instance, the fact that [1, 1] < [3, 4] is confirmed by the path [1, 1] → [2, 2] → [3, 4]. But the diagram also serves to show why this relationship is not a strict weak order: for it to be so, incomparable elements (those not connected) should be indistinguishable in that they are connected upwards and downwards with the same elements, and this is clearly not the case (in fact, it is not the case for any pair of incomparable elements). In mathematical terms, our relationship is of a more general type called a strict partial order.

Going back to C++, associative containers assume that the elements inserted form a linear arrangement with respect to <: when we try to insert a new element y that is incomparable with some previously inserted element x, the properties of strict weak orders allows us to determine that x and y are equivalent, so nothing breaks up (the insertion fails as a duplicate for a std::map, or y is added next to x for a std::multimap).

There's a way to accommodate our interval scenario with std::map, though. As long as the elements we are inserting belong to the same connecting path or chainstd::map can't possibly "know" if our relationship is a strict weak order or not: it certainly looks like one for the limited subset of elements it has seen so far. Implementation-wise, we just have to make sure we're not comparing partially overlapping intervals:

struct interval_overlap: std::runtime_error
{
  interval_overlap(): std::runtime_error("interval overlap"){}
};

bool operator<(const interval& x, const interval& y)
{
  if(x.min == y.min) {
    if(x.max != y.max) throw interval_overlap();
    return false;
  }
  else if(x.min < y.min) {
    if(x.max >= y.min) throw interval_overlap();
    return true;
  }
  else /* x.min > y.min */
  {
    if(x.min <= y.max) throw interval_overlap();
    return false;
  }
}

std::map<interval, std::string> m;

m[{0, 9}] = "ABC";
m[{10, 19}] = "DEF";
m[{5, 14}] = "GHI"; // throws interval_overlap

So, when we try to insert an element that would violate the strict weak ordering constraints (that is, it lies outside the chain connecting the intervals inserted so far), an exception is thrown and no undefined behavior is hit. A strict reading of the standard would not allow this workaround, as it is required that the comparison object for the map induce a strict weak ordering for all possible values of Key, not only those in the container (or that is my interpretation, at least): for all practical purposes, though, this works and will foreseeably continue to work for all future revisions of the standard.

Bonus point. Thanks to heterogeneous lookup, we can extend our use case to support lookup for integers inside the intervals:

struct less_interval
{
  using is_transparent = void;

  bool operator()(const interval& x, const interval& y) const
  {
    // as operator< before
  }

  bool operator()(int x, const interval& y) const
  {
    return x < y.min;
  }
  
  bool operator()(const interval& x, int y) const
  {
    return x.max < y; 
  }    
};

std::map<interval, std::string, less_interval> m;
  
m[{0, 9}] = "ABC";
m[{10, 19}] = "DEF";

std::cout << m.find(5)->second << "\n"; // prints "ABC"

Exercise for the reader: Can you formally prove that this works? (Hint: define a strict weak order on ℕ I, where ℕ is the set of natural numbers and I is a collection of disjoint integer intervals.)