Saturday, June 18, 2022

Advancing the state of the art for std::unordered_map implementations


Several Boost authors have embarked on a project to improve the performance of Boost.Unordered's implementation of std::unordered_map (and multimap, set and multiset variants), and to extend its portfolio of available containers to offer faster, non-standard alternatives based on open addressing.

The first goal of the project has been completed in time for Boost 1.80 (due August 2022). We describe here the technical innovations introduced in boost::unordered_map that makes it the fastest implementation of std::unordered_map on the market.

Closed vs. open addressing

On a first approximation, hash table implementations fall on either of two general classes:

  • Closed addressing (also known as separate chaining) relies on an array of buckets, each of which points to a list of elements belonging to it. When a new element goes to an already occupied bucket, it is simply linked to the associated element list. The figure depicts what we call the textbook implementation of closed addressing, arguably the simplest layout, and among the fastest, for this type of hash tables.
textbook layout
  • Open addressing (or closed hashing) stores at most one element in each bucket (sometimes called a slot). When an element goes to an already occupied slot, some probing mechanism is used to locate an available slot, preferrably close to the original one.

Recent, high-performance hash tables use open addressing and leverage on its inherently better cache locality and on widely available SIMD operations. Closed addressing provides some functional advantages, though, and remains relevant as the required foundation for the implementation of std::unodered_map.

Restrictions on the implementation of std::unordered_map

The standardization of C++ unordered associative containers is based on Matt Austern's 2003 N1456 paper. Back in the day, open-addressing approaches were not regarded as sufficiently mature, so closed addressing was taken as the safe implementation of choice. Even though the C++ standard does not explicitly require that closed addressing must be used, the assumption that this is the case leaks through the public interface of std::unordered_map:

  • A bucket API is provided.
  • Pointer stability implies that the container is node-based. In C++17, this implication was made explicit with the introduction of extract capabilities.
  • Users can control the container load factor.
  • Requirements on the hash function are very lax (open addressing depends on high-quality hash functions with the ability to spread keys widely across the space of std::size_t values.)

As a result, all standard library implementations use some form of closed addressing for the internal structure of their std::unordered_map (and related containers).

Coming as an additional difficulty, there are two complexity requirements:

  • iterator increment must be (amortized) constant time,
  • erase must be constant time on average,

that rule out the textbook implementation of closed addressing (see N2023 for details). To cope with this problem, standard libraries depart from the textbook layout in ways that introduce speed and memory penalties: this is, for instance, how libstdc++-v3 and libc++ layouts look like:

libstdc++-v3/libc++ layout

To provide constant iterator increment, all nodes are linked together, which in its turn forces two adjustments to the data structure:

  • Buckets point to the node before the first one in the bucket so as to preserve constant-time erasure.
  • To detect the end of a bucket, the element hash value is added as a data member of the node itself (libstdc++-v3 opts for on-the-fly hash calculation under some circumstances).

Visual Studio standard library (formerly from Dinkumware) uses an entirely different approach to circumvent the problem, but the general outcome is that resulting data structures perform significantly worse than the textbook layout in terms of speed, memory consumption, or both.

Boost.Unordered 1.80 data layout

The new data layout used by Boost.Unordered goes back to the textbook approach:

Boost.Unordered layout

Unlike the rest of standard library implementations, nodes are not linked across the container but only within each bucket. This makes constant-time erase trivially implementable, but leaves unsolved the problem of constant-time iterator increment: to achieve it, we introduce so-called bucket groups (top of the diagram). Each bucket group consists of a 32/64-bit bucket occupancy mask plus next and prev pointers linking non-empty bucket groups together. Iteration across buckets resorts to a combination of bit manipulation operations on the bitmasks plus group traversal through next pointers, which is not only constant time but also very lightweight in terms of execution time and of memory overhead (4 bits per bucket).

Fast modulo

When inserting or looking for an element, hash table implementations need to map the element hash value into the array of buckets (or slots in the open-addressing case). There are two general approaches in common use:

  • Bucket array sizes follow a sequence of prime numbers p, and mapping is of the form hh mod p.
  • Bucket array sizes follow a power-of-two sequence 2n, and mapping takes n bits from h. Typically it is the n least significant bits that are used, but in some cases, like when h is postprocessed to improve its uniformity via multiplication by a well-chosen constant m (such as defined by Fibonacci hashing), it is best to take the n most significant bits, that is, h → (h × m) >> (Nn), where N is the bitwidth of std::size_t and >> is the usual C++ right shift operation.

We use the modulo by a prime approach because it produces very good spreading even if hash values are not uniformly distributed. In modern CPUs, however, modulo is an expensive operation involving integer division; compilers, on the other hand, know how to perform modulo by a constant much more efficiently, so one possible optimization is to keep a table of pointers to functions fp : hh mod p. This technique replaces expensive modulo calculation with a table jump plus a modulo-by-a-constant operation.

In Boost.Unordered 1.80, we have gone a step further. Daniel Lemire et al. show how to calculate h mod p as an operation involving some shifts and multiplications by p and a pre-computed c value acting as a sort of reciprocal of p. We have used this work to implement hash mapping as h → fastmod(h, p, c) (some details omitted). Note that, even though fastmod is generally faster than modulo by a constant, most performance gains actually come from the fact that we are eliminating the table jump needed to select fp, which prevented code inlining.

Time and memory performance of Boost 1.80 boost::unordered_map

We are providing some benchmark results of the boost::unordered_map against libstdc++-v3, libc++ and Visual Studio standard library for insertion, lookup and erasure scenarios. boost::unordered_map is mostly faster across the board, and in some cases significantly so. There are three factors contributing to this performance advantage:

  • the very reduced memory footprint improves cache utilization,
  • fast modulo is used,
  • the new layout incurs one less pointer indirection than libstdc++-v3 and libc++ to access the elements of a bucket.

As for memory consumption, let N be the number of elements in a container with B buckets: the memory overheads (that is, memory allocated minus memory used strictly for the elements themselves) of the different implementations on 64-bit architectures are:

Implementation Memory overhead (bytes)
libstdc++-v3 16 N + 8 B (hash caching)
8 N + 8 B (no hash caching)
libc++ 16 N + 8 B
Visual Studio (Dinkumware) 16 N + 16 B
Boost.Unordered 8 N + 8.5 B

Which hash container to choose

Opting for closed-addressing (which, in the realm of C++, is almost synonymous with using an implementation of std::unordered_map) or choosing a speed-oriented, open-addressing container is in practice not a clear-cut decision. Some factors favoring one or the other option are listed:

  • std::unordered_map
    • The code uses some specific parts of its API like node extraction, the bucket interface or the ability to set the maximum load factor, which are generally not available in open-addressing containers.
    • Pointer stability and/or non-moveability of values required (though some open-addressing alternatives support these at the expense of reduced performance).
    • Constant-time iterator increment required.
    • Hash functions used are only mid-quality (open addressing requires that the hash function have very good key-spreading properties).
    • Equivalent key support, ie. unordered_multimap/unordered_multiset required. We do not know of any open-addressing container supporting equivalent keys.
  • Open-addressing containers
    • Performance is the main concern.
    • Existing code can be adapted to a basically more stringent API and more demanding requirements on the element type (like moveability).
    • Hash functions are of good quality (or the default ones from the container provider are used).

If you decide to use std::unordered_map, Boost.Unordered 1.80 now gives you the fastest, fully-conformant implementation on the market.

Next steps

There are some further areas of improvement to boost::unordered_map that we will investigate post Boost 1.80:

  • Reduce the memory overhead of the new layout from 4 bits to 3 bits per bucket.
  • Speed up performance for equivalent key variants (unordered_multimap/unordered_multiset).

In parallel, we are working on the future boost::unordered_flat_map, our proposal for a top-speed, open-addressing container beyond the limitations imposed by std::unordered_map interface. Your feedback on our current and future work is much welcome.

Thursday, March 10, 2022

Emulating template named arguments in C++20

std::unordered_map is a highly configurable class template with five parameters:

    class Key,
    class Value,
    class Hash = std::hash<Key>,
    class KeyEqual = std::equal_to<Key>,
    class Allocator = std::allocator< std::pair<const Key, Value> >
> class unordered_map;

Typical usage depends on default values for most of these parameters:

using my_map=std::unordered_map<int,std::string>;

but things get cumbersome when we want to specify one of the usually defaulted types:

using my_allocator = ...;
using my_map=std::unordered_map<
int, std::string,
std::hash<int>, std::equal_to<int>,
my_allocator< std::pair<const int, std::string> >

In the example, we are forced to specify the hash and equality predicate with their default value types just to get to the allocator, which is the parameter we really wanted to specify. Ideally we would like to have a syntax like this:

// this is not actual C++
using my_map = std::unordered_map<
Key=int, Value=std::string,
Allocator=my_allocator< std::pair<const int, std::string> >

Turns out we can emulate this by resorting to designated initializers, introduced in C++20:

typename Key, typename Value,
typename Hash = std::hash<Key>,
typename Equal = std::equal_to<Key>,
typename Allocator = std::allocator< std::pair<const Key,Value> >
struct unordered_map_config
Key *key = nullptr;
Value *value = nullptr;
Hash *hash = nullptr;
Equal *equal = nullptr;
Allocator *allocator = nullptr;

using type = std::unordered_map<Key,Value,Hash,Equal,Allocator>;

template<typename T>
constexpr T *type = nullptr;

template<unordered_map_config Cfg>
using unordered_map = typename decltype(Cfg)::type;


using my_map = unordered_map<{
.key = type<int>, .value = type<std::string>,
.allocator = type< my_allocator< std::pair<const int, std::string > > >

The approach taken by the simulation is to use designated initializers to create an aggregate object consisting of dummy null pointers: the values of the pointers do not matter, but their types are captured via CTAD and used to synthesize the associated std::unordered_map instantiation. Two more C++20 features this technique depends on are:

  • Non-type template parameters have been extended to accept literal types (which include aggregate types such as unordered_map_config instantiations).
  • The class template unordered_map_config can be specified as a non-type template parameter of unordered_map. In C++17, we would have had to define unordered_map as
    template<auto Cfg>
    using unordered_map = typename decltype(Cfg)::type;
    which would force the user to explicit name unordered_map_config in
    using my_map = unordered_map<unordered_map_config{...}>;

There is still the unavoidable noise of having to use the type template alias since, of course, aggregate initialization is about values rather than types.

Another limitation of this simulation is that we cannot mix named and unnamed parameters:

// compiler error: either all initializer clauses should be designated
// or none of them should be
using my_map = unordered_map<{
type<int>, type<std::string>,
.allocator = type< my_allocator< std::pair<const int, std::string > > >

C++20 designated parameters are more restrictive than their C99 counterpart; some of the constraints (initializers cannot be specified out of order) are totally valid in the context of C++, but I personally fail to see why mixing named and unnamed parameters would pose any problem.

Monday, January 17, 2022

Start Wordle with TARES

There have been some discussions on what the best first guess is for the game Wordle, but none, to the best of my knowledge, has used the following approach. After each guess, the game answers back with a matching result like these:

■■■■■ (all letters wrong), 

■■ (two letters right, one mispositioned),

■■■■■ (all letters right).

There are 35=243 possible answers. From an information-theoretic point of view, the word we are trying to guess is a random variable (selected from a predefined dictionary), and the information we are obtaining by submitting our query is measured by the entropy formula

H(guess) = pi log2 pi bits,

where pi is the probability that the game returns the i-th answer (i = 1, ... , 243) for our particular guess. So, the best first guess is the one for which we get the most information, that is, the associated entropy is maximum. Intuitively speaking, we are going for the guess that yields the most balanced partition of the dictionary words as grouped by their matching result: entropy is maximum when all pi are equal (this is impossible for our problem, but gives an upper bound on the attainable entropy of log2(243) = 7.93 bits).

Let's compute then the best guesses. Wordle uses a dictionary of 2,315 entries which is unfortunately not disclosed; in its place we will resort to Stanford GraphBase list. I wrote a trivial C++17 program that goes through each of the 5,757 words of Stanford's list and computes its associated entropy as a first guess (see it running online). The resulting top 10 best words, along with their entropies are:

TARES    6.20918
RATES    6.11622
TALES    6.09823
TEARS    6.05801
NARES    6.01579
TIRES    6.01493
REALS    6.00117
DARES    5.99343
LORES    5.99031
TRIES    5.98875

Thursday, September 8, 2016

Global warming as falling into the Sun

This summer in Spain has been so particularly hot that people came up with graphical jokes like this:
(Cáceres is my hometown; versions of this picture for many other Spanish populations swarm the net.) Pursuing this idea half-seriously, one can reason that an increase in global temperatures due to climate change might be journalistically equated with the Earth getting closer to the Sun and thus receiving more radiation, which analogy conjures up doomy visions of our planet falling into the blazing hell of the star: let us do the calculations.
Climate sensitivity, usually denoted by λ, links changes in global surface temperature with variations of received radiative power
ΔT = λ ΔW.
The mechanism by which radiative power changes (increased albedo, greenhouse effect) results in a different associated λ parameter. For the case of power variations due to changes in solar activity, Tung et. al have calculated λs to be in the range of 0.69 to 0.97 K/(W/m2) using data from observations of 11-year solar cycles, and estimate that the stationary sensitivity (i.e. if the change in power was permanent) would be 1.5 times higher, thus in the range of 1.03 to 1.45 K/(W/m2).
Now, the Earth is D0 = 1.496 × 108 km away from the Sun, and receives an average radiation of W0 = 1366 W/m2. Assuming far-field conditions, the radiative power received at the Earth as a function of the distance D to the Sun is then
W = w / D2,
w = 3.057 × 1025 W/sr,
which allows us to calculate ΔT = λs ΔW from ΔD = D0D, as shown in the graph for the minimum and maximum estimated values of λs. Although this cannot be checked visually, the lines are not straight but include a negligible (in these distance ranges) quadratic component.
So, the estimated increase of 0.75 °C in global temperature during the 20th century is equivalent to pushing the Earth between 30 and 40 thousand kilometers towards the Sun. Each extra °C brings us 38,000-54,000 km closer to the star. For those stuck with USCS, each °F is equivalent to 13,000-18,000 miles.
As an alarmist meme, the figure works poorly since no amount of global warming will translate to anything resembling "falling into the Sun": relative changes in distance measure in the permyriads. And, yes, the joke at the beginning of this article is definitely a gross exaggeration.

Tuesday, September 6, 2016

Compile-time checking the existence of a class template

(Updated after a suggestion from bluescarni.) I recently had to use C++14's std::is_final but wanted to downgrade to boost::is_final if the former was not available. Trusting __cplusplus implies overlooking the fact that compilers never provide 100% support for any version of the language, and  Boost.Config is usually helpful with these matters, but, as of this writing, it does not provide any macro to check for the existence of std::is_final. It turns out the matter can be investigated with some compile-time manipulations. We first set up some helping machinery in a namespace of our own:
namespace std_is_final_exists_detail{
template<typename> struct is_final{};

struct helper{};

std_is_final_exists_detail::is_final has the same signature as the (possibly existing) std::is_final homonym, but need not implement any of the functionality since it will be used for detection only. The class helper is now used to write code directly into namespace std, as the rules of the language allow (and, in some cases, encourage) us to specialize standard class templates for our own types, like for instance with std::hash:
namespace std{

struct hash<std_is_final_exists_detail::helper>
  std::size_t operator()(
    const std_is_final_exists_detail::helper&)const{return 0;}
  static constexpr bool check()
    using helper=std_is_final_exists_detail::helper;
    using namespace std_is_final_exists_detail;

operator() is defined to nominally comply with the expected semantics of std::hash specialization; it is in check that the interesting work happens. By a non-obvious but totally sensible C++ rule, the directive
using namespace std_is_final_exists_detail;
makes all the symbols of the namespace (including is_final) visible as if they were declared in the nearest namespace containing both std_is_final_exists_detail and std, that is, at global namespace level. This means that the unqualified use of is_final in
resolves to std::is_final if it exists (as it is within namespace std, i.e. closer than the global level), and to std_is_final_exists_detail::is_final otherwise. We can wrap everything up in a utility class:
using std_is_final_exists=std::integral_constant<
and check with a program
#include <iostream>

int main()
  std::cout<<"std_is_final_exists: "
that dutifully ouputs
std_is_final_exists: 0
with GCC in -std=c+11 mode and
std_is_final_exists: 1
when with -std=c+14. Clang and Visual Studio also handle this code properly.
(Updated Sep 7, 2016.) The same technique can be used to walk the last mile and implement an is_final type trait class relying on std::final but falling back to boost::is_final if the former is not present. I've slightly changed naming and used std::is_void for the specialization trick as it involves a little less typing.
#include <boost/type_traits/is_final.hpp>
#include <type_traits>

namespace my_lib{
namespace is_final_fallback{

template<typename T> using is_final=boost::is_final<T>;

struct hook{};


namespace std{

struct is_void<::my_lib::is_final_fallback::hook>:
  template<typename T>
  static constexpr bool is_final_f()
    using namespace ::my_lib::is_final_fallback;
    return is_final<T>::value;

} /* namespace std */

namespace my_lib{

template<typename T>
struct is_final:std::integral_constant<
  std::is_void<is_final_fallback::hook>::template is_final_f<T>()

} /* namespace mylib */

Thursday, July 28, 2016

Passing capturing C++ lambda functions as function pointers

Suppose we have a function accepting a C-style callback function like this:
void do_something(void (*callback)())
As captureless C++ lambda functions can be cast to regular function pointers, the following works as expected:
auto callback=[](){std::cout<<"callback called\n";};

output: callback called
Unfortunately , if our callback code captures some variable from the context, we are out of luck
int num_callbacks=0;
auto callback=[&](){
  std::cout<<"callback called "<<++num_callbacks<<" times \n";

error: cannot convert 'main()::<lambda>' to 'void (*)()'
because capturing lambda functions create a closure of the used context that needs to be carried around to the point of invocation. If we are allowed to modify do_something we can easily circumvent the problem by accepting a more powerful std::function-based callback:
void do_something(std::function<void()> callback)

int num_callbacks=0;
auto callback=[&](){
  std::cout<<"callback called "<<++num_callbacks<<" times \n";

output: callback called 1 times
but we want to explore the challenge when this is not available (maybe because do_something is legacy C code, or because we do not want to incur the runtime penalty associated with std::function's usage of dynamic memory). Typically, C-style callback APIs accept an additional callback argument through a type-erased void*:
void do_something(void(*callback)(void*),void* callback_arg)
and this is actually the only bit we need to force our capturing lambda function through do_something. The gist of the trick is passing the lambda function as the callback argument and providing a captureless thunk as the callback function pointer:
int num_callbacks=0;
auto callback=[&](){
  std::cout<<"callback called "<<++num_callbacks<<" times \n";
auto thunk=[](void* arg){ // note thunk is captureless

output: callback called 1 times
Note that we are not using dynamic memory nor doing any extra copying of the captured data, since callback is accessed in the point of invocation through a pointer; so, this technique can be advantageous even if modern std::functions could be used instead. The caveat is that the user code must make sure that captured data is alive when the callback is invoked (which is not the case when execution happens after scope exit if, for instance, it is carried out in a different thread).
Tcbrindle poses the issue of lambda functions casting to function pointers with C++ linkage, where C linkage may be needed. Although this is rarely a problem in practice, it can be solved through another layer of indirection:
extern "C" void do_something(
  void(*callback)(void*),void* callback_arg)


using callback_pair=std::pair<void(*)(void*),void*>;

extern "C" void call_thunk(void * arg)
  callback_pair* p=static_cast<callback_pair*>(arg);
int num_callbacks=0;
auto callback=[&](){
  std::cout<<"callback called "<<++num_callbacks<<" times \n";
auto thunk=[](void* arg){ // note thunk is captureless
callback_pair p{thunk,&callback};

output: callback called 1 times

Sunday, February 21, 2016

A formal definition of mutation independence

Louis Dionne poses the problem of move independence in the context of C++, that is, under which conditions a sequence of operations
is sound in the sense that the first does not interfere with the second. We give here a functional definition for this property that can be applied to the case Louis discusses.
Let X be some type and functions f: X T×X and g: X Q×X. The impurity of a non-functional construct in an imperative language such as C++ is captured in this functional setting by the fact that these functions return, besides the output value itself, a new, possibly changed, value of X. We denote by fT and fX the projection of f onto T and X, respectively, and similarly for g. We say that f does not affect g if
 gQ(x) = gQ(fX(x)) ∀xX.
If we define the equivalence relationship ~g in X as
x ~g y iff gQ(x) = gQ(y),
then f does not affect g iff
fX(x) ~g xxX
fX([x]g) ⊆ [x]gxX,
where [x]g is the equivalence class of x under ~g.
We say that f and g are mutation-independent if f does not affect g and g does not affect f, that is,
fX([x]g) ⊆ [x]g and gX([x]f) ⊆ [x]fxX,
The following considers the case of f and g acting on separate components of a tuple: suppose that X = X1×X2 and f and g depend on and mutate X1 and X2 alone, respectively, or put more formally:
fT(x1,x2) = fT(x1,x'2),
fX2(x1,x2) = x2,
gQ(x1,x2) = gQ(x'1,x2),
gX1(x1,x2) = x1
for all x1, x'1X1, x2, x'2X2. Then  f and g are mutation-independent (proof trivial). Getting back to C++, given a tuple x, two operations of the form:
are mutation-independent if i!=j; this can be extended to the case where f and g read from (but not write to) any component of x except the j-th and i-th, respectively.