Bannalia: trivial notes on themes diverse: October 2013

Sunday, October 27, 2013

Insertion complexity of C++ unordered associative containers

C++ unordered associative containers make the beautiful promise of (average) constant time insertion provided hashing is done properly. Let's see the algorithmic theory hash table insertion relies on. We restrict ourselves to discussing containers not allowing duplicate elements (unordered_set and unordered_map, which, for the purpose of this analysis, are entirely equivalent).

So, we want to calculate the average complexity of this model of running insertion into a container:

void insert(container& c,unsigned int n)
{
  while(n--)c.insert(rnd());
}

where rnd is a random source generating non-repeating integers and c is empty upon invoking insert. Let B₀ be the initial number of buckets of an empty container and F_max its maximum load factor (usually 1): when there are n-1 < F_maxB₀ elements in c, the time taken to insert a new one is

t(n) = a + b(n-1)/B₀,

for some constants a and b: the first addend represents the constant insertion time of the element into the structure and the second one accounts for collisions into the destination bucket: if hashing is distributing elements uniformly, there are on average (n-1)/B₀ previous elements in each bucket. So, the time complexity of insert(n)(if n ≤ F_maxB₀) is

f(n) = Σ_i=1,..,n t(i) = an + bn(n-1)/2B₀.

Now, when the maximum load factor is hit (n = F_maxB₀), an additional insertion involves rehashing the table to a new, larger number of buckets B. In this scenario, insert(n) complexity is

f(n) = a(n-m) + b(n(n-1) - m(m-1))/2B + r(m,B) + f(m),

with

m := F_maxB₀,
r(m,B) := a'm + b'm(m-1)/2B.

This is simpler than it seems: f(m) is the time taken before rehashing, r(m,B) represents rehashing m elements from B₀ to B (for the sake of generality we allow for a', b' to be different than a, b because no new nodes are allocated, insertion can take some more time than the regular case due to the creation of the new bucket array, collisions need not be checked, etc.) and the rest of the expression models the insertion of the remaining n-m elements after rehashing. The formula we have derived applies recursively to any n provided that B is the number of buckets used by the container to hold at least n elements and B₀ is the number of buckets after the last rehashing operation (0 if no previous rehashing took place), and defining by convention r(0,0) = f(0) = 0. This is a C++ implementation of f(n) (name running_insertion) and f(n)/n (average_running_insertion):

double rehash(unsigned int m,unsigned int B)
{
  static const double aprime=1.0,bprime=1.0;
  return m==0?0:aprime*m+bprime*m*(m-1)/(2*B);
}

double running_insertion(unsigned int n,double Fmax)
{
  static const double a=1.0,b=1.0;
  static const std::array<unsigned int,18> Bs={
    0, 53, 97, 193, 389, 769,
    1543, 3079, 6151, 12289, 24593,
    49157, 98317, 196613, 393241, 786433,
    1572869, 3145739
  };

  if(n==0)return 0.0;

  auto it=std::lower_bound(
    Bs.begin(),Bs.end(),(unsigned int)std::ceil(n/Fmax));
  if(it==Bs.end())throw std::out_of_range("running_insertion");
  unsigned int B=*it,m=(unsigned int)(Fmax*(*(it-1)));
  
  return a*(n-m)+(b*n*(n-1)-b*m*(m-1))/(2*B)+
         rehash(m,B)+running_insertion(m,Fmax);
}

double average_running_insertion(unsigned int n,double Fmax=1.0)
{
  return running_insertion(n,Fmax)/n;
}

53, 97, 193,... is the sequence of bucket array sizes used by Boost.Unordered and other implementations of unordered associative containers. The graphic shows f(n)/n with F_max = a = b = a' = b' = 1 for n = 1 to 3·10⁶; the horizontal scale is logarithmic.

Before the first rehashing happens (that is, when n ≤ 53), f(n)/n grows as collisions are more frequent, but the slope of the function is negative after every subsequent rehashing because f(n)/n amortizes such steps. From 1,000 elements onwards, the average value of f(n)/n stabilizes at around 3.6 and does not grow indefinitely. This can be proved formally:

Proposition. f(n)/n is bounded if bucket array sizes follow a geometric progression.

Proof. By hypothesis, bucket array sizes form a sequence B_i = B₀Kⁱ for some K > 1. Let's call N_i := F_maxB_i. Now, when n > N₁ the recursive calculation of f(n) can be unfolded to

f(n) = a(n-N_k_-1) + b(n(n-1) - N_k_-1(N_k_-1-1))/2B_k +
+ Σ_i=0,..,k-2(a(N_i₊₁-N_i) + b(N_i₊₁(N_i₊₁-1) - N_i(N_i-1))/2B_i₊₁) +
+ aN₀ + bN₀(N₀-1)/2B₀ +
+ Σ_i=0,..,k-1(a'N_i + b'N_i(N_i-1)/2B_i₊₁),

where k = ceil(log_K n/N₀), that is, n/N₀ ≤ K^k < nK/N₀. In a [N_i + 1, N_i₊₁ + 1] interval, Δ(f(n)/n) is monotonically non-decreasing, i.e. f(n)/n is convex (proof omitted), so its maximum happens at N_i + 1 or N_i₊₁ + 1 and hence we can restrict our study of boundedness of f(n)/n to the points n = N_i + 1. Approximating x-1 to x for x ≫ 1, dropping the terms outside of the summatories (as they are constant or grow slower than n) and taking into account the relationship between bucket array sizes, we have

f(N_i + 1) ≃ Σ_i=0,..,k-2(a(N_i₊₁-N_i) + bF_max(N_i₊₁/2 - N_i/2K)) +
+ Σ_i=0,..,k-1(a'N_i + b'F_maxN_i/2K) =
= (a(K-1) + bF_max(K²-1)/2K) Σ_i=0,..,k-2N_i +
+ (a' + b'F_max/2K) Σ_i=0,..,k-1 N_i =
= (a(K-1) + bF_max(K²-1)/2K)N₀(K^k^-1-1)/(K-1) +
+ (a' + b'F_max/2K)N₀(K^k-1)/(K-1) <

< n(a(K-1) + bF_max(K²-1)/2K + a'K + b'F_max/2)/(K-1),

thus f(n)/n is bounded.

The (approximate) bounding formula yields 4.25 for the function depicted above (K ≃ 2), in excellent agreement with the graph.

There is an alternative insertion scenario where the total number of elements is known in advance and the user can reserve space to avoid rehashing:

void insert(container& c,unsigned int n)
{
  c.reserve(n);
  while(n--)c.insert(rnd());
}

The associated complexity of this operation is

f(n) = an + bn(n-1)/2B,

where B is the minimum bucket size such that n ≤ F_maxB. f(n)/n is, again, bounded, as shown in the figure.

We omit the proof, which is much simpler than the previous one. In this case, the bound on f(n)/n is a + bF_max/2. Note that this does not depend on K precisely because there is no rehashing involved.

So, the promise holds true: if the container uses bucket array sizes following a geometric progression and hashing behaves well, insertion is O(1) on average (either with or without rehashing). In a later entry we will see whether real implementations of unordered associative containers live up to theory.

Friday, October 25, 2013

Implementation of C++ unordered associative containers with duplicate elements

In a prior entry we reviewed the data structures used by some popular implementations of C++ unordered associative containers without duplicate elements (unordered_set and unordered_map) and described a new approach introduced by Boost.MultiIndex hashed indices. Let's continue with the corresponding containers allowing for duplicate elements, unordered_multiset and unordered_multimap.

Hash tables and duplicate elements do not mix well together, the two basic problems being:

Load factor control based on the total number of elements rather than the number of different elements leads to unnecessarily large and sparsely populated bucket arrays;
Algorithm complexity is not bound by the load factor but instead depends on the average length of groups of equivalent elements.

Implementations of unordered_multiset/unordered_multimap can do nothing to alleviate the difficulties with load factor management: the standard codifies its behavior very precisely with little room for reinterpretation or improvement. An observant user can remedy the situation on her own by setting FG as the load factor, where F controls the level of occupancy (i.e. it is the equivalent load factor for the case where no duplicate elements are allowed) and G is the average length of element groups. As for the second problem, an implementation may choose to enrich the internal data structure so that element groups can be skipped in constant time (the standard specifies that equivalent elements must be packed together): in this case, group length is no longer an issue and the container provides the same performance guarantees as the non-duplicate version. What do popular implementations do in this respect?

Dinkumware, libc++, libstdc++-v3

None of these libraries makes any special provision to handle groups of equivalent elements in an efficient manner: they rely on the same data structure used for non-duplicate versions of the containers.

Boost.Unordered

This library does augment its data structure to cope with equivalent elements:

The figure shows a group of five equivalent elements located at bucket b2. An additional back pointer is added to the original structure that reverse-links the elements of the group together in the way a circular doubly linked list does. This back pointer allows groups to be treated as whole units: when traversing a bucket for insertion or erasure, only the first element of each group need be inspected, and the rest can be skipped by following the back pointer to the last element. The penalty for this performance improvement is one extra pointer per node.

Boost.MultiIndex

The same data structure used for containers without duplicates can be further tweaked to efficiently handle groups of equivalent elements at no extra memory overhead:

To interpret the tangle of pointers shown in the figure we use the Xn, Xp terminology introduced in the first article. As there, the first element of each bucket is redirectioned to have Xp point to the associated bucket entry. Now consider a group of three or more equivalent elements with F, S, P, L pointers to the first, second, penultimate and last node, as marked in the figure (S = P when the group has length 3); the following redirections are applied:

Sp = L,
Pn = F.

It is not hard to prove that:

X is the first of its bucket iff Xpnn = X,
X is the last of its bucket iff Xnpn = X,
X is the first of a group (of length ≥3) iff Xnp ≠ X and Xnppn = X,
X is the second of a group (of length ≥3) iff Xpn ≠ X and Xppnn = X,
X is the penultimate of a group (of length ≥3) iff Xnp ≠ X and Xnnpp = X,
X is the last of a group (of length ≥3) iff Xpn ≠ X and Xpnnp = X,

that is, each special position can be univocally determined. Moreover, given X we can apply these properties to locate in constant time the nodes following and preceding X:

The element following X is Xnnp if X is the penultimate of a group (of length ≥3), Xn otherwise,
the element preceding X is Xpn if X is the first of its bucket, Xppn if X is the second of a group (of length ≥3), Xp otherwise.

So, even with all these redirections we can still treat the structure as a circular doubly linked list with O(1) linking/unlinking, and at the same time take benefit of group identification to speed up hash operations. The procedure for insertion follows these steps:

Use the hash function to locate the corresponding bucket entry → constant time.
Traverse the first elements of each group in the bucket looking for a group of equivalent elements → time linear on the number of groups in the bucket.
If a group was found, link into it; otherwise, link at the beginning of the bucket → constant time.
Adjust bucket entry pointers as needed → constant time.

Group skipping is constant time: if X is the first element of a group of length ≥3, the last element is reached with Xnp; if the length is 1 or 2, fast skipping is not available, but the number of elements to check against is, again, 1 or 2, i.e. bounded. The scheme for erasure looks like:

Unlink from the (doubly linked) list → constant time.
Adjust bucket entry pointers if the element is the first or the last element of its bucket, or both → constant time.

The actual implementation is more complicated than this overall description suggests, since linking and unlinking must maintain the special redirections used to identify groups, but the net result from the point of view of performance is: insertion and erasure times depend on the number of groups rather than the total number of elements and have the same complexity as the non-duplicate case, i.e. O(1) for insertion if the hash function works properly, unconditionally O(1) for erasure.

Sunday, October 20, 2013

Implementation of C++ unordered associative containers

Unordered associative container is a fancy name adopted by C++ for what the rest of the world refers to as a hash table. With the 2011 revision of the standard, the language provides four different unordered associative containers:

unordered_set and unordered_map,
unordered_multiset and unordered_multimap, allowing for duplicate elements.

Given the particular interface these containers have, their internal representation is universally based on closed addressing, of which the following is a straightforward implementation.

In this data structure, each bucket points to a null-terminated singly linked list of elements. If the provided hash function works as it should, the number of elements in each bucket only rarely exceeds by much the specified load factor, so insertion and erasure of elements is O(1) on average. In the figure above, buckets b1 and b3 have one element each, and bucket b2 holds two elements. Simple and memory-efficient as this structure is, it turns out C++ unordered associative containers cannot use it directly: containers are required to be traversable, i.e. they must provide an associated iterator type which can go through all the elements of a container c as comprised in the range [c.begin(), c.end()). So, how does an iterator go from say bucket b1 to b2? The obvious solution of scanning the bucket array until the next non-empty bucket is found does not work, since the average time this operation takes grows as the bucket array is depleted and more buckets need to be walked through to reach the following element, whereas the standard requires that iterator increment be O(1). As this has been discussed in extenso elsewhere, we'll make a long story short and simply conclude that, in order to meet the requirements of C++ unordered associative containers, all the elements must be linked together.

Let's see the resulting data structures used by some popular implementations of unordered associative containers, plus a new approach adopted by Boost.MultiIndex. We consider containers without duplicate elements fist (unordered_set and unordered_map), and study unordered_multiset and unordered_multimap in a further article. In what follows we call N the number of elements in the container and B the number of its buckets.

Dinkumware

These are the providers of the Microsoft Visual Studio C++ standard library implementation. Their hash tables are designed as shown in the figure:

(A note on this and the following diagrams: although seemingly implied by the pictures, bucket entries need not be arranged in the same order as their corresponding buckets are traversed.)

As anticipated, all the elements are linked together, although in this case they are doubly so: this allows for bidirectional iteration at the expense of one more pointer per element, and, more importantly, makes unlinking elements extremely easy. As for the bucket array, each entry holds two pointers to the first and last element of the bucket, respectively (the latter pointer is needed to determine the extent of the bucket). Insertion of a new element is simple enough:

Use the hash function to locate the corresponding bucket entry → constant time.
Look for duplicate elements in the bucket and abort insertion if found → time linear on the number of elements in the bucket.
Link into the element list at the beginning of the bucket → constant time.
Adjust bucket entry pointers as needed → constant time.

And erasure goes as follows:

Use the hash function to locate the corresponding bucket entry → constant time.
Unlink from the element list → constant time.
Adjust bucket entry pointers as needed → constant time.

The resulting complexity of these operations is then O(1) (in the case of insertion, under the assumption that the hash function is well behaved and does not overpopulate buckets) and the memory overhead is two pointers per element plus two pointers per bucket entry:

2N + 2B,

All is well then, right? Well, actually not: Dinkumware's scheme has a serious problem with erasure, as the first step of the procedure

Use the hash function to locate the corresponding bucket entry.

will fail if the user-provided hash function throws an exception, and the C++ standard requires that erasure work unconditionally and never throw. Blunt as it sounds, it is my opinion that Dinkuware's implementation of C++ unordered associative containers is broken.

Boost.Unordered, libc++, libstdc++-v3

Boost.Unordered and libc++ (Clang's standard library implementation) use basically the same data structure, whose design is a direct application of three guiding principles:

All elements are linked (here, singly linked) together.

Recognizing the fact that, in order to implement erasure without throwing, we must connect an element with its bucket entry without invoking the hash function, the associated hash value is stored as part of the element node.

This is the trickiest part: in a singly linked list, erasure of an element takes linear time unless we know the preceding element in advance. So, instead of having each bucket entry point to the first element in the bucket, we make it point to the element before that one. The figure can be more easily understood if color codes are taken into account: for instance, b2, which is yellow, points to a fuchsia element (in b1).

The algorithm for insertion is based on the same guidelines as those of Dinkumware (with the complication that the end of the bucket is located by comparing stored hash values), whereas erasure follows these steps:

Locate the associated bucket entry via the stored hash value → constant time, never throws.
Locate the preceding element → linear on the number of elements in the bucket.
Use the preceding element to unlink from the element list → constant time.
Adjust bucket entry pointers as needed → constant time.

Insertion and erasure are then O(1) for reasonably good hash distributions, which is the minimum mandated by the C++ standard. The memory overhead is, using the same terminology as before,

2N + 1B,

(here we are assuming that storing a hash value takes the same space as a pointer, namely a word.)

GNU Standard C++ Library v3, also known as libstdc++-v3, provides a memory optimization over this data structure: if the hash function is determined to not throw (by noexcept specification) and is marked as "fast" (via some dedicated type trait), hash values are not stored and the function is safely used instead. The way this optimization is currently implemented looks to me to a bit dangerous, as the __is_fast_hash type trait is true by default except for some basic types, which means that all noexcept user-provided hash functions will be deemed as fast unless the user is aware of this implementation-specific trait and states otherwise.

So, leaving libstdc++-v3 optimizations aside, 2N + 1B seems to be the minimum overhead required to implement C++ unordered associative containers based on closed addressing and meeting the complexity requirements of the standard. In fact, we can do better in the area of complexity without giving up extra memory.

A novel data structure

As of Boost 1.56, Boost.MultiIndex hashed indices internally use a new hash table design with some nice complexity characteristics. The key insight behind the approach taken is this: a circular doubly linked list has redundancies that we can take advantage of to accomodate extra information:

Given a node pointer X, let's call Xn and Xp the pointers to the next and prior node, respectively. A circular doubly-linked list maintains this invariant:

X = Xnp = Xpn for every X in the list.

With some care, we can redirect Xn or Xp (which is detected by the breach of the former invariant) provided we know how to get back to X. This is precisely what we are doing to associate elements to bucket entries in a hash table:

Think of this data structure as a simple circular doubly linked list where some redirections have been applied in the following case:

If an element X is the first of its bucket, Xp points to the bucket entry.

Besides, bucket entries point to the element preceding the first one in the bucket, just as Boost.Unordered does. Extending our terminology so that if X points to a bucket entry then Xn denotes the element the entry points to, we have these formal properties:

X is the first element of its bucket iff Xpn ≠ X,
X is the last element of its bucket iff Xnp ≠ X,
the element following X is always Xn,
the element preceding X is Xpn if X is the first of its bucket, Xp otherwise.

Note that the evaluation of all these properties can be done in constant time, so we can treat our element list as a circular doubly linked list with special operations for accessing the preceding and following nodes and use it for implementing hash table insertion (following the same general steps as with the other data structures) and erasure:

Unlink from the (doubly linked) list → constant time.
Adjust bucket entry pointers if the element is the first or the last element of its bucket, or both → constant time.

Erasure of an element does not imply traversing the bucket, so unlinking happens in constant time even if the bucket is crowded by a bad behaved hash function; we have then designed a hash table with unconditional O(1) erasure complexity at no extra cost in terms of memory overhead (2N + 1B).

Wednesday, October 16, 2013

Hash tables with duplicate elements

Let's study some statistics of a hash table based on closed addressing. We call N the number of elements inserted into the table and B the number of buckets, and define the table load factor as F = N/B. If the hash function being used behaves well, the distribution of elements across the buckets will be approximately uniform, that is, each element will have an independent probability 1/B of being located at a particular bucket. The number of elements in a bucket, denoted by S, follows then a binomial distribution:

S(n) := Pr(size of a bucket is n) =
= P₁_/_B(n | N) = C(N,n) (1/B)ⁿ(1-1/B)^N-n,

where C(N,n) is the binomial coefficient N choose n. Unsurprisingly enough, the average size of a bucket, i.e. the mean value of S, is

E[S]= N/B = F.

When N ≫ 1/B, P₁_/_B(n | N) can be approximated by a Poisson distribution

P_F(n) = Fⁿe^-F/n!,

which depends on F alone. The figure shows P_F(n) for some values of F.

Two interesting statistical descriptors are the probability of a bucket being empty and the average size of a non-empty bucket, which we denote by P_empty and S_full, respectively. Using the Poisson approximation, these can be calculated as:

P_empty = P_F(0) = e^-F,
S_full = Σ_n_≥1nS(n)/P_empty = E[S]/(1-P_empty) = F/(1-e^-F).

For a commonly used load factor F = 1, we have P_empty = 0.37, S_full = 1.58: 37% of buckets are wasted (since the hash function can't do miracles: it is well-behaved, not perfect), and the remaining 63% usually hold one or two elements, rarely more. This is how a typical distribution of elements across a hash table looks like:

Consider now a variation where duplicate elements are allowed and we insert M groups of elements (with M ≫ 1/B), each of a size following some independent probability distribution G(n) with mean value G. Again assuming well-behaved hashing, the probability distribution of bucket sizes is given by

S(n) = Σ_m≥0Pr(m groups in the bucket)Pr(sum of lenghts of m groups is n)=
= Σ_m≥0P_F'(m)G^*^m(n),

where F' = M/B and G^*^m(n) is the n-th convolution power of G(n). The mean value of S is

E[S] = Σ_n≥0 n(Σ_m≥0P_F'(m)G^*^m(n)) =
= Σ_m≥0 P_F'(m)(Σ_n≥0nG^*^m(n)) = Σ_m≥0P_F'(m)E[G^*^m] =
= Σ_m≥0P_F'(m)mE[G] = E[P_F']E[G] =
=F'G = MG/B = N/B = F,

obviously. So, average occupancy remains the same when allowing for duplicate elements. But if we look at empty and non-empty buckets separately, the situation changes dramatically:

P_empty = P_F'(0) = e^-F/G,
S_full = E[S]/(1-P_empty) = F/(1-e^-F/G)

(for the latter calculation we have relied on the assumption that G(0) = 0, i.e. there are no "ghost" groups of zero elements.) Now, empty buckets quickly approach 100% as G grows, and S_full ≈ G when G ≫ F. The figure shows P_empty and S_full as functions of G for F = 1:

And this is how bucket occupancy looks like with F = 1 and element group sizes following a normal distribution around G = 5:

What is happening here? It is instructive to consider this sparse usage of buckets in the context of a hash table with fixed maximum load factor F_max that accepts a stream of duplicate element groups and grows its bucket array ("rehashes") as F hits F_max. Since duplicate elements end at the same bucket no matter how big B, rehashing merely spreads occupied buckets across an increasingly larger and emptier array; this also has the effect of greatly diminishing the probability that two groups are inserted into the same bucket, but the average number of elements in a non-empty bucket won't ever go below G, which ultimately dominates S_full over F_max.

Hash tables were not designed with duplicate elements in mind. To accommodate them, load factor control based on N/B is an ill-advised policy leading to unnecessarily large bucket arrays: a more sensible approach involves treating groups of equivalent elements as units of occupancy and thus monitor the alternative load factor M/B. Unfortunately, the C++ programming language overlooked these problems when defining std::unordered_multiset and std::unordered_multimap. If the specification of these containers were to be revisited, it could use some improvements:

An alternative hash policy interface (say unique_load_factor, unique_max_load_factor) based on M/B rather than N/B.
Complexity requirements guaranteeing that implementations include the internal machinery necessary to efficiently skip groups of duplicate elements and provide truly O(1) operation (under good hashing conditions); in this respect, specifying equal_range complexity to be average case O(1) should suffice.

Subscribe to: Posts ( Atom )