Bannalia: trivial notes on themes diverse: Hash tables with duplicate elements

Let's study some statistics of a hash table based on closed addressing. We call N the number of elements inserted into the table and B the number of buckets, and define the table load factor as F = N/B. If the hash function being used behaves well, the distribution of elements across the buckets will be approximately uniform, that is, each element will have an independent probability 1/B of being located at a particular bucket. The number of elements in a bucket, denoted by S, follows then a binomial distribution:

S(n) := Pr(size of a bucket is n) =
= P₁_/_B(n | N) = C(N,n) (1/B)ⁿ(1-1/B)^N-n,

where C(N,n) is the binomial coefficient N choose n. Unsurprisingly enough, the average size of a bucket, i.e. the mean value of S, is

E[S]= N/B = F.

When N ≫ 1/B, P₁_/_B(n | N) can be approximated by a Poisson distribution

P_F(n) = Fⁿe^-F/n!,

which depends on F alone. The figure shows P_F(n) for some values of F.

Two interesting statistical descriptors are the probability of a bucket being empty and the average size of a non-empty bucket, which we denote by P_empty and S_full, respectively. Using the Poisson approximation, these can be calculated as:

P_empty = P_F(0) = e^-F,
S_full = Σ_n_≥1nS(n)/P_empty = E[S]/(1-P_empty) = F/(1-e^-F).

For a commonly used load factor F = 1, we have P_empty = 0.37, S_full = 1.58: 37% of buckets are wasted (since the hash function can't do miracles: it is well-behaved, not perfect), and the remaining 63% usually hold one or two elements, rarely more. This is how a typical distribution of elements across a hash table looks like:

Consider now a variation where duplicate elements are allowed and we insert M groups of elements (with M ≫ 1/B), each of a size following some independent probability distribution G(n) with mean value G. Again assuming well-behaved hashing, the probability distribution of bucket sizes is given by

S(n) = Σ_m≥0Pr(m groups in the bucket)Pr(sum of lenghts of m groups is n)=
= Σ_m≥0P_F'(m)G^*^m(n),

where F' = M/B and G^*^m(n) is the n-th convolution power of G(n). The mean value of S is

E[S] = Σ_n≥0 n(Σ_m≥0P_F'(m)G^*^m(n)) =
= Σ_m≥0 P_F'(m)(Σ_n≥0nG^*^m(n)) = Σ_m≥0P_F'(m)E[G^*^m] =
= Σ_m≥0P_F'(m)mE[G] = E[P_F']E[G] =
=F'G = MG/B = N/B = F,

obviously. So, average occupancy remains the same when allowing for duplicate elements. But if we look at empty and non-empty buckets separately, the situation changes dramatically:

P_empty = P_F'(0) = e^-F/G,
S_full = E[S]/(1-P_empty) = F/(1-e^-F/G)

(for the latter calculation we have relied on the assumption that G(0) = 0, i.e. there are no "ghost" groups of zero elements.) Now, empty buckets quickly approach 100% as G grows, and S_full ≈ G when G ≫ F. The figure shows P_empty and S_full as functions of G for F = 1:

And this is how bucket occupancy looks like with F = 1 and element group sizes following a normal distribution around G = 5:

What is happening here? It is instructive to consider this sparse usage of buckets in the context of a hash table with fixed maximum load factor F_max that accepts a stream of duplicate element groups and grows its bucket array ("rehashes") as F hits F_max. Since duplicate elements end at the same bucket no matter how big B, rehashing merely spreads occupied buckets across an increasingly larger and emptier array; this also has the effect of greatly diminishing the probability that two groups are inserted into the same bucket, but the average number of elements in a non-empty bucket won't ever go below G, which ultimately dominates S_full over F_max.

Hash tables were not designed with duplicate elements in mind. To accommodate them, load factor control based on N/B is an ill-advised policy leading to unnecessarily large bucket arrays: a more sensible approach involves treating groups of equivalent elements as units of occupancy and thus monitor the alternative load factor M/B. Unfortunately, the C++ programming language overlooked these problems when defining std::unordered_multiset and std::unordered_multimap. If the specification of these containers were to be revisited, it could use some improvements:

An alternative hash policy interface (say unique_load_factor, unique_max_load_factor) based on M/B rather than N/B.
Complexity requirements guaranteeing that implementations include the internal machinery necessary to efficiently skip groups of duplicate elements and provide truly O(1) operation (under good hashing conditions); in this respect, specifying equal_range complexity to be average case O(1) should suffice.

Bannalia: trivial notes on themes diverse

Wednesday, October 16, 2013

Hash tables with duplicate elements

No comments :

Post a Comment