Wednesday, October 16, 2013

Hash tables with duplicate elements

Let's study some statistics of a hash table based on closed addressing. We call N the number of elements inserted into the table and B the number of buckets, and define the table load factor as F = N/B. If the hash function being used behaves well, the distribution of elements across the buckets will be approximately uniform, that is, each element will have an independent probability 1/B of being located at a particular bucket. The number of elements in a bucket, denoted by S, follows then a binomial distribution:
S(n) := Pr(size of a bucket is n) =
= P
1/B(n | N) = C(N,n) (1/B)n(1-1/B)N-n,
where C(N,n) is the binomial coefficient N choose n. Unsurprisingly enough, the average size of a bucket, i.e. the mean value of S, is
E[S]= N/B = F
When N 1/B, P1/B(n | N) can be approximated by a Poisson distribution
PF(n) = Fne-F/n!,
which depends on F alone. The figure shows PF(n) for some values of F.
Two interesting statistical descriptors are the probability of a bucket being empty and the average size of a non-empty bucket, which we denote by Pempty and Sfull, respectively. Using the Poisson approximation, these can be calculated as:
Pempty = PF(0) = e-F,
Sfull = Σn≥1nS(n)/Pempty = E[S]/(1-Pempty) = F/(1-e-F).
For a commonly used load factor F = 1,  we have Pempty = 0.37, Sfull = 1.58: 37% of buckets are wasted (since the hash function can't do miracles: it is well-behaved, not perfect), and the remaining 63% usually hold one or two elements, rarely more. This is how a typical distribution of elements across a hash table looks like:
Consider now a variation where duplicate elements are allowed and we insert M groups of elements (with M 1/B),  each of a size following some independent probability distribution G(n) with mean value G. Again assuming well-behaved hashing, the probability distribution of bucket sizes is given by
S(n) = Σm≥0Pr(m groups in the bucket)Pr(sum of lenghts of m groups is n)=
= Σm≥0PF'(m)G*m(n),
where F' = M/B and G*m(n) is the n-th convolution power of G(n). The mean value of S is
E[S] = Σn≥0 nm≥0PF'(m)G*m(n)) =
= Σm≥0 PF'(m)(Σn≥0nG*m(n)) = Σm≥0PF'(m)E[G*m] =
= Σm≥0PF'(m)mE[G] = E[PF']E[G] =
=F'G = MG/B = N/B = F,
obviously. So, average occupancy remains the same when allowing for duplicate elements. But if we look at empty and non-empty buckets separately, the situation changes dramatically:
 Pempty = PF'(0) = e-F/G,
Sfull = E[S]/(1-Pempty) = F/(1-e-F/G)
(for the latter calculation we have relied on the assumption that G(0) = 0, i.e. there are no "ghost" groups of zero elements.) Now, empty buckets quickly approach 100% as G grows, and Sfull G when G F. The figure shows Pempty and Sfull as functions of G for F = 1:
And this is how bucket occupancy looks like with F = 1 and element group sizes following a normal distribution around G = 5:
What is happening here? It is instructive to consider this sparse usage of buckets in the context of a hash table with fixed maximum load factor Fmax that accepts a stream of duplicate element groups and grows its bucket array ("rehashes") as F hits Fmax. Since duplicate elements end at the same bucket no matter how big B, rehashing merely spreads occupied buckets across an increasingly larger and emptier array; this also has the effect of greatly diminishing the probability that two groups are inserted into the same bucket, but the average number of elements in a non-empty bucket won't ever go below G, which ultimately dominates Sfull over Fmax.
Hash tables were not designed with duplicate elements in mind. To accommodate them, load factor control based on N/B is an ill-advised policy leading to unnecessarily large bucket arrays: a more sensible approach involves treating groups of equivalent elements as units of occupancy and thus monitor the alternative load factor M/B. Unfortunately, the C++ programming language overlooked these problems when defining std::unordered_multiset and std::unordered_multimap. If the specification of these containers were to be revisited, it could use some improvements:
  • An alternative hash policy interface (say unique_load_factor, unique_max_load_factor) based on M/B rather than N/B.
  • Complexity requirements guaranteeing that implementations include the internal machinery necessary to efficiently skip groups of duplicate elements and provide truly O(1) operation (under good hashing conditions); in this respect, specifying equal_range complexity to be average case O(1) should suffice.

No comments:

Post a Comment