Let's study some statistics of a hash table based on

*closed addressing*. We call*N*the number of elements inserted into the table and*B*the number of buckets, and define the table*load factor*as*F*=*N*/*B*. If the hash function being used behaves well, the distribution of elements across the buckets will be approximately uniform, that is, each element will have an independent probability 1/*B*of being located at a particular bucket. The number of elements in a bucket, denoted by*S*, follows then a binomial distribution:*S*(

*n*) := Pr(size of a bucket is

*n*)

*=*

=

=

*P*_{1}

_{/}

*(*

_{B}*)*

*n*|*N**= C*(

*N*,

*n*) (1/

*B*)

*(1-1/*

^{n}*B*)

^{N-n},

where

*C*(*N*,*n*) is the binomial coefficient*N*choose*n*. Unsurprisingly enough, the average size of a bucket, i.e. the mean value of*S*, is
E[

*S*]=*N*/*B*=*F*.*P*(

_{F}*n*) =

*F*

^{n}e^{-F}/

*n*!,

which depends on

*F*alone. The figure shows*P*(_{F}*n*) for some values of*F*.
Two interesting statistical descriptors are the probability of a bucket being empty and the average size of a

*non-empty*bucket, which we denote by*P*_{empty}and*S*_{full}, respectively. Using the Poisson approximation, these can be calculated as:*P*

_{empty}=

*P*(0) =

_{F}*e*

^{-F},

*S*

_{full}= Σ

_{n}

_{≥1}

*nS*(

*n*)/

*P*

_{empty}= E[

*S*]/(1-

*P*

_{empty}) =

*F*/(1-

*e*

^{-F}).

For a commonly used load factor

*F*= 1, we have*P*_{empty}= 0.37,*S*_{full}= 1.58: 37% of buckets are wasted (since the hash function can't do miracles: it is well-behaved, not perfect), and the remaining 63% usually hold one or two elements, rarely more. This is how a typical distribution of elements across a hash table looks like:
Consider now a variation where duplicate elements are allowed and we insert

*M*groups of elements (with*M*≫ 1/*B*), each of a size following some independent probability distribution*G*(*n*) with mean value*G*. Again assuming well-behaved hashing, the probability distribution of bucket sizes is given by*S*(

*n*) = Σ

_{m≥0}Pr(

*m*groups in the bucket)Pr(sum of lenghts of

*m*groups is

*n*)=

= Σ

_{m≥0}

*P*(

_{F'}*m*)

*G*

^{*}

*(*

^{m}*n*),

E[

= Σ

= Σ

=

*S*] = Σ_{n≥0}*n*(Σ_{m≥0}*P*(_{F'}*m*)*G*^{*}*(*^{m}*n*)) == Σ

_{m≥0}*P*(_{F'}*)(Σ**m*_{n≥0}*nG*^{*}*(*^{m}*n*)) = Σ_{m≥0}*P*(_{F'}*m*)E[*G*^{*}*] =*^{m}= Σ

_{m≥0}*P*(_{F'}*m*)*m*E[*G*] = E[*P*]E[_{F'}*G*] ==

*F'G*=*MG*/*B*=*N*/*B*=*F,*
obviously. So, average occupancy remains the same when allowing for duplicate elements. But if we look at empty and non-empty buckets separately, the situation changes dramatically:

*P*

_{empty}=

*P*(0) =

_{F'}*e*

^{-F/G},

*S*

_{full}= E[

*S*]/(1-

*P*

_{empty}) =

*F*/(1-

*e*

^{-F/G})

(for the latter calculation we have relied on the assumption that

*G*(0) = 0, i.e. there are no "ghost" groups of zero elements.) Now, empty buckets quickly approach 100% as*G*grows, and*S*_{full}≈*G*when*G*≫*F*. The figure shows*P*_{empty}and*S*_{full}as functions of*G*for*F*= 1:
And this is how bucket occupancy looks like with

*F*= 1 and element group sizes following a normal distribution around*G*= 5:
What is happening here? It is instructive to consider this sparse usage of buckets in the context of a hash table with fixed maximum load factor

*F*_{max}that accepts a stream of duplicate element groups and grows its bucket array ("rehashes") as*F*hits*F*_{max}. Since duplicate elements end at the same bucket no matter how big*B*, rehashing merely spreads occupied buckets across an increasingly larger and emptier array; this also has the effect of greatly diminishing the probability that two groups are inserted into the same bucket, but the average number of elements in a non-empty bucket won't ever go below*G*, which ultimately dominates*S*_{full}over*F*_{max}.
Hash tables were not designed with duplicate elements in mind. To accommodate them, load factor control based on

*N*/*B*is an ill-advised policy leading to unnecessarily large bucket arrays: a more sensible approach involves treating groups of equivalent elements as units of occupancy and thus monitor the alternative load factor*M*/*B*. Unfortunately, the C++ programming language overlooked these problems when defining std::unordered_multiset and std::unordered_multimap. If the specification of these containers were to be revisited, it could use some improvements:- An alternative hash policy interface (say unique_load_factor, unique_max_load_factor) based on
*M*/*B*rather than*N*/*B*. - Complexity requirements guaranteeing that implementations include the internal machinery necessary to efficiently skip groups of duplicate elements and provide truly O(1) operation (under good hashing conditions); in this respect, specifying equal_range complexity to be average case O(1) should suffice.

## No comments :

## Post a Comment