Wednesday, December 4, 2013

Lookup cost in hash tables with duplicate elements

As a continuation of our previous entry on the statistical properties of hash tables with duplicate elements, let's now analyze how duplicates affect lookup times. Let N be the number of elements in the table, B the number of buckets and F = N/B the associated load factor. We have already learnt that the size S of a bucket follows (for large N) a Poisson distribution PF(n) with mean F. Beginning with the non-duplicate case, what is the average number of elements checked when searching for a given key k? In such general terms, the question is underspecified because lookup times differ depending on whether k is actually present in the table or not. The analysis can be done, though, considering each case in isolation.
Key is not present. This is easy to calculate: the number of elements checked is simply the average length of a bucket, which we know to be F.
Key is present. We assume that the value of k, regarded as a random variable, is evenly distributed among the values stored in the table, i.e. each element has the same probability of being chosen for lookup. Define
B(n) := Pr(k is in a bucket of size n).
It is easy to see that B(n) can be calculated as
B(n) = nPF(n)B/N = nPF(n)/F,
following from the simple fact that there are PF(n)B buckets of size n, each holding precisely n elements. In such a bucket, the number of elements checked when looking for k goes from 1 (k is the first element) to n (k is the last), and is thus in average (1 + ··· + n)/n = (n+1)/2. From this we can calculate in the general case the average number of elements checked as
Σn≥0 ((n+1)/2)B(n) =
= Σn≥0 ((n+1)/2)nPF(n)/F = (1/2F)Σn≥0 (n2+n)PF(n) =
= (1/2F)(E[S] + E[S2]) = (1/2F)(F + F2 + F) = 1+ F/2,
where we have used the fact that the second moment of a Poisson distribution with mean F is F2 + F.
We are now ready to study the case where the hash table holds M groups of duplicate elements, each group with average size G
Key is not present. There are no differences with respect to the non-duplicate case; an entire bucket, whose average size is again F, is traversed when looking up for the non-present key.
Key is present. In terms of distribution of elements, the table is equivalent to another one without duplicates where each group of equal elements is replaced by just one representative. This theoretical table has a load factor F' = F/G and an associated average lookup time 1 + F'/2. Now, searching in our original table incurs the same number of operations, except that all the individual checks before getting to k now turn into checks of an average of G equivalent elements. So, the total number of steps is
1 + GF'/2 = 1 + F/2,
exactly as in the non-duplicate case. How come we see no degradation despite elements being heavily concentrated in fewer buckets? The reason is that the distribution of groups in the table is governed by an equivalent load factor F' = F/G which is much (as much as G) less than F, and the probability that there are two of more groups of elements in the same bucket is in consequence greatly reduced.   
The hash table structures used by some implementations of C++ unordered associative containers such as Boost.Unordered allow for skipping of groups of equivalent elements. For these special hash tables, the average lookup steps are reduced to F/G, if the key is not present, and 1+F/2G, if it is.


  1. Hi Joaquín,

    I read this fascinating series a few months ago and should have congratulated you on it then.

    I was reminded of it today when I ready about changed HashMap behaviour in Java 8 -

    So, in the case of heavily-occupied buckets, Java 8 is putting an order on the bucket (assuming the key supports it) and getting O(log(bucket size)). Is this something you looked at? (Sorry if I missed it)


  2. Hi Pete,

    Please note that the purpose of adding bucket sorting is not to increase performance in the general case (where hashing is working OK and buckets are consequently sparsely populated) but to defend against malicious DoS attacks via injection values calculated to defeat hashing.

    The scenario described in the article, where we consider duplicate values, might benefit of bucket ordering, but group skipping is going to be faster --tipically only ~2 nodes are checked regardless of duplicate group sizes.

  3. Fascinating. I too happened to work on this problem, a few months ago! (After comparing to yours, turns out I was off by a factor of F!)

    I'd just like to point out that you can use the binomial distribution directly (rather than approximating using a poisson) because the moments work out to the same thing.

  4. Hi Mute Invert,

    According to Wolfram MathWorld, the second moment of a binomial distribution with parameters N and p (= 1/B in our case) is

    Np(1 - p + Np) = F + F^2 -F/B,

    so the average number of elements checked is smaller than in the Poisson case by an offset of 1/(2B).