## Wednesday, April 2, 2014

### Complexity of lexicographical comparison: part II

Let SN,L the equiprobable sample space of strings of length L 1 over an alphabet A of N ≥ 2 symbols. For instance, S2,3 with A = {a, b} (which particular symbols A consists of is immaterial to our discussion) runs over the strings
aa, ab, ba, bb,
each with probability 1/4. According to the representation model introduced in an earlier entry, SN,L is characterized by
L(n) = 0, n < L,
L(n) = 1, n L,
T(p,q) = (1 - T(p,0))/N, pA*,
and the average number of steps it takes to lexicographically compare two independent outcomes of S, which we proved for the general case to be
E[C] =  Σn ≥ 0 (1/N)n(1- L(n - 1))2,
reduces here to
E[CN,L] = Σ0 ≤ nL (1/N)n = (NL+1 - 1)/(NL+1 - NL),
tending to N/(N - 1) as L → ∞. The figure shows E[CN,L] as a function of L for various values of N.
 E[CN,L] as a function of L.
But lexicographical comparison does not perform so well in other, very common scenarios. Suppose we form a sequence s =(s1, ..., sNL) with the sorted values of SN,L and perform a binary search on it of some si extracted at random:
bs(si, s).
This operation does approximately log2 N comparisons between strings in s. We want now to calculate the average number of steps (i.e. symbols checked) these comparisons take, which we denote by E[C'N,L]. A simple C++ program exercising std::lower_bound helps us obtain the figures:
 E[C'N,L] as a function of L.
E[C'N,L] is indeed different to E[CN,L] and in fact grows linearly with the length of the strings in s. The reason is that the algorithm for searching si iteratively touches on strings more and more similar to si, that is, sharing an increasingly longer common prefix with si, which imposes a penalty on lexicographical comparison. We can make a crude estimation analysis of E[C'N,L]: as each step of binary search gains one extra bit of information, the common prefix grows by 1/(log2 N) symbols per step, yielding an average common prefix length per comparison of
1 ≤ n ≤ (log2N ) - 1 n/(log2 N))/(log2 N) =  (1/2)(L - 1/(log2 N))
to be added an additional term 1 ≤ c < N/(N - 1) accounting for the comparison of the remaining string suffixes.
Lexicographical comparison as used in binary searching is then a rather inefficient O(length of strings). We will see in a later entry how to improve complexity by incorporating contextual information to the execution of the search algorithm.