Bannalia: trivial notes on themes diverse: Complexity of lexicographical comparison: part II

Let S_N,L the equiprobable sample space of strings of length L ≥ 1 over an alphabet A of N ≥ 2 symbols. For instance, S_2,3 with A = {a, b} (which particular symbols A consists of is immaterial to our discussion) runs over the strings

aa, ab, ba, bb,

each with probability 1/4. According to the representation model introduced in an earlier entry, S_N,L is characterized by

L(n) = 0, n < L,
L(n) = 1, n ≥ L,
T(p,q) = (1 - T(p,0))/N, p ∈ A*,

and the average number of steps it takes to lexicographically compare two independent outcomes of S, which we proved for the general case to be

E[C] = Σ_{n ≥ 0} (1/N)ⁿ(1- L(n - 1))²,

reduces here to

E[C_N,L] = Σ_{0 ≤ n ≤ L} (1/N)ⁿ = (N^L+1 - 1)/(N^L+1 - N^L),

tending to N/(N - 1) as L → ∞. The figure shows E[C_N,L] as a function of L for various values of N.

E[C_N,L] as a function of L.

But lexicographical comparison does not perform so well in other, very common scenarios. Suppose we form a sequence s =(s₁, ..., s_N^L) with the sorted values of S_N,L and perform a binary search on it of some s_i extracted at random:

bs(s_i, s).

This operation does approximately L·log₂ N comparisons between strings in s. We want now to calculate the average number of steps (i.e. symbols checked) these comparisons take, which we denote by E[C'_N,L]. A simple C++ program exercising std::lower_bound helps us obtain the figures:

E[C'_N,L] as a function of L.

E[C'_N,L] is indeed different to E[C_N,L] and in fact grows linearly with the length of the strings in s. The reason is that the algorithm for searching s_i iteratively touches on strings more and more similar to s_i, that is, sharing an increasingly longer common prefix with s_i, which imposes a penalty on lexicographical comparison. We can make a crude estimation analysis of E[C'_N,L]: as each step of binary search gains one extra bit of information, the common prefix grows by 1/(log₂ N) symbols per step, yielding an average common prefix length per comparison of

(Σ_{1 ≤ n ≤ (L·log₂N ) - 1} n/(log₂ N))/(L·log₂ N) = (1/2)(L - 1/(log₂ N))

to be added an additional term 1 ≤ c < N/(N - 1) accounting for the comparison of the remaining string suffixes.

Lexicographical comparison as used in binary searching is then a rather inefficient O(length of strings). We will see in a later entry how to improve complexity by incorporating contextual information to the execution of the search algorithm.

Bannalia: trivial notes on themes diverse

Wednesday, April 2, 2014

Complexity of lexicographical comparison: part II

No comments:

Post a Comment