*S*

_{N,L}the equiprobable sample space of strings of length

*L*≥ 1 over an alphabet

*A*of

*N*≥ 2 symbols. For instance,

*S*

_{2,3}with

*A*= {a, b} (which particular symbols

*A*consists of is immaterial to our discussion) runs over the strings

aa, ab, ba, bb,

each with probability 1/4. According to the representation model introduced in an earlier entry,

*S*_{N,L}is characterized by*L*(

*n*) = 0,

*n*<

*L*,

*L*(

*n*) = 1,

*n*≥

*L*,

*T*(

*p*,

*q*) = (1 -

*T*(

*p*,

**0**))/

*N*,

*p*∈

*A**,

and the average number of steps it takes to lexicographically compare two independent outcomes of

*S*, which we proved for the general case to be
E[

*C*] = Σ_{n ≥ 0}(1/*N*)^{n}(1-*L*(*n*- 1))^{2},
reduces here to

E[

*C*_{N,L}] = Σ_{0 ≤ n ≤ L}(1/*N*)^{n}= (*N*^{L+1}- 1)/(*N*^{L+1}-*N*^{L}),
tending to

*N*/(*N*- 1) as*L*→ ∞. The figure shows E[*C*_{N,L}] as a function of*L*for various values of*N*.E[C_{N,L}] as a function of L. |

But lexicographical comparison does not perform so well in other, very common scenarios. Suppose we form a sequence

*s*=(*s*_{1}, ...,*s*_{NL}) with the sorted values of*S*_{N,L}and perform a binary search on it of some*s*extracted at random:_{i}
bs(

*s*,_{i}*s*).
This operation does approximately

*L·*log_{2}*N*comparisons between strings in*s*. We want now to calculate the average number of steps (i.e. symbols checked) these comparisons take, which we denote by E[*C'*_{N,L}]. A simple C++ program exercising std::lower_bound helps us obtain the figures:E[C'_{N,L}] as a function of L. |

E[

*C'*_{N,L}] is indeed different to E[*C*_{N,L}] and in fact grows linearly with the length of the strings in*s*. The reason is that the algorithm for searching*s*iteratively touches on strings more and more similar to_{i}*s*, that is, sharing an increasingly longer common prefix with_{i}*s*, which imposes a penalty on lexicographical comparison. We can make a crude estimation analysis of E[_{i}*C'*_{N,L}]: as each step of binary search gains one extra bit of information, the common prefix grows by 1/(log_{2}*N*) symbols per step, yielding an average common prefix length per comparison of
(Σ

_{1 ≤ n ≤ (L·log2N ) - 1}*n*/(log_{2}*N*))/(*L·*log_{2}*N*) = (1/2)(*L*- 1/(log_{2}*N*))
to be added an additional term 1 ≤

*c*<*N*/(*N*- 1) accounting for the comparison of the remaining string suffixes.
Lexicographical comparison as used in binary searching is then a rather inefficient O(length of strings). We will see in a later entry how to improve complexity by incorporating contextual information to the execution of the search algorithm.

## No comments :

Post a Comment