Given two strings s, s', determining whether s < s' under a lexicographical order takes a number of steps 1 ≤ C ≤ 1 + min{len(s),len(s')}, but the actual statistical distribution of C depends of course on those of the strings being compared: if they tend to share the same prefix, for instance, C will be typically larger than in the case where the strings are more randomly generated.
Let S be a probability distribution over A*, the set of strings composable with symbols from a finite alphabet A with N symbols. A convenient way to describe S is by way of Markov chains:
Strings are generated from the empty prefix Λ by the addition of new symbols from A, termination being conventionally marked by transitioning to some symbol 0 not in A. Each arrow has an associated transition probability T(p,q), p ∈ A*, q ∈ A ∪ {0}. The probability that a string has prefix p = p1p2...pn is then
P(p) = P(Λ)T(Λ, p1)T(p1, p2)T(p1p2, p3) ··· T(p1p2···pn-1, pn),
with P(Λ) = 1. Now, comparing two strings independently extracted from A* is equivalent to analyzing their transition paths until the first divergence is found. If we denote by C(n) the probability that such comparison takes exactly n steps it is easy to see that
C(n) =Σp ∈ An-1 P2(p)(1 - Σq ∈ A T2(p,q)),
which corresponds to the probability that the strings coincide up to the (n-1)-th symbol and then either differ on the n-th one or both terminate (that is, they don't transition to the same non-terminating symbol).
A particularly simple family of distributions for S are those where the lenghts of strings are governed by a random variable L with cumulative distribution function L(n) = Pr(L ≤ n) and non-terminating symbols occur equally likely, that is, for p ∈ An, q ∈ A we have:
T(p,0) = (L(n) - L(n - 1))/(1- L(n - 1)),
T(p,q) = (1 - T(p,0))/N = (1/N)(1 - L(n))/(1- L(n - 1)),
P(p) = Πi = 0,...,n-1 (1/N)(1 - L(i))/(1- L(i - 1)) = (1/N)n (1 - L(n - 1)),
which givesT(p,q) = (1 - T(p,0))/N = (1/N)(1 - L(n))/(1- L(n - 1)),
P(p) = Πi = 0,...,n-1 (1/N)(1 - L(i))/(1- L(i - 1)) = (1/N)n (1 - L(n - 1)),
C(n) = Nn-1·(1/N)2(n-1)(1- L(n - 2))2(1 - N·(1/N)2(1 - L(n - 1))2/(1- L(n - 2))2) =
= (1/N)n-1(1- L(n - 2))2 - (1/N)n(1- L(n - 1))2,
= (1/N)n-1(1- L(n - 2))2 - (1/N)n(1- L(n - 1))2,
or
C(n) = D(n - 1) - D(n),
D(n) = (1/N)n(1- L(n - 1))2,
D(n) = (1/N)n(1- L(n - 1))2,
resulting in an average number of comparison steps
E[C] = Σn > 0 nC(n) = Σn > 0 n(D(n - 1) - D(n)) =
= Σn ≥ 0 D(n) = Σn ≥ 0 (1/N)n(1- L(n - 1))2.
= Σn ≥ 0 D(n) = Σn ≥ 0 (1/N)n(1- L(n - 1))2.
When N ≥ 2, E[C] is dominated by the first few values of C(n); if L(n -1) ≈ 0 for those values (i.e. strings are typically larger) then
E[C] ≈ Σn ≥ 0 (1/N)n = N/(N - 1).
This analysis rests on the assumption that lexicographical comparison is done between independent outcomes of S, but there are scenarios such as binary search where this is not the case at all: we will see in a later entry how this impacts complexity.
I guess you meant
ReplyDelete"if L(n -1) ≈ 0 for those values"
insted of
"if L(n -1) ≈ 1 for those values"
Fixed, thank you!
ReplyDelete