Given two strings

*s*,*s'*, determining whether*s*<*s'*under a lexicographical order takes a number of steps 1 ≤*C*≤ 1 + min{len(*s*),len(*s'*)}, but the actual statistical distribution of*C*depends of course on those of the strings being compared: if they tend to share the same prefix, for instance,*C*will be typically larger than in the case where the strings are more randomly generated.
Let

*S*be a probability distribution over*A**, the set of strings composable with symbols from a finite alphabet*A*with*N*symbols. A convenient way to describe*S*is by way of Markov chains:
Strings are generated from the empty prefix Λ by the addition of new symbols from

*A*, termination being conventionally marked by transitioning to some symbol**0**not in*A*. Each arrow has an associated transition probability*T*(*p*,*q*),*p*∈*A**,*q*∈*A*∪ {**0**}. The probability that a string has prefix*p*=*p*_{1}*p*_{2}...*p*is then_{n}*P*(

*p*) =

*(Λ)*

*P**T*(Λ,

*p*

_{1})

*T*(

*p*

_{1},

*p*

_{2})

*T*(

*p*

_{1}

*p*

_{2},

*p*

_{3}) ···

*T*(

*p*

_{1}

*p*

_{2}

*···p*

_{n-1},

*p*

_{n}),

with

*(Λ) = 1. Now, comparing two strings independently extracted from**P**A**is equivalent to analyzing their transition paths until the first divergence is found. If we denote by*C*(*n*) the probability that such comparison takes exactly*n*steps it is easy to see that*C*(

*n*) =Σ

_{p }

_{∈ An-1}

*P*

^{2}(

*p*)(1 - Σ

_{q }

_{∈ A}

*(*

*T*^{2}*p*,

*q*)),

which corresponds to the probability that the strings coincide up to the (

A particularly simple family of distributions for *n*-1)-th symbol and then either differ on the*n*-th one or both terminate (that is, they don't transition to the same non-terminating symbol).*S*are those where the lenghts of strings are governed by a random variable

*L*with cumulative distribution function

*L*(

*n*) = Pr(

*L*≤

*n*) and non-terminating symbols occur equally likely, that is, for

*p*∈

*,*

*A*^{n}*q*∈

*A*we have:

*T*(

*p*,

**0**) = (

*L*(

*n*) -

*L*(

*n*- 1))/(1-

*L*(

*n*- 1)),

*T*(

*p*,

*q*) = (1 -

*T*(

*p*,

**0**))/

*N*= (1/

*N*)(1 -

*L*(

*n*))/(1-

*L*(

*n*- 1)),

*P*(

*p*) = Π

_{i = 0,...,n-1}(1/

*N*)(1 -

*L*(

*i*))/(1-

*L*(

*i*- 1)) = (1/

*N*)

^{n}(1 -

*L*(

*n*- 1)),

*C*(

*n*) =

*N*

^{n-1}·(1/

*N*)

^{2(n-1)}(1-

*L*(

*n*- 2))

^{2}(1 -

*N*·(1/

*N*)

^{2}(1 -

*L*(

*n*- 1))

^{2}/(1-

*L*(

*n*- 2))

^{2}) =

= (1/

*N*)

^{n-1}(1-

*L*(

*n*- 2))

^{2}- (1/

*N*)

^{n}(1-

*L*(

*n*- 1))

^{2},

or

*C*(

*n*) =

*D*(

*n*- 1) -

*D*(

*n*),

*D*(

*n*) = (1/

*N*)

^{n}(1-

*L*(

*n*- 1))

^{2},

resulting in an average number of comparison steps

E[

= Σ

*C*] = Σ_{n > 0}*nC*(*n*) = Σ_{n > 0}*n*(*D*(*n*- 1) -*D*(*n*)) == Σ

_{n ≥ 0}*D*(*n*) = Σ_{n ≥ 0}(1/*N*)^{n}(1-*L*(*n*- 1))^{2}.
When

*N*≥ 2, E[*C*] is dominated by the first few values of*C*(*n*); if*L*(*n*-1) ≈ 0 for those values (i.e. strings are typically larger) then
E[

*C*] ≈ Σ_{n }_{≥ 0}(1/*N*)^{n}=*N*/(*N*- 1).
This analysis rests on the assumption that lexicographical comparison is done between independent outcomes of

*S*, but there are scenarios such as binary search where this is not the case at all: we will see in a later entry how this impacts complexity.
I guess you meant

ReplyDelete"if L(n -1) ≈ 0 for those values"

insted of

"if L(n -1) ≈ 1 for those values"

Fixed, thank you!

ReplyDelete