Bannalia: trivial notes on themes diverse: Complexity of lexicographical comparison

Given two strings s, s', determining whether s < s' under a lexicographical order takes a number of steps 1 ≤ C ≤ 1 + min{len(s),len(s')}, but the actual statistical distribution of C depends of course on those of the strings being compared: if they tend to share the same prefix, for instance, C will be typically larger than in the case where the strings are more randomly generated.

Let S be a probability distribution over A*, the set of strings composable with symbols from a finite alphabet A with N symbols. A convenient way to describe S is by way of Markov chains:

Strings are generated from the empty prefix Λ by the addition of new symbols from A, termination being conventionally marked by transitioning to some symbol 0 not in A. Each arrow has an associated transition probability T(p,q), p ∈ A*, q ∈ A ∪ {0}. The probability that a string has prefix p = p₁p₂...p_n is then

P(p) = P(Λ)T(Λ, p₁)T(p₁, p₂)T(p₁p₂, p₃) ··· T(p₁p₂···p_n-1, p_n),

with P(Λ) = 1. Now, comparing two strings independently extracted from A* is equivalent to analyzing their transition paths until the first divergence is found. If we denote by C(n) the probability that such comparison takes exactly n steps it is easy to see that

C(n) =Σ_p_{∈ A^n-1} P²(p)(1 - Σ_q_{∈ A} T²(p,q)),

which corresponds to the probability that the strings coincide up to the (n-1)-th symbol and then either differ on the n-th one or both terminate (that is, they don't transition to the same non-terminating symbol).

A particularly simple family of distributions for S are those where the lenghts of strings are governed by a random variable L with cumulative distribution function L(n) = Pr(L ≤ n) and non-terminating symbols occur equally likely, that is, for p ∈ Aⁿ, q ∈ A we have:

T(p,0) = (L(n) - L(n - 1))/(1- L(n - 1)),
T(p,q) = (1 - T(p,0))/N = (1/N)(1 - L(n))/(1- L(n - 1)),
P(p) = Π_{i = 0,...,n-1} (1/N)(1 - L(i))/(1- L(i - 1)) = (1/N)ⁿ (1 - L(n - 1)),

which gives

C(n) = N^n-1·(1/N)^2(n-1)(1- L(n - 2))²(1 - N·(1/N)²(1 - L(n - 1))²/(1- L(n - 2))²) =
= (1/N)^n-1(1- L(n - 2))² - (1/N)ⁿ(1- L(n - 1))²,

C(n) = D(n - 1) - D(n),
D(n) = (1/N)ⁿ(1- L(n - 1))²,

resulting in an average number of comparison steps

E[C] = Σ_{n > 0} nC(n) = Σ_{n > 0} n(D(n - 1) - D(n)) =
= Σ_{n ≥ 0} D(n) = Σ_{n ≥ 0} (1/N)ⁿ(1- L(n - 1))².

When N ≥ 2, E[C] is dominated by the first few values of C(n); if L(n -1) ≈ 0 for those values (i.e. strings are typically larger) then

E[C] ≈ Σ_n_{≥ 0} (1/N)ⁿ = N/(N - 1).

This analysis rests on the assumption that lexicographical comparison is done between independent outcomes of S, but there are scenarios such as binary search where this is not the case at all: we will see in a later entry how this impacts complexity.

Bannalia: trivial notes on themes diverse

Friday, March 28, 2014

Complexity of lexicographical comparison

2 comments: