Saturday, April 5, 2014

Climbing down the tree

Within the execution of binary search through a range of strings, the algorithmic complexity of lexicographical comparison, which is almost independent of the length of the strings in the general case, turns out here to degrade to O(length of strings). We want to devise a mechanism to restore the efficiency of lexicographical comparison by introducing some contextual information on the binary search execution back into the comparison routine.
Binary search can be regarded as the process of climbing down a binary tree whose nodes are the elements of the sorted range being searched through:
Say we are looking for a given value x and denote by x1, x2, ... the sequence of elements being compared against x during the process. The following is a fundamental property of partition-based algorithms such as binary search:
Property. For each xi visited during the process
  • if x < xi then xj < xi for all j > i,
  • if x > xi then xj > xi for all j > i.
So, searching zeroes in on x by fencing it into a narrowing interval [li, ui] ("l" and "u" stand for lower and upper bound, respectively) that is implicitly calculated as
  • [l0, u0] = [smallest string, largest string],
  • if x < xi then [li, ui] = [li-1, si],
  • if x > xi then [li, ui] = [si, ui-1].
The key observation here in connection with lexicographical comparison is: if, at stage i, li and ui are strings sharing a common prefix pi, then all subsequent comparisons will be done against strings beginning with this same prefix. An intelligent exploitation of this fact spares us then the need to check the prefix so that we can start the comparison at mid-string position.
Let us put the idea in C++. The canonical implementation of std::lower_bound looks like this:
template<typename ForwardIterator,typename T,typename Compare>
ForwardIterator lower_bound(
  ForwardIterator first,ForwardIterator last,
  const T& x,Compare comp)
{
  ForwardIterator it;
  typename std::iterator_traits<
    ForwardIterator>::difference_type count,step;
  count=std::distance(first,last);
  while(count>0){
    it=first;
    step=count/2;
    std::advance(it,step);
    if(comp(*it,x)){
      first=++it;
      count-=step+1;
    }
    else count=step;
  }
  return first;
}
where comp is a functor comparing T values for inequality (typically std::less<T>). This interface does not serve our purposes because, when invoking comp(x,y), we are not indicating which argument is the element being searched and which the range element compared against. So let us tweak the code a bit to make this information explicit:
{
  ...
  auto cbind=comp_bind(comp,x);
  while(count>0){
    it=first;
    step=count/2;
    std::advance(it,step);
    if(cbind.less(*it)){
      first=++it;
      count-=step+1;
    }
    else count=step;
  }
  return first;
}
We will not delve now into the implementation of the framework comp_bind relies on: suffice it to say that it passes x to a comparison object with the following interface: 
template<typename Comp,typename Value>
struct comp_binder
{
  comp_binder(const Comp& cmp,const Value& x);

  bool less(const Value& y);    // ~ cmp(y,x)
  bool greater(const Value& y); // ~ cmp(x,y)
};
This is then a mere change of interface with respect to that of std::less, but one that allows us to keep relevant contextual information on the execution of lower_bound within the comparison object itself, information that an optimized lexicographical comparison object can take advantage of:
template<typename T>
struct comp_binder<prefix_less<T>,T>
{
  comp_binder(const prefix_less<T>&,const T& x):
    x(x),pref_left(0),pref_right(0)
  {}

  bool less(const T& y)
  {
    auto n=std::min(pref_left,pref_right);
    auto m=std::min(x.size(),y.size());
    auto it1=x.begin()+n,it2=y.begin()+n;
    for(;n!=m&&*it1==*it2;++n,++it1,++it2);
    return (n!=m?*it2<*it1:n<x.size())?
      (pref_left=n,true):
      (pref_right=n,false);
  }

  bool greater(const T& y); // similar to less

  const T&              x;
  typename T::size_type pref_left,pref_right;
};
The object keeps in pref_left the length of the common prefix between the latest lower bounding element and x, and similarly with pref_right for upper bounding elements. By the fundamental properties of partition based algorithms, it is guaranteed that further comparisons will be done only against strings sharing with x a common prefix of length ≥ min{pref_left, pref_right}, a fact we can use to shorten the checking loop. 
I have put together an instrumented version of this scheme into a test program that calculates the resulting complexity of optimized lexicographical comparison for the sorted range of the NL strings of length L constructible with an alphabet of N symbols, denoted in what follows by E[CoptN,L]:
E[CoptN,L] as a function of L.
Remember that E[CN,L] is the average number of symbols checked when strings are randomly extracted from the range and E[C'N,L] is the corresponding value in the case that comparison happens within binary search. We have then
E[CN,L] ≤ E[CoptN,L] ≤ E[C'N,L],
E[CoptN,L] is bounded as N, L → ∞
(we won't prove this); so, we have restored O(1) complexity for lexicographical comparison, though the figures are still around two times higher than in the "free-floating" scenario.
In a later entry we will elaborate on the definition of comp_bind to support a backwards-compatible framework for less-based standard binary search algorithms and will also measure actual execution times with and without this optimization.

No comments :