Saturday, May 31, 2014

Indiscernible properties

Max Black's argument against the principle of identity of indiscernibles (PII for short) contends that one could conceive of a plurality of objects with the exact same observable properties (for some notion of "observable" and "property"), and proposes the simple example of a universe exclusively composed of two exactly similar spheres some distance apart from each other: in this perfectly symmetrical setup there is no way to tell one sphere from the other except by arbitrarily picking one up, which selection can not be based on their observable properties; so, the spheres are different (there are two of them) but all their salient properties are the same, thus breaking PII.
Let us simplify the two-sphere example even further down to a universe with two points without any intrinsic physical property except their being at some distance of each other. In the 2P universe, formulated as a theory in second-order logic, there are no primitive properties (unary relations) to discuss about and only one primitive binary relation C defined as
C(x, y) := the distance from x to y is zero
(C stands for"colocated") satisfying the following axioms:
x C(x, x) (each point is colocated with itself),
xy ¬C(x, y), (for each point there is another one not colocated with it).
Now, if we accept the axiom scheme of comprehension, we have
xRy(RyC(x, y)),
which, in combination with the axioms for C, implies
xRy(Rx¬Ry),
that is, for each point x there is a property (namely that of being colocated with x) that x has and some other point y (which, in 2P, is tantamount to saying any other point) does not have. This seems to restitute PII.
To this reasoning Black could have retorted in (at least) two different ways:
  • "Being colocated with x" is just a slightly disguised form of "being identical with x", which is trivially true of x itself and trivially false of any other entity, adding nothing to our initial assumption that 2P has a population of two.
  • The reasoning involves two ill-defined properties, "being colocated with a" and "being colocated with b", where a and b are one point and the other, or the other way around: without any discernible feature to use, there is no way we can select a particular point out of the two we have in 2P.
The first objection we can easily dispose of: "being colocated with x" is a bound version (via comprehension) of C, which is a perfectly discernible, physical property of pairs of points —if I'm given two points, identical or not, I can certainly inspect whether they are colocated. The second counterargument is more interesting and, in my opinion, allows us to provide a clearer formulation of Black's thesis. The objection could be rephrased as: for the same reasons that naming the two points "a" and "b" requires that we can previously tell one point from the other, their univocally associated properties Ra and Rb are also indiscernible, even if they are different. We can formalize indiscernibility in the following way: 
Definition. A second-order theory T (with or without equality) is said to have indiscernible entities if there exists a formula φ with two free variables such that 
xy (φ(x, x) ∧ φ(y, y) ∧ ¬φ(x, y))   (1)
is a theorem of T and, for each model M of T and a, bM satisfying (1), the structure M' obtained by swapping a and b in all the relations of M is also a model of T.
If the set of primitive relations of T is finite, indiscernibility can be expressed as a statement in T itself (sketch of proof: augment (1) with a conjunction of terms expressing swappability of x and y with respect to each position of each primitive relation of T). If T has equality, the statement
xy (φ(x, x) ∧ φ(y, y) ∧ ¬φ(x, y) ∧ xy)   (2)
is a theorem of T (proof trivial). 2P has indiscernible entities (proof trivial).
In conclusion, a formal interpretation of PII resists Black's attacks, but the following, stricter version of the principle:
If two entities have the same discernible properties, they are identical,
is false, and probably what Black had originally in mind.

Friday, May 16, 2014

The perfect shape

A liquid container designed to keep its content cool as long as possible would be spherical in shape, since the sphere is the closed surface that has the minimum area to enclosed volume ratio, thus minimizing convection heating. When the container is also meant to drink from, the sphere is no longer the optimum shape for coolness (leaving aside the fact that one should have to pierce it to access the liquid) because the surface of the remaining liquid changes as the container is being emptied.
A volume V of liquid at temperature T inside a glass is heated by convection on two different surfaces, one in direct contact with the environment and the other as enclosed by the glass wall.
These two surfaces contribute to the increase in temperature, given by Newton's law of cooling (here heating):
dT = (1/cvV)(hopen Aopen + hclosed Aclosed)(Tenv - T) dt,
where cv is the volumetric heat capacity of the liquid, hopen and hclosed the overall heat transfer coefficients of the open and closed interfaces, respectively, and Tenv the temperature of the environment. If the liquid is drunk at a constant rate of ϕ units of volume per unit of time, we have
dT = (1/cvϕV)(hopen Aopen + hclosed Aclosed)(T - Tenv) dV,
with Aopen and Aclosed varying as functions of V; this differential equation implicitly gives T as a function of (Aopen, Aclosed, V), although its analytical solution is only feasible for the simplest shapes. We define the optimum container as that for which the average temperature of the liquid during consumption
Tavg = [Vinit,0]TdV
is minimum.
Let us calculate this shape numerically. Candidate glasses are restricted to surfaces of revolution generated by nonnegative functions y = f(x) between 0 and some maximum height h.
We fix the rest of parameters: the contained liquid is one liter of water (or beer, for that matter) brought out of the fridge at 5 ºC and drunk in 3 minutes in a very hot summer afternoon at 40 ºC; the glass wall (made of, well, glass) is set to be 3 mm thick. All of this gives us:
Vinit = 1,000 cm3,
ϕ = 1,000/180 cm3/s,
Tinit = 5 ºC,
Tenv = 40 ºC,
cv = 4.1796 J/cm3/K,
hwater = 50 W/m2/K,
hair = 100 W/m2/K,
θglass = 3 mm,
kglass0.85 W/m/K,
hopen = 1/((1/hwater)+(1/hair)) = 33.3 W/m2/K,
hclosed = 1/((1/hwater)+(θglass/kglass)+(1/hair)) = 29.8 W/m2/K.
We will use a genetic algorithm to solve the problem:
  • Candidate solutions are codified as arrays with the values f(0), f(0.5 cm), ... , f(50 cm) (that is, the maximum height of the glass is h = 50 cm, much higher than really needed —excess height will simply manifest as a tiny column of radius ≈ 0).
  • The initial pool of candidates is populated with values from random Hermite interpolation polynomials of degree 3 normalized so that the enclosed volume is Vinit.
  • Individual fitness is simply its Tavg calculated figure: the lower Tavg the fitter the solution.
  • At each generation, the fittest 25% of the pool is kept and the rest replaced by random breeding of the former. Crossover of f0 and f1 is implemented by randomly selecting two cutoff points c1 and c2, putting f as:
    • f(x) = f0(x) for x < c1,
    • f(x) = f0(x)·(c2- x)/(c2- c1) + f1(x)·(x - c1)/(c2- c1) for c1x < c2,
    • f(x) = f1(x) for xc2,
    and normalizing f.
  • Bred individuals are mutated (before normalization) by multiplying some of their values (mutation rate = 1%) by a random factor in the range [0,2).
The calculation program has been implemented in C++ (Boost, and in particular Boost.Units, is used). Trials with different initial populations and algorithm parameters (mutation rates, pool sizes and so on) yield basically the same result, which is depicted in the figure:
Optimum glass profile.
The area of this glass is 537 cm2 and its associated average temp Tavg is 6.40 ºC (quite cool!). As predicted, the excess height from ~25 cm up shows as a tiny filament whose actual capacity is totally negligible: trimming the glass cap off at a height of 18.5 cm produces a more practical glass with an opening diameter of 4.2 cm and almost the same performance (we only dropped 2% of the initial content and the temperature of the liquid at that point, 3.6 seconds after beginning to drink, is only slightly higher than Tinit). The glass thus obtained is, perhaps surprisingly, very reminiscent of real-life glasses:
Optimum glass trimmed at 18.5 cm.
In general, trimming the optimum glass f(x) for a given Tinit at height hcut produces the optimum solution of the problem for Vinit = V(height = hcut), Tinit = T(height = hcut) with the additional constraint that the opening of the glass is precisely 2f(hcut): we can then play with this property to obtain solutions to the modified problem
minimize Tavg while maintaining a given opening diameter δ
by starting with Vinit+ ΔV and Tinit -  ΔT and adjusting ΔV and ΔT until the volume and temperature at the desired opening hit Vinit and Tinit simultaneously (experimenting with this is left as an exercise for the reader). 
Parameter sensitivity
An informal analysis reveals that the optimum glass shape is quite independent of changes in most problem parameters. In particular:
  • Tinit does not affect significantly the solution (as long as Tinit < Tenv).
  • Changing Vinit simply produces a scaled version of the same shape.
  • Decreasing ϕ (that is, taking longer to drink the beer up) yields only slightly different shapes, with a wider lower segment.
The one aspect that truly changes the shape is the hopen/hclosed ratio. As this increases, the optimum glass gets thinner at the base and longer, approaching a cylinder as hopen/hclosed → ∞. This has practical implications: for instance, making the beer grow a thick foam topping reduces hopen, which allows for shorter, wider, rounder glasses.
Postscript
Some readers have asked why we haven't used calculus of variations to solve the problem. In its classical one-dimensional formulation (Euler-Lagrange equation), this calculus provides tools for finding the stationary values of a functional J(f) defined as
J(f) = [a,b] L(f(x), f'(x), x) dx,
for some function L : ℝ3 → ℝ with appropriate smoothness properties. The important aspect to note here is that L must depend on the point values of f and f' at x alone: by contrast, in our case the integrand T(V) is a function of hopen Aopen(v) + hclosed Aclosed(v) for all values of v in [V, Vinit], so the conditions for the application of Euler-Lagrange equation do not hold.

Wednesday, May 7, 2014

Fast polymorphic collections with devirtualization

The poly_collection template class we implemented for fast handling of polymorphic objects can be potentially speeded up in the case where some of the derived classes it manages are known to the compiler at the point of template instantiation. Let us use this auxiliary template class:
template<class Derived>
class poly_collection_static_segment
{
public:
  void insert(const Derived& x)
  {
    store.push_back(x);
  }
  
  template<typename F>
  void for_each(F& f)
  {
    std::for_each(store.begin(),store.end(),f);
  }
  
  template<typename F>
  void for_each(F& f)const
  {
    std::for_each(store.begin(),store.end(),f);
  }

private:
  std::vector<Derived> store;
};
We can add the ability for poly_collection to accept a variable number of derived class types:
template<class Base,class ...Derivedn>
class poly_collection
{
  ...
};
so that it includes poly_collection_static_segment<Derivedi> components taking care of objects whose types concide exactly with those specified, whereas for the rest the default dynamic handler is still used.
What does this gain us? Consider the example:
class base
{
  virtual void do_stuff()=0;
};

class derived1:public base{...};
...
poly_collection<base,derived1> c;
c.insert(derived1(0));
...
c.for_each([](base& x){
  x.do_stuff();
});
For the segment handling derived1, the code in for_each resolves to:
std::for_each(store.begin(),store.end(),[](base& x){
  x.do_stuff();
});
where store is of type std::vector<derived1>: so, the compiler can know that x.do_stuff() is invoked on objects with exact type derived1 and hence omit vtable lookup and even, if the definition of derived1::do_stuff is available at instantiation time, inline the call itself.
Now, the fact that the compiler can apply these optimizations does not mean it necessarily will do so: the static analysis needed to determine the optimization opportunity is certainly not trivial. There are a number of techniques we can implement to help the compiler in the process:
  1. Make derived1::do_stuff final.
  2. Use a polymorphic lambda [](auto& x){x.do_stuff();} to omit the cast from derived1& to base&.
  3. Do both 1 and 2.
I have written a test program (Boost required) that implements this new version of poly_collection and measures its for_each performance for the following scenarios:
  1. poly_collection<base> (baseline)
  2. poly_collection<base,derived1>
  3. poly_collection<base,derived1,derived2>
  4. poly_collection<base,derived1,derived2,derived3>
  5. poly_collection<base,final_derived1>
  6. poly_collection<base,final_derived1,final_derived2>
  7. poly_collection<
      base,final_derived1,final_derived2,final_derived3
    >
  8. same as 1, with a polymorphic functor
  9. same as 2, with a polymorphic functor
  10. same as 3, with a polymorphic functor
  11. same as 4, with a polymorphic functor
  12. same as 5, with a polymorphic functor
  13. same as 6, with a polymorphic functor
  14. same as 7, with a polymorphic functor
with containers of several sizes ranging from n = 1,000 to 107 elements (except where noted, results do not vary significantly across sizes). Values are the averages of execution times / number of elements for each scenario, normalized to the baseline = 1.
MSVC 2012
Microsoft Visual Studio 2012 using default release mode settings on a Windows box with an Intel Core i5-2520M CPU @2.50GHz.
Normalized execution times / number of elements.
Differences in performance are not significative and can be labeled as mostly noise: in fact, a visual inspection of the generated assembly code reveals that virtual calls are not optimized ever.
MSVC 2013
Results by Lars Schouw: MSVC 2013 with default release settings on an box with an Intel Core i7-920 @2.67GHz.
Normalized execution times / number of elements.
Unlike with MSVC 2012, here there is a modest but consistent improvement in performance as more derived classes are statically specified. This is probably due to tighter traversal loops rather than devirtualization, except when a polymorphic lambda is combined with final virtual functions, where reductions of up to 25% seem to indicate that inlining is taking place.
Clang on Arch Linux
Results by Theodoros Theodoridis: Clang 3.4 with settings -std=c++11 -O3, Arch Linux x64 with an Intel Core i7-950 @3.07GHz.
Normalized execution times / number of elements.
In most cases, specifying derived classes has no or a detrimental effect on performance (the reason for the degradation is not clear to me). For polymorphic lambda + final virtual functions, however, reductions in execution times are impressively large.
There is an odd effect that the aggregate figures of the graph do not show: in the last two scenarios, performance results when n = 107 are still much better than the baseline but comparatively worse than for other container sizes.
Clang on Ubuntu 64
Results by plash: Compiler Clang 3.5.0 with settings -std=c++11 -stdlib=libc++ -lc++ -O3, system Ubuntu 13.04 x86_64 with an AMD Phenom II.
Normalized execution times / number of elements.
Specification of derived classes definitely improves execution times, although the resulting performance patterns and underlying factors are not easily interpretable. For polymorphic lambda + final virtual functions we have again massive reductions in execution time. Similarly to the case with Clang 3.4, we have comparatively poorer performance for  n = 10, although here the effect is seen across all scenarios.
GCC on Arch Linux
Results by Theodoros Theodoridis: GCC 4.9.0 with settings -std=c++11 -O3, Arch Linux x64 with an Intel Core i7-950 @3.07GHz.
Normalized execution times / number of elements.
For this compiler, passing a monomorphic or a polymorpic lambda to for_each makes all the difference, with the latter case achieving reductions in execution time of up to 60%. I do not have an explanation as to why using final virtual functions improves on the non-final case: in both situations inlining seems to be in effect, which should render final qualification irrelevant. Like with Clang, some cases of relative degradation when  n = 107 are also seen for this compiler, here mostly concentrated on the last two scenarios.
(Update Aug 17, 2014) GCC 5 on Linux
Jan Hubička is working on integrating devirtualization capabilities in GCC, and the results on this area are improving accordingly. He ran the test benchmark against GCC mainline (as of Aug 15, 2014) with settings -O3 on a Linux machine with an Intel i5-4300U processor.
Normalized execution times / number of elements.
GCC's speculative devirtualization analysis is able to take full advantage of the optimization opportunities offered by static knowledge of the types involved, even to the point of rendering final decoration superfluous.
Conclusions
Allowing for the specification of a list of statically known derived class types as part of a poly_collection instantiation opens up possibilities for internal performance optimizations that do not impact the usage interface of this template class. The actual improvements achieved depend very much on the compiler being used, but in general are greater when passing polymorphic functors (such as generic lambdas) to for_each and qualifying virtual functions as final where possible.
Postscript: The observant reader might have noticed that poly_collection_static_segment<Derivedi> does not explicitly require that Derivedi be an actual derived class from Base, and in fact one can pass any type as long as we use a polymorphic functor with for_each that accepts it:
poly_collection<std::string,int,char> c;
c.insert(std::string("hello"));
c.insert(0);
c.insert('a');
c.for_each([](auto& x){
  std::cout<<x<<"\n";
});

Output:
hello
0
a
This comes more as a (easily fixable) happy accident than from a conscious design decision, but hints at an area of exploration away from the original purpose of poly_collection, focused on handling heterogeneous classes via visitor patterns in a manner reminiscent of the techniques used in Boost.Variant.
Update Jan 17, 2015
Fixed a serious bug in the code, that, fortunately, doesn't affect the test results.

Sunday, May 4, 2014

Fast polymorphic collections

In C++, instances of classes derived from an abstract class are typically heap-allocated and managed through (smart) pointers because their exact types (and hence their sizes) are not known until run time —strictly speaking, the type of each object is known at the point of creation but then abstracted away or erased when passed along to the location of the program where it is stored and used. For this very reason, such objects can't be stored in STL collections directly but rather through an indirection. For instance, this is how the layout of a std::vector<std::unique_ptr<base>> could look like for some abstract base class:
Sometimes we are not interested in the particular order the objects are stored in as we only want to keep them together to interact with them in a uniform manner:
struct base
{
  virtual int do_stuff(int x)=0;
};
...
using base_pointer=std::unique_ptr<base>;
std::vector<base_pointer> c;
...
std::for_each(c.begin(),c.end(),[](base_pointer& p){
  p->do_stuff(1);
});
When c.size() is large the resulting performance is impacted by two negative factors:
  • Heap allocated objects might be scattered throughout the memory space, incurring many cache misses during traversal;
  • Modern architectures with branch prediction can optimize away the vtable lookup cost associated to virtual calls when these calls successively resolve to the same function, but this is not the case if objects of distinct derived classes are interspersed along the vector.
Let us explore a different structure where objects of the same derived class are held together in segments of contiguous memory:
Here, objects are appended to their associated memory segment on insertion time, for which we need to know their exact type at that point. The structure used to determine which segment each object is inserted into is then just a std::map of std::type_index's to segment handling objects. This is a minimal interface for such a data structure:
template<class Base>
class poly_collection
{
public:
  // only available if Derived is derived from Base
  template<class Derived>
  void insert(Derived& x);

  template<typename F>
  F for_each(F f);

  template<typename F>
  F for_each(F f)const;
};
Profiling
A testing program is provided that implements poly_collection and compares the performance of poly_collection<base> with respect to std::vector<std::unique_ptr<base>> for the following scenario:
struct base
{
  virtual int f(int x)const=0;
};

struct derived1:base
{
  derived1(int n):n(n){}
  virtual int f(int x)const{return x*n;}
  
  int n;
};

struct derived2:base
{
  derived2(int n):n(n){}
  virtual int f(int x)const{return x+n;}
  
  int unused,n;
};

struct derived3:base
{
  derived3(int n):n(n){}
  virtual int f(int x)const{return x-n;}
  
  int unused,n;
};

template<typename Collection>
int run_for_each(const Collection& c)
{
  int res=0;
  c.for_each([&res](const base& x){
    res+=x.f(1);
  });
  return res;
}
where the containers are filled with an assortment of n = 1,000 to 107 objects of types derived1, derived2 and derived3 and then randomly shuffled to reflect the memory status of a program where objects have heterogeneous lifetimes.
MSVC on Windows
Test built with Microsoft Visual Studio 2012 using default release mode settings and run on a Windows box with an Intel Core i5-2520M CPU @2.50GHz. Depicted values are microseconds/n, where n is the number of elements in the container.
for_each execution times / number of elements.
poly_collection behaves excellently and is virtually not affected by the size of the container. For n < 105, the differences in performance between poly_collection and a std::vector of std::unique_ptrs are due to worse virtual call branch prediction in the latter case; when n > 105, massive cache misses are added to the first degrading factor.
For MSVC 2013 on an box with an Intel Core i7-920 @2.67GHz, results (provided by Lars Schouw) are similar, though the sharp steep in execution times for std::vector<std::unique_ptr<base>> is shifted from ~105 to ~3·105 due to the larger memory cache of the i7-920 processor.
for_each execution times / number of elements.
Clang on Ubuntu 64
Results provided by plash: Compiler Clang 3.5.0 with settings -std=c++11 -stdlib=libc++ -lc++ -O3, system Ubuntu 13.04 x86_64 with an AMD Phenom II X6 1090T.
for_each execution times / number of elements.
Results are even more striking here: while for small values of n poly_collection is only 2x faster than a std::vector of std::unique_ptrs (3x for MSVC), performance ratio is 25x for n =  107 (10-12x for MSVC).
GCC on Arch Linux
Results provided by Theodoros Theodoridis: Compiler GCC 4.9.0 with settings -std=c++11 -O3, system Arch Linux x64 with an Intel Core i7-950 @3.07GHz.
for_each execution times / number of elements.
Other environments
Request for help: if you can build (with release settings) and run the test program on some other environment/compiler and post the results back I'd be happy to include them in the entry.
Conclusions
Modern architectures, heavily accelerated by CPU memory caches, penalize typical patterns of C++ run-time polymorphism based on individual heap allocation of many heteogeneous objects implementing a given abstract interface. This can be improved dramatically by using containers that gather objects of the same concrete type in segments of contiguous memory, if this type is known at the point of insertion: poly_collection provides the foundational ideas for such a container. Note that poly_collection requires the classes dealt with to be CopyConstructible and MoveInsertable, which involve operations typically not used in classic patterns of C++ dynamic polymorphism.
There is still margin for improvement: if some of the concrete types deriving from base are known at container definition time, we can use this knowledge so that poly_collection benefits from devirtualization of these types; this we will explore in a later entry.
Update Jan 17, 2015
Fixed a serious bug in the code, that, fortunately, doesn't affect the test results.

Saturday, April 12, 2014

Climbing down the tree: framework and performance

Continuing with our study on optimizing lexicographical comparison as used within binary search algorithms, we will provide now a fuller description of the supporting C++ framework and measure the actual performance with several test sequences on different environments.
Let Compare be a binary relation inducing a strict weak ordering on values of a type T, cmp a value of Compare and x a value of T. A binary search binder for cmp and x is any object cbind for which the expressions cbind.less(y) and cbind.greater(y) are valid whenever cmp(y,x) and cmp(x,y) are valid, respectively, and return the same boolean value as these. Moreover, when the range being searched is sorted (i.e. not merely partitioned with respect to x) the implementation of a binary search binder can assume the following usage restrictions:
  • If either the invocation cbind.less(y) returns true or the invocation cbind.greater(y) returns false, subsequent invocations on cbind will be done against values not less than y under the ordering relation induced by Compare.
  • If either the invocation cbind.less(y) returns false or the invocation cbind.greater(y) returns true, subsequent invocations on cbind will be done against values not greater than y.
Standard C++ binary search algorithms such as lower_bound or binary_search can be implemented by replacing each usage of the given Compare object with the equivalent invocation of a single associated binder obtained through comp_bind, which is defined as:
template<typename Comp,typename T>
struct comp_binder
{
  comp_binder(Comp cmp,const T& x):cmp(cmp),x(x){}

  template<typename Q> bool less(const Q& y){return cmp(y,x);}
  template<typename Q> bool greater(const Q& y){return cmp(x,y);}

  Comp     cmp;
  const T& x;
};

template<typename Comp,typename T>
comp_binder<Comp,T> comp_bind(Comp cmp,const T& x)
{
  return comp_binder<Comp,T>(cmp,x);
}
(Note that comp_binder is compatible even with polymorphic comparison types such as C++14 transparent operator functors.) While providing backwards compatibility, this framework introduces new customization opportunities via specialization of comp_binder; this is precisely what prefix_less does:
template<typename T>
struct prefix_less:std::less<T>
{
};

template<typename T>
struct comp_binder<prefix_less<T>,T>
{
  bool less(const T& y);
  bool greater(const T& y);
};
T must be either an instantiation of std::basic_string or a container with lexicographical less-than semantics. prefix_less<T> is functionally equivalent to std::less<T>, but its associated binder takes advantage of the special binary search context where it is being used to provide better algorithmic complexity. The actual implementation can be consulted in the test program provided below. From the point of view of the end user, prefix_less works exactly as std::less:
std::vector<std::string> v;
...
auto it=std::lower_bound(
  v.begin(),v.end(),x,prefix_less<std::string>());
except that, if the algorithm internally uses comp_bind, the resulting performance might improve —or degrade, we will see now how the real thing behaves.
Measurements
I have written a program that compares the execution speeds of std::less and prefix_less with comp_bind-compliant versions of lower_bound and binary_search under the following testing scenarios:
template<typename Sequence,typename Compare>
void run_lower_bound(const Sequence& s,Compare cmp)
{
  for(const auto& x:s){
    lower_bound(s.begin(),s.end(),x,cmp);
  }
}

template<typename Sequence,typename Compare>
void run_binary_search(const Sequence& s,Compare cmp)
{
  for(const auto& x:s){
    binary_search(s.begin(),s.end(),x,cmp);
  }
}
An assortment of sequences is tested (in what follows SN,L denotes the sorted range of strings of length L composed with the N characters '0', '1', ...; for instance, S2,3 is the range ["000", "001", "010", "011", "100", "101", "110", "111"]):
  • S2,4 (size: 16), S2,10 (size: 1,024), S2,20 (size: ~1,000,000), S10,4 (size: 10,000), S10,6 (size: 10,000,000), S20,4 (size: 160,000),
  • The sorted set of ~23,000 unique words within Cervantes' Don Quijote de la Mancha
Several representation types are also used to encode the elements of these sequences:
  • std::string,
  • std::wstring,
  • std::vector<udchar>, where udchar is a simple wrapper around char,
  • std::vector<std::string>, i.e. each character is codified as a separate string of length 1.
MSVC
Tests were built with Microsoft Visual Studio 2012 using default release mode settings and run on a Windows box with an Intel Core i5-2520M CPU @2.50GHz. Depicted values are microseconds/n, where n is the number of elements in the sequence tested.
Execution times, std::string.
Execution times, std::wstring.
Execution times, std::vector<udchar>.
Execution times, std::vector<std::string>.
It is hard to beat the default implementation of lexicographical comparison for such simple types as std::string, which usually reduces to a call to the extremely efficient std::memcmp, and additionally the theoretical performance gain of prefix_less is dwarfed by the constant bookkeeping cost associated to tracking the length of the discarded common prefix. With these considerations in mind, prefix_less performs only marginally better than std::less when average string length is small and the representation type is simple, and excels in the opposite situations. std::vector<udchar> is an anomaly for which I do not have a clear explanation: udchar not being a memcmp-compatible type, one would expect prefix_less to behave unconditionally better than std::less just as with std::vector<std::string>, but in fact this is not the case; maybe there is some inlining threshold met by prefix_less that the simpler code of std::less manages to avoid.
GCC
Thomas Goyne has built the tests with GCC 4.9 20140330, release settings -O3 -std=c++11, and run them on an OS X 10.9.1 box with an Intel Core i7-3720QM @2.6GHz.
Execution times, std::string.
Execution times, std::wstring.
Execution times, std::vector<udchar>.
Execution times, std::vector<std::string>.
Unlike with MSVC, here prefix_less performs consistently better than std::less except for S2,4, where string length is too small to compensate for the extra constant cost incurred by prefix_less. Reductions in execution times range from 10% to 50%.
Clang
Results also provided by Thomas Goyne: Clang 503.0.38 (3.4svn) with libc++, release settings -O3 -std=c++11, same OS X machine as with GCC.
Execution times, std::string.
Execution times, std::wstring.
Execution times, std::vector<udchar>.
Execution times, std::vector<std::string>.
Performance gains are similar to those obtained in the case of GCC except when the representation type is std::vector<udchar>, in which case std::less is generally faster, again for reasons for which I do not have a satisfactory explanation.
Conclusions
C++ binary search algorithms can be extended in a backwards compatible way to support mechanisms for lexicographical comparison with lower complexity and, generally speaking, better performance than provided by regular std::less instantiations. Although we have focused on comparison of containers and strings, there is also an additional category of entities, namely tuples and similar types, that could also benefit from this framework: this we might explore in future entries.

Saturday, April 5, 2014

Climbing down the tree

Within the execution of binary search through a range of strings, the algorithmic complexity of lexicographical comparison, which is almost independent of the length of the strings in the general case, turns out here to degrade to O(length of strings). We want to devise a mechanism to restore the efficiency of lexicographical comparison by introducing some contextual information on the binary search execution back into the comparison routine.
Binary search can be regarded as the process of climbing down a binary tree whose nodes are the elements of the sorted range being searched through:
Say we are looking for a given value x and denote by x1, x2, ... the sequence of elements being compared against x during the process. The following is a fundamental property of partition-based algorithms such as binary search:
Property. For each xi visited during the process
  • if x < xi then xj < xi for all j > i,
  • if x > xi then xj > xi for all j > i.
So, searching zeroes in on x by fencing it into a narrowing interval [li, ui] ("l" and "u" stand for lower and upper bound, respectively) that is implicitly calculated as
  • [l0, u0] = [smallest string, largest string],
  • if x < xi then [li, ui] = [li-1, si],
  • if x > xi then [li, ui] = [si, ui-1].
The key observation here in connection with lexicographical comparison is: if, at stage i, li and ui are strings sharing a common prefix pi, then all subsequent comparisons will be done against strings beginning with this same prefix. An intelligent exploitation of this fact spares us then the need to check the prefix so that we can start the comparison at mid-string position.
Let us put the idea in C++. The canonical implementation of std::lower_bound looks like this:
template<typename ForwardIterator,typename T,typename Compare>
ForwardIterator lower_bound(
  ForwardIterator first,ForwardIterator last,
  const T& x,Compare comp)
{
  ForwardIterator it;
  typename std::iterator_traits<
    ForwardIterator>::difference_type count,step;
  count=std::distance(first,last);
  while(count>0){
    it=first;
    step=count/2;
    std::advance(it,step);
    if(comp(*it,x)){
      first=++it;
      count-=step+1;
    }
    else count=step;
  }
  return first;
}
where comp is a functor comparing T values for inequality (typically std::less<T>). This interface does not serve our purposes because, when invoking comp(x,y), we are not indicating which argument is the element being searched and which the range element compared against. So let us tweak the code a bit to make this information explicit:
{
  ...
  auto cbind=comp_bind(comp,x);
  while(count>0){
    it=first;
    step=count/2;
    std::advance(it,step);
    if(cbind.less(*it)){
      first=++it;
      count-=step+1;
    }
    else count=step;
  }
  return first;
}
We will not delve now into the implementation of the framework comp_bind relies on: suffice it to say that it passes x to a comparison object with the following interface: 
template<typename Comp,typename Value>
struct comp_binder
{
  comp_binder(const Comp& cmp,const Value& x);

  bool less(const Value& y);    // ~ cmp(y,x)
  bool greater(const Value& y); // ~ cmp(x,y)
};
This is then a mere change of interface with respect to that of std::less, but one that allows us to keep relevant contextual information on the execution of lower_bound within the comparison object itself, information that an optimized lexicographical comparison object can take advantage of:
template<typename T>
struct comp_binder<prefix_less<T>,T>
{
  comp_binder(const prefix_less<T>&,const T& x):
    x(x),pref_left(0),pref_right(0)
  {}

  bool less(const T& y)
  {
    auto n=std::min(pref_left,pref_right);
    auto m=std::min(x.size(),y.size());
    auto it1=x.begin()+n,it2=y.begin()+n;
    for(;n!=m&&*it1==*it2;++n,++it1,++it2);
    return (n!=m?*it2<*it1:n<x.size())?
      (pref_left=n,true):
      (pref_right=n,false);
  }

  bool greater(const T& y); // similar to less

  const T&              x;
  typename T::size_type pref_left,pref_right;
};
The object keeps in pref_left the length of the common prefix between the latest lower bounding element and x, and similarly with pref_right for upper bounding elements. By the fundamental properties of partition based algorithms, it is guaranteed that further comparisons will be done only against strings sharing with x a common prefix of length ≥ min{pref_left, pref_right}, a fact we can use to shorten the checking loop. 
I have put together an instrumented version of this scheme into a test program that calculates the resulting complexity of optimized lexicographical comparison for the sorted range of the NL strings of length L constructible with an alphabet of N symbols, denoted in what follows by E[CoptN,L]:
E[CoptN,L] as a function of L.
Remember that E[CN,L] is the average number of symbols checked when strings are randomly extracted from the range and E[C'N,L] is the corresponding value in the case that comparison happens within binary search. We have then
E[CN,L] ≤ E[CoptN,L] ≤ E[C'N,L],
E[CoptN,L] is bounded as N, L → ∞
(we won't prove this); so, we have restored O(1) complexity for lexicographical comparison, though the figures are still around two times higher than in the "free-floating" scenario.
In a later entry we will elaborate on the definition of comp_bind to support a backwards-compatible framework for less-based standard binary search algorithms and will also measure actual execution times with and without this optimization.