Monday, December 28, 2015

How likely?

Yesterday, CUP political party held a general assembly to determine whether to support or not Artur Mas's candidacy to President of the Catalonian regional government. The final voting round among 3,030 representatives ended up in an exact 1,515/1,515 tie, leaving the question unsolved for the moment being. Such an unexpected result has prompted a flurry of Internet activity about the mathematical probability of its occurrence.
The question "how likely was this result to happen?" is of course unanswerable without a specification of the context (i.e. the probability space) we choose to frame the event. A plausible formulation is:
If a proportion p of CUP voters are pro-Mas, how likely is it that a random sample based on 3,030 individuals yields a 50/50 tie?
The simple answer (assuming the number of CUP voters is much larger that 3,030) is Pp(1,015 | 3,030), where Pp(n | N) is the binomial distribution of N Bernouilli trials with probability p resulting in exactly n successes.
The figure shows this value for 40% ≤ p ≤ 60%. At p = 50%, which without further information is our best estimation of pro-Mas supporters among CUP voters, the probability of a tie is 1.45%. A deviation in p of ±4% would have made this result virtually impossible.
A slightly more interesting question is the following:
If a proportion p of CUP voters are pro-Mas, how likely is a random sample of 3,030 individuals to misestimate the majority opinion?
When p is in the vicinity of 50%, there is a non-negligible probability that the assembly vote come up with the wrong (i.e. against voters' wishes) result. This probability is
Ip(1,516, 1,515) if p < 50%,
1 − Pp(1,015 | 3,030) if p = 50%,
I1−p(1,516, 1,515) if p > 50%,
where Ip(a,b) is the regularized beta function. The figure shows the corresponding graph for 3,030 representatives and 40% ≤ p ≤ 60%.
The function shows a discontinuity at the singular (and zero-probability) event p = 50%, in which case the assembly will yield the wrong result always except for the previously studied situation that there is an exact tie (so, the probability of misestimation is 1 − 1.45% = 98.55 %). Other than this, the likelihood of misestimation approaches 49%+ as p tends to 50%. We have learnt that CUP voters are almost evenly divided between pro- and anti-Mas: if the difference between both positions is 0.7% or less, an assembly of 3,030 representatives such as held yesterday will fail to reflect the party's global position in more than 1 out of 5 cases.

Saturday, November 14, 2015

SOA container for encapsulated C++ DOD

In a previous entry we saw how to decouple the logic of a class from the access to its member data so that the latter can be laid out in a DOD-friendly fashion for faster sequential processing. Instead of having a std::vector of, say, particles, now we can store the different particle members (position, velocity, etc.) in separate containers. This unfortunately results in more cumbersome initialization code: whereas for the traditional, OOP approach particle creation and access is compact and nicely localized:
std::vector<plain_particle> pp_;
...
for(std::size_t i=0;i<n;++i){
  pp_.push_back(plain_particle(...));
}
...
render(pp_.begin(),pp_.end());
when using DOD, in contrast, the equivalent code grows linearly with the number of members, even if most of it is boilerplate:
std::vector<char> color_;
std::vector<int>  x_,y_,dx_,dy_;
...
for(std::size_t i=0;i<n;++i){
  color_.push_back(...);
  x_.push_back(...);
  y_.push_back(...);
  dx_.push_back(...);
  dy_.push_back(...);  
}
...
auto beg_=make_pointer<particle>(
  access(&color_[0],&x_[0],&y_[0],&dx_[0],&dy_[0])),
auto end_=beg_dod+n;
render(beg_,end_);
We would like to rely on a container using SOA (structure of arrays) for its storage that allows us to retain our original OOP syntax:
using access=dod::access<color,x,y,dx,dy>;
dod::vector<particle<access>> p_;
...
for(std::size_t i=0;i<n;++i){
  p_.emplace_back(...);
}
...
render(p_.begin(),p_.end());
Note that particles are inserted into the container using emplace_back rather than push_back: this is due to the fact that a particle object (which push_back accepts as its argument) cannot be created out of the blue without its constituent members being previously stored somewhere; emplace_back, on the other hand, does not suffer from this chicken-and-egg problem.
The implementation of such a container class is fairly straightfoward (limited here to the operations required to make the previous code work):
namespace dod{

template<typename Access>
class vector_base;

template<>
class vector_base<access<>>
{
protected:
  access<> data(){return {};}
  void emplace_back(){}
};

template<typename Member0,typename... Members>
class vector_base<access<Member0,Members...>>:
  protected vector_base<access<Members...>>
{
  using super=vector_base<access<Members...>>;
  using type=typename Member0::type;
  using impl=std::vector<type>;
  using size_type=typename impl::size_type;
  impl v;
  
protected:
  access<Member0,Members...> data()
  {
    return {v.data(),super::data()};
  }

  size_type size()const{return v.size();}

  template<typename Arg0,typename... Args>
  void emplace_back(Arg0&& arg0,Args&&... args){
    v.emplace_back(std::forward<Arg0>(arg0));
    try{
      super::emplace_back(std::forward<Args>(args)...);
    }
    catch(...){
      v.pop_back();
      throw;
    }
  }
};
  
template<typename T> class vector;
 
template<template <typename> class Class,typename Access> 
class vector<Class<Access>>:protected vector_base<Access>
{
  using super=vector_base<Access>;
  
public:
  using iterator=pointer<Class<Access>>;
  
  iterator begin(){return super::data();}
  iterator end(){return this->begin()+super::size();}
  using super::emplace_back;
};

} // namespace dod
dod::vector<Class<Members...>> derives from an implementation class that holds a std::vector for each of the Members declared. Inserting elements is just a simple matter of multiplexing to the vectors, and begin and end return dod::pointers to this structure of arrays. From the point of view of the user all the necessary magic is hidden by the framework and DOD processing becomes nearly identical in syntax to OOP.
We provide a test program that exercises dod::vector against the classical OOP approach based on a std::vector of plain (i.e., non DOD) particles. Results are the same as previously discussed when we used DOD with manual initialization, that is, there is no abstraction penalty associated to using dod::vector, so we won't present any additional figures here.
The framework we have constructed so far provides the bare minimum needed to test the ideas presented. In order to be fully usable there are various aspects that should be expanded upon:
  • access<Members...> just considers the case where each member is stored separately. Sometimes the most efficient layout will call for mixed scenarios where some of the members are grouped together. This can be modelled, for instance, by having member accept multiple pieces of data in its declaration.
  • dod::pointer does not properly implement const access, that is, pointer<const particle<...>> does not compile.
  • dod::vector should be implemented to provide the full interface of a proper vector class.
All of this can be in principle tackled without serious design dificulties.

Sunday, September 6, 2015

C++ encapsulation for Data-Oriented Design: performance

(Many thanks to Manu Sánchez for his help with running tests and analyzing results.)
In a past entry, we implemented a little C++ framework that allows us to do DOD while retaining some of the encapsulation benefits and the general look and feel of traditional object-based programming. We complete here the framework by adding a critical piece from the point of view of usability, namely the ability to process sequences of DOD entities with as terse a syntax as we would have in OOP.
To enable DOD for a particular class (like the particle we used in the previous entry), i.e., to distribute its different data members in separate memory locations, we change the class source code to turn it into a class template particle<Access> where Access is a framework-provided entity in charge of granting access to the external data members with a similar syntax as if they were an integral part of the class itself. Now, particle<Access> is no longer a regular class with value semantics, but a mere proxy to the external data without ownership to it. Importantly, it is the members and not the particle objects that are stored: particles are constructed on the fly when needed to use its interface in order to process the data. So, code like
for(const auto& p:particle_)p.render();
cannot possibly work because the application does not have any particle_ container to begin with: instead, the information is stored in separate locations:
std::vector<char> color_;
std::vector<int>  x_,y_,dx_,dy_;
and "traversing" the particles requires that we go through the associated containers in parallel and invoke render on a temporary particle object constructed out of them:
auto itc=&color_[0],ec=itc+color_.size();
auto itx=&x_[0];
auto ity=&y_[0];
auto itdx=&dx_[0];
auto itdy=&dy_[0];
  
for(;itc!=ec;++itc,++itx,++ity,++itdx,++itdy){
  auto p=make_particle(
    access<color,x,y,dx,dy>(itc,itx,ity,itdx,itdy));
  p.render();
}
Fortunately, this boilerplate code can be hidden by the framework by using these auxiliary constructs:
template<typename T> class pointer;

template<template <typename> class Class,typename Access>
class pointer<Class<Access>>
{
  // behaves as Class<Access>>*
};

template<template <typename> class Class,typename Access>
pointer<Class<Access>> make_pointer(const Access& a)
{
  return pointer<Class<Access>>(a);
}
We won't delve into the implementation details of pointer (the interested reader can see the actual code in the test program given below): from the point of view of the user, this utility class accepts an access entity, which is a collection of pointers to the data members plus an offset member (this offset has been added to the former version of the framework), it keeps everything in sync when doing pointer arithmetic and dereferences to a temporary particle object. The resulting user code is as simple as it gets:
auto n=color_.size();
auto beg_=make_pointer<particle>(access<color,x,y,dx,dy>(
  &color_[0],&x_[0],&y_[0],&dx_[0],&dy_[0]));
auto end_=beg_+n;
  
for(auto it=beg_;it!=end_;++it)it->render();
Index-based traversal is also possible:
for(std::size_t i=0;i<n;++i)beg_[i].render();
Once the containers are populated and beg_ and end_ defined, user code can handle particles as if they were stored in [beg_, end_), thus effectively isolated from the fact that the actual data is scattered around different containers for maximum processing performance.
Are we paying an abstraction penalty for the convenience this framework affords? There are two sources of concern:
  • Even though traversal code is in principle equivalent to hand-written DOD code, compilers might not be able to optimize all the template scaffolding away.
  • Traversing with access<color,x,y,dx,dy> for rendering when only color, x and y are needed (because render does not access dx or dy) involves iterating over dx_ and dy_ without actually accessing either one: again, the compiler might or might not optimize this extra code.
We provide a test program (Boost required) that measures the performance of this framework against some alternatives. The looped-over render procedure simply updates a global variable so that resulting execution times are basically those of the produced iteration code. The different options compared are:
  • oop: iteration over a traditional object-based structure
  • raw: hand-written data-processing loop
  • dod: DOD framework with access<color,x,y,dx,dy>
  • render_dod: DOD framework with  access<color,x,y>
  • oop[i]: index-based access instead of iterator traversal
  • raw[i]: hand-written index-based loop
  • dod[i]: index-based with access<color,x,y,dx,dy>
  • render_dod[i]: index-based with access<color,x,y>
The difference between dod and render_dod (and the same applies to their index-based variants) is that the latter keeps access only to the data members strictly required by render: if the compiler were not able to optimize unnecessary pointer manipulations in dod, render_dod would be expected to be faster; the drawback is that this would require fine tuning the access entity for each member function.
Manu Sánchez has set up an extensive testing environment to build and run the program using different compilers and machines:
The figures show the release-mode execution times of the eight options described above when traversing sequences of n = 104, 105, 106 and 107 particles.
GCC 5.1, MinGW, Intel Core i7-4790k @4.5GHz
Execution times / number of elements.
As expected, OOP is the slowest due to cache effects. The rest of options are basically equivalent, which shows that GCC is able to entirely optimize away the syntactic niceties brought in by our DOD framework.
MSVC 14.0, Windows, Intel Core i7-4790k @4.5GHz
Execution times / number of elements.
Here, again, all DOD options are roughly equivalent, although raw (pointer-based hand-written loop) is slightly slower. Curiously enough, MSVC is much worse at optimizing DOD with respect to OOP than GCC is, with execution times up to 4 times higher for n = 104 and 1.3 times higher for n = 107, the latter scenario being presumably dominated by cache efficiencies.
GCC 5.2, Linux, AMD A6-1450 APU @1.0 GHz
Execution times / number of elements.
From a qualitative point of view, these results are in line with those obtained for GCC 5.1 under an Intel Core i7, although as the AMD A6 is a much less powerful processor execution times are higher (×8-10 for n = 104, ×4-5.5 for n = 107).
Clang 3.6, Linux, AMD A6-1450 APU @1.0 GHz
Execution times / number of elements.
As it happens with the rest of compilers, DOD options (both manual and framework-supported) perform equally well. However, the comparison with GCC 5.2 on the same machine shows important differences: iterator-based OOP is faster (×1.1-1.4) in Clang, index-based OOP yields the same results for both compilers, and the DOD options in Clang are consistently slower (×2.3-3.4) than in GCC, to the point that OOP outperforms them for low values of n. A detailed analysis of the assembly code produced would probably gain us more insight into these contrasting behaviors: interested readers can access the resulting assembly listings at the associated GitHub repository.

Monday, August 31, 2015

C++ encapsulation for Data-Oriented Design

Data-Oriented Design, or DOD for short, seeks to maximize efficiency by laying out data in such a way that their processing is as streamlined as possible. This is often against the usual object-based principles that naturally lead to grouping the information accordingly to the user-domain entities that it models. Consider for instance a game where large quantities of particles are rendered and moved around:
class particle
{  
  char  color;
  int   x;
  int   y;
  int   dx;
  int   dy;
public:

  static const int max_x=200;
  static const int max_y=100;
    
  particle(char color_,int x_,int y_,int dx_,int dy_):
    color(color_),x(x_),y(y_),dx(dx_),dy(dy_)
  {}

  void render()const
  {
    // for explanatory purposes only: dump to std::cout
    std::cout<<"["<<x<<","<<y<<","<<int(color)<<"]\n";
  }

  void move()
  {
    x+=dx;
    if(x<0){
      x*=-1;
      dx*=-1;
    }
    else if(x>max_x){
      x=2*max_x-x;
      dx*=-1;      
    }
    
    y+=dy;
    if(y<0){
      y*=-1;
      dx*=-1;
    }
    else if(y>max_y){
      y=2*max_y-y;
      dy*=-1;      
    }
  }
};
...
// game particles
std::vector<particle> particles;
In the rendering loop, the program might do:
for(const auto& p:particles)p.render();
Trivial as it seems, the execution speed of this approach is nevertheless suboptimal. The memory layout for particles looks like:
which, when traversed in the rendering loop, results in 47% of the data cached by the CPU (the part corresponding to dx and dy, in white) not being used, or even more if padding occurs. A more intelligent layout based on 5 different vectors
allows the needed data, and only this, to be cached in three parallel cache lines, thus maximizing occupancy and minimizing misses. For the moving loop, it is a different set of data vectors that must be provided. DOD is increasingly popular, in particular in very demanding areas such as game programming. Mike Acton's presentation on DOD and C++ is an excellent introduction to the principles of data orientation.
The problem with DOD is that encapsulation is lost: rather than being nicely packed in contiguous chunks of memory whose lifetime management is heavily supported by the language rules, "objects" now live as virtual entities with disemboweled, scattered pieces of information floating around in separate data structures. Methods acting on the data need to publish the exact information they require as part of their interface, and it is the responsibility of the user to locate it and provide it. We want to explore ways to remedy this situation by allowing a modest level of object encapsulation compatible with DOD.
Roughly speaking, in C++ an object serves two different purposes:
  • Providing a public interface (a set of member functions) acting on the associated data.
  • Keeping access to the data and managing its lifetime.
Both roles are mediated by the this pointer. In fact, executing a member function on an object
x.f(args...);
is conceptually equivalent to invoking a function with an implicit extra argument
X::f(this,args...);
where the data associated to x, assumed to be contiguous, is pointed to by this. We can break this intermediation by letting objects be supplied with an access entity replacing this for the purpose of reaching out to the information. We begin with a purely syntactic device:
template<typename T,int Tag=0>
struct member
{
  using type=T;
  static const int tag=Tag;
};
member<T,Tag> will be used to specify that a given class has a piece of information with type T. Tag is needed to tell apart different members of the same type (for instance, particle has four different members of type int, namely x, y, dx and dy). Now, the following class:
template<typename Member>
class access
{
  using type=typename Member::type;
  type* p;

public:
  access(type* p):p(p){}
  
  type&       get(Member){return *p;}
  const type& get(Member)const{return *p;}
};
stores a pointer to a piece of data accessing the specified member. This can be easily expanded to accommodate for more than one member:
template<typename... Members>class access;

template<typename Member>
class access<Member>
{
  using type=typename Member::type;
  type* p;

public:
  access(type* p):p(p){}
  
  type&       get(Member){return *p;}
  const type& get(Member)const{return *p;}
};

template<typename Member0,typename... Members>
class access<Member0,Members...>:
  public access<Member0>,access<Members...>
{
public:
  template<typename Arg0,typename... Args>
  access(Arg0&& arg0,Args&&... args):
    access<Member0>(std::forward<Arg0>(arg0)),
    access<Members...>(std::forward<Args>(args)...)
  {}
  
  using access<Member0>::get;
  using access<Members...>::get;
};
To access, say, the data labeled as member<int,0> we need to write get(member<int,0>()). The price we have to pay for having data scattered around memory is that the access entity holds several pointers, one for member: on the other hand, the resulting objects, as we will see, really behave as on-the-fly proxies to their associated information, so access entities will seldom be stored. particle can be rewritten so that data is accessed through a generic access object:
template<typename Access>
class particle:Access
{
  using Access::get;
  
  using color=member<char,0>;
  using x=member<int,0>;
  using y=member<int,1>;
  using dx=member<int,2>;
  using dy=member<int,3>;

public:

  static const int max_x=200;
  static const int max_y=100;

  particle(const Access& a):Access(a){}

  void render()const
  {
    std::cout<<"["<<get(x())<<","
      <<get(y())<<","<<int(get(color()))<<"]\n";
  }

  void move()
  {
    get(x())+=get(dx());
    if(get(x())<0){
      get(x())*=-1;
      get(dx())*=-1;
    }
    else if(get(x())>max_x){
      get(x())=2*max_x-get(x());
      get(dx())*=-1;      
    }
    
    get(y())+=get(dy());
    if(get(y())<0){
      get(y())*=-1;
      get(dy())*=-1;
    }
    else if(get(y())>max_y){
      get(y())=2*max_y-get(y());
      get(dy())*=-1;      
    }
  }
};

template<typename Access>
particle<Access> make_particle(Access&& a)
{
  return particle<Access>(std::forward<Access>(a));
}
The transformations that need be done on the source code are not many:
  • Turn the class into a class template dependent on an access entity from which it derives.
  • Rather than declaring internal data members, define the corresponding member labels.
  • Delete former OOP constructors define just one constructor taking an access object as its only data member.
  • Replace mentions of data member by their corresponding access member function invocation (in the example, substitute get(color()) for color, get(x()) for x, etc.)
  • For convenience's sake, provide a make template function (in the example make_particle) to simplify object creation.
Observe how this woks in practice:
using color=member<char,0>;
using x=member<int,0>;
using y=member<int,1>;
using dx=member<int,2>;
using dy=member<int,3>;

char color_=5;
int  x_=20,y_=40,dx_=2,dy_=-1;

auto p=make_particle(access<color,x,y>(&color_,&x_,&y_));
auto q=make_particle(access<x,y,dx,dy>(&x_,&y_,&dx_,&dy_));
p.render();
q.move();
p.render();
The particle data now lives externally as a bunch of separate variables (or, in a more real-life scenario, stored in containers). p and q act as proxies to the same information (i.e., they don't copy data internally) but other than this they provide the same interface as the OOP version of particle, and can be used similarly. Note that the two objects specify different sets of access members, as required by render and move, respectively. So, the following
q.render(); // error
would result in a compile time error as render accesses data that q does not provide. Of course we can do
auto p=make_particle(
         access<color,x,y,dx,dy>(&color_,&x_,&y_,&dx_,&dy_)),
     q=p;
so that the resulting objects can take advantage of the entire particle interface. In later entries we will see how this need not affect performance in traversal algorithms. A nice side effect of this technique is that, when a DOD class is added extra data, former code will continue to work as long as this data is only used in new member functions of the class.
Implementing DOD enablement as a template policy also allows us to experiment with alternative access semantics. For instance, the tuple_storage utility
template<typename Tuple,std::size_t Index,typename... Members>
class tuple_storage_base;

template<typename Tuple,std::size_t Index>
class tuple_storage_base<Tuple,Index>:public Tuple
{
  struct inaccessible{};
public:
  using Tuple::Tuple;
  
  void get(inaccessible);
  
  Tuple&       tuple(){return *this;}
  const Tuple& tuple()const{return *this;}
};

template<
  typename Tuple,std::size_t Index,
  typename Member0,typename... Members
>
class tuple_storage_base<Tuple,Index,Member0,Members...>:
  public tuple_storage_base<Tuple,Index+1,Members...>
{
  using super=tuple_storage_base<Tuple,Index+1,Members...>;
  using type=typename Member0::type;

public:
  using super::super;
  using super::get;
  
  type&       get(Member0)
                {return std::get<Index>(this->tuple());}
  const type& get(Member0)const
                {return std::get<Index>(this->tuple());}  
};

template<typename... Members>
class tuple_storage:
  public tuple_storage_base<
    std::tuple<typename Members::type...>,0,Members...
  >
{
  using super=tuple_storage_base<
    std::tuple<typename Members::type...>,0,Members...
  >;
  
public:
  using super::super;
};
can we used to replace the external access policy with an object containing the data proper:
using storage=tuple_storage<color,x,y,dx,dy>;
auto r=make_particle(storage(3,100,10,10,-15));
auto s=r;
r.render();
r.move();
r.render();
s.render(); // different data than r
which effectively brings us back the old OOP class with ownership semantics. (Also, it is easy to implement an access policy on top of tuple_storage that gives proxy semantics for tuple-based storage. This is left as an exercise for the reader.)
A C++11 example program is provided that puts to use the ideas we have presented.
Traversal is at the core of DOD, as the paradigm is oriented towards handling large numbers of like objects. In a later entry we will extend this framework to provide for easy object traversal and measure the resulting performance as compared with OOP.

Saturday, July 4, 2015

Allowing consumers into the advertising value chain

Online advertising market operates through a small number of service providers such as Google or Facebook which aggregate consumers' attention and offer it to advertisers via a B2B model based on CPC, CPM or similar indicators. Companies determine their total expenditure in advertising A as a fixed percentage of their revenues or with some other formula that takes into account the estimated increase in revenues that ads might bring in as a result of exposure of the company's products to their potential customers. In any case, we can consider that this figure is an increasing function of consumer spending C:
A = A(C).
In the United States, digital advertising spend amounted to $49.5B or 0.3% of the country's GDP, roughy $155 per person. Figures for Spain are more modest: €1.1B or €23 per person.
Consumers receive no direct payback from this business other than access to the free services online providers offer them. What would happen if users could actually charge online service providers for their share of attention? At first blush there seem to be no changes in terms of national economy as we are merely redistributing online revenues among a different set of players. On closer analysis, though, we can predict a multiplier effect that works as follows:
  • Online providers' net income I is reduced by some quantity Ic that is directly received by users themselves.
  • Consumer income Y is added Ic and detracted some unkown quantity L resulting from online provider's lower dividends (if any, as currently neither Google nor Facebook pay them), reduced market cap, adjustments in employees' salaries, etc. The key aspect is that, even if  Ic = L and so Y remains the same, this redistribution will benefit lower-income residents, with a net effect of higher overall consumption because these people pay proportionally less taxes T (so, the national disposable income Yd grows) and have higher marginal propensity to consume MPC.
  • Higher consumption increases demand and supply and produces a net gain in GDP (and online ad spending in its turn).
So, the entire cycle is summarized as
ΔA = (ΔAC) · ΔMPC · Δ(YT).
Of course this analysis is extremely simple and there are additional factors that complicate it:
  • Tax avoidance practices by big online providers basically take all their income out of some countries (Spain, for instance) to derive it to others where taxes are lower. For the losing countries, redistribution of ad revenues among their residents is a win-win as their national economies receive no input from online providers to begin with (that is, online income losses do not affect their GDP).
  • Global online advertising is currently a duopoly and challenges to the existing value chain can confront very aggressive opposition: no matter how big the multiplier effect ΔA/Ic, online providers can not compensate their loss with increased revenues in a bigger advertisement market, since ad spend is by definition a fraction of consumers' income. When French newspapers demanded sharing into Google News' business, a modest agreement was settled after harsh disputes, but the same situation in Spain ended much worse for everybody.
  • In fact, threatening the very source of revenues of online service providers could damage their business to the point of making it unviable. Current operating margins for this market go from  25% to 35%, though, which seems to indicate otherwise.
Now, if users were to enter into the ad value chain, they would need some proxy on their behalf who has the power to oppose online service providers (for a share of the pie, at any rate). Two potential players come to mind:
  • ISPs,
  • Internet browser companies.
As online ad spend per person is actually not that high, we are talking here of potential revenues for the end user of a few dollars a year at most. This would be best provided by the proxy in the form of virtual goods or, in the case of ISPs, discounts in their services.

Monday, June 29, 2015

Design patterns for invariant suspension

One of the most important aspects of data encapsulation as embodied by classes and similar constructs is the ability to maintain class invariants by restricting access to the implementation data through a well defined interface. (Incidentally, a class without significant invariants is little more than a tuple of values, which is why the presence of getters and setters is a sure sign of poor design). Sometimes, however, maintaining the class invariant can lead to unacceptable costs in terms of complexity or performance. Consider two examples:
  • A sorted_vector class that maintains an internal buffer of ordered elements.
  • A spread_sheet class that implements automatic formula calculation.
These classes restrict the potential states of their implementation data so that they meet some extra requirements the interface relies on (for instance, the fact that the elements of sorted_vector are ordered allows for an lookup operations, automatic values in the spread sheet are consistent with input data cells, etc.). State can only then be changed by using the provided interface, which takes proper care that the invariant is kept upon exit from any method —and this is all good and desireable.
The problem arises when we need to do bulk modifications on the internal data and the class interface is too rigid to support this:
  • sorted_vector typically only allows for deletion of elements or insertion of new values, and does not provide mechanisms for directly modifying the elements or temporarily rearranging them.
  • spread_sheet might only accept changing one data cell at a time so as to check which formulas are affected and recalculate those alone.
If we granted unrestricted access to the data, the class would have to renormalize (i.e. restore the invariant) inside every method. Let us see this for the case of sorted_vector:
class sorted_vector
{
  value_type* data(); // unrestricted acccess to data
  iterator find(const key_type& x)const
  {
    sort();
    ...
  }
  ...
private:
  void sort()mutable;
};
This is obviously impracticable for efficiency reasons. Another alternative, sometimes found in questionable designs, is to put the burden of maintaining the invariant on the user herself:
class sorted_vector
{
  value_type* data();
  void sort();
  iterator find(const key_type& x)const // must be sorted
  ...
}
The problem with this approach is that it is all too easy to forget to restore the invariant. What is worse, there is no way that the class can detect the overlook and raise some signal (an error code or an exception). A poor attempt at solving this:
class sorted_vector
{
  value_type* data();
  void modifying_data() // to indicate invariant breach
  {
    sorted=false;
  }
  void sort();
  iterator find(const key_type& x)const
  {
    if(!sorted)throw std::runtime_exception("not sorted");
    ...
  }
  ...
private:
  bool sorted;
  ...
}
is just as brittle as the previous situation and more cumbersome for the user. If we finally renounce to giving access to the internal data, the only way to do bulk modifications of the internal state is constructing the data externally and then reassigning:
sorted_vector v;
vector aux;
... // manipulate data in aux
v=sorted_vector(aux.begin(),aux.end())
This last option is not entirely bad, but incurs serious inefficiencies related to the external allocation of the auxiliary data and the recreation of a new sorted_vector object.
We look for design patterns that enable invariant suspension (i.e. unrestricted access to a class internal data) in a controlled and maximally efficient way.
Internal invariant suspension
In this pattern, we allow for suspension only within the execution of a class method:
class X
{
  template<typename F>
  void modify_data(F f); // f is invoked with X internal data
};
How does this work? The idea is that f is a user-provided piece of code that is given access to the internal data to manipulate freely. After f is done, modify_data, which issued the invocation, can renormalize the data before exiting. Let us see this in action for sorted_vector:
class sorted_vector
{
  template<typename F>
  void modify_data(F f)
  {
    f(data);
    sort(); // restore invariant
  }
};
...
sorted_vector v;
...
v.modify_data([](value_type* data){
  ... // manipulate data as wee see fit
});
This pattern is most suitable when modifications are relatively localized in code. When the modification takes place in a more uncontrolled setting, we need something even more flexible.
External invariant suspension
Here, we move the data out and accept it back again when the user is done with it.
class X
{
  data extract_data(); // X becomes "empty"
  void accept_data(data& d); // X should be "empty"
};
At first sight, this seems just as bad as our first example with sorted_vector::data(): as soon as we grant access to the internal state, we can no longer be sure the invariant is kept. The key aspect of the pattern, however, is that extract_data sets X to some empty state, i.e., a state without internal data where the invariant is trivially true in all cases —in other words, X gives its internal data away.
class sorted_vector
{
  value_type* extract_data()
  {
    value_type res=data;
    data=nullptr;
    return res;
  }
  void accept_data(value_type*& d)
  {
    if(data)throw std::runtime("not empty");
    // alternatively: delete data
    
    data=d;
    d=nullptr;
    sort(); // restore invariant
  }
};
...
sorted_vector v;
...
value_type* data=v.extract_data()
... // manipulate data
v.accept_data(data);
How does this meet our requirements of control and efficiency?
  • Control: the invariant of v is always true regardless of the user code: when invoking extract_data, v becomes empty, i.e. a vector with zero elements which is of course sorted. The methods of sorted_vector need not check for special cases of invariant suspension.
  • Efficiency: the data is never re-created, but merely moved around; no extra allocations are required.
Conclusions
Class invariants are a key aspect of data encapsulation, but can lead to inefficient code when the class interface is too restrictive for data manipulation, We have described two general design patterns, internal invariant suspension and external invariant suspension, that allow for internal state modification in a controlled fashion without interfering with the class general interface or requiring any type of data copying.

Sunday, June 28, 2015

Traversing a linearized tree: cache friendliness analysis

We do a formal analysis of the cache-related behavior of traversing a levelorder_vector, a linear array of elements laid out according to the breadth-first order of a binary tree. Ignoring roundoff effects, let us assume that the array has n = 2d − 1 elements and the CPU provides a cache system of N lines capable of holding L contiguous array elements each. We further assume that N > d.
Under these conditions, the elements of level i, being contiguous, are paged to 2i/L lines.
The traversal of  the array corresponds to the post-order sequence of the tree implicitly encoded, and it is easy to see that this meets the elements of each level i exactly in the order they are laid out in memory: so, given that  N > d, i.e. the CPU can hold at least as many lines in cache as there are levels in the tree, an optimal caching scheduler will result in only 2i/L misses per level, or n/L overall. The number of cache misses per element during traversal is then constant and equal to 1/L, which is the minimum attainable for any traversal scheme (like the simplest one consisting of walking linearly the array from first to last).
The questions remains of whether we can justifiably hold the assumption that N > d in modern architectures. In the Intel Core family, for instance, the data L1 cache is 32 KB (per core) and line size is 64 B, so N = 512. On the other hand, the maximum d possible in 64-bit architectures is precisely 64, therefore the condition  N > d is met indeed by a large margin.

Saturday, June 27, 2015

Traversing a linearized tree

In a previous entry we have introduced the data structure levelorder_vector<T>, which provides faster binary search than sorted vectors (such as sported by Boost.Container flat associative containers) by laying out the elements in a linear fashion following the breadth-first or level order of the binary tree associated to the sequence. As usual with data structures, this is expected to come at a price in the form of poorer performance for other operations.
Let us begin with iteration: traversing a levelorder_vector is algorithmically equivalent to doing so in its homologous binary tree:
The picture above shows the three generic cases the algorithm for incrementing a position has to deal with, highlighted with the same color in the following pseudocode:
void increment(position& i)
{
  position j=right(i);
  if(j){                      // right child exists
    do{
      i=j;                    // take the path down
      j=left(i);              // and follow left children
    }while(j);
  }
  else if(is_last_leaf(i)){
    i=end;                    // next of last is end
  }
  else{
    while(is_right_child(i)){ // go up while the node is
      i=parent(i);            // the right child of its parent 
    }
    i=parent(i);              // and up an extra level
  }
}
In the case of levelorder_vector, positions are numerical indexes into an array, so these operations are purely arithmetic (by contrast, in a node-based tree we would need to follow pointers around, which is more expensive as we will see shortly):
  • parent(i) = (i − 1)/2,
  • right(i) = 2i +2 (if less than size),
  • left(i) = 2i + 1 (if less than size),
  • is_right_child(i) = (i mod 2 = 0),
  • is_last_leaf(i) = i does not have a right child and i + 2 is a power of 2.
The last identity might be a little trickier to understand: it simply relies on the fact that the last leaf of the tree must also be the last element of a fully occupied level, and the number of elements in a complete binary tree with d levels is 2d − 1, hence the zero-based index of the element + 2 has to be a power of two.
A test program is provided that implements forward iterators for levelorder_vector (backwards traversal is mostly symmetrical) and measures the time taken to traverse std::set, boost::container::flat_set and levelorder_vector instances holding n = 10,000 to 3 million integer elements.
MSVC on Windows
Test compiled with Microsoft Visual Studio 2012 using default release mode settings and run on a Windows box with an Intel Core i5-2520M CPU @2.50GHz.
Traversal time / number of elements.
As expected, boost::container::flat_set is the fastest: nothing (single threaded) can beat traversing an array of memory in sequential order. levelorder_vector is around 5x slower than boost::container::flat_set but up to 4x faster than std::set. More interestingly, its traversal time per element is virtually non-affected by the size of the container. This can be further confirmed by measuring a non-touching traversal where the range [begin, end) is walked through without actually dereferencing the iterator, and consequently not accesing the container memory at all: resulting times remain almost the same, which indicates that the traversal time is dominated by the arithmetic operations on indexes. In a later entry we provide the formal proof that traversing a levelorder_vector is indeed cache-oblivious.
GCC on Linux
Contributed by Manu Sánchez (GCC 5.1, Arch Linux 3.16, Intel Core i7 4790k @ 4.5GHz) and Laurentiu Nicola (GCC 5.1, Arch Linux x64, Intel Core i7-2760QM CPU @ 2.40GHz and Intel Celeron J1900 @ 1.99GHz). Rather than showing the plots, we provide the observed ranges in execution speed ratios between the three containers (for instance, the cell "10-30" means that boost::container::flat_set is between 10 and 30 times faster than levelorder_vector for Intel Core i7).
flat_set
levelorder_vector
levelorder_vector
     std::set 
  
i710-303-8
Celeron10-122.5-4.5
Clang on Linux
Clang 3.6.1 with libstdc++-v3 and libc++ Arch Linux 3.16, Intel Core i7 4790k @ 4.5GHz.  Clang 3.6.1, Arch Linux x64, Intel Core i7-2760QM CPU @ 2.40GHz and Intel Celeron J1900 @ 1.99GHz.
flat_set
levelorder_vector
levelorder_vector
     std::set 
  
libstdc++-v3, i710-502-7
libc++, i710-502.5-8
Celeron15-302.5-4.5
Conclusions
Traversing a levelorder_vector takes a constant time per element that sits in between those of boost::container::flat_set (10-50 times slower, depending on number of elements and environment) and std::set (2-8 times faster). Users need to consider their particular execution scenarios to balance this fact vs. the superior performance offered at lookup.

Thursday, June 25, 2015

Cache-friendly binary search

High-speed memory caches present in modern computer architectures favor data structures with good locality of reference, i.e. the property by which elements accessed in sequence are located in memory addresses close to each other. This is the rationale behind classes such as Boost.Container flat associative containers, that emulate the functionality of standard C++ node-based associative containers while storing the elements contiguously (and in order). This is an example of how binary search works in a boost::container::flat_set with elements 0 trough 30:
As searching narrows down towards the looked-for value, visited elements are progressively closer, so locality of reference is very good for the last part of  the process, even if the first steps hop over large distances. All in all, the process is more cache-friendly than searching in a regular std::set (where elements are totally scattered throughout memory according to the allocator's whims) and resulting lookup times are consequently much better. But we can get even more cache-friendly than that.
Consider the structure of a classical red-black tree holding values 0 to 30:
and lay out the elements in a contiguous array following the tree's breadth-first (or level) order:
We can perform binary search on this level-order vector beginning with the first element (the root) and then jumping to either the "left" or "right" child as comparisons indicate:
The hopping pattern is somewhat different than with boost::container::flat_set: here, the first visited elements are close to each other and hops become increasingly longer. So, we get good locality of reference for the first part of the process, exactly the opposite than before. It could seem then that we are no better off with this new layout, but in fact we are. Let us have a look at the heat map of boost::container::flat_set:
This shows how frequently a given element is visited through an average binary search operation (brighter means "hotter", that is, accessed more). As every lookup begins at the mid element 15, this gets visited 100% of the times, then 7 and 23 have 50% visit frequency, etc. On the other hand, the analogous heat map for a level-order vector looks very different:
Now hotter elements are much closer: as cache management mechanisms try to keep hot areas longer in cache, we can then expect this layout to result in fewer cache misses overall. To summarize, we are improving locality of reference for hotter elements at the expense of colder areas of the array.
Let us measure this data structure in practice. I have written a protoype class template levelorder_vector<T> that stores its contents in level-order sequence and provides a member function for binary search:
const_iterator lower_bound(const T& x)const
{
  size_type n=impl.size(),i=n,j=0;
  while(j<n){
    if(impl[j]<x){
      j=2*j+2;
    }
    else{
      i=j;
      j=2*j+1;
    }
  }
  return begin()+i;
}
A test program is provided that measures lower_bound execution times with a random sequence of values on containers std::set, boost::container::flat_set and levelorder_vector holding n = 10,000 to 3 million elements.
MSVC on Windows
Test built with Microsoft Visual Studio 2012 using default release mode settings and run on a Windows box with an Intel Core i5-2520M CPU @2.50GHz. Depicted values are microseconds/n, where n is the number of elements in the container.
lower_bound execution times / number of elements.
The theoretical lookup times for the three containers are O(log n), which should show as straight lines (horizontal scale is logarithmic), but increasing cache misses as n grows result in some upwards bending. As predicted, levelorder_vector performs better than boost::container::flat_set, with improvement factors ranging from 10% to 30%. Both containers are nevertheless much faster than std::set, and don't show much degradation until n ≃ 5·105 (~7·104 for std::set), which is a direct consequence of their superior cache friendliness.
GCC on Linux
The following people have kindly sent in test outputs for GCC on varius Linux platforms: Philippe Navaux, xckd, Manu Sánchez. Results are very similar, so we just show those corresponding to GCC 5.1 on a Intel Core i7-860 @ 2.80GHz with Linux kernel 4.0 from Debian:
lower_bound execution times / number of elements.
Although levelorder_vector performs generally better than boost::container::flat_set, there is a strange phenomenon for which I don't have an explanation: improvement factors are around 40% for low values of n, but then decrease or even become negative as n aproaches 3·106. I'd be grateful if readers can offer interpretations of this.
Laurentiu Nicola has later submitted results with -march=native -O2 settings that differ significantly from the previous ones: GCC 5.1 on an Intel Core i7-2760QM CPU @ 2.40GHz (Linux distro undisclosed):
lower_bound execution times / number of elements.
and GCC 5.1 on an Intel Celeron J1900 @ 1.99GHz:
lower_bound execution times / number of elements.
-march=native seems to make a big difference: now levelorder_vector execution times are up to 25% (i7) and 40% (Celeron) lower than boost::container::flat_set when n approaches 3·106, but there is no significant gain for low values of n.
Clang on OS X
As provided by fortch: clang-500.2.79 in -std=c++0x mode on an Darwin 13.1.0 box with an Intel Core  i7-3740QM CPU @ 2.70GHz.
lower_bound execution times / number of elements.
The improvement of levelorder_vector with respect to boost::container::flat_set ranges from 10% to 20%.
Clang on Linux
Manu Sánchez has run the tests with Clang 3.6.1 on an Arch Linux 3.16 box with an Intel Core i7 4790k @4.5GHz. In the first case he has used the libstdc++-v3 standard library:
lower_bound execution times / number of elements.
and then repeated with libc++:
lower_bound execution times / number of elements.
Results are similar, with levelorder_vector execution speeds between 5% and 20% less than those of boost::container::flat_set. The very noticeable bumps on the graphs for these two containers when using libc++ are also present on the tests for Clang on OS X but do not appear if the standard library used is libstdc++-v3, which suggests that they might be due to some artifact related to libc++ memory allocator.
Laurentiu's results for Clang 3.6.1 with -march=native -O2 are similar to Manu's for an i7 CPU, whereas for an Intel Celeron J1900 @ 1.99GHz we have:
lower_bound execution times / number of elements.
where levelorder_vector execution times are up to 35% lower than those of boost::container::flat_set.
Conclusions
Even though binary search on sorted vectors perform much better than node-base data structures, alternative contiguous layouts can achieve even more cache friendliness. We have shown a possibility based on arranging elements according to the breadth-first order of a binary tree holding the sorted sequence. This data structure can possibly perform worse, though, in other operations like insertion and deletion: we defer the study of this matter to a future entry.

Friday, May 29, 2015

A case in political reasoning

Municipal elections in Spain have just been run with interesting results. In Barcelona, the nationalist ruling party CiU, noted by its recent shift towards unreserved support for the independence of Catalonia, has lost to Barcelona en Comú, a political alliance loosely associated to left-wing Podemos, ambivalent on these matters. In Madrid, the party in office and undisputed winner for more than twenty years, right-wing PP,  has lost a great deal of support, almost yielding first place to Ahora Madrid, also linked to Podemos, which is now in a better position to form government. Commenting on the elections, CiU's Artur Mas, president of the Catalonian regional government, says (translation mine):
Mas has discarded the possibility that CiU's defeat in Barcelona has been influenced by the party's pro-independence discourse. "Madrid might also have a mayor from Podemos, and there is no independentist movement there", he said.
Is Mr. Mas's argument sound? Let us analyze it formally.
Logical analysis
Write
i(x) = party in power in county x supports independence of the region,
p(x) = Podemos (or associated alliances) gains support in x over the party in power,
where x ranges over all counties in Spain. Then, Mas's reasoning can be formulated as
It is not the case that x i(x) p(x) because ¬i(Madrid) p(Madrid).
Put this way, the argument is blatantly false: a refutation of x i(x) p(x) would ask for a city with a pro-independence governing party where Podemos did not succeed, which is not the case of Madrid. Let us try another approach.
Probabilistic analysis
If Mas was thinking in statistical terms, a plausible formulation could be
It is not the case that P(p(x) | i(x)) > P(p(x)) because ¬i(Madrid) p(Madrid),
that is, the probability that Podemos gains power in a pro-independence context is not higher that in general, as supported by the results in Madrid. If we write
PI = number of counties with pro-independence councils where Podemos wins,
I = number of counties with pro-independence councils where Podemos runs,
P = number of counties where Podemos wins,
N = total number of counties where Podemos runs,
then Mas afirms that
PI / I P / N
or, equivalently,
D = (P / N) (PI / I) ≥ 0.
Is the fact that ¬i(Madrid) p(Madrid) giving any support to this contention? Let
D' = (P' / (N − 1)) (PI' / I')
be the value resulting from considering every county in Spain where Podemos ran except Madrid. When taking Madrid into account, we have
D = ((P' + 1) / N) (PI' / I')
and, consequently,
D D' = ((P' + 1) / N) (P' / (N − 1)), 
which is always non-negative (since P' is never greater than N − 1), and furthermore strictly positive if P' < N − 1, that is, if Podemos has lost somewhere (which is certainly the case). So, knowing about the existence of counties without pro-independence councils where Podemos won increases the chances that, in effect,  D ≥ 0 and, consequently, independentism not play a beneficial role in the success of this party. But if this knowledge is of just one county out of the hundreds where Podemos ran, the degree of extra confidence we gain is extremely small. This is the very weak sense in which Mas's argument can be held valid.