Thursday, July 28, 2016

Passing capturing C++ lambda functions as function pointers

Suppose we have a function accepting a C-style callback function like this:
void do_something(void (*callback)())
{
  ...
  callback();
}
As captureless C++ lambda functions can be cast to regular function pointers, the following works as expected:
auto callback=[](){std::cout<<"callback called\n";};
do_something(callback);

output: callback called
Unfortunately , if our callback code captures some variable from the context, we are out of luck
int num_callbacks=0;
···
auto callback=[&](){
  std::cout<<"callback called "<<++num_callbacks<<" times \n";
};
do_something(callback);

error: cannot convert 'main()::<lambda>' to 'void (*)()'
because capturing lambda functions create a closure of the used context that needs to be carried around to the point of invocation. If we are allowed to modify do_something we can easily circumvent the problem by accepting a more powerful std::function-based callback:
void do_something(std::function<void> callback)
{
  ...
  callback();
}

int num_callbacks=0;
...
auto callback=[&](){
  std::cout<<"callback called "<<++num_callbacks<<" times \n";
};
do_something(callback);

output: callback called 1 times
but we want to explore the challenge when this is not available (maybe because do_something is legacy C code, or because we do not want to incur the runtime penalty associated with std::function's usage of dynamic memory). Typically, C-style callback APIs accept an additional callback argument through a type-erased void*:
void do_something(void(*callback)(void*),void* callback_arg)
{
  ...
  callback(callback_arg);
}
and this is actually the only bit we need to force our capturing lambda function through do_something. The gist of the trick is passing the lambda function as the callback argument and providing a captureless thunk as the callback function pointer:
int num_callbacks=0;
...
auto callback=[&](){
  std::cout<<"callback called "<<++num_callbacks<<" times \n";
};
auto thunk=[](void* arg){ // note thunk is captureless
  (*static_cast<decltype(callback)*>(arg))();
};
do_something(thunk,&callback);

output: callback called 1 times
Note that we are not using dynamic memory nor doing any extra copying of the captured data, since callback is accessed in the point of invocation through a pointer; so, this technique can be advantageous even if modern std::functions could be used instead. The caveat is that the user code must make sure that captured data is alive when the callback is invoked (which is not the case when execution happens after scope exit if, for instance, it is carried out in a different thread).
Postcript
Tcbrindle poses the issue of lambda functions casting to function pointers with C++ linkage, where C linkage may be needed. Although this is rarely a problem in practice, it can be solved through another layer of indirection:
extern "C" void do_something(
  void(*callback)(void*),void* callback_arg)
{
  ...
  callback(callback_arg);
}

...

using callback_pair=std::pair<void(*)(void*),void*>;

extern "C" void call_thunk(void * arg)
{
  callback_pair* p=static_cast<callback_pair*>(arg);
  p->first(p->second);
}
...
int num_callbacks=0;
...
auto callback=[&](){
  std::cout<<"callback called "<<++num_callbacks<<" times \n";
};
auto thunk=[](void* arg){ // note thunk is captureless
  (*static_cast<decltype(callback)*>(arg))();
};
callback_pair p{thunk,&callback};
do_something(call_thunk,&p);

output: callback called 1 times

Sunday, February 21, 2016

A formal definition of mutation independence

Louis Dionne poses the problem of move independence in the context of C++, that is, under which conditions a sequence of operations
f(std::move(x));
g(std::move(x));
is sound in the sense that the first does not interfere with the second. We give here a functional definition for this property that can be applied to the case Louis discusses.
Let X be some type and functions f: X T×X and g: X Q×X. The impurity of a non-functional construct in an imperative language such as C++ is captured in this functional setting by the fact that these functions return, besides the output value itself, a new, possibly changed, value of X. We denote by fT and fX the projection of f onto T and X, respectively, and similarly for g. We say that f does not affect g if
 gQ(x) = gQ(fX(x)) ∀xX.
If we define the equivalence relationship ~g in X as
x ~g y iff gQ(x) = gQ(y),
then f does not affect g iff
fX(x) ~g xxX
or
fX([x]g) ⊆ [x]gxX,
where [x]g is the equivalence class of x under ~g.
We say that f and g are mutation-independent if f does not affect g and g does not affect f, that is,
fX([x]g) ⊆ [x]g and gX([x]f) ⊆ [x]fxX,
The following considers the case of f and g acting on separate components of a tuple: suppose that X = X1×X2 and f and g depend on and mutate X1 and X2 alone, respectively, or put more formally:
fT(x1,x2) = fT(x1,x'2),
fX2(x1,x2) = x2,
gQ(x1,x2) = gQ(x'1,x2),
gX1(x1,x2) = x1
for all x1, x'1X1, x2, x'2X2. Then  f and g are mutation-independent (proof trivial). Getting back to C++, given a tuple x, two operations of the form:
f(std::get<i>(std::move(x)));
g(std::get<j>(std::move(x)));
are mutation-independent if i!=j; this can be extended to the case where f and g read from (but not write to) any component of x except the j-th and i-th, respectively.

Monday, January 11, 2016

(Oil+tax)-free Spanish gas prices 2014-15

We use the data gathered at our hysteresis analysis of Spanish gas prices for 2014 and 2015 to gain further insight on their dynamics. This is a simple breakdown of gas (or gasoil) price:
Price = oil cost + other costs + taxes + margin.
A barrel of crude oil is refined into several final products totalling approximately the same amount of volume, that is, it takes roughly one liter of crude oil to produce one liter of gas (or gasoil). The simplest allocation model is to use market Brent prices as the oil cost for fuel production (we will see more realistic models later). If we eliminate taxes and oil cost, what remains in the fuel price is other costs plus margin. We plot this number for 95 octane gas and gasoil compared with Brent oil price, all in c€/l, for the period 2014-2015:
(Oil+tax)-free fuel price, simple cost allocation model [c€/l]
Brent oil cost [c€/l]
When we factor out crude oil cost, the remaning parts of the price increase moderately (~25% for gasoline, ~15% for gas). In a scenario of oil price reduction, oil direct costs as a percentage of tax-free fuel prices have consequently dropped from 70% to 50%:
Oil direct cost / tax-free fuel price,simple cost allocation model
Value-based cost allocation
Crude oil is refined into several final products from high-quality fuel to asphalt, plastic etc. The EIA provides typical yield data for US refineries that we can use as a reasonable approximation to the Spanish case. The volume breakdown we are interested in is roughly:
  • Gas: 45%
  • Gasoil: 30%
  • Other products: 37%
(Note that the sum is greater than 100% because additional components are mixed in the process). Now, as these products have very different prices in the market, it is natural to allocate oil costs proportionally to end-user value:
pricetotal45% pricegasoline + 30% pricegasoil + 37% priceother ,
costgasoline = costoil × pricegasoline / pricetotal ,
costgas = costoil × pricegas / pricetotal
(prices without taxes). Since it is difficult to obtain accurate data on prices for the remaining products, we consider two conventional scenarios where these products are valued at 50% and 25% of the average fuel price, respectively:
  • A: priceother = 50% (pricegasoline + pricegasoil)/2
  • B: priceother = 25% (pricegasoline + pricegasoil)/2
The figure depicts resulting prices without oil costs or taxes (i.e. other costs plus margin):
(Oil+tax)-free fuel price, value-based cost allocation [c€/l]
Brent oil cost [c€/l]
Unlike with our previous, naïve allocation model, here we see, both in scenarios A and B, that margins for gasoline and gas match very precisely almost all the time: this can be seen as further indication that value-based cost allocation is indeed the model used by gas companies themselves. Visual inspection reveals two insights:
  • Short-term, margin fluctuations are countercyclical to oil price. This might be due to an effort from companies to stabilize prices.
  • In the two-year period studied, margins grow very much, around 30% for scenario A and 60% for scenario B. This trend has been somewhat corrected in the second half of 2015, though.
The percentual contribution of oil costs to fuel prices (which is by virtue of the cost allocation model exactly the same for gasoline and gas) drops in 2014-15 from 75% to 55% (scenario A) and from 85% to 60% (scenario B).
Oil direct cost / tax-free fuel price, value-based cost allocation

Gas price hysteresis, Spain 2015

We begin the new year redoing our hysteresis analysis for Spanish gas prices with data from 2015, obtained from the usual sources:
The figure shows the weekly evolution during 2015 of prices of Brent oil and average retail prices without taxes of 95 octane gas and gasoil in Spain, all in c€ per liter.
For gasoline, the corresponding scatter plot of Δ(gasoline price before taxes) against Δ(Brent price) is
with linear regressions for the entire graph and both semiplanes Δ(Brent price) ≥ 0 and ≤ 0, given by
overall → y = f(x) = b + mx = −0.1210 + 0.2554x,
ΔBrent ≥ 0 → y = f+(x) = b+ + m+x = 0.2866 − 0.0824x,
ΔBrent ≤ 0 → y = f(x) = b + mx = 0.3552 + 0.4040x.
Due to the outlier in the right lower corner (with date August 31), positive variations in oil price don't translate, in average, as positive increments in the price of gasoline. The most worrisome aspect is the fact that b+ and are b positive, which suggests an underlying trend to increase prices when oil is stable.
For gasoil we have
with regressions
overall → y = f(x) = b + mx = −0.0672 + 0.3538x,
ΔBrent ≥ 0 → y = f+(x) = b+ + m+x = −0.2457 + 0.2013x,
ΔBrent ≤ 0 → y = f(x) = b + mx = 0.2468 + 0.3956x.
Again, no "rocket and feather" effect here (in fact,  m+ is slightly smaller than m). Variations around ΔBrent = 0 are fairly symmetrical and, seemingly, fair.

Monday, December 28, 2015

How likely?

Yesterday, CUP political party held a general assembly to determine whether to support or not Artur Mas's candidacy to President of the Catalonian regional government. The final voting round among 3,030 representatives ended up in an exact 1,515/1,515 tie, leaving the question unsolved for the moment being. Such an unexpected result has prompted a flurry of Internet activity about the mathematical probability of its occurrence.
The question "how likely was this result to happen?" is of course unanswerable without a specification of the context (i.e. the probability space) we choose to frame the event. A plausible formulation is:
If a proportion p of CUP voters are pro-Mas, how likely is it that a random sample based on 3,030 individuals yields a 50/50 tie?
The simple answer (assuming the number of CUP voters is much larger that 3,030) is Pp(1,015 | 3,030), where Pp(n | N) is the binomial distribution of N Bernouilli trials with probability p resulting in exactly n successes.
The figure shows this value for 40% ≤ p ≤ 60%. At p = 50%, which without further information is our best estimation of pro-Mas supporters among CUP voters, the probability of a tie is 1.45%. A deviation in p of ±4% would have made this result virtually impossible.
A slightly more interesting question is the following:
If a proportion p of CUP voters are pro-Mas, how likely is a random sample of 3,030 individuals to misestimate the majority opinion?
When p is in the vicinity of 50%, there is a non-negligible probability that the assembly vote come up with the wrong (i.e. against voters' wishes) result. This probability is
Ip(1,516, 1,515) if p < 50%,
1 − Pp(1,015 | 3,030) if p = 50%,
I1−p(1,516, 1,515) if p > 50%,
where Ip(a,b) is the regularized beta function. The figure shows the corresponding graph for 3,030 representatives and 40% ≤ p ≤ 60%.
The function shows a discontinuity at the singular (and zero-probability) event p = 50%, in which case the assembly will yield the wrong result always except for the previously studied situation that there is an exact tie (so, the probability of misestimation is 1 − 1.45% = 98.55 %). Other than this, the likelihood of misestimation approaches 49%+ as p tends to 50%. We have learnt that CUP voters are almost evenly divided between pro- and anti-Mas: if the difference between both positions is 0.7% or less, an assembly of 3,030 representatives such as held yesterday will fail to reflect the party's global position in more than 1 out of 5 cases.

Saturday, November 14, 2015

SOA container for encapsulated C++ DOD

In a previous entry we saw how to decouple the logic of a class from the access to its member data so that the latter can be laid out in a DOD-friendly fashion for faster sequential processing. Instead of having a std::vector of, say, particles, now we can store the different particle members (position, velocity, etc.) in separate containers. This unfortunately results in more cumbersome initialization code: whereas for the traditional, OOP approach particle creation and access is compact and nicely localized:
std::vector<plain_particle> pp_;
...
for(std::size_t i=0;i<n;++i){
  pp_.push_back(plain_particle(...));
}
...
render(pp_.begin(),pp_.end());
when using DOD, in contrast, the equivalent code grows linearly with the number of members, even if most of it is boilerplate:
std::vector<char> color_;
std::vector<int>  x_,y_,dx_,dy_;
...
for(std::size_t i=0;i<n;++i){
  color_.push_back(...);
  x_.push_back(...);
  y_.push_back(...);
  dx_.push_back(...);
  dy_.push_back(...);  
}
...
auto beg_=make_pointer<particle>(
  access(&color_[0],&x_[0],&y_[0],&dx_[0],&dy_[0])),
auto end_=beg_dod+n;
render(beg_,end_);
We would like to rely on a container using SOA (structure of arrays) for its storage that allows us to retain our original OOP syntax:
using access=dod::access<color,x,y,dx,dy>;
dod::vector<particle<access>> p_;
...
for(std::size_t i=0;i<n;++i){
  p_.emplace_back(...);
}
...
render(p_.begin(),p_.end());
Note that particles are inserted into the container using emplace_back rather than push_back: this is due to the fact that a particle object (which push_back accepts as its argument) cannot be created out of the blue without its constituent members being previously stored somewhere; emplace_back, on the other hand, does not suffer from this chicken-and-egg problem.
The implementation of such a container class is fairly straightfoward (limited here to the operations required to make the previous code work):
namespace dod{

template<typename Access>
class vector_base;

template<>
class vector_base<access<>>
{
protected:
  access<> data(){return {};}
  void emplace_back(){}
};

template<typename Member0,typename... Members>
class vector_base<access<Member0,Members...>>:
  protected vector_base<access<Members...>>
{
  using super=vector_base<access<Members...>>;
  using type=typename Member0::type;
  using impl=std::vector<type>;
  using size_type=typename impl::size_type;
  impl v;
  
protected:
  access<Member0,Members...> data()
  {
    return {v.data(),super::data()};
  }

  size_type size()const{return v.size();}

  template<typename Arg0,typename... Args>
  void emplace_back(Arg0&& arg0,Args&&... args){
    v.emplace_back(std::forward<Arg0>(arg0));
    try{
      super::emplace_back(std::forward<Args>(args)...);
    }
    catch(...){
      v.pop_back();
      throw;
    }
  }
};
  
template<typename T> class vector;
 
template<template <typename> class Class,typename Access> 
class vector<Class<Access>>:protected vector_base<Access>
{
  using super=vector_base<Access>;
  
public:
  using iterator=pointer<Class<Access>>;
  
  iterator begin(){return super::data();}
  iterator end(){return this->begin()+super::size();}
  using super::emplace_back;
};

} // namespace dod
dod::vector<Class<Members...>> derives from an implementation class that holds a std::vector for each of the Members declared. Inserting elements is just a simple matter of multiplexing to the vectors, and begin and end return dod::pointers to this structure of arrays. From the point of view of the user all the necessary magic is hidden by the framework and DOD processing becomes nearly identical in syntax to OOP.
We provide a test program that exercises dod::vector against the classical OOP approach based on a std::vector of plain (i.e., non DOD) particles. Results are the same as previously discussed when we used DOD with manual initialization, that is, there is no abstraction penalty associated to using dod::vector, so we won't present any additional figures here.
The framework we have constructed so far provides the bare minimum needed to test the ideas presented. In order to be fully usable there are various aspects that should be expanded upon:
  • access<Members...> just considers the case where each member is stored separately. Sometimes the most efficient layout will call for mixed scenarios where some of the members are grouped together. This can be modelled, for instance, by having member accept multiple pieces of data in its declaration.
  • dod::pointer does not properly implement const access, that is, pointer<const particle<...>> does not compile.
  • dod::vector should be implemented to provide the full interface of a proper vector class.
All of this can be in principle tackled without serious design dificulties.

Sunday, September 6, 2015

C++ encapsulation for Data-Oriented Design: performance

(Many thanks to Manu Sánchez for his help with running tests and analyzing results.)
In a past entry, we implemented a little C++ framework that allows us to do DOD while retaining some of the encapsulation benefits and the general look and feel of traditional object-based programming. We complete here the framework by adding a critical piece from the point of view of usability, namely the ability to process sequences of DOD entities with as terse a syntax as we would have in OOP.
To enable DOD for a particular class (like the particle we used in the previous entry), i.e., to distribute its different data members in separate memory locations, we change the class source code to turn it into a class template particle<Access> where Access is a framework-provided entity in charge of granting access to the external data members with a similar syntax as if they were an integral part of the class itself. Now, particle<Access> is no longer a regular class with value semantics, but a mere proxy to the external data without ownership to it. Importantly, it is the members and not the particle objects that are stored: particles are constructed on the fly when needed to use its interface in order to process the data. So, code like
for(const auto& p:particle_)p.render();
cannot possibly work because the application does not have any particle_ container to begin with: instead, the information is stored in separate locations:
std::vector<char> color_;
std::vector<int>  x_,y_,dx_,dy_;
and "traversing" the particles requires that we go through the associated containers in parallel and invoke render on a temporary particle object constructed out of them:
auto itc=&color_[0],ec=itc+color_.size();
auto itx=&x_[0];
auto ity=&y_[0];
auto itdx=&dx_[0];
auto itdy=&dy_[0];
  
for(;itc!=ec;++itc,++itx,++ity,++itdx,++itdy){
  auto p=make_particle(
    access<color,x,y,dx,dy>(itc,itx,ity,itdx,itdy));
  p.render();
}
Fortunately, this boilerplate code can be hidden by the framework by using these auxiliary constructs:
template<typename T> class pointer;

template<template <typename> class Class,typename Access>
class pointer<Class<Access>>
{
  // behaves as Class<Access>>*
};

template<template <typename> class Class,typename Access>
pointer<Class<Access>> make_pointer(const Access& a)
{
  return pointer<Class<Access>>(a);
}
We won't delve into the implementation details of pointer (the interested reader can see the actual code in the test program given below): from the point of view of the user, this utility class accepts an access entity, which is a collection of pointers to the data members plus an offset member (this offset has been added to the former version of the framework), it keeps everything in sync when doing pointer arithmetic and dereferences to a temporary particle object. The resulting user code is as simple as it gets:
auto n=color_.size();
auto beg_=make_pointer<particle>(access<color,x,y,dx,dy>(
  &color_[0],&x_[0],&y_[0],&dx_[0],&dy_[0]));
auto end_=beg_+n;
  
for(auto it=beg_;it!=end_;++it)it->render();
Index-based traversal is also possible:
for(std::size_t i=0;i<n;++i)beg_[i].render();
Once the containers are populated and beg_ and end_ defined, user code can handle particles as if they were stored in [beg_, end_), thus effectively isolated from the fact that the actual data is scattered around different containers for maximum processing performance.
Are we paying an abstraction penalty for the convenience this framework affords? There are two sources of concern:
  • Even though traversal code is in principle equivalent to hand-written DOD code, compilers might not be able to optimize all the template scaffolding away.
  • Traversing with access<color,x,y,dx,dy> for rendering when only color, x and y are needed (because render does not access dx or dy) involves iterating over dx_ and dy_ without actually accessing either one: again, the compiler might or might not optimize this extra code.
We provide a test program (Boost required) that measures the performance of this framework against some alternatives. The looped-over render procedure simply updates a global variable so that resulting execution times are basically those of the produced iteration code. The different options compared are:
  • oop: iteration over a traditional object-based structure
  • raw: hand-written data-processing loop
  • dod: DOD framework with access<color,x,y,dx,dy>
  • render_dod: DOD framework with  access<color,x,y>
  • oop[i]: index-based access instead of iterator traversal
  • raw[i]: hand-written index-based loop
  • dod[i]: index-based with access<color,x,y,dx,dy>
  • render_dod[i]: index-based with access<color,x,y>
The difference between dod and render_dod (and the same applies to their index-based variants) is that the latter keeps access only to the data members strictly required by render: if the compiler were not able to optimize unnecessary pointer manipulations in dod, render_dod would be expected to be faster; the drawback is that this would require fine tuning the access entity for each member function.
Manu Sánchez has set up an extensive testing environment to build and run the program using different compilers and machines:
The figures show the release-mode execution times of the eight options described above when traversing sequences of n = 104, 105, 106 and 107 particles.
GCC 5.1, MinGW, Intel Core i7-4790k @4.5GHz
Execution times / number of elements.
As expected, OOP is the slowest due to cache effects. The rest of options are basically equivalent, which shows that GCC is able to entirely optimize away the syntactic niceties brought in by our DOD framework.
MSVC 14.0, Windows, Intel Core i7-4790k @4.5GHz
Execution times / number of elements.
Here, again, all DOD options are roughly equivalent, although raw (pointer-based hand-written loop) is slightly slower. Curiously enough, MSVC is much worse at optimizing DOD with respect to OOP than GCC is, with execution times up to 4 times higher for n = 104 and 1.3 times higher for n = 107, the latter scenario being presumably dominated by cache efficiencies.
GCC 5.2, Linux, AMD A6-1450 APU @1.0 GHz
Execution times / number of elements.
From a qualitative point of view, these results are in line with those obtained for GCC 5.1 under an Intel Core i7, although as the AMD A6 is a much less powerful processor execution times are higher (×8-10 for n = 104, ×4-5.5 for n = 107).
Clang 3.6, Linux, AMD A6-1450 APU @1.0 GHz
Execution times / number of elements.
As it happens with the rest of compilers, DOD options (both manual and framework-supported) perform equally well. However, the comparison with GCC 5.2 on the same machine shows important differences: iterator-based OOP is faster (×1.1-1.4) in Clang, index-based OOP yields the same results for both compilers, and the DOD options in Clang are consistently slower (×2.3-3.4) than in GCC, to the point that OOP outperforms them for low values of n. A detailed analysis of the assembly code produced would probably gain us more insight into these contrasting behaviors: interested readers can access the resulting assembly listings at the associated GitHub repository.