Saturday, July 4, 2015

Allowing consumers into the advertising value chain

Online advertising market operates through a small number of service providers such as Google or Facebook which aggregate consumers' attention and offer it to advertisers via a B2B model based on CPC, CPM or similar indicators. Companies determine their total expenditure in advertising A as a fixed percentage of their revenues or with some other formula that takes into account the estimated increase in revenues that ads might bring in as a result of exposure of the company's products to their potential customers. In any case, we can consider that this figure is an increasing function of consumer spending C:
A = A(C).
In the United States, digital advertising spend amounted to $49.5B or 0.3% of the country's GDP, roughy $155 per person. Figures for Spain are more modest: €1.1B or €23 per person.
Consumers receive no direct payback from this business other than access to the free services online providers offer them. What would happen if users could actually charge online service providers for their share of attention? At first blush there seem to be no changes in terms of national economy as we are merely redistributing online revenues among a different set of players. On closer analysis, though, we can predict a multiplier effect that works as follows:
  • Online providers' net income I is reduced by some quantity Ic that is directly received by users themselves.
  • Consumer income Y is added Ic and detracted some unkown quantity L resulting from online provider's lower dividends (if any, as currently neither Google nor Facebook pay them), reduced market cap, adjustments in employees' salaries, etc. The key aspect is that, even if  Ic = L and so Y remains the same, this redistribution will benefit lower-income residents, with a net effect of higher overall consumption because these people pay proportionally less taxes T (so, the national disposable income Yd grows) and have higher marginal propensity to consume MPC.
  • Higher consumption increases demand and supply and produces a net gain in GDP (and online ad spending in its turn).
So, the entire cycle is summarized as
ΔA = (ΔAC) · ΔMPC · Δ(YT).
Of course this analysis is extremely simple and there are additional factors that complicate it:
  • Tax avoidance practices by big online providers basically take all their income out of some countries (Spain, for instance) to derive it to others where taxes are lower. For the losing countries, redistribution of ad revenues among their residents is a win-win as their national economies receive no input from online providers to begin with (that is, online income losses do not affect their GDP).
  • Global online advertising is currently a duopoly and challenges to the existing value chain can confront very aggressive opposition: no matter how big the multiplier effect ΔA/Ic, online providers can not compensate their loss with increased revenues in a bigger advertisement market, since ad spend is by definition a fraction of consumers' income. When French newspapers demanded sharing into Google News' business, a modest agreement was settled after harsh disputes, but the same situation in Spain ended much worse for everybody.
  • In fact, threatening the very source of revenues of online service providers could damage their business to the point of making it unviable. Current operating margins for this market go from  25% to 35%, though, which seems to indicate otherwise.
Now, if users were to enter into the ad value chain, they would need some proxy on their behalf who has the power to oppose online service providers (for a share of the pie, at any rate). Two potential players come to mind:
  • ISPs,
  • Internet browser companies.
As online ad spend per person is actually not that high, we are talking here of potential revenues for the end user of a few dollars a year at most. This would be best provided by the proxy in the form of virtual goods or, in the case of ISPs, discounts in their services.

Monday, June 29, 2015

Design patterns for invariant suspension

One of the most important aspects of data encapsulation as embodied by classes and similar constructs is the ability to maintain class invariants by restricting access to the implementation data through a well defined interface. (Incidentally, a class without significant invariants is little more than a tuple of values, which is why the presence of getters and setters is a sure sign of poor design). Sometimes, however, maintaining the class invariant can lead to unacceptable costs in terms of complexity or performance. Consider two examples:
  • A sorted_vector class that maintains an internal buffer of ordered elements.
  • A spread_sheet class that implements automatic formula calculation.
These classes restrict the potential states of their implementation data so that they meet some extra requirements the interface relies on (for instance, the fact that the elements of sorted_vector are ordered allows for an lookup operations, automatic values in the spread sheet are consistent with input data cells, etc.). State can only then be changed by using the provided interface, which takes proper care that the invariant is kept upon exit from any method —and this is all good and desireable.
The problem arises when we need to do bulk modifications on the internal data and the class interface is too rigid to support this:
  • sorted_vector typically only allows for deletion of elements or insertion of new values, and does not provide mechanisms for directly modifying the elements or temporarily rearranging them.
  • spread_sheet might only accept changing one data cell at a time so as to check which formulas are affected and recalculate those alone.
If we granted unrestricted access to the data, the class would have to renormalize (i.e. restore the invariant) inside every method. Let us see this for the case of sorted_vector:
class sorted_vector
{
  value_type* data(); // unrestricted acccess to data
  iterator find(const key_type& x)const
  {
    sort();
    ...
  }
  ...
private:
  void sort()mutable;
};
This is obviously impracticable for efficiency reasons. Another alternative, sometimes found in questionable designs, is to put the burden of maintaining the invariant on the user herself:
class sorted_vector
{
  value_type* data();
  void sort();
  iterator find(const key_type& x)const // must be sorted
  ...
}
The problem with this approach is that it is all too easy to forget to restore the invariant. What is worse, there is no way that the class can detect the overlook and raise some signal (an error code or an exception). A poor attempt at solving this:
class sorted_vector
{
  value_type* data();
  void modifying_data() // to indicate invariant breach
  {
    sorted=false;
  }
  void sort();
  iterator find(const key_type& x)const
  {
    if(!sorted)throw std::runtime_exception("not sorted");
    ...
  }
  ...
private:
  bool sorted;
  ...
}
is just as brittle as the previous situation and more cumbersome for the user. If we finally renounce to giving access to the internal data, the only way to do bulk modifications of the internal state is constructing the data externally and then reassigning:
sorted_vector v;
vector aux;
... // manipulate data in aux
v=sorted_vector(aux.begin(),aux.end())
This last option is not entirely bad, but incurs serious inefficiencies related to the external allocation of the auxiliary data and the recreation of a new sorted_vector object.
We look for design patterns that enable invariant suspension (i.e. unrestricted access to a class internal data) in a controlled and maximally efficient way.
Internal invariant suspension
In this pattern, we allow for suspension only within the execution of a class method:
class X
{
  template<typename F>
  void modify_data(F f); // f is invoked with X internal data
};
How does this work? The idea is that f is a user-provided piece of code that is given access to the internal data to manipulate freely. After f is done, modify_data, which issued the invocation, can renormalize the data before exiting. Let us see this in action for sorted_vector:
class sorted_vector
{
  template<typename F>
  void modify_data(F f)
  {
    f(data);
    sort(); // restore invariant
  }
};
...
sorted_vector v;
...
v.modify_data([](value_type* data){
  ... // manipulate data as wee see fit
});
This pattern is most suitable when modifications are relatively localized in code. When the modification takes place in a more uncontrolled setting, we need something even more flexible.
External invariant suspension
Here, we move the data out and accept it back again when the user is done with it.
class X
{
  data extract_data(); // X becomes "empty"
  void accept_data(data& d); // X should be "empty"
};
At first sight, this seems just as bad as our first example with sorted_vector::data(): as soon as we grant access to the internal state, we can no longer be sure the invariant is kept. The key aspect of the pattern, however, is that extract_data sets X to some empty state, i.e., a state without internal data where the invariant is trivially true in all cases —in other words, X gives its internal data away.
class sorted_vector
{
  value_type* extract_data()
  {
    value_type res=data;
    data=nullptr;
    return res;
  }
  void accept_data(value_type*& d)
  {
    if(data)throw std::runtime("not empty");
    // alternatively: delete data
    
    data=d;
    d=nullptr;
    sort(); // restore invariant
  }
};
...
sorted_vector v;
...
value_type* data=v.extract_data()
... // manipulate data
v.accept_data(data);
How does this meet our requirements of control and efficiency?
  • Control: the invariant of v is always true regardless of the user code: when invoking extract_data, v becomes empty, i.e. a vector with zero elements which is of course sorted. The methods of sorted_vector need not check for special cases of invariant suspension.
  • Efficiency: the data is never re-created, but merely moved around; no extra allocations are required.
Conclusions
Class invariants are a key aspect of data encapsulation, but can lead to inefficient code when the class interface is too restrictive for data manipulation, We have described two general design patterns, internal invariant suspension and external invariant suspension, that allow for internal state modification in a controlled fashion without interfering with the class general interface or requiring any type of data copying.

Sunday, June 28, 2015

Traversing a linearized tree: cache friendliness analysis

We do a formal analysis of the cache-related behavior of traversing a levelorder_vector, a linear array of elements laid out according to the breadth-first order of a binary tree. Ignoring roundoff effects, let us assume that the array has n = 2d − 1 elements and the CPU provides a cache system of N lines capable of holding L contiguous array elements each. We further assume that N > d.
Under these conditions, the elements of level i, being contiguous, are paged to 2i/L lines.
The traversal of  the array corresponds to the post-order sequence of the tree implicitly encoded, and it is easy to see that this meets the elements of each level i exactly in the order they are laid out in memory: so, given that  N > d, i.e. the CPU can hold at least as many lines in cache as there are levels in the tree, an optimal caching scheduler will result in only 2i/L misses per level, or n/L overall. The number of cache misses per element during traversal is then constant and equal to 1/L, which is the minimum attainable for any traversal scheme (like the simplest one consisting of walking linearly the array from first to last).
The questions remains of whether we can justifiably hold the assumption that N > d in modern architectures. In the Intel Core family, for instance, the data L1 cache is 32 KB (per core) and line size is 64 B, so N = 512. On the other hand, the maximum d possible in 64-bit architectures is precisely 64, therefore the condition  N > d is met indeed by a large margin.

Saturday, June 27, 2015

Traversing a linearized tree

In a previous entry we have introduced the data structure levelorder_vector<T>, which provides faster binary search than sorted vectors (such as sported by Boost.Container flat associative containers) by laying out the elements in a linear fashion following the breadth-first or level order of the binary tree associated to the sequence. As usual with data structures, this is expected to come at a price in the form of poorer performance for other operations.
Let us begin with iteration: traversing a levelorder_vector is algorithmically equivalent to doing so in its homologous binary tree:
The picture above shows the three generic cases the algorithm for incrementing a position has to deal with, highlighted with the same color in the following pseudocode:
void increment(position& i)
{
  position j=right(i);
  if(j){                      // right child exists
    do{
      i=j;                    // take the path down
      j=left(i);              // and follow left children
    }while(j);
  }
  else if(is_last_leaf(i)){
    i=end;                    // next of last is end
  }
  else{
    while(is_right_child(i)){ // go up while the node is
      i=parent(i);            // the right child of its parent 
    }
    i=parent(i);              // and up an extra level
  }
}
In the case of levelorder_vector, positions are numerical indexes into an array, so these operations are purely arithmetic (by contrast, in a node-based tree we would need to follow pointers around, which is more expensive as we will see shortly):
  • parent(i) = (i − 1)/2,
  • right(i) = 2i +2 (if less than size),
  • left(i) = 2i + 1 (if less than size),
  • is_right_child(i) = (i mod 2 = 0),
  • is_last_leaf(i) = i does not have a right child and i + 2 is a power of 2.
The last identity might be a little trickier to understand: it simply relies on the fact that the last leaf of the tree must also be the last element of a fully occupied level, and the number of elements in a complete binary tree with d levels is 2d − 1, hence the zero-based index of the element + 2 has to be a power of two.
A test program is provided that implements forward iterators for levelorder_vector (backwards traversal is mostly symmetrical) and measures the time taken to traverse std::set, boost::container::flat_set and levelorder_vector instances holding n = 10,000 to 3 million integer elements.
MSVC on Windows
Test compiled with Microsoft Visual Studio 2012 using default release mode settings and run on a Windows box with an Intel Core i5-2520M CPU @2.50GHz.
Traversal time / number of elements.
As expected, boost::container::flat_set is the fastest: nothing (single threaded) can beat traversing an array of memory in sequential order. levelorder_vector is around 5x slower than boost::container::flat_set but up to 4x faster than std::set. More interestingly, its traversal time per element is virtually non-affected by the size of the container. This can be further confirmed by measuring a non-touching traversal where the range [begin, end) is walked through without actually dereferencing the iterator, and consequently not accesing the container memory at all: resulting times remain almost the same, which indicates that the traversal time is dominated by the arithmetic operations on indexes. In a later entry we provide the formal proof that traversing a levelorder_vector is indeed cache-oblivious.
GCC on Linux
Contributed by Manu Sánchez (GCC 5.1, Arch Linux 3.16, Intel Core i7 4790k @ 4.5GHz) and Laurentiu Nicola (GCC 5.1, Arch Linux x64, Intel Core i7-2760QM CPU @ 2.40GHz and Intel Celeron J1900 @ 1.99GHz). Rather than showing the plots, we provide the observed ranges in execution speed ratios between the three containers (for instance, the cell "10-30" means that boost::container::flat_set is between 10 and 30 times faster than levelorder_vector for Intel Core i7).
flat_set
levelorder_vector
levelorder_vector
     std::set 
  
i710-303-8
Celeron10-122.5-4.5
Clang on Linux
Clang 3.6.1 with libstdc++-v3 and libc++ Arch Linux 3.16, Intel Core i7 4790k @ 4.5GHz.  Clang 3.6.1, Arch Linux x64, Intel Core i7-2760QM CPU @ 2.40GHz and Intel Celeron J1900 @ 1.99GHz.
flat_set
levelorder_vector
levelorder_vector
     std::set 
  
libstdc++-v3, i710-502-7
libc++, i710-502.5-8
Celeron15-302.5-4.5
Conclusions
Traversing a levelorder_vector takes a constant time per element that sits in between those of boost::container::flat_set (10-50 times slower, depending on number of elements and environment) and std::set (2-8 times faster). Users need to consider their particular execution scenarios to balance this fact vs. the superior performance offered at lookup.

Thursday, June 25, 2015

Cache-friendly binary search

High-speed memory caches present in modern computer architectures favor data structures with good locality of reference, i.e. the property by which elements accessed in sequence are located in memory addresses close to each other. This is the rationale behind classes such as Boost.Container flat associative containers, that emulate the functionality of standard C++ node-based associative containers while storing the elements contiguously (and in order). This is an example of how binary search works in a boost::container::flat_set with elements 0 trough 30:
As searching narrows down towards the looked-for value, visited elements are progressively closer, so locality of reference is very good for the last part of  the process, even if the first steps hop over large distances. All in all, the process is more cache-friendly than searching in a regular std::set (where elements are totally scattered throughout memory according to the allocator's whims) and resulting lookup times are consequently much better. But we can get even more cache-friendly than that.
Consider the structure of a classical red-black tree holding values 0 to 30:
and lay out the elements in a contiguous array following the tree's breadth-first (or level) order:
We can perform binary search on this level-order vector beginning with the first element (the root) and then jumping to either the "left" or "right" child as comparisons indicate:
The hopping pattern is somewhat different than with boost::container::flat_set: here, the first visited elements are close to each other and hops become increasingly longer. So, we get good locality of reference for the first part of the process, exactly the opposite than before. It could seem then that we are no better off with this new layout, but in fact we are. Let us have a look at the heat map of boost::container::flat_set:
This shows how frequently a given element is visited through an average binary search operation (brighter means "hotter", that is, accessed more). As every lookup begins at the mid element 15, this gets visited 100% of the times, then 7 and 23 have 50% visit frequency, etc. On the other hand, the analogous heat map for a level-order vector looks very different:
Now hotter elements are much closer: as cache management mechanisms try to keep hot areas longer in cache, we can then expect this layout to result in fewer cache misses overall. To summarize, we are improving locality of reference for hotter elements at the expense of colder areas of the array.
Let us measure this data structure in practice. I have written a protoype class template levelorder_vector<T> that stores its contents in level-order sequence and provides a member function for binary search:
const_iterator lower_bound(const T& x)const
{
  size_type n=impl.size(),i=n,j=0;
  while(j<n){
    if(impl[j]<x){
      j=2*j+2;
    }
    else{
      i=j;
      j=2*j+1;
    }
  }
  return begin()+i;
}
A test program is provided that measures lower_bound execution times with a random sequence of values on containers std::set, boost::container::flat_set and levelorder_vector holding n = 10,000 to 3 million elements.
MSVC on Windows
Test built with Microsoft Visual Studio 2012 using default release mode settings and run on a Windows box with an Intel Core i5-2520M CPU @2.50GHz. Depicted values are microseconds/n, where n is the number of elements in the container.
lower_bound execution times / number of elements.
The theoretical lookup times for the three containers are O(log n), which should show as straight lines (horizontal scale is logarithmic), but increasing cache misses as n grows result in some upwards bending. As predicted, levelorder_vector performs better than boost::container::flat_set, with improvement factors ranging from 10% to 30%. Both containers are nevertheless much faster than std::set, and don't show much degradation until n ≃ 5·105 (~7·104 for std::set), which is a direct consequence of their superior cache friendliness.
GCC on Linux
The following people have kindly sent in test outputs for GCC on varius Linux platforms: Philippe Navaux, xckd, Manu Sánchez. Results are very similar, so we just show those corresponding to GCC 5.1 on a Intel Core i7-860 @ 2.80GHz with Linux kernel 4.0 from Debian:
lower_bound execution times / number of elements.
Although levelorder_vector performs generally better than boost::container::flat_set, there is a strange phenomenon for which I don't have an explanation: improvement factors are around 40% for low values of n, but then decrease or even become negative as n aproaches 3·106. I'd be grateful if readers can offer interpretations of this.
Laurentiu Nicola has later submitted results with -march=native -O2 settings that differ significantly from the previous ones: GCC 5.1 on an Intel Core i7-2760QM CPU @ 2.40GHz (Linux distro undisclosed):
lower_bound execution times / number of elements.
and GCC 5.1 on an Intel Celeron J1900 @ 1.99GHz:
lower_bound execution times / number of elements.
-march=native seems to make a big difference: now levelorder_vector execution times are up to 25% (i7) and 40% (Celeron) lower than boost::container::flat_set when n approaches 3·106, but there is no significant gain for low values of n.
Clang on OS X
As provided by fortch: clang-500.2.79 in -std=c++0x mode on an Darwin 13.1.0 box with an Intel Core  i7-3740QM CPU @ 2.70GHz.
lower_bound execution times / number of elements.
The improvement of levelorder_vector with respect to boost::container::flat_set ranges from 10% to 20%.
Clang on Linux
Manu Sánchez has run the tests with Clang 3.6.1 on an Arch Linux 3.16 box with an Intel Core i7 4790k @4.5GHz. In the first case he has used the libstdc++-v3 standard library:
lower_bound execution times / number of elements.
and then repeated with libc++:
lower_bound execution times / number of elements.
Results are similar, with levelorder_vector execution speeds between 5% and 20% less than those of boost::container::flat_set. The very noticeable bumps on the graphs for these two containers when using libc++ are also present on the tests for Clang on OS X but do not appear if the standard library used is libstdc++-v3, which suggests that they might be due to some artifact related to libc++ memory allocator.
Laurentiu's results for Clang 3.6.1 with -march=native -O2 are similar to Manu's for an i7 CPU, whereas for an Intel Celeron J1900 @ 1.99GHz we have:
lower_bound execution times / number of elements.
where levelorder_vector execution times are up to 35% lower than those of boost::container::flat_set.
Conclusions
Even though binary search on sorted vectors perform much better than node-base data structures, alternative contiguous layouts can achieve even more cache friendliness. We have shown a possibility based on arranging elements according to the breadth-first order of a binary tree holding the sorted sequence. This data structure can possibly perform worse, though, in other operations like insertion and deletion: we defer the study of this matter to a future entry.

Friday, May 29, 2015

A case in political reasoning

Municipal elections in Spain have just been run with interesting results. In Barcelona, the nationalist ruling party CiU, noted by its recent shift towards unreserved support for the independence of Catalonia, has lost to Barcelona en Comú, a political alliance loosely associated to left-wing Podemos, ambivalent on these matters. In Madrid, the party in office and undisputed winner for more than twenty years, right-wing PP,  has lost a great deal of support, almost yielding first place to Ahora Madrid, also linked to Podemos, which is now in a better position to form government. Commenting on the elections, CiU's Artur Mas, president of the Catalonian regional government, says (translation mine):
Mas has discarded the possibility that CiU's defeat in Barcelona has been influenced by the party's pro-independence discourse. "Madrid might also have a mayor from Podemos, and there is no independentist movement there", he said.
Is Mr. Mas's argument sound? Let us analyze it formally.
Logical analysis
Write
i(x) = party in power in county x supports independence of the region,
p(x) = Podemos (or associated alliances) gains support in x over the party in power,
where x ranges over all counties in Spain. Then, Mas's reasoning can be formulated as
It is not the case that x i(x) p(x) because ¬i(Madrid) p(Madrid).
Put this way, the argument is blatantly false: a refutation of x i(x) p(x) would ask for a city with a pro-independence governing party where Podemos did not succeed, which is not the case of Madrid. Let us try another approach.
Probabilistic analysis
If Mas was thinking in statistical terms, a plausible formulation could be
It is not the case that P(p(x) | i(x)) > P(p(x)) because ¬i(Madrid) p(Madrid),
that is, the probability that Podemos gains power in a pro-independence context is not higher that in general, as supported by the results in Madrid. If we write
PI = number of counties with pro-independence councils where Podemos wins,
I = number of counties with pro-independence councils where Podemos runs,
P = number of counties where Podemos wins,
N = total number of counties where Podemos runs,
then Mas afirms that
PI / I P / N
or, equivalently,
D = (P / N) (PI / I) ≥ 0.
Is the fact that ¬i(Madrid) p(Madrid) giving any support to this contention? Let
D' = (P' / (N − 1)) (PI' / I')
be the value resulting from considering every county in Spain where Podemos ran except Madrid. When taking Madrid into account, we have
D = ((P' + 1) / N) (PI' / I')
and, consequently,
D D' = ((P' + 1) / N) (P' / (N − 1)), 
which is always non-negative (since P' is never greater than N − 1), and furthermore strictly positive if P' < N − 1, that is, if Podemos has lost somewhere (which is certainly the case). So, knowing about the existence of counties without pro-independence councils where Podemos won increases the chances that, in effect,  D ≥ 0 and, consequently, independentism not play a beneficial role in the success of this party. But if this knowledge is of just one county out of the hundreds where Podemos ran, the degree of extra confidence we gain is extremely small. This is the very weak sense in which Mas's argument can be held valid.

Sunday, January 11, 2015

Gas price hysteresis, Spain 2014

In 2008 we did a hysteresis analysis to try to understand how variations in the price of oil translate to retail prices of gasoline and gasoil (diesel) in Spain. As crude oil prices have plummeted in the last few months, spurring some interest in whether consumers have benefited from this as much as they should, we redo here the analysis with 2014 data:
The figure shows the weekly evolution during 2014 of prices of Brent oil and average retail prices without taxes of 95 octane gas and gasoil in Spain, all in c€ per liter.
Below is the scatter plot of Δ(gasoline price before taxes) against Δ(Brent price),
with linear regressions for the semiplanes Δ(Brent price) ≥ 0 and ≤ 0, given by
ΔBrent ≥ 0 → y = f+(x) = b+ + m+x = 0.0734 − 0.3685x,
ΔBrent ≤ 0 → y = f(x) = b + mx = 0.2338 + 0.5447x,
respectively (shown in the plot as dotted red lines). This is entirely unexpected: retail gas prices tend to reduce even for positive variations of oil. The overall regression line for the scatter plot (dotted black line) is
y = f(x) = b + mx = −0.2489 + 0.4208x.
As for gasoil, we have
and
ΔBrent ≥ 0 → y = f+(x) = b+ + m+x = −0.1965 − 0.1115x,
ΔBrent ≤ 0 → y = f(x) = b + mx = 0.16 + 0.4498x,
 overall → y = f(x) = b + mx = −0.2169 + 0.5039x,
which is qualitatively similar to the behavior for gasoline. In both cases, retail prices under constant oil prices (that is, when ΔBrent is small) have reduced around 0.2c€/l per week.
So, oil's fall during the second half of 2014 has of course resulted in a general reduction of gas prices, but these were are also showing a correction trend during the first half of the year, when oil rises were translated less strongly than drops. At least for 2014, we haven't found evidence of the "rocket and feather" effect fuel companies are usually suspected of.