Monday, June 15, 2009

60/40 split

Tim Harford recounts in his book The Logic of Life an anecdote about the authors of Freakonomics, Steven D. Levitt and Stephen J. Dubner: when discussing the terms of their collaboration for writing the book, Levitt stated that he wouldn't go for less than a 60/40 split on revenues; Dubner, on the other hand, would settle for no less than 60/40 either. The apparent confrontation vanished when they realized that each was meaning the other to get 60%, so they decided to join the venture.

Charming as it may sound, this story cannot be true: if I'm settling for 40% of the revenues and my partner proposes a 60/40 split, the proposal immediately fits my goals no matter whom I assume will get each part of the split.

Monday, March 23, 2009

About break statements

In a recent article, Andrew Koenig argues that break statements are harmful because they make it harder to deduce the program context at loop termination. For instance, given:

while (n != 0) { /* do something */ }
n can be assumed to be zero at loop termination unless there is a break within it. Koenig proposes two different approaches to alleviating this problem:

  • Include the program context at the break statement as a possible value for the program context at loop termination.
  • Force the loop termination condition just before breaking.

There is another alternative: exclude the loop termination condition from the evaluable program context. Here is a way to do that:

for (int local_n = n; local_n != 0;) { /* do something with local_n */ }
As the scope of local_n does not extend outside the loop, program context after loop termination does not depend on the termination procedure. A cruder, but also more flexible way to do the same is the following:
{
‎ int local_n = n;
‎ while (local_n != 0) { /* do something with local_n */ }
‎}
The somewhat unexpected surrounding braces are a conspicuous reminder that local_n does not form part of the global program state.

Tuesday, March 17, 2009

A curious syntactic transformation

Consider the sentence

Rationalism has many followers.

What in principle looks like a rather dull N+V+O statement does not however allow for seeminlgy innocuous variations on the object:

*Rationalism has many people.

The reason why the former is valid while the latter is not is that followers are followers of something, in this particular case rationalism:

Rationalism has many followers [of rationalism].

In fact, we can view our sentence as a mere rewording of:

There are many followers of rationalism,

which leads us to hypothetize the existence of the following syntactic transformation:

There is/are [Det] N of NP → NP has/have [Det] N.

For instance:

There were lots of fans of the BeatlesThe Beatles had lots of fans.
There are no enemies of RomeRome has no enemies.

This transformational rule explains from a purely syntactical perspective why superficially similar sentences like *Rationalism has many people are invalid --they have no "There is..." equivalent.

Interestingly, the rule seems to operate in other languages apart from English (I presume that at least in most modern European languages):

El racionalismo tiene muchos seguidores.
Le rationalisme a beaucoup d'adeptes.
Rationalismus hat viele Anhänger.

This points to some general (maybe universal?) mechanism of reification of the possesion relationship between nouns.

Tuesday, March 10, 2009

The mythical Bell Curve in Human Resources

Suppose a company's HR department wishes to establish a statistical model for their personnel performance in order to set up outcome predictions for the company's bonus system. It is all too easy to assume that performance will follow a normal distribution such as this:

Fig. 1: Performance distribution under the Bell Curve assumption.

The assumption stems from the deeply held custom in Psychology and Sociology of using the Bell Curve to model any a priori unknown human trait. Actually, if the company's hiring process is indeed efficient in selecting better than average people, this assumption is a complete contradiction.

Consider a simplistic hiring process in which performance is assesed by means of a test, so that applicants scoring some fixed minimum at the test are hired, and let us concede for the sake of the argument that the test is a perfect predictor of performance. The resulting performance distribution is depicted at the figure.

Fig. 2: Performance distribution with a test-based selection process.

The distribution has positive skewness, i.e. its right tail is longer than the left tail. So, the normal approximation with the same mean and variance (shown in dotted line) both overestimates low performers and underestimates high performers. It also underestimates low-to-normal performers and overestimates normal-to-high performers.

In other hiring process scenario, the best among N candidates is selected. The resulting distribution is depicted at the following figure for N = 10.

Fig. 3: Performance distribution with best-of-10 selection process.

We have positive skewness again, though not as marked as in the prvious case. Skewness grows as N does. Again, the normal approximation results in overestimation of low performers and underestimation of high performers.

Finally, we consider a two-stage hiring process where applicants are first filtered by a test and then the best candidate out of N is selected.

Fig. 4: Performance distribution with test prefiltering and best-of-10 selection process.

The test filter results in a slightly larger positive skewness. As in previous cases, normal approximation predicts more low performers and less high performers than the real case.

To summarize: hiring process not only results in a personnel performance distribution with a higher than average mean (which is the primary purpose of any hiring process); the distribution will also have positive skewness, with more excellent and less deficient people than predicted by the Bell Curve.

Friday, November 28, 2008

More on price gas hysteresis

We redo the hysteresis analysis we conducted on gas prices, now for a longer period of time and using much more data than before:
  • Average weekly retail prices without taxes of 95 octane gasoline and gasoil in Spain have been obtained from the European Commision Oil Bulletin (thanks to Wonka for the pointer).
  • The US Energy Information Administration provides daily spot prices of Brent oil in dollars per barrel.
  • Historical data on euro to dollar exchange rates can be retrieved from FXHistory.

Using this information we have constructed the following figure showing the evolution of Brent prices and gasoline and gasoil Spanish retail prices without taxes, all in c€ per liter, between January 2006 and October 2008.

And this is the scatter plot of Δ(gasoline price before taxes) against Δ(Brent price):

The linear regressions of this plot for the semiplanes Δ(Brent price) ≥ 0 and ≤ 0 are:

ΔBrent ≥ 0 → y = f+(x) = b+ + m+x = 0.1854 + 0.1691x,
ΔBrent ≤ 0 → y = f(x) = b + mx = 0.1273 + 0.3636x.

On one hand, m is more than twice as large as m+, which indicates that gasoline price follows oil price reductions much more strongly than oil price increases when these variations are large; when oil price variations are small, it is the fact that both b+ and b+ are positive that dominates the situation: gasoline prices have a very clear bias towards increasing then. The "consumer unfavorable" region in which f+(x) > −f(−x) is

−1.61 c€/l < ΔBrent < 1.61 c€/l.

As for gasoil, the plot look like this:

with regression lines

ΔBrent ≥ 0 → y = f+(x) = b+ + m+x = 0.2669 + 0.1269x,
ΔBrent ≤ 0 → y = f(x) = b + mx = 0.2391 + 0.4134x.

Again, m is much greater than m+ (~3.25 times), but this is masked for small variations of Brent price by the constants b+ and b+ being greater than zero. The consumer unfavorable region is

−1.77 c€/l < ΔBrent < 1.77 c€/l,

which is very similar to the consumer unfavorable region for gasoline.

So, we have a somewhat mixed scenario: when oil prices do not vary much, there is a tendency for retail gas prices to increase, whereas when oil price variations are high, reductions are tracked much more strongly than increases. We propose two explanations, both highly speculative, for this phenomenon:

  • Small variations of oil prices do not reach public attention, which allows retailers to silently increase gas prices; high variations in oil price, however, generally make it to the news, so that retailers are under greater public scrutiny when transferring these variations to the pump.
  • So as to soften the public negative reaction to price increments (and associated contractions in consumption), retailers do not reflect high increases in oil price as abruptly as oil price reductions, but opt to distribute the increment over a longer period of time; this would account both for the fact that m+ < m and for the positive bias in small price variations.

Monday, November 24, 2008

Optimizing red-black tree color bits

C++ associative containers are almost universally implemented on top of a structure known as a red-black tree. An rb tree node has approximately the following look:

‎enum color_type{red=false,black=true};

‎template<typename T>
‎struct rb_node
‎{
‎ color_type color;
‎ rb_node* parent;
‎ rb_node* left;
‎ rb_node* right;
‎ T value;
‎};

The color of a node, from which the data structure receives its name, is used to implement some algorithms to keep the tree balanced. Note that the information conveyed by color is exactly one bit; yet, this member takes at least one char of storage and, in most platforms, much more than that because of alignment issues. Increasingly more libraries implement an optimization technique that is able to get rid of the extra storage demanded by color by embedding the information bit into one of the additional node fields. As this technique does not appear to be much documented on the Internet, this entry describes it in detail. Much of the material draws from the source code of Boost.MultiIndex. Some acquantaince with Boost.MPL is welcome to understand what follows.

To embed a bit into the representation of a pointer we need to use in place of the pointer an unsigned integer type with the same size. In most 32bit platforms this type is unsigned int, but the standard does not guarantee this, or even that such an integral type exists. We assume that we have the following utilities at our disposal:

‎typedef ... has_uintptr_type;
‎typedef ... uintptr_type;

where has_uintptr_type is a Boost.MPL boolean indicating whether an unsigned integral type exists with exactly the same size as a pointer, and if so uintptr_type is such type. We will provide later an implementation for these.

Alignment considerations dictate that many types must lie in memory addresses that are multiple of some small integer greater than 1; in particular, it is extremely usual that the alignment of a type T is even (i.e. a multiple of 2), in which case the least significant bit of the representation of pointers to T is always zero. And this is precisely the situation in which we can reuse this bit to embed the color information:

‎template<typename T,bool ColorBitCompression=true>
‎class rb_node:
‎ public boost::mpl::if_c<
‎ ColorBitCompression&&
‎ has_uintptr_type::value&&
‎ (boost::alignment_of<compressed_rb_node_header>::value%2==0),
‎ compressed_rb_node_header,
‎ regular_rb_node_header
‎ >::type
‎{
‎ T value_;

‎public:
‎ T& value(){return value_;}
‎ const T& value()const{return value_;}
‎};

rb_node uses a header with the color bit impression technique when:

  1. This is requested by the user via ColorBitCompression.
  2. There is an unsigned integer with which pointers can be represented.
  3. The alignment of headers is even.

Otherwise a regular header is used. The implementation of the latter is entirely obvious:

‎class regular_rb_node_header
‎{
‎ typedef regular_rb_node_header* pointer;

‎ color_type color_;
‎ regular_rb_node_header* parent_;
‎ regular_rb_node_header* left_;
‎ regular_rb_node_header* right_;

‎public:
‎ color_type& color(){return color_;}
‎ color_type color()const{return color_;}
‎ pointer& parent(){return parent_;}
‎ pointer parent()const{return parent_;}
‎ pointer& left(){return left_;}
‎ pointer left()const{return left_;}
‎ pointer& right(){return right_;}
‎ pointer right()const{return right_;}
‎};

As for compressed_rb_node_header, the most interesting part lie in the use of proxy classes color_ref and parent_ref to isolate the user from the representation of the color information and the parent pointer as a merged uintptr_type:

‎class compressed_rb_node_header
‎{
‎ typedef compressed_rb_node_header* pointer;
‎ struct color_ref
‎ {
‎ color_ref(uintptr_type* r_):r(r_){}

‎ operator color_type()const
‎ {
‎ return color_type(*r&uintptr_type(1));
‎ }

‎ color_ref& operator=(color_type c)
‎ {
‎ *r&=~uintptr_type(1);
‎ *r|=uintptr_type(c);
‎ return *this;
‎ }

‎ color_ref& operator=(const color_ref& x)
‎ {
‎ return operator=(x.operator color_type());
‎ }

‎ private:
‎ uintptr_type* r;
‎ };
‎ struct parent_ref
‎ {
‎ parent_ref(uintptr_type* r_):r(r_){}

‎ operator pointer()const
‎ {
‎ return (pointer)(void*)(*r&~uintptr_type(1));
‎ }

‎ parent_ref& operator=(pointer p)
‎ {
‎ *r=((uintptr_type)(void*)p)|(*r&uintptr_type(1));
‎ return *this;
‎ }

‎ parent_ref& operator=(const parent_ref& x)
‎ {
‎ return operator=(x.operator pointer());
‎ }

‎ pointer operator->()const
‎ {
‎ return operator pointer();
‎ }

‎ private:
‎ uintptr_type* r;
‎ };

‎ uintptr_type parentcolor_;
‎ pointer left_;
‎ pointer right_;

‎public:
‎ color_ref color(){return color_ref(&parentcolor_);}
‎ color_type color()const{return color_type(parentcolor_&std::size_t(1ul));}
‎ parent_ref parent(){return parent_ref(&parentcolor_);}
‎ pointer parent()const
‎ {return (pointer)(void*)(parentcolor_&~uintptr_type(1));}
‎ pointer& left(){return left_;}
‎ pointer left()const{return left_;}
‎ pointer& right(){return right_;}
‎ pointer right()const{return right_;}
‎};

A complete example program is provided. In typical 32bit architectures, color bit compression makes the size of rb_node<T> reduce by 4 bytes.

This color bit compression technique introduces some penalty when accessing and modifying the color and parent information. This penalty, however, is entirely negligible. What is more, the reduction of size brought about by color bit compression usually results in a performance speedup for associative containers taking advantage of this trick: as nodes are smaller, more of them can be stored in L1 cache, which accelerates execution noticeably.

One final remark about this technique: as the parent pointer is no longer declared as a regular pointer, but rather represented by a masked integer, most C++ garbage collectors will not recognize the implicit pointer, which in principle can cause problems. All is fine, however, as it is only the parent that is masked: the right and left pointers are stored as genuine pointers, and as every node of a red/black tree is reachable only through left and right garbage collectors will not fail.

Thank you to Thorsten Ottosen for discussing this color bit compression technique with me.

Thursday, November 20, 2008

Infinitely many voters

As we have discussed before, the famous Arrow's Theorem is equivalent to this basic theorem on ultrafilters:

Every ultrafilter on a finite set is principal.

(See for instance Joel W. Robbin's paper for a proof of this equivalence.) Basically, the set on which the ultrafilter is built represents the voters, and the member sets of an ultrafilter can be equated with forcing coalitions, groups of voters which, if unanimous on a given preference, determine the overall result with respect to that preference. The theorem above says that every ultrafilter on a finite set of voters has a forcing coalition of size 1, i.e. a dictator.

If we have an inifite set, however, the Axiom of Choice implies that there are non-principal ultrafilters: so if there were infinitely many voters we could devise (alas, nonconstructively) a fair voting system free of dictatorships. It is interesting to try to take advantage of this result to somehow circumvent the limitations for the finite case, and see what fails (something must fail after all):

Let our voters be represented by the set V={v1,...,vN}: we "extend" them by assigning to each vi an enumerable set of fictitious voters vi1, vi2,... Thus the set V' of fictitious voters is infinite and we can have a non-principal ultrafilter F on it. We design the following voting system: for any given preference the vote of each vi is obtained and assigned to the associated fictitious voters vij; if the overall resulting set of fictitious voters supporting the preference is a forcing coalition according to F, the preference is approved, otherwise rejected.

We already know that this voting system cannot be fair, and in fact it is easy to uncover the underlying flaw: the range of our mapping f : VV' is not the entire powerset of V', but only a finite subset of it (with cardinality 2N); although F is non-principal, it becomes principal (streching somewhat the meaning of the concept) when restricted to range(f): F necessarily contains then a pseudodictatorial forcing coalition comprising exactly all the fictitious voters associated to a single vi, who is the resulting dictator for the so devised voting system.