Saturday, January 18, 2014

A better hash table: Clang

After Microsoft Visual Studio 2012 and GCC 4.8.2, we now provide the results of our performance test suite for C++ unordered associative containers in Clang 3.4. All the tests were compiled and run by Thomas Goyne on a Mac OS X 10.9.1 box with an Intel Core i7-3720QM @2.6GHz, Turbo Boost and SpeedStep disabled, release settings -O3 -std=c++11.
Clang can work with two different standard libraries, libstdc++-v3 and libc++: we have tried both libstdc++ 4.8.2 and libc++ 3.4.
libstdc++-v3
Insertion without rehashing, unique elements.
Insertion with rehashing, unique elements.
Insertion without rehashing, duplicate elements, Fmax = 1, G = 5.
Insertion without rehashing, duplicate elements, Fmax = 5, G = 5.
Insertion with rehashing, duplicate elements, Fmax = 1, G = 5.
Insertion with rehashing, duplicate elements, Fmax = 5, G = 5.
Erasure, unique elements.
Erasure, duplicate elements, Fmax = 1, G = 5.
Erasure, duplicate elements, Fmax = 5, G = 5.
Successful lookup, unique elements.
Unsuccessful lookup, unique elements.
Successful lookup, duplicate elements, Fmax = 1, G = 5.
Unsuccessful lookup, duplicate elements, Fmax = 1, G = 5.
Successful lookup, duplicate elements, Fmax = 5, G = 5.
Unsuccessful lookup, duplicate elements, Fmax = 5, G = 5.
As expected, the results are very similar to those obtained with GCC: the only noteworthy difference is Boost.Unordered performance in lookup with unique elements, which is somewhat poorer here —the reasons for this discrepancy are not clear to me.
libc++
Insertion without rehashing, unique elements.
Insertion with rehashing, unique elements.
Insertion without rehashing, duplicate elements, Fmax = 1, G = 5.
Insertion without rehashing, duplicate elements, Fmax = 5, G = 5.
Insertion with rehashing, duplicate elements, Fmax = 1, G = 5.
Insertion with rehashing, duplicate elements, Fmax = 5, G = 5.
Erasure, unique elements.
Erasure, duplicate elements, Fmax = 1, G = 5.
Erasure, duplicate elements, Fmax = 5, G = 5.
Successful lookup, unique elements.
Unsuccessful lookup, unique elements.
Successful lookup, duplicate elements, Fmax = 1, G = 5.
Unsuccessful lookup, duplicate elements, Fmax = 1, G = 5.
Successful lookup, duplicate elements, Fmax = 5, G = 5.
Unsuccessful lookup, duplicate elements, Fmax = 5, G = 5.
In general, the unordered associative containers provided by libc++ are considerably slower than those in libstdc++-v3:
  • When duplicate elements are allowed, insertion of a new element in libc++ takes place at the end of the range of equivalent elements (in libstdc++ -v3 it is at the beginning), which incurs the extra cost of traversing the range.
  • Erasure is unnecessarily expensive: to determine whether a given node p is located at the beginning of its bucket b, libc++ does a hash→bucket mapping on the predecessor prev of x and checks if the resulting bucket is not the same as b, when it suffices to verify if b points to prev.
  • It is harder to explain why lookup (specially with unique elements) is so much slower than in libstdc++-v3: code inspection reveals that the hash→bucket mapping is less efficient in libc++ (it involves one extra logical and operation), but this fact alone seems unlikely to account for the big differences in performance.

Friday, January 17, 2014

A better hash table: GCC

As a continuation of our previous entry in the performance of various implementations of C++ unordered associative containers, Thomas Goyne has run the measurement test suite in GCC 4.8.2 for Mac with the default standard library libstdc++-v3 and release settings -O3 -std=c++11. The machine used was a OS X 10.9.1 box with an Intel Core i7-3720QM running at 2.6GHz with Turbo Boost and SpeedStep disabled. Results are depicted below.
Insertion without rehashing, unique elements.
Insertion with rehashing, unique elements.
Insertion without rehashing, duplicate elements, Fmax = 1, G = 5.
Insertion without rehashing, duplicate elements, Fmax = 5, G = 5.
Insertion with rehashing, duplicate elements, Fmax = 1, G = 5.
Insertion with rehashing, duplicate elements, Fmax = 5, G = 5.
Erasure, unique elements.
Erasure, duplicate elements, Fmax = 1, G = 5.
Erasure, duplicate elements, Fmax = 5, G = 5.
Successful lookup, unique elements.
Unsuccessful lookup, unique elements.
Successful lookup, duplicate elements, Fmax = 1, G = 5.
Unsuccessful lookup, duplicate elements, Fmax = 1, G = 5.
Successful lookup, duplicate elements, Fmax = 5, G = 5.
Unsuccessful lookup, duplicate elements, Fmax = 5, G = 5.
Some general insights:
  • Insertion times are very similar among the three libs for unique elements. As for the case of duplicate elements, the group skipping capabilities of Boost.Unordered and Boost.MultiIndex, while greatly improving lookup performance, do have a noticeable negative impact on insertion times.
  • On the other hand, libstdc++-v3 performs surprisingly well in erasure with duplicate elements, despite its having to do bucket traversal to locate the node preceding the one to be erased. I don't have a clear explanation for this fact besides attributing it to a better CPU cache behavior due to element nodes in libstdc++-v3 being smaller (as this lib does not store hash values if these can be calculated on the fly): maybe readers can propose alternative interpretations.
  • Boost.Multiindex highest locality shows up very clearly in its excellent lookup performance for unique elements.
At a later entry we will study libstdc++-v3, libc++, Boost.Unordered and Boost.MultiIndex in Clang.

Thursday, January 9, 2014

A better hash table

In previous entries we have described several implementations of C++ unordered associative containers without and with duplicate elements. Subsequent performance measurements have taught us how important algorithm locality is in the presence of CPU memory caches, which favor data structures with as few node indirections as possible. This suggests an improvement over the implementation introduced by Boost.MultiIndex hashed indices for Boost 1.56. In the case of unique elements, the data structure looked like this:
Let us now twist this structure a bit:
This new scheme results from the old one by merely swapping forward and back pointers (bucket nodes are equipped then with back pointers), which at first blush seems to gain us nothing in terms of efficiency. But there is a key difference: going from a bucket node to the first bucket element is now done directly, rather than with an indirection through the preceding node. Previously, traversing a bucket of n elements required visiting 1 bucket node and n + 2 element nodes; with the new implementation, 2 bucket nodes and n element nodes, one less node in total (and visiting bucket nodes is expected to be cheaper, as they lie contiguously in an array and are thus better cached). Similarly, insertion of an element at the beginning of a nonempty bucket involves visiting 2 nodes while formerly it involved 3, and if the bucket is empty (this is harder to see) it is 4 vs. 5 nodes.
The same change can be easily applied to the case of duplicate elements, from:
to
This seemingly small modification brings about substantial improvements in performance. We have repeated the measurements done in the series of articles
under the same conditions as described there. For the reader's reference, we provide snaphots to
  • Boost.MultiIndex for Boost 1.56, prerelease 1 (the one used for the first set of measurements),
  • Boost.MultiIndex for Boost 1.56, prerelease 2, (with the changes proposed in this entry).
Besides the change in data structure, prerelease 2 of Boost.MultiIndex also includes some improvements in its rehashing procedure. The new results are depicted below.
Insertion without rehashing, unique elements.
Insertion with rehashing, unique elements.
Insertion without rehashing, duplicate elements, Fmax = 1, G = 5.
Insertion without rehashing, duplicate elements, Fmax = 5, G = 5.
Insertion with rehashing, duplicate elements, Fmax = 1, G = 5.
Insertion with rehashing, duplicate elements, Fmax = 5, G = 5.
Erasure, unique elements.
Erasure, duplicate elements, Fmax = 1, G = 5.
Erasure, duplicate elements, Fmax = 5, G = 5.
Successful lookup, unique elements.
Unsuccessful lookup, unique elements.
Successful lookup, duplicate elements, Fmax = 1, G = 5.
Unsuccessful lookup, duplicate elements, Fmax = 1, G = 5.
Successful lookup, duplicate elements, Fmax = 5, G = 5.
Unsuccessful lookup, duplicate elements, Fmax = 5, G = 5.
Note that implementations of C++ unordered associative containers based on single linking cannot apply this improvement, that is, they cannot make bucket nodes point directly to the first element in the bucket: if they did so, the node preceding the first in a bucket would be unreachable in constant time, making it impossible to efficiently implement erasure.

Wednesday, January 1, 2014

A relationship on real functions

Given two functions f, g, we say that f is strongly greater than g, denoted f g, iff there exists a nondecreasing function h such that f(x) > h(x) > g(x) for all x in . ≫ is obviously a strict order on ℝ. The following are additional elementary facts about this relationship.
Proposition. Let f, g, f', g', f+, g-, h, r : such that f g, f' g', f+(x) f(x) and g(x) g-(x) for all xh is strictly increasing and r is nondecreasing, and let k be a positive real number. We have
  • (f + f') (g + g'),
  • f+ g-,
  • h f h g,
  • k(f + r) k(g + r).
For any function f we introduce the following associated functions fmin, fmax:
fmin(x) := min{f(y) : y  x},
fmax(x) := max{f(y) : y x}.
provided the expressions actually exist for all x in . These auxiliary functions give us an alternative way of defining .
Proposition. f g iff fmin and gmax exist and fmin(x) > gmax(x) for all x in .
Proof. If f g then there is a nondecreasing function h such that
min{f(y) : y  x} > min{h(y) : y  x} = h(x),
max
{g(y) : y x} < max{h(y) : y x} = h(x),
hence fmin and gmax are well defined and  fmin(x) > gmax(x) for all x. Conversely if fmin(x) > gmax(x), let us take
h(x) := (fmin(x) + gmax(x))/2;
fmin and gmax are obviously nondecreasing, thus h is also nondecreasing, and for all x in
 f(x) ≥ fmin(x) > h(x) > gmax(x g(x).
As a corollary, if f g then limx+∞f(x) is lower bounded and limx-∞g(x) is upper bounded.

Saturday, December 28, 2013

Objects, slots, functors, duality

C++ smart pointers rely on the ability to overload operator-> (and, less importantly, operator->*), whose invocation precedes any access to the pointee. Unfortunately, there is no such thing as overloading operator. (even though the possibility to do so has been considered by Bjarne Stroustrup and others), which would allow for the creation of "smart reference" classes. Lacking this ability, a wrapper class has to go through the painful and non-generic process of replicating the interface of the wrapped object. We present here an approach to class interface design, let's call it slot-based interfaces, that allows for easy wrapping without losing much expressivity with respect to traditional C++ interfaces. The techniques described are exercised by a program available for download (a fairly C++11-compliant compiler such as GCC 4.8 is needed).
Say we have a container class like the following:
template <class T, class Allocator = std::allocator<T>>
class vector {
  ...
  iterator       begin();
  const_iterator begin()const;
  iterator       end();
  const_iterator end()const;
  ...
  template <class... Args>
  void emplace_back(Args&&... args);
  void push_back(const T& x);
  ...
};
The interface of this vector can be changed to one based in slots by replacing each member function with an equivalent overload of operator():
template <class T, class Allocator = std::allocator<T>>
class vector {
  ...
  iterator       operator()(begin_s);
  const_iterator operator()(begin_s)const;
  iterator       operator()(end_s);
  const_iterator operator()(end_s)const;
  ...
  template <class... Args>
  void operator()(emplace_back_s,Args&&... args);
  void operator()(push_back_s,const T& x);
  ...
};
where begin_s, end_s, emplace_back_s, etc. are arbitrary, different types called slots, of each of which there is a corresponding global instance with name begin, end, emplace_back, etc. Traditional code such as
vector<int> c;
for(int i=0;i<10;++i){
  c.push_back(2*i);
  c.emplace_back(3*i);
}
for(auto it=c.begin(),it_end=c.end();it!=it_end;++it){
  std::cout<<*it<<" ";
}
looks like the following when using a slot-based interface:
vector<int> c;
for(int i=0;i<10;++i){
  c(push_back,2*i);
  c(emplace_back,3*i);
}
for(auto it=c(begin),it_end=c(end);it!=it_end;++it){
  std::cout<<*it<<" ";
}
Legibility is still excelent, if a little odd at the beginning. The big advantage of a slot-based interface is that all of it is mediated through invocations of (overloads of) operator(), so wrapping, with the aid of C++11 perfect forwarding, is an extremely simple business:
template <typename T>
class logger
{
  T t;
public:
  template<typename ...Args>
  logger(Args&&... args):t(std::forward<Args>(args)...){}

  template<typename Q,typename ...Args>
  auto operator()(Q&& q,Args&&... args)
    ->decltype(t(std::forward<Q>(q),std::forward<Args>(args)...))
  {
     std::cout<<typeid(q).name()<<std::endl;
     return t(std::forward<Q>(q),std::forward<Args>(args)...);
  }

  template<typename Q,typename ...Args>
  auto operator()(Q&& q,Args&&... args)const
    ->decltype(t(std::forward<Q>(q),std::forward<Args>(args)...))
  {
     std::cout<<typeid(q).name()<<std::endl;
     return t(std::forward<Q>(q),std::forward<Args>(args)...);
  }
};
logger<T> merely adorns the interface of T (which is, remember, entirely comprised of overloads of operator()) by registering the name of the first argument type, i.e. the slot used. Substituting logger<T> by T in any piece of code works immediately without further modifications, regardless of the interface of T (provided, of course, it is slot-based). The following is probably more interesting:
template <typename T>
class shared_ref
{
  std::shared_ptr<T> p;
public:
  template<typename ...Args>
  shared_ref(Args&&... args):p(std::forward<Args>(args)...){}

  template<typename ...Args>
  auto operator()(Args&&... args)
    ->decltype((*p)(std::forward<Args>(args)...))
  {
     return (*p)(std::forward<Args>(args)...);
  }
};
Much as shared_ptr<T> acts as regular T*, shared_ref<T> acts as a regular T& whose (slot-based) interface can be unencumberedly used. This even works with run-time polymorphism based on virtual overloads of operator().
We can continue expanding on this new interface paradigm. So far slots have been regarded as passive types whose only mission is to adorn interfaces with convenient names, but nothing stops us from making these slots active classes in a curiously useful way:
struct begin_s
{
  template<typename T,typename ...Args>
  auto operator()(T&& t,Args&&... args)const
    ->decltype(t(*this,std::forward<Args>(args)...))
  {
    return t(*this,std::forward<Args>(args)...);
  }
};
extern begin_s begin;
We have defined begin_s as a functor performing the following transformation:
begin(t,...) t(begin,...),
which allows us to write (assuming all the slots are defined in an equivalent way) code like this:
vector<int> c;
for(int i=0;i<10;++i){
  push_back(c,2*i);
  emplace_back(c,3*i);
}
for(auto it=begin(c),it_end=end(c);it!=it_end;++it){
  std::cout<<*it<<" ";
}
So, slots give us for free the additional possibility to use them with global function syntax. In fact, we can take this behavior as the very definition of a slot.
Definition. A basic slot is a functor S resolving each invocation of s(t,...) to an invocation of t(s,...), for any s of type S.
Those readers with a mathematical inclination will have detected a nice duality pattern going on here. On one hand, slot-based classes are functors whose overloads of operator() take slots as their first parameter, and on the other hand slots are functors whose first parameter is assumed to be a slot-based class type.
The diagram shows a number of different classes against several slot names, where succesful combinations are marked by a dot: in a sense it is immaterial which axis is considered to hold classes and which holds slots, as their roles can be reversed from a syntactical point of view. One unexpected implication of this duality is that slot-based class wrappers will also work as slot wrappers. For instance, much as we can write
// register access to the interface of c
logger<vector<int>> c;
...
we can also write
// register invocations to emplace_back
logger<emplace_back_s> logged_emplace_back;
logged_emplace_back(c,100);
...
Or, if we make vector open-ended by accepting any slot, we can use logged_emplace_back with member function syntax:
template <typename T>
class open_ended
{
  T t;
public:
  template<typename ...Args>
  open_ended(Args&&... args):t(std::forward<Args>(args)...){}
      
  template<typename Slot,typename ...Args>
  auto operator()(Slot&& s,Args&&... args)
    ->decltype(s(t,std::forward<Args>(args)...))
  {
     return s(t,std::forward<Args>(args)...);
  }

  template<typename Slot,typename ...Args>
  auto operator()(Slot&& s,Args&&... args)const
    ->decltype(s(t,std::forward<Args>(args)...))
  {
     return s(t,std::forward<Args>(args)...);
  }
};
...
logger<emplace_back_s> logged_emplace_back;
open_ended<vector<int>> c;
c(logged_emplace_back,100);
...
By identifying class access with class invocation and reifying member function names into first-class functors, slot-based interfaces allow for a great deal of static composability and provide unified concepts blurring the distinction between classes and functions in ways that are of practical utility. The approach is not perfect: a non-standard syntax has to be resorted to, public inheritance needs special provisions and operators (as opposed to regular member functions, i.e. functions like operator=, operator!, etc.) are not covered. With these limitations in mind, slot-based interfaces can come handy in specialized applications or frameworks requiring interface decoration, smart references or similar constructs.

Wednesday, December 4, 2013

Lookup cost in hash tables with duplicate elements

As a continuation of our previous entry on the statistical properties of hash tables with duplicate elements, let's now analyze how duplicates affect lookup times. Let N be the number of elements in the table, B the number of buckets and F = N/B the associated load factor. We have already learnt that the size S of a bucket follows (for large N) a Poisson distribution PF(n) with mean F. Beginning with the non-duplicate case, what is the average number of elements checked when searching for a given key k? In such general terms, the question is underspecified because lookup times differ depending on whether k is actually present in the table or not. The analysis can be done, though, considering each case in isolation.
Key is not present. This is easy to calculate: the number of elements checked is simply the average length of a bucket, which we know to be F.
Key is present. We assume that the value of k, regarded as a random variable, is evenly distributed among the values stored in the table, i.e. each element has the same probability of being chosen for lookup. Define
B(n) := Pr(k is in a bucket of size n).
It is easy to see that B(n) can be calculated as
B(n) = nPF(n)B/N = nPF(n)/F,
following from the simple fact that there are PF(n)B buckets of size n, each holding precisely n elements. In such a bucket, the number of elements checked when looking for k goes from 1 (k is the first element) to n (k is the last), and is thus in average (1 + ··· + n)/n = (n+1)/2. From this we can calculate in the general case the average number of elements checked as
Σn≥0 ((n+1)/2)B(n) =
= Σn≥0 ((n+1)/2)nPF(n)/F = (1/2F)Σn≥0 (n2+n)PF(n) =
= (1/2F)(E[S] + E[S2]) = (1/2F)(F + F2 + F) = 1+ F/2,
where we have used the fact that the second moment of a Poisson distribution with mean F is F2 + F.
We are now ready to study the case where the hash table holds M groups of duplicate elements, each group with average size G
Key is not present. There are no differences with respect to the non-duplicate case; an entire bucket, whose average size is again F, is traversed when looking up for the non-present key.
Key is present. In terms of distribution of elements, the table is equivalent to another one without duplicates where each group of equal elements is replaced by just one representative. This theoretical table has a load factor F' = F/G and an associated average lookup time 1 + F'/2. Now, searching in our original table incurs the same number of operations, except that all the individual checks before getting to k now turn into checks of an average of G equivalent elements. So, the total number of steps is
1 + GF'/2 = 1 + F/2,
exactly as in the non-duplicate case. How come we see no degradation despite elements being heavily concentrated in fewer buckets? The reason is that the distribution of groups in the table is governed by an equivalent load factor F' = F/G which is much (as much as G) less than F, and the probability that there are two of more groups of elements in the same bucket is in consequence greatly reduced.   
The hash table structures used by some implementations of C++ unordered associative containers such as Boost.Unordered allow for skipping of groups of equivalent elements. For these special hash tables, the average lookup steps are reduced to F/G, if the key is not present, and 1+F/2G, if it is.