Bannalia: trivial notes on themes diverse: Fast polymorphic collections

Sunday, May 4, 2014

Fast polymorphic collections

In C++, instances of classes derived from an abstract class are typically heap-allocated and managed through (smart) pointers because their exact types (and hence their sizes) are not known until run time —strictly speaking, the type of each object is known at the point of creation but then abstracted away or erased when passed along to the location of the program where it is stored and used. For this very reason, such objects can't be stored in STL collections directly but rather through an indirection. For instance, this is how the layout of a std::vector<std::unique_ptr<base>> could look like for some abstract base class:

Sometimes we are not interested in the particular order the objects are stored in as we only want to keep them together to interact with them in a uniform manner:

struct base
{
  virtual int do_stuff(int x)=0;
};
...
using base_pointer=std::unique_ptr<base>;
std::vector<base_pointer> c;
...
std::for_each(c.begin(),c.end(),[](base_pointer& p){
  p->do_stuff(1);
});

When c.size() is large the resulting performance is impacted by two negative factors:

Heap allocated objects might be scattered throughout the memory space, incurring many cache misses during traversal;
Modern architectures with branch prediction can optimize away the vtable lookup cost associated to virtual calls when these calls successively resolve to the same function, but this is not the case if objects of distinct derived classes are interspersed along the vector.

Let us explore a different structure where objects of the same derived class are held together in segments of contiguous memory:

Here, objects are appended to their associated memory segment on insertion time, for which we need to know their exact type at that point. The structure used to determine which segment each object is inserted into is then just a std::map of std::type_index's to segment handling objects. This is a minimal interface for such a data structure:

template<class Base>
class poly_collection
{
public:
  // only available if Derived is derived from Base
  template<class Derived>
  void insert(Derived& x);

  template<typename F>
  F for_each(F f);

  template<typename F>
  F for_each(F f)const;
};

Profiling

A testing program is provided that implements poly_collection and compares the performance of poly_collection<base> with respect to std::vector<std::unique_ptr<base>> for the following scenario:

struct base
{
  virtual int f(int x)const=0;
};

struct derived1:base
{
  derived1(int n):n(n){}
  virtual int f(int x)const{return x*n;}
  
  int n;
};

struct derived2:base
{
  derived2(int n):n(n){}
  virtual int f(int x)const{return x+n;}
  
  int unused,n;
};

struct derived3:base
{
  derived3(int n):n(n){}
  virtual int f(int x)const{return x-n;}
  
  int unused,n;
};

template<typename Collection>
int run_for_each(const Collection& c)
{
  int res=0;
  c.for_each([&res](const base& x){
    res+=x.f(1);
  });
  return res;
}

where the containers are filled with an assortment of n = 1,000 to 10⁷ objects of types derived1, derived2 and derived3 and then randomly shuffled to reflect the memory status of a program where objects have heterogeneous lifetimes.

MSVC on Windows

Test built with Microsoft Visual Studio 2012 using default release mode settings and run on a Windows box with an Intel Core i5-2520M CPU @2.50GHz. Depicted values are microseconds/n, where n is the number of elements in the container.

for_each execution times / number of elements.

poly_collection behaves excellently and is virtually not affected by the size of the container. For n < 10⁵, the differences in performance between poly_collection and a std::vector of std::unique_ptrs are due to worse virtual call branch prediction in the latter case; when n > 10⁵, massive cache misses are added to the first degrading factor.

For MSVC 2013 on an box with an Intel Core i7-920 @2.67GHz, results (provided by Lars Schouw) are similar, though the sharp steep in execution times for std::vector<std::unique_ptr<base>> is shifted from ~10⁵ to ~3·10⁵ due to the larger memory cache of the i7-920 processor.

for_each execution times / number of elements.

Clang on Ubuntu 64

Results provided by plash: Compiler Clang 3.5.0 with settings -std=c++11 -stdlib=libc++ -lc++ -O3, system Ubuntu 13.04 x86_64 with an AMD Phenom II X6 1090T.

for_each execution times / number of elements.

Results are even more striking here: while for small values of n poly_collection is only 2x faster than a std::vector of std::unique_ptrs (3x for MSVC), performance ratio is 25x for n = 10⁷ (10-12x for MSVC).

GCC on Arch Linux

Results provided by Theodoros Theodoridis: Compiler GCC 4.9.0 with settings -std=c++11 -O3, system Arch Linux x64 with an Intel Core i7-950 @3.07GHz.

for_each execution times / number of elements.

Other environments

Request for help: if you can build (with release settings) and run the test program on some other environment/compiler and post the results back I'd be happy to include them in the entry.

Conclusions

Modern architectures, heavily accelerated by CPU memory caches, penalize typical patterns of C++ run-time polymorphism based on individual heap allocation of many heteogeneous objects implementing a given abstract interface. This can be improved dramatically by using containers that gather objects of the same concrete type in segments of contiguous memory, if this type is known at the point of insertion: poly_collection provides the foundational ideas for such a container. Note that poly_collection requires the classes dealt with to be CopyConstructible and MoveInsertable, which involve operations typically not used in classic patterns of C++ dynamic polymorphism.

There is still margin for improvement: if some of the concrete types deriving from base are known at container definition time, we can use this knowledge so that poly_collection benefits from devirtualization of these types; this we will explore in a later entry.

Update Jan 17, 2015

Fixed a serious bug in the code, that, fortunately, doesn't affect the test results.

21 comments :

Vittorio RomeoMay 4, 2014 at 3:02 PM
Nice article! Here are my results.

Environment: Arch Linux x64

`g++ -std=c++11 -O3 ./poly_collection.cpp -o poly_collection && ./poly_collection`

Output:

`
1000;0.0639611;0.0178826
2000;0.0667656;0.0172973
3100;0.0666778;0.0171172
4310;0.0683208;0.0170331
5641;0.0684578;0.0169607
7105;0.0698626;0.016793
8715;0.0729074;0.0168298
10486;0.0746891;0.0168191
12434;0.0754995;0.0168244
14576;0.0764422;0.0167795
16932;0.0778012;0.0167208
19523;0.0774298;0.016751
22373;0.0769457;0.0167479
25508;0.077696;0.0167359
28956;0.0779066;0.0166767
32748;0.0773267;0.0167178
36919;0.0778486;0.0166682
41507;0.0786115;0.0166409
46553;0.0784301;0.0166811
52103;0.0796602;0.016921
58208;0.0797421;0.016593
...
`
ReplyDelete
Replies
Joaquín M López MuñozMay 4, 2014 at 3:25 PM
Hi SuperV

Alas yor post has been truncated. Would you mind resending to joaquin at tid.es along with info on the compiler and CPU versions?

Thank you
ReplyDelete
Replies
TrophyNinjaShrubMay 5, 2014 at 7:05 PM
One thing I think you didn't address too closely is the fact that if your for_each is calling a virtual function you will likely invalidate the instruction cache every time you switch to an object of a different concrete class, so your class gains iteration performance by grouping like this - you'd do that invalidation a minimal number of times rather than potentially O(N) times (depending on insertion order) as the original does.
It would be interesting to compare to std::vector. The latter maintains the original semantics (iteration in insertion order), keeps the objects in even fewer contiguous memory segments (one dynamic block, instead of one for every concrete class). You lose convenience because you need to know all the classes when you declare the container. It might be more widely applicable because they don't have to share a base class anymore. But the performance characteristics are notably different. Insertion would be cheaper especially for very small N because there's fewer allocations (imagine if there was one object of each type), element access would be slightly faster because of the O(logM) (M=number of classes) lookup (or O(M) iteration & size check if doing position-based lookup) and dereference in your class before you get to the vector lookup, vs. just doing the O(1) pointer-arithmetic vector lookup. But yeah, doing a for_each with a virtual function would be much faster with your class (depending on insertion order), and also the memory consumption could be significantly different depending on the relative sizes of the subclasses (variant is going to be the size of the largest one potentially wasting lots of memory, but if the sizes were all the same your class would actually be larger because of the extra bookkeeping).
ReplyDelete
Replies
Joaquín M López MuñozMay 5, 2014 at 9:04 PM
One thing I think you didn't address too closely is the fact that if your for_each is calling a virtual function you will likely invalidate the instruction cache every time you switch to an object of a different concrete class, so your class gains iteration performance by grouping like this - you'd do that invalidation a minimal number of times rather than potentially O(N) times (depending on insertion order) as the original does.

I think this is covered in the bullet that reads: "Modern architectures with branch prediction can optimize away the vtable lookup cost associated to virtual calls[...]"

It would be interesting to compare to std::vector.

I take this to mean a contiguous chunk of memory containing the derived objects one after the other, right? Many problems with this, as I discuss below.

The latter maintains the original semantics (iteration in insertion order), keeps the objects in even fewer contiguous memory segments (one dynamic block, instead of one for every concrete class). You lose convenience because you need to know all the classes when you declare the container.

Correct, and add also the problem that you have to keep objects aligned to the most restrictive derivedn, which can potentially use more memory than having one chunk per derived class.

It might be more widely applicable because they don't have to share a base class anymore.

Correct. This is essentially a std::vector<boost:variant<derived1,...,derivedn>>... as you mention later in your post.

But the performance characteristics are notably different. Insertion would be cheaper especially for very small N because there's fewer allocations (imagine if there was one object of each type), element access would be slightly faster because of the O(logM) (M=number of classes) lookup (or O(M) iteration & size check if doing position-based lookup) and dereference in your class before you get to the vector lookup, vs. just doing the O(1) pointer-arithmetic vector lookup. But yeah, doing a for_each with a virtual function would be much faster with your class (depending on insertion order), and also the memory consumption could be significantly different depending on the relative sizes of the subclasses (variant is going to be the size of the largest one potentially wasting lots of memory, but if the sizes were all the same your class would actually be larger because of the extra bookkeeping).

At the end of the day this is an entirely different beast. Having to know all the contained classes in advance defeats the very purpose of dynamic polymorphism.

If you can drop insertion order as a requirement for the container, a better alternative to vector<variant<class1,...,classn>> would be a class holding n vectors, one for each classi, so that for_each is going to be much faster than the visitor pattern thing implemented by variant.
ReplyDelete
Replies
JouanMay 5, 2014 at 9:08 PM
Is there a way to implement this without using RTTI?
ReplyDelete
Replies
Joaquín M López MuñozMay 5, 2014 at 9:28 PM
Is there a way to implement this without using RTTI?

Not that I recommend it, but one can use some sort of "fist come first served" index dispatcher for types, as exemplified in

http://coliru.stacked-crooked.com/a/5bab7d98b583f382
ReplyDelete
Replies
D-503May 6, 2014 at 9:05 AM
Is there a way to implement this without using RTTI?

It can be implemented with TypeIndex[1] library, wich almost in Boost.
I've tested it on MSVC 2012 with /GR- and got exactly same performance as RTTI std version.

[1] http://apolukhin.github.io/type_index/index.html
ReplyDelete
Replies
Mark VPMay 6, 2014 at 9:44 AM
Interesting solution, at least if you don't need to look up the objects by index, but there are cases where that is not needed.

I tend to design my code such that virtual functions have code in them that does more than in the examples here, so the overhead of the virtual function call is not important. But I realize this will not always be possible.
ReplyDelete
Replies
UnknownMay 7, 2014 at 10:23 PM
Nice, but I'd rename poly_container::shuffle() to shuffle_each_segment(), since the order of segments themselves is maintained.
ReplyDelete
Replies
SergeJanuary 16, 2015 at 12:16 PM
Very usefull.
There is a memory leak though: the segment destructors should be made virtual otherwise the collection data is never freed (the collection holds pointers to the base segment class only).
ReplyDelete
Replies
Joaquín M López MuñozJanuary 17, 2015 at 11:13 AM
Hi Serge, thanks for spotting the bug, I've just fixed it and uploaded the code. The lack of virtual destructor engenders undefined behavior, although, in this particular case and for the compilers tested, no memory leaks happen because derived bases don't allocate dynamic memory (which doesn't mean the code wasn't buggy to begin with).
ReplyDelete
Replies
ÚjháziBFebruary 10, 2016 at 3:33 PM
Nice code. I was a bit annoyed though, somehow I couldn't implement an iterator for this without having a major performance drop, compared to both the original poly_collection and that simple vector_ptr.
ReplyDelete
Replies
ÚjháziBMarch 12, 2016 at 8:52 PM
HI! Thanks for the reply. For your questions then. I have uploaded a zip here with the cpp and an xlsx file with the data and a generated chart.

https://www.dropbox.com/s/l3h5leus33jmgy8/poly_test.zip?dl=0

The code might have some flaws, I just tried to reach a good performance first, then I have to clear out some things in it. I still have some fears, I would like to use it in a code base, which must compile in C++03 and doesn't have RTTI enabled in it (so I'll use boost's typeid generation logic here). Some tests will also be needed afterwards, I am curious about the results.
ReplyDelete
Replies
ÚjháziBMay 23, 2017 at 12:55 PM
congrats for the boost nomination, I have just seen the reddit page about it (Y)
ReplyDelete
Replies

Add comment

Subscribe to: Post Comments ( Atom )