Sunday, May 4, 2014

Fast polymorphic collections

In C++, instances of classes derived from an abstract class are typically heap-allocated and managed through (smart) pointers because their exact types (and hence their sizes) are not known until run time —strictly speaking, the type of each object is known at the point of creation but then abstracted away or erased when passed along to the location of the program where it is stored and used. For this very reason, such objects can't be stored in STL collections directly but rather through an indirection. For instance, this is how the layout of a std::vector<std::unique_ptr<base>> could look like for some abstract base class:
Sometimes we are not interested in the particular order the objects are stored in as we only want to keep them together to interact with them in a uniform manner:
struct base
{
  virtual int do_stuff(int x)=0;
};
...
using base_pointer=std::unique_ptr<base>;
std::vector<base_pointer> c;
...
std::for_each(c.begin(),c.end(),[](base_pointer& p){
  p->do_stuff(1);
});
When c.size() is large the resulting performance is impacted by two negative factors:
  • Heap allocated objects might be scattered throughout the memory space, incurring many cache misses during traversal;
  • Modern architectures with branch prediction can optimize away the vtable lookup cost associated to virtual calls when these calls successively resolve to the same function, but this is not the case if objects of distinct derived classes are interspersed along the vector.
Let us explore a different structure where objects of the same derived class are held together in segments of contiguous memory:
Here, objects are appended to their associated memory segment on insertion time, for which we need to know their exact type at that point. The structure used to determine which segment each object is inserted into is then just a std::map of std::type_index's to segment handling objects. This is a minimal interface for such a data structure:
template<class Base>
class poly_collection
{
public:
  // only available if Derived is derived from Base
  template<class Derived>
  void insert(Derived& x);

  template<typename F>
  F for_each(F f);

  template<typename F>
  F for_each(F f)const;
};
Profiling
A testing program is provided that implements poly_collection and compares the performance of poly_collection<base> with respect to std::vector<std::unique_ptr<base>> for the following scenario:
struct base
{
  virtual int f(int x)const=0;
};

struct derived1:base
{
  derived1(int n):n(n){}
  virtual int f(int x)const{return x*n;}
  
  int n;
};

struct derived2:base
{
  derived2(int n):n(n){}
  virtual int f(int x)const{return x+n;}
  
  int unused,n;
};

struct derived3:base
{
  derived3(int n):n(n){}
  virtual int f(int x)const{return x-n;}
  
  int unused,n;
};

template<typename Collection>
int run_for_each(const Collection& c)
{
  int res=0;
  c.for_each([&res](const base& x){
    res+=x.f(1);
  });
  return res;
}
where the containers are filled with an assortment of n = 1,000 to 107 objects of types derived1, derived2 and derived3 and then randomly shuffled to reflect the memory status of a program where objects have heterogeneous lifetimes.
MSVC on Windows
Test built with Microsoft Visual Studio 2012 using default release mode settings and run on a Windows box with an Intel Core i5-2520M CPU @2.50GHz. Depicted values are microseconds/n, where n is the number of elements in the container.
for_each execution times / number of elements.
poly_collection behaves excellently and is virtually not affected by the size of the container. For n < 105, the differences in performance between poly_collection and a std::vector of std::unique_ptrs are due to worse virtual call branch prediction in the latter case; when n > 105, massive cache misses are added to the first degrading factor.
For MSVC 2013 on an box with an Intel Core i7-920 @2.67GHz, results (provided by Lars Schouw) are similar, though the sharp steep in execution times for std::vector<std::unique_ptr<base>> is shifted from ~105 to ~3·105 due to the larger memory cache of the i7-920 processor.
for_each execution times / number of elements.
Clang on Ubuntu 64
Results provided by plash: Compiler Clang 3.5.0 with settings -std=c++11 -stdlib=libc++ -lc++ -O3, system Ubuntu 13.04 x86_64 with an AMD Phenom II X6 1090T.
for_each execution times / number of elements.
Results are even more striking here: while for small values of n poly_collection is only 2x faster than a std::vector of std::unique_ptrs (3x for MSVC), performance ratio is 25x for n =  107 (10-12x for MSVC).
GCC on Arch Linux
Results provided by Theodoros Theodoridis: Compiler GCC 4.9.0 with settings -std=c++11 -O3, system Arch Linux x64 with an Intel Core i7-950 @3.07GHz.
for_each execution times / number of elements.
Other environments
Request for help: if you can build (with release settings) and run the test program on some other environment/compiler and post the results back I'd be happy to include them in the entry.
Conclusions
Modern architectures, heavily accelerated by CPU memory caches, penalize typical patterns of C++ run-time polymorphism based on individual heap allocation of many heteogeneous objects implementing a given abstract interface. This can be improved dramatically by using containers that gather objects of the same concrete type in segments of contiguous memory, if this type is known at the point of insertion: poly_collection provides the foundational ideas for such a container. Note that poly_collection requires the classes dealt with to be CopyConstructible and MoveInsertable, which involve operations typically not used in classic patterns of C++ dynamic polymorphism.
There is still margin for improvement: if some of the concrete types deriving from base are known at container definition time, we can use this knowledge so that poly_collection benefits from devirtualization of these types; this we will explore in a later entry.
Update Jan 17, 2015
Fixed a serious bug in the code, that, fortunately, doesn't affect the test results.

20 comments :

  1. Nice article! Here are my results.

    Environment: Arch Linux x64

    `g++ -std=c++11 -O3 ./poly_collection.cpp -o poly_collection && ./poly_collection`

    Output:

    `
    1000;0.0639611;0.0178826
    2000;0.0667656;0.0172973
    3100;0.0666778;0.0171172
    4310;0.0683208;0.0170331
    5641;0.0684578;0.0169607
    7105;0.0698626;0.016793
    8715;0.0729074;0.0168298
    10486;0.0746891;0.0168191
    12434;0.0754995;0.0168244
    14576;0.0764422;0.0167795
    16932;0.0778012;0.0167208
    19523;0.0774298;0.016751
    22373;0.0769457;0.0167479
    25508;0.077696;0.0167359
    28956;0.0779066;0.0166767
    32748;0.0773267;0.0167178
    36919;0.0778486;0.0166682
    41507;0.0786115;0.0166409
    46553;0.0784301;0.0166811
    52103;0.0796602;0.016921
    58208;0.0797421;0.016593
    ...
    `

    ReplyDelete
  2. Hi SuperV

    Alas yor post has been truncated. Would you mind resending to joaquin at tid.es along with info on the compiler and CPU versions?

    Thank you

    ReplyDelete
  3. One thing I think you didn't address too closely is the fact that if your for_each is calling a virtual function you will likely invalidate the instruction cache every time you switch to an object of a different concrete class, so your class gains iteration performance by grouping like this - you'd do that invalidation a minimal number of times rather than potentially O(N) times (depending on insertion order) as the original does.
    It would be interesting to compare to std::vector. The latter maintains the original semantics (iteration in insertion order), keeps the objects in even fewer contiguous memory segments (one dynamic block, instead of one for every concrete class). You lose convenience because you need to know all the classes when you declare the container. It might be more widely applicable because they don't have to share a base class anymore. But the performance characteristics are notably different. Insertion would be cheaper especially for very small N because there's fewer allocations (imagine if there was one object of each type), element access would be slightly faster because of the O(logM) (M=number of classes) lookup (or O(M) iteration & size check if doing position-based lookup) and dereference in your class before you get to the vector lookup, vs. just doing the O(1) pointer-arithmetic vector lookup. But yeah, doing a for_each with a virtual function would be much faster with your class (depending on insertion order), and also the memory consumption could be significantly different depending on the relative sizes of the subclasses (variant is going to be the size of the largest one potentially wasting lots of memory, but if the sizes were all the same your class would actually be larger because of the extra bookkeeping).

    ReplyDelete
  4. One thing I think you didn't address too closely is the fact that if your for_each is calling a virtual function you will likely invalidate the instruction cache every time you switch to an object of a different concrete class, so your class gains iteration performance by grouping like this - you'd do that invalidation a minimal number of times rather than potentially O(N) times (depending on insertion order) as the original does.

    I think this is covered in the bullet that reads: "Modern architectures with branch prediction can optimize away the vtable lookup cost associated to virtual calls[...]"

    It would be interesting to compare to std::vector.

    I take this to mean a contiguous chunk of memory containing the derived objects one after the other, right? Many problems with this, as I discuss below.

    The latter maintains the original semantics (iteration in insertion order), keeps the objects in even fewer contiguous memory segments (one dynamic block, instead of one for every concrete class). You lose convenience because you need to know all the classes when you declare the container.

    Correct, and add also the problem that you have to keep objects aligned to the most restrictive derivedn, which can potentially use more memory than having one chunk per derived class.

    It might be more widely applicable because they don't have to share a base class anymore.

    Correct. This is essentially a std::vector<boost:variant<derived1,...,derivedn>>... as you mention later in your post.

    But the performance characteristics are notably different. Insertion would be cheaper especially for very small N because there's fewer allocations (imagine if there was one object of each type), element access would be slightly faster because of the O(logM) (M=number of classes) lookup (or O(M) iteration & size check if doing position-based lookup) and dereference in your class before you get to the vector lookup, vs. just doing the O(1) pointer-arithmetic vector lookup. But yeah, doing a for_each with a virtual function would be much faster with your class (depending on insertion order), and also the memory consumption could be significantly different depending on the relative sizes of the subclasses (variant is going to be the size of the largest one potentially wasting lots of memory, but if the sizes were all the same your class would actually be larger because of the extra bookkeeping).

    At the end of the day this is an entirely different beast. Having to know all the contained classes in advance defeats the very purpose of dynamic polymorphism.

    If you can drop insertion order as a requirement for the container, a better alternative to vector<variant<class1,...,classn>> would be a class holding n vectors, one for each classi, so that for_each is going to be much faster than the visitor pattern thing implemented by variant.

    ReplyDelete
  5. Is there a way to implement this without using RTTI?

    ReplyDelete
  6. Is there a way to implement this without using RTTI?

    Not that I recommend it, but one can use some sort of "fist come first served" index dispatcher for types, as exemplified in

    http://coliru.stacked-crooked.com/a/5bab7d98b583f382

    ReplyDelete
  7. Is there a way to implement this without using RTTI?

    It can be implemented with TypeIndex[1] library, wich almost in Boost.
    I've tested it on MSVC 2012 with /GR- and got exactly same performance as RTTI std version.

    [1] http://apolukhin.github.io/type_index/index.html

    ReplyDelete
  8. Interesting solution, at least if you don't need to look up the objects by index, but there are cases where that is not needed.

    I tend to design my code such that virtual functions have code in them that does more than in the examples here, so the overhead of the virtual function call is not important. But I realize this will not always be possible.

    ReplyDelete
  9. Nice, but I'd rename poly_container::shuffle() to shuffle_each_segment(), since the order of segments themselves is maintained.

    ReplyDelete
  10. Very usefull.
    There is a memory leak though: the segment destructors should be made virtual otherwise the collection data is never freed (the collection holds pointers to the base segment class only).

    ReplyDelete
  11. Hi Serge, thanks for spotting the bug, I've just fixed it and uploaded the code. The lack of virtual destructor engenders undefined behavior, although, in this particular case and for the compilers tested, no memory leaks happen because derived bases don't allocate dynamic memory (which doesn't mean the code wasn't buggy to begin with).

    ReplyDelete
  12. Nice code. I was a bit annoyed though, somehow I couldn't implement an iterator for this without having a major performance drop, compared to both the original poly_collection and that simple vector_ptr.

    ReplyDelete
    Replies
    1. I tried again to reach a better performance with my modification of this container, where I use an iterator and I can make this whole container compatible with STL algorithms for example. The results became much better, here's an extract from my release built run (Windows 7 64 bit, core i7 pro mobile processor clocked on 3GHz and 16 GB RAM):

      for_each:
      vector_ptr;poly_collection
      1000;0.0732999;0.0308788
      2000;0.0749965;0.0281373
      3100;0.0778615;0.0269457
      4310;0.0793592;0.0270851
      5641;0.0786436;0.0266359
      7105;0.0787332;0.0263393
      8715;0.0802934;0.0264099
      10486;0.0808019;0.0265479
      12434;0.083963;0.0269599
      14576;0.1003;0.0262659
      16932;0.0875782;0.0263974
      19523;0.0886515;0.0260781
      22373;0.0898212;0.0260811
      25508;0.0903302;0.0260797
      28956;0.0904529;0.0259674
      32748;0.0917269;0.0260768
      36919;0.0915638;0.0264518
      41507;0.0917573;0.0258712
      46553;0.0934502;0.0262559
      52103;0.0943214;0.026344
      58208;0.0925437;0.0263018
      64923;0.092984;0.0260634
      72309;0.0933847;0.0259765
      80433;0.0953131;0.0260175
      89369;0.0960014;0.0261469
      99198;0.0956152;0.0261458
      110009;0.0944218;0.0262231
      121901;0.0976262;0.0261558
      134982;0.111935;0.0262597
      149371;0.108656;0.0261098
      165198;0.113493;0.0262204
      182607;0.126589;0.0260252
      201756;0.139701;0.0262402
      222819;0.146582;0.0261239
      245988;0.150024;0.0261742
      271473;0.160681;0.0265803
      299506;0.162183;0.0260157
      330342;0.165536;0.026288
      364261;0.174865;0.0264457
      401571;0.178111;0.0279352
      442612;0.181734;0.0267827
      487757;0.181254;0.0266632
      537416;0.184124;0.0263927
      592040;0.182943;0.0266525
      652126;0.184627;0.0261169
      718220;0.186842;0.0261232
      790923;0.215298;0.0317105
      870896;0.196685;0.0261664
      958866;0.192063;0.0263706
      1055633;0.193821;0.0261598
      1162076;0.198371;0.0261848
      1279163;0.196398;0.0262802

      Not as much fast as the foreach version (considering iterator creations, etc), but I think I've reached a fine point and I can start to use it somewhere. Thanks again for sharing this idea.

      Delete
    2. Thanks for commenting on this; indeed poly_collection iterators can't be made as efficient as built-in for_each, and I'm very interested in better understanding what you got at:

      1. I understand you're providing figures for vector_ptr::iterator vs. poly_collection::iterator. Can you maybe augment the text to show a third column with poly_collection::for_each to have a feeling of the penalty introduced by iterators?
      2. Can you share the code?

      Thank you

      Delete
  13. HI! Thanks for the reply. For your questions then. I have uploaded a zip here with the cpp and an xlsx file with the data and a generated chart.

    https://www.dropbox.com/s/l3h5leus33jmgy8/poly_test.zip?dl=0

    The code might have some flaws, I just tried to reach a good performance first, then I have to clear out some things in it. I still have some fears, I would like to use it in a code base, which must compile in C++03 and doesn't have RTTI enabled in it (so I'll use boost's typeid generation logic here). Some tests will also be needed afterwards, I am curious about the results.

    ReplyDelete
    Replies
    1. Thanks for the info and code. I see poly_collection::iterator performs around 30-40% worse than poly_collection::for_each. Such is life, this is expected as poly_collection::iterator::operator++ has to do an equality check against the end of the segment (as you've implemented in next()) and then another check is made at the std::for_each level, whereas poly_collection does only one check per step.

      The only weird thing I've seen in your code is that you've redefined poly_collection::pointer as

      typedef std::shared_ptr<segment> pointer;

      when in the original code this is a std::unique_ptr. Any reason for that? We don't need the extra semantics of shared_ptr, and this is less efficient than a simple unique_ptr (although I don't think it'll affect this particular traversal test).

      FWIW, I have the intention to turn this tiny proof of concept into a robust library for Boost submission, some time in 2016. Stay tuned.

      Delete
    2. Hi! Yep, the extra tests were the beasts here, but I had a big mistake in the code since my first comment and I didn't realize it. A comment in the uploaded code shows that point.

      Oh, the shared ptr thing, just ignore it. Unique ptr is the best choice of course. When I started to work with your code, I changed it to that C++03 + boost typeid conform I wrote above and the environment I wanted to put it in didn't have unique ptr. I just played with the smart ptr types we have there and I left shared ptr in it afterwards.

      About your intention: good luck with that, I am keen to see it in boost :)

      Delete
    3. I forgot to point out a major problem in the code, which I sent you. Iterator ++ can cause undefined behavior, if you reach the end of the container and youo try to ++ it. And extra if(vecPos) or something should be added to the beginning, so if you already reached the end, you won't go into codes below, which can cause crash. Pro: no crash, and as I see, it didn't do any perf. degradation. Con: well, iterator++ will never pass end, I am not quite sure, that other containers do such the same. But I think it is not such a big issue. Overpassed iterator usages are also undefined behavior, so well, that's life.

      Delete
    4. Standard (and other) containers don't offer any kind of guarantee on iterating past the end and library implementations freely assume that this behavior won't occur, which dispenses them from doing extra checks etc., especially if these would result in some performance degradation --for instance, the usual implementation of std::vector::iterator::operator++ is to, well, increment the pointer that stands for the iterator without regard to whether the end has been reached or not. So I think you needn't relly add the if(vecPos) check. Besides, protecting against undefined behavior can effectively *mask* it, which at the end could be worse than letting the program fail spectacularly.

      Delete
    5. Thanks, I couldn't find first, whether over-iterating an "end" should be safe or not, but I found, that it is also undefined behavior.

      Delete