Wednesday, May 7, 2014

Fast polymorphic collections with devirtualization

The poly_collection template class we implemented for fast handling of polymorphic objects can be potentially speeded up in the case where some of the derived classes it manages are known to the compiler at the point of template instantiation. Let us use this auxiliary template class:
template<class Derived>
class poly_collection_static_segment
{
public:
  void insert(const Derived& x)
  {
    store.push_back(x);
  }
  
  template<typename F>
  void for_each(F& f)
  {
    std::for_each(store.begin(),store.end(),f);
  }
  
  template<typename F>
  void for_each(F& f)const
  {
    std::for_each(store.begin(),store.end(),f);
  }

private:
  std::vector<Derived> store;
};
We can add the ability for poly_collection to accept a variable number of derived class types:
template<class Base,class ...Derivedn>
class poly_collection
{
  ...
};
so that it includes poly_collection_static_segment<Derivedi> components taking care of objects whose types concide exactly with those specified, whereas for the rest the default dynamic handler is still used.
What does this gain us? Consider the example:
class base
{
  virtual void do_stuff()=0;
};

class derived1:public base{...};
...
poly_collection<base,derived1> c;
c.insert(derived1(0));
...
c.for_each([](base& x){
  x.do_stuff();
});
For the segment handling derived1, the code in for_each resolves to:
std::for_each(store.begin(),store.end(),[](base& x){
  x.do_stuff();
});
where store is of type std::vector<derived1>: so, the compiler can know that x.do_stuff() is invoked on objects with exact type derived1 and hence omit vtable lookup and even, if the definition of derived1::do_stuff is available at instantiation time, inline the call itself.
Now, the fact that the compiler can apply these optimizations does not mean it necessarily will do so: the static analysis needed to determine the optimization opportunity is certainly not trivial. There are a number of techniques we can implement to help the compiler in the process:
  1. Make derived1::do_stuff final.
  2. Use a polymorphic lambda [](auto& x){x.do_stuff();} to omit the cast from derived1& to base&.
  3. Do both 1 and 2.
I have written a test program (Boost required) that implements this new version of poly_collection and measures its for_each performance for the following scenarios:
  1. poly_collection<base> (baseline)
  2. poly_collection<base,derived1>
  3. poly_collection<base,derived1,derived2>
  4. poly_collection<base,derived1,derived2,derived3>
  5. poly_collection<base,final_derived1>
  6. poly_collection<base,final_derived1,final_derived2>
  7. poly_collection<
      base,final_derived1,final_derived2,final_derived3
    >
  8. same as 1, with a polymorphic functor
  9. same as 2, with a polymorphic functor
  10. same as 3, with a polymorphic functor
  11. same as 4, with a polymorphic functor
  12. same as 5, with a polymorphic functor
  13. same as 6, with a polymorphic functor
  14. same as 7, with a polymorphic functor
with containers of several sizes ranging from n = 1,000 to 107 elements (except where noted, results do not vary significantly across sizes). Values are the averages of execution times / number of elements for each scenario, normalized to the baseline = 1.
MSVC 2012
Microsoft Visual Studio 2012 using default release mode settings on a Windows box with an Intel Core i5-2520M CPU @2.50GHz.
Normalized execution times / number of elements.
Differences in performance are not significative and can be labeled as mostly noise: in fact, a visual inspection of the generated assembly code reveals that virtual calls are not optimized ever.
MSVC 2013
Results by Lars Schouw: MSVC 2013 with default release settings on an box with an Intel Core i7-920 @2.67GHz.
Normalized execution times / number of elements.
Unlike with MSVC 2012, here there is a modest but consistent improvement in performance as more derived classes are statically specified. This is probably due to tighter traversal loops rather than devirtualization, except when a polymorphic lambda is combined with final virtual functions, where reductions of up to 25% seem to indicate that inlining is taking place.
Clang on Arch Linux
Results by Theodoros Theodoridis: Clang 3.4 with settings -std=c++11 -O3, Arch Linux x64 with an Intel Core i7-950 @3.07GHz.
Normalized execution times / number of elements.
In most cases, specifying derived classes has no or a detrimental effect on performance (the reason for the degradation is not clear to me). For polymorphic lambda + final virtual functions, however, reductions in execution times are impressively large.
There is an odd effect that the aggregate figures of the graph do not show: in the last two scenarios, performance results when n = 107 are still much better than the baseline but comparatively worse than for other container sizes.
Clang on Ubuntu 64
Results by plash: Compiler Clang 3.5.0 with settings -std=c++11 -stdlib=libc++ -lc++ -O3, system Ubuntu 13.04 x86_64 with an AMD Phenom II.
Normalized execution times / number of elements.
Specification of derived classes definitely improves execution times, although the resulting performance patterns and underlying factors are not easily interpretable. For polymorphic lambda + final virtual functions we have again massive reductions in execution time. Similarly to the case with Clang 3.4, we have comparatively poorer performance for  n = 10, although here the effect is seen across all scenarios.
GCC on Arch Linux
Results by Theodoros Theodoridis: GCC 4.9.0 with settings -std=c++11 -O3, Arch Linux x64 with an Intel Core i7-950 @3.07GHz.
Normalized execution times / number of elements.
For this compiler, passing a monomorphic or a polymorpic lambda to for_each makes all the difference, with the latter case achieving reductions in execution time of up to 60%. I do not have an explanation as to why using final virtual functions improves on the non-final case: in both situations inlining seems to be in effect, which should render final qualification irrelevant. Like with Clang, some cases of relative degradation when  n = 107 are also seen for this compiler, here mostly concentrated on the last two scenarios.
(Update Aug 17, 2014) GCC 5 on Linux
Jan Hubička is working on integrating devirtualization capabilities in GCC, and the results on this area are improving accordingly. He ran the test benchmark against GCC mainline (as of Aug 15, 2014) with settings -O3 on a Linux machine with an Intel i5-4300U processor.
Normalized execution times / number of elements.
GCC's speculative devirtualization analysis is able to take full advantage of the optimization opportunities offered by static knowledge of the types involved, even to the point of rendering final decoration superfluous.
Conclusions
Allowing for the specification of a list of statically known derived class types as part of a poly_collection instantiation opens up possibilities for internal performance optimizations that do not impact the usage interface of this template class. The actual improvements achieved depend very much on the compiler being used, but in general are greater when passing polymorphic functors (such as generic lambdas) to for_each and qualifying virtual functions as final where possible.
Postscript: The observant reader might have noticed that poly_collection_static_segment<Derivedi> does not explicitly require that Derivedi be an actual derived class from Base, and in fact one can pass any type as long as we use a polymorphic functor with for_each that accepts it:
poly_collection<std::string,int,char> c;
c.insert(std::string("hello"));
c.insert(0);
c.insert('a');
c.for_each([](auto& x){
  std::cout<<x<<"\n";
});

Output:
hello
0
a
This comes more as a (easily fixable) happy accident than from a conscious design decision, but hints at an area of exploration away from the original purpose of poly_collection, focused on handling heterogeneous classes via visitor patterns in a manner reminiscent of the techniques used in Boost.Variant.
Update Jan 17, 2015
Fixed a serious bug in the code, that, fortunately, doesn't affect the test results.

5 comments :

  1. Hello,
    interesting observations - I would be curious how the performance compare between GCC and clang.

    I did not had time to look into detail of your benchmark yet, but GCC 4.9 may be inlining speculatively without the final keyword - GCC 4.9 does so in cases it finds only one possible target of the call but additional one might exist in other compilation units.

    http://hubicka.blogspot.ca/2014/02/devirtualization-in-c-part-4-analyzing.html

    ReplyDelete
  2. Hi Jan,

    FYI, results for Clang 3.4 and GCC 4.9.0 were run on the same machine by Theodoros Theodoridis, so they can be compared. As it turns out, they're very alike for the baseline case (the rest you can derive from this and the graphs in the article, but if you're interested I can send you the entire datasheet via email):

    poly_collection<base> GCC 4.9.0 (# elements: µs/element)
    1000: 0.019342
    10000: 0.018862
    100000: 0.018902
    10000000: 0.021752

    poly_collection<base> Clang 3.4 (# elements: µs/element)
    1000: 0.019369
    10000: 0.018592
    100000: 0.02207
    10000000: 0.022296

    It doesn't look to me like GCC is doing a good job at speculatively inlining, at least not in the monomorphic case, as performance is basically the same as the baseline modulo noise (even when final keywords are used).

    ReplyDelete
  3. Hi,
    I just checked with GCC baseline at i5-4300U CPU (running on battery in bar :).

    I get:
    jan@linux-ujxe:~/trunk/build/gcc> ./a.out
    for_each:
    1000;0.0198776;0.0159347;0.0129508;0.00830357;0.0182472;0.0141009;0.00812702
    10000;0.0176984;0.0149351;0.0126008;0.00854723;0.0171027;0.0129121;0.00877823
    100000;0.0174399;0.0146567;0.0130957;0.00924514;0.0169892;0.0131333;0.00931866
    10000000;0.0208471;0.0173063;0.0167201;0.0139997;0.0191265;0.0170114;0.0139088
    poly_for_each:
    1000;0.022814;0.0179252;0.0127548;0.00808398;0.0174698;0.0129332;0.00939567
    10000;0.0214426;0.0168582;0.0127193;0.00849997;0.0144839;0.0113562;0.00837254
    100000;0.0208492;0.0169348;0.0130949;0.00917836;0.0142415;0.0112331;0.0082145
    10000000;0.0213977;0.019074;0.0164221;0.014096;0.0164575;0.0150328;0.0157039

    that seems like a progress in the specialized versions... With -fno-devirtualize-speculatively the scores all:



    that is more flat.

    Indeed the extra cast into basetype in your first variant prevents front-end devirtualization from happening (since it needs to propagate the type information). I implemented some code for that in GCC in 4.9 and more for 5.0.

    jan@linux-ujxe:~/trunk/build/gcc> ./a.out

    for_each:
    1000;0.0235775;0.0228017;0.0198231;0.0204826;0.0195936;0.0194731;0.0205503
    10000;0.0213635;0.0211344;0.0187591;0.0199181;0.0177108;0.0188326;0.0199056
    100000;0.0209169;0.0209705;0.0186805;0.0198186;0.0174773;0.0187597;0.0199357
    10000000;0.0217421;0.0215433;0.0197101;0.0202344;0.0181195;0.0193275;0.0205337
    poly_for_each:
    1000;0.0205234;0.0197122;0.0202281;0.0208116;0.0186996;0.0132652;0.00884317
    10000;0.0181498;0.0178491;0.0190802;0.0202551;0.0171023;0.012599;0.00815125
    100000;0.0177592;0.0178001;0.0189025;0.0201467;0.0166985;0.012598;0.00844246
    10000000;0.0183719;0.01831;0.019438;0.020585;0.0189521;0.0161091;0.0135812

    You can also check speculative devirtualization at GCC 4.9+ with -fdump-ipa-devirt

    Your testcase is certainly interesting, it seems that everything boils down to std's vector being implemented by direct pointer arithmetic that makes compiler to think that given memory location may hold either the type or any of its derived types (thus need for final)

    I also added a warning to help with this:
    jan@linux-ujxe:~/trunk/build/gcc> ./xgcc -B ./ -O3 ~/poly.ii -march=native -S -std=c++11 -Wsuggest-final-methods
    poly_collection2.cpp:296:15: warning: Declaring method ‘virtual int derived1::f(int) const’ final would enable devirtualization of 3 calls [-Wsuggest-final-methods]
    poly_collection2.cpp:304:15: warning: Declaring method ‘virtual int derived2::f(int) const’ final would enable devirtualization of 2 calls [-Wsuggest-final-methods]
    poly_collection2.cpp:312:15: warning: Declaring method ‘virtual int derived3::f(int) const’ final would enable devirtualization of 1 calls [-Wsuggest-final-methods]

    ReplyDelete
  4. Just to clarify - the first results are from -O3 compilation, the second from -O3 -fno-devirtualize-speculatively. Seems I cut&pasted it to the wrong place.

    ReplyDelete
  5. Hi Jan

    Thanks for the results, I'll try to save some time to include them (at least those with speculative devirtualization) in the article.

    ReplyDelete