The poly_collection template class we implemented for fast handling of polymorphic objects can be potentially speeded up in the case where some of the derived classes it manages are known to the compiler at the point of template instantiation. Let us use this auxiliary template class:
template<class Derived> class poly_collection_static_segment { public: void insert(const Derived& x) { store.push_back(x); } template<typename F> void for_each(F& f) { std::for_each(store.begin(),store.end(),f); } template<typename F> void for_each(F& f)const { std::for_each(store.begin(),store.end(),f); } private: std::vector<Derived> store; };
We can add the ability for poly_collection to accept a variable number of derived class types:
template<class Base,class ...Derivedn> class poly_collection { ... };
so that it includes poly_collection_static_segment<Derivedi> components taking care of objects whose types concide exactly with those specified, whereas for the rest the default dynamic handler is still used.
What does this gain us? Consider the example:
class base { virtual void do_stuff()=0; }; class derived1:public base{...}; ... poly_collection<base,derived1> c; c.insert(derived1(0)); ... c.for_each([](base& x){ x.do_stuff(); });
For the segment handling derived1, the code in for_each resolves to:
std::for_each(store.begin(),store.end(),[](base& x){ x.do_stuff(); });
where store is of type std::vector<derived1>: so, the compiler can know that x.do_stuff() is invoked on objects with exact type derived1 and hence omit vtable lookup and even, if the definition of derived1::do_stuff is available at instantiation time, inline the call itself.
Now, the fact that the compiler can apply these optimizations does not mean it necessarily will do so: the static analysis needed to determine the optimization opportunity is certainly not trivial. There are a number of techniques we can implement to help the compiler in the process:
- Make derived1::do_stuff final.
- Use a polymorphic lambda [](auto& x){x.do_stuff();} to omit the cast from derived1& to base&.
- Do both 1 and 2.
I have written a test program (Boost required) that implements this new version of poly_collection and measures its for_each performance for the following scenarios:
- ⬛ poly_collection<base> (baseline)
- ⬛ poly_collection<base,derived1>
- ⬛ poly_collection<base,derived1,derived2>
- ⬛ poly_collection<base,derived1,derived2,derived3>
- ⬛ poly_collection<base,final_derived1>
- ⬛ poly_collection<base,final_derived1,final_derived2>
- ⬛ poly_collection<
base,final_derived1,final_derived2,final_derived3> - ⬛ same as 1, with a polymorphic functor
- ⬛ same as 2, with a polymorphic functor
- ⬛ same as 3, with a polymorphic functor
- ⬛ same as 4, with a polymorphic functor
- ⬛ same as 5, with a polymorphic functor
- ⬛ same as 6, with a polymorphic functor
- ⬛ same as 7, with a polymorphic functor
with containers of several sizes ranging from n = 1,000 to 107 elements (except where noted, results do not vary significantly across sizes). Values are the averages of execution times / number of elements for each scenario, normalized to the baseline = 1.
MSVC 2012
Microsoft Visual Studio 2012 using default release mode settings on a Windows box with an Intel Core i5-2520M CPU @2.50GHz.
Normalized execution times / number of elements. |
Differences in performance are not significative and can be labeled as mostly noise: in fact, a visual inspection of the generated assembly code reveals that virtual calls are not optimized ever.
MSVC 2013
Results by Lars Schouw: MSVC 2013 with default release settings on an box with an Intel Core i7-920 @2.67GHz.
Normalized execution times / number of elements. |
Unlike with MSVC 2012, here there is a modest but consistent improvement in performance as more derived classes are statically specified. This is probably due to tighter traversal loops rather than devirtualization, except when a polymorphic lambda is combined with final virtual functions, where reductions of up to 25% seem to indicate that inlining is taking place.
Clang on Arch Linux
Results by Theodoros Theodoridis: Clang 3.4 with settings -std=c++11 -O3, Arch Linux x64 with an Intel Core i7-950 @3.07GHz.
Normalized execution times / number of elements. |
In most cases, specifying derived classes has no or a detrimental effect on performance (the reason for the degradation is not clear to me). For polymorphic lambda + final virtual functions, however, reductions in execution times are impressively large.
There is an odd effect that the aggregate figures of the graph do not show: in the last two scenarios, performance results when n = 107 are still much better than the baseline but comparatively worse than for other container sizes.
Clang on Ubuntu 64
Results by plash: Compiler Clang 3.5.0 with settings -std=c++11 -stdlib=libc++ -lc++ -O3, system Ubuntu 13.04 x86_64 with an AMD Phenom II.
Normalized execution times / number of elements. |
Specification of derived classes definitely improves execution times, although the resulting performance patterns and underlying factors are not easily interpretable. For polymorphic lambda + final virtual functions we have again massive reductions in execution time. Similarly to the case with Clang 3.4, we have comparatively poorer performance for n = 10, although here the effect is seen across all scenarios.
GCC on Arch Linux
Results by Theodoros Theodoridis: GCC 4.9.0 with settings -std=c++11 -O3, Arch Linux x64 with an Intel Core i7-950 @3.07GHz.
Normalized execution times / number of elements. |
For this compiler, passing a monomorphic or a polymorpic lambda to for_each makes all the difference, with the latter case achieving reductions in execution time of up to 60%. I do not have an explanation as to why using final virtual functions improves on the non-final case: in both situations inlining seems to be in effect, which should render final qualification irrelevant. Like with Clang, some cases of relative degradation when n = 107 are also seen for this compiler, here mostly concentrated on the last two scenarios.
(Update Aug 17, 2014) GCC 5 on Linux
Jan Hubička is working on integrating devirtualization capabilities in GCC, and the results on this area are improving accordingly. He ran the test benchmark against GCC mainline (as of Aug 15, 2014) with settings -O3 on a Linux machine with an Intel i5-4300U processor.
Normalized execution times / number of elements. |
GCC's speculative devirtualization analysis is able to take full advantage of the optimization opportunities offered by static knowledge of the types involved, even to the point of rendering final decoration superfluous.
Conclusions
Allowing for the specification of a list of statically known derived class types as part of a poly_collection instantiation opens up possibilities for internal performance optimizations that do not impact the usage interface of this template class. The actual improvements achieved depend very much on the compiler being used, but in general are greater when passing polymorphic functors (such as generic lambdas) to for_each and qualifying virtual functions as final where possible.
poly_collection<std::string,int,char> c;
c.insert(std::string("hello"));
c.insert(0);
c.insert('a');
c.for_each([](auto& x){
std::cout<<x<<"\n";
});
Output:
hello
0
a
This comes more as a (easily fixable) happy accident than from a conscious design decision, but hints at an area of exploration away from the original purpose of poly_collection, focused on handling heterogeneous classes via visitor patterns in a manner reminiscent of the techniques used in Boost.Variant.
Update Jan 17, 2015
Fixed a serious bug in the code, that, fortunately, doesn't affect the test results.
Hello,
ReplyDeleteinteresting observations - I would be curious how the performance compare between GCC and clang.
I did not had time to look into detail of your benchmark yet, but GCC 4.9 may be inlining speculatively without the final keyword - GCC 4.9 does so in cases it finds only one possible target of the call but additional one might exist in other compilation units.
http://hubicka.blogspot.ca/2014/02/devirtualization-in-c-part-4-analyzing.html
Hi Jan,
ReplyDeleteFYI, results for Clang 3.4 and GCC 4.9.0 were run on the same machine by Theodoros Theodoridis, so they can be compared. As it turns out, they're very alike for the baseline case (the rest you can derive from this and the graphs in the article, but if you're interested I can send you the entire datasheet via email):
poly_collection<base> GCC 4.9.0 (# elements: µs/element)
1000: 0.019342
10000: 0.018862
100000: 0.018902
10000000: 0.021752
poly_collection<base> Clang 3.4 (# elements: µs/element)
1000: 0.019369
10000: 0.018592
100000: 0.02207
10000000: 0.022296
It doesn't look to me like GCC is doing a good job at speculatively inlining, at least not in the monomorphic case, as performance is basically the same as the baseline modulo noise (even when final keywords are used).
Hi,
ReplyDeleteI just checked with GCC baseline at i5-4300U CPU (running on battery in bar :).
I get:
jan@linux-ujxe:~/trunk/build/gcc> ./a.out
for_each:
1000;0.0198776;0.0159347;0.0129508;0.00830357;0.0182472;0.0141009;0.00812702
10000;0.0176984;0.0149351;0.0126008;0.00854723;0.0171027;0.0129121;0.00877823
100000;0.0174399;0.0146567;0.0130957;0.00924514;0.0169892;0.0131333;0.00931866
10000000;0.0208471;0.0173063;0.0167201;0.0139997;0.0191265;0.0170114;0.0139088
poly_for_each:
1000;0.022814;0.0179252;0.0127548;0.00808398;0.0174698;0.0129332;0.00939567
10000;0.0214426;0.0168582;0.0127193;0.00849997;0.0144839;0.0113562;0.00837254
100000;0.0208492;0.0169348;0.0130949;0.00917836;0.0142415;0.0112331;0.0082145
10000000;0.0213977;0.019074;0.0164221;0.014096;0.0164575;0.0150328;0.0157039
that seems like a progress in the specialized versions... With -fno-devirtualize-speculatively the scores all:
that is more flat.
Indeed the extra cast into basetype in your first variant prevents front-end devirtualization from happening (since it needs to propagate the type information). I implemented some code for that in GCC in 4.9 and more for 5.0.
jan@linux-ujxe:~/trunk/build/gcc> ./a.out
for_each:
1000;0.0235775;0.0228017;0.0198231;0.0204826;0.0195936;0.0194731;0.0205503
10000;0.0213635;0.0211344;0.0187591;0.0199181;0.0177108;0.0188326;0.0199056
100000;0.0209169;0.0209705;0.0186805;0.0198186;0.0174773;0.0187597;0.0199357
10000000;0.0217421;0.0215433;0.0197101;0.0202344;0.0181195;0.0193275;0.0205337
poly_for_each:
1000;0.0205234;0.0197122;0.0202281;0.0208116;0.0186996;0.0132652;0.00884317
10000;0.0181498;0.0178491;0.0190802;0.0202551;0.0171023;0.012599;0.00815125
100000;0.0177592;0.0178001;0.0189025;0.0201467;0.0166985;0.012598;0.00844246
10000000;0.0183719;0.01831;0.019438;0.020585;0.0189521;0.0161091;0.0135812
You can also check speculative devirtualization at GCC 4.9+ with -fdump-ipa-devirt
Your testcase is certainly interesting, it seems that everything boils down to std's vector being implemented by direct pointer arithmetic that makes compiler to think that given memory location may hold either the type or any of its derived types (thus need for final)
I also added a warning to help with this:
jan@linux-ujxe:~/trunk/build/gcc> ./xgcc -B ./ -O3 ~/poly.ii -march=native -S -std=c++11 -Wsuggest-final-methods
poly_collection2.cpp:296:15: warning: Declaring method ‘virtual int derived1::f(int) const’ final would enable devirtualization of 3 calls [-Wsuggest-final-methods]
poly_collection2.cpp:304:15: warning: Declaring method ‘virtual int derived2::f(int) const’ final would enable devirtualization of 2 calls [-Wsuggest-final-methods]
poly_collection2.cpp:312:15: warning: Declaring method ‘virtual int derived3::f(int) const’ final would enable devirtualization of 1 calls [-Wsuggest-final-methods]
Just to clarify - the first results are from -O3 compilation, the second from -O3 -fno-devirtualize-speculatively. Seems I cut&pasted it to the wrong place.
ReplyDeleteHi Jan
ReplyDeleteThanks for the results, I'll try to save some time to include them (at least those with speculative devirtualization) in the article.