Bannalia: trivial notes on themes diverse: Fast polymorphic collections with devirtualization

The poly_collection template class we implemented for fast handling of polymorphic objects can be potentially speeded up in the case where some of the derived classes it manages are known to the compiler at the point of template instantiation. Let us use this auxiliary template class:

template<class Derived>
class poly_collection_static_segment
{
public:
  void insert(const Derived& x)
  {
    store.push_back(x);
  }
  
  template<typename F>
  void for_each(F& f)
  {
    std::for_each(store.begin(),store.end(),f);
  }
  
  template<typename F>
  void for_each(F& f)const
  {
    std::for_each(store.begin(),store.end(),f);
  }

private:
  std::vector<Derived> store;
};

We can add the ability for poly_collection to accept a variable number of derived class types:

template<class Base,class ...Derivedn>
class poly_collection
{
  ...
};

so that it includes poly_collection_static_segment<Derivedi> components taking care of objects whose types concide exactly with those specified, whereas for the rest the default dynamic handler is still used.

What does this gain us? Consider the example:

class base
{
  virtual void do_stuff()=0;
};

class derived1:public base{...};
...
poly_collection<base,derived1> c;
c.insert(derived1(0));
...
c.for_each([](base& x){
  x.do_stuff();
});

For the segment handling derived1, the code in for_each resolves to:

std::for_each(store.begin(),store.end(),[](base& x){
  x.do_stuff();
});

where store is of type std::vector<derived1>: so, the compiler can know that x.do_stuff() is invoked on objects with exact type derived1 and hence omit vtable lookup and even, if the definition of derived1::do_stuff is available at instantiation time, inline the call itself.

Now, the fact that the compiler can apply these optimizations does not mean it necessarily will do so: the static analysis needed to determine the optimization opportunity is certainly not trivial. There are a number of techniques we can implement to help the compiler in the process:

Make derived1::do_stuff final.
Use a polymorphic lambda [](auto& x){x.do_stuff();} to omit the cast from derived1& to base&.
Do both 1 and 2.

I have written a test program (Boost required) that implements this new version of poly_collection and measures its for_each performance for the following scenarios:

⬛ poly_collection<base> (baseline)
⬛ poly_collection<base,derived1>
⬛ poly_collection<base,derived1,derived2>
⬛ poly_collection<base,derived1,derived2,derived3>
⬛ poly_collection<base,final_derived1>
⬛ poly_collection<base,final_derived1,final_derived2>
⬛ poly_collection<
base,final_derived1,final_derived2,final_derived3>
⬛ same as 1, with a polymorphic functor
⬛ same as 2, with a polymorphic functor
⬛ same as 3, with a polymorphic functor
⬛ same as 4, with a polymorphic functor
⬛ same as 5, with a polymorphic functor
⬛ same as 6, with a polymorphic functor
⬛ same as 7, with a polymorphic functor

with containers of several sizes ranging from n = 1,000 to 10⁷ elements (except where noted, results do not vary significantly across sizes). Values are the averages of execution times / number of elements for each scenario, normalized to the baseline = 1.

MSVC 2012

Microsoft Visual Studio 2012 using default release mode settings on a Windows box with an Intel Core i5-2520M CPU @2.50GHz.