Sunday, September 6, 2015

C++ encapsulation for Data-Oriented Design: performance

(Many thanks to Manu Sánchez for his help with running tests and analyzing results.)
In a past entry, we implemented a little C++ framework that allows us to do DOD while retaining some of the encapsulation benefits and the general look and feel of traditional object-based programming. We complete here the framework by adding a critical piece from the point of view of usability, namely the ability to process sequences of DOD entities with as terse a syntax as we would have in OOP.
To enable DOD for a particular class (like the particle we used in the previous entry), i.e., to distribute its different data members in separate memory locations, we change the class source code to turn it into a class template particle<Access> where Access is a framework-provided entity in charge of granting access to the external data members with a similar syntax as if they were an integral part of the class itself. Now, particle<Access> is no longer a regular class with value semantics, but a mere proxy to the external data without ownership to it. Importantly, it is the members and not the particle objects that are stored: particles are constructed on the fly when needed to use its interface in order to process the data. So, code like
for(const auto& p:particle_)p.render();
cannot possibly work because the application does not have any particle_ container to begin with: instead, the information is stored in separate locations:
std::vector<char> color_;
std::vector<int>  x_,y_,dx_,dy_;
and "traversing" the particles requires that we go through the associated containers in parallel and invoke render on a temporary particle object constructed out of them:
auto itc=&color_[0],ec=itc+color_.size();
auto itx=&x_[0];
auto ity=&y_[0];
auto itdx=&dx_[0];
auto itdy=&dy_[0];
  
for(;itc!=ec;++itc,++itx,++ity,++itdx,++itdy){
  auto p=make_particle(
    access<color,x,y,dx,dy>(itc,itx,ity,itdx,itdy));
  p.render();
}
Fortunately, this boilerplate code can be hidden by the framework by using these auxiliary constructs:
template<typename T> class pointer;

template<template <typename> class Class,typename Access>
class pointer<Class<Access>>
{
  // behaves as Class<Access>>*
};

template<template <typename> class Class,typename Access>
pointer<Class<Access>> make_pointer(const Access& a)
{
  return pointer<Class<Access>>(a);
}
We won't delve into the implementation details of pointer (the interested reader can see the actual code in the test program given below): from the point of view of the user, this utility class accepts an access entity, which is a collection of pointers to the data members plus an offset member (this offset has been added to the former version of the framework), it keeps everything in sync when doing pointer arithmetic and dereferences to a temporary particle object. The resulting user code is as simple as it gets:
auto n=color_.size();
auto beg_=make_pointer<particle>(access<color,x,y,dx,dy>(
  &color_[0],&x_[0],&y_[0],&dx_[0],&dy_[0]));
auto end_=beg_+n;
  
for(auto it=beg_;it!=end_;++it)it->render();
Index-based traversal is also possible:
for(std::size_t i=0;i<n;++i)beg_[i].render();
Once the containers are populated and beg_ and end_ defined, user code can handle particles as if they were stored in [beg_, end_), thus effectively isolated from the fact that the actual data is scattered around different containers for maximum processing performance.
Are we paying an abstraction penalty for the convenience this framework affords? There are two sources of concern:
  • Even though traversal code is in principle equivalent to hand-written DOD code, compilers might not be able to optimize all the template scaffolding away.
  • Traversing with access<color,x,y,dx,dy> for rendering when only color, x and y are needed (because render does not access dx or dy) involves iterating over dx_ and dy_ without actually accessing either one: again, the compiler might or might not optimize this extra code.
We provide a test program (Boost required) that measures the performance of this framework against some alternatives. The looped-over render procedure simply updates a global variable so that resulting execution times are basically those of the produced iteration code. The different options compared are:
  • oop: iteration over a traditional object-based structure
  • raw: hand-written data-processing loop
  • dod: DOD framework with access<color,x,y,dx,dy>
  • render_dod: DOD framework with  access<color,x,y>
  • oop[i]: index-based access instead of iterator traversal
  • raw[i]: hand-written index-based loop
  • dod[i]: index-based with access<color,x,y,dx,dy>
  • render_dod[i]: index-based with access<color,x,y>
The difference between dod and render_dod (and the same applies to their index-based variants) is that the latter keeps access only to the data members strictly required by render: if the compiler were not able to optimize unnecessary pointer manipulations in dod, render_dod would be expected to be faster; the drawback is that this would require fine tuning the access entity for each member function.
Manu Sánchez has set up an extensive testing environment to build and run the program using different compilers and machines:
The figures show the release-mode execution times of the eight options described above when traversing sequences of n = 104, 105, 106 and 107 particles.
GCC 5.1, MinGW, Intel Core i7-4790k @4.5GHz
Execution times / number of elements.
As expected, OOP is the slowest due to cache effects. The rest of options are basically equivalent, which shows that GCC is able to entirely optimize away the syntactic niceties brought in by our DOD framework.
MSVC 14.0, Windows, Intel Core i7-4790k @4.5GHz
Execution times / number of elements.
Here, again, all DOD options are roughly equivalent, although raw (pointer-based hand-written loop) is slightly slower. Curiously enough, MSVC is much worse at optimizing DOD with respect to OOP than GCC is, with execution times up to 4 times higher for n = 104 and 1.3 times higher for n = 107, the latter scenario being presumably dominated by cache efficiencies.
GCC 5.2, Linux, AMD A6-1450 APU @1.0 GHz
Execution times / number of elements.
From a qualitative point of view, these results are in line with those obtained for GCC 5.1 under an Intel Core i7, although as the AMD A6 is a much less powerful processor execution times are higher (×8-10 for n = 104, ×4-5.5 for n = 107).
Clang 3.6, Linux, AMD A6-1450 APU @1.0 GHz
Execution times / number of elements.
As it happens with the rest of compilers, DOD options (both manual and framework-supported) perform equally well. However, the comparison with GCC 5.2 on the same machine shows important differences: iterator-based OOP is faster (×1.1-1.4) in Clang, index-based OOP yields the same results for both compilers, and the DOD options in Clang are consistently slower (×2.3-3.4) than in GCC, to the point that OOP outperforms them for low values of n. A detailed analysis of the assembly code produced would probably gain us more insight into these contrasting behaviors: interested readers can access the resulting assembly listings at the associated GitHub repository.

9 comments:

  1. I got similar results in a similar framework that I've written. I took a slightly different approach in that I wanted to allow optional fields. As a result users are required to use my containers; it's not possible to use an arbitrary array or vector. The benefit is that the system has complete control over where things go in memory.

    At a high level it works like this;

    // Create some pool types -- these are basically an array.
    // For optional fields, consider a sparce_pool instead.
    typedef entity::component::saturated_pool position_pool_type;
    typedef entity::component::saturated_pool velocity_pool_type;
    typedef entity::component::saturated_pool accel_pool_type;

    // Create a pool of entities.
    entity::entity_pool entities;

    // Associate the component pools with the entity pool
    position_pool_type position_pool(entities);
    velocity_pool_type velocity_pool(entities);
    accel_pool_type accel_pool(entities);

    // Create an entity, creates associated components
    auto e = entities.create();

    // Iterate over all positions for all entities
    entity::for_each(entities, entity::component::tie(position_pool), [](float position)
    {
    // ...
    });

    // Calculate new position from velocity.
    entity::for_each(entities, entity::component::tie(velocity_pool, position_pool), [](float v, float& p)
    {
    // Compute new position.
    p += v * 0.016;
    });

    I'm still working on the syntax and the code and documentation need a little love, but I think the idea has merit.


    ReplyDelete
    Replies
    1. The ideas you present are indeed similar. One thing I find lacking in your approach, though, is the ability to process entities through an iterator-like object (rather than with specialized entity::for_each), which opens up the possibility to interoperate with STL-like algorithms.

      Delete
    2. There is an iterator actually, but it cant be random access due to my desire to support optional components (forward iteration only). As a result, making an iterator involves writing two sets of bind code to make the begin and end iterators (ie: I can't do 'end = begin +n' like you can).

      For entity::for_each wraps that and then decomposes the iterator back into its components to call the functor with, so it's syntactic sugar.

      But, I still don't like it because I don't want to have to support aliases for all algorithms.

      The ideas presented here have given me some ideas though so I may utilize some of them, if you don't mind.

      Delete
  2. Nice series of articles.
    I'm extremely intrigued by Clang results... any plan on writing something on it?

    ReplyDelete
    Replies
    1. Manu Sánchez, who ran the actual tests, has also produced asm listings and is taking a look at them, but I don't know yet if he's ganing some insight that can be shared about the differences in compilers.

      Delete
  3. Just curious; what are you using to produce the charts? Is it automated?

    ReplyDelete
    Replies
    1. Just a regular Excel sheet I copy and paste the tests output into.

      Delete
  4. Really impressive work!
    I'm not so experienced programmist. but I'm heavily inspired by DOD and I hope taht thisparadigm will be more and more popular :)
    My question is: There is possible to make some "magic" generic class that create and associate specific containers on the basis of user-declared members?

    Some loose example:
    template< typename UserDefinedClass >
    class ThisMagicClass { ... }

    class UserDefinedClass : public MagicClass< UserDefinedClass > {
    using PosX = Member< float, 0 >;
    using PosY = Member< float, 1 >;
    using Another = Member< int, 0 >;
    etc.
    };

    And when we instantiate that class, own "MagicClass" iterate over members and create corresponding containers (vector or sth ).

    Like I wrote at the begin, im not so experienced, so are this idea is good? Is worth? Have sense? Can be handy?

    Anyway, there will be great if You write some more posts like this about DOD.
    Greetings!

    ReplyDelete