Saturday, June 27, 2015

Traversing a linearized tree

In a previous entry we have introduced the data structure levelorder_vector<T>, which provides faster binary search than sorted vectors (such as sported by Boost.Container flat associative containers) by laying out the elements in a linear fashion following the breadth-first or level order of the binary tree associated to the sequence. As usual with data structures, this is expected to come at a price in the form of poorer performance for other operations.
Let us begin with iteration: traversing a levelorder_vector is algorithmically equivalent to doing so in its homologous binary tree:
The picture above shows the three generic cases the algorithm for incrementing a position has to deal with, highlighted with the same color in the following pseudocode:
void increment(position& i)
{
  position j=right(i);
  if(j){                      // right child exists
    do{
      i=j;                    // take the path down
      j=left(i);              // and follow left children
    }while(j);
  }
  else if(is_last_leaf(i)){
    i=end;                    // next of last is end
  }
  else{
    while(is_right_child(i)){ // go up while the node is
      i=parent(i);            // the right child of its parent 
    }
    i=parent(i);              // and up an extra level
  }
}
In the case of levelorder_vector, positions are numerical indexes into an array, so these operations are purely arithmetic (by contrast, in a node-based tree we would need to follow pointers around, which is more expensive as we will see shortly):
  • parent(i) = (i − 1)/2,
  • right(i) = 2i +2 (if less than size),
  • left(i) = 2i + 1 (if less than size),
  • is_right_child(i) = (i mod 2 = 0),
  • is_last_leaf(i) = i does not have a right child and i + 2 is a power of 2.
The last identity might be a little trickier to understand: it simply relies on the fact that the last leaf of the tree must also be the last element of a fully occupied level, and the number of elements in a complete binary tree with d levels is 2d − 1, hence the zero-based index of the element + 2 has to be a power of two.
A test program is provided that implements forward iterators for levelorder_vector (backwards traversal is mostly symmetrical) and measures the time taken to traverse std::set, boost::container::flat_set and levelorder_vector instances holding n = 10,000 to 3 million integer elements.
MSVC on Windows
Test compiled with Microsoft Visual Studio 2012 using default release mode settings and run on a Windows box with an Intel Core i5-2520M CPU @2.50GHz.
Traversal time / number of elements.
As expected, boost::container::flat_set is the fastest: nothing (single threaded) can beat traversing an array of memory in sequential order. levelorder_vector is around 5x slower than boost::container::flat_set but up to 4x faster than std::set. More interestingly, its traversal time per element is virtually non-affected by the size of the container. This can be further confirmed by measuring a non-touching traversal where the range [begin, end) is walked through without actually dereferencing the iterator, and consequently not accesing the container memory at all: resulting times remain almost the same, which indicates that the traversal time is dominated by the arithmetic operations on indexes. In a later entry we provide the formal proof that traversing a levelorder_vector is indeed cache-oblivious.
GCC on Linux
Contributed by Manu Sánchez (GCC 5.1, Arch Linux 3.16, Intel Core i7 4790k @ 4.5GHz) and Laurentiu Nicola (GCC 5.1, Arch Linux x64, Intel Core i7-2760QM CPU @ 2.40GHz and Intel Celeron J1900 @ 1.99GHz). Rather than showing the plots, we provide the observed ranges in execution speed ratios between the three containers (for instance, the cell "10-30" means that boost::container::flat_set is between 10 and 30 times faster than levelorder_vector for Intel Core i7).
flat_set
levelorder_vector
levelorder_vector
     std::set 
  
i710-303-8
Celeron10-122.5-4.5
Clang on Linux
Clang 3.6.1 with libstdc++-v3 and libc++ Arch Linux 3.16, Intel Core i7 4790k @ 4.5GHz.  Clang 3.6.1, Arch Linux x64, Intel Core i7-2760QM CPU @ 2.40GHz and Intel Celeron J1900 @ 1.99GHz.
flat_set
levelorder_vector
levelorder_vector
     std::set 
  
libstdc++-v3, i710-502-7
libc++, i710-502.5-8
Celeron15-302.5-4.5
Conclusions
Traversing a levelorder_vector takes a constant time per element that sits in between those of boost::container::flat_set (10-50 times slower, depending on number of elements and environment) and std::set (2-8 times faster). Users need to consider their particular execution scenarios to balance this fact vs. the superior performance offered at lookup.

8 comments :

  1. Here are my results: Arch Linux 3.16 x86_64 Intel Core i7 4790k @4.5GHz 16GB RAM

    GCC 5.1 -Wall -pedantic -march=native -O3

    std::set;boost::container::flat_set;levelorder_vector
    10000;0.0420321;0.000491979;0.014563
    12000;0.0421156;0.000483647;0.0144815
    14400;0.0422168;0.000480834;0.0145216
    17280;0.0428365;0.000478953;0.0145598
    20736;0.042634;0.000476514;0.0146061
    24883;0.0425655;0.000478717;0.0146198
    29859;0.0427637;0.000476505;0.0145363
    35830;0.0426727;0.000479686;0.0145594
    42995;0.0426188;0.000478767;0.0145501
    51593;0.0428331;0.000526655;0.0146135
    61910;0.0429622;0.000634265;0.0146206
    74290;0.0430819;0.000689963;0.0146423
    89146;0.0429906;0.000740952;0.0146404
    106973;0.0435249;0.000741801;0.0146144
    128365;0.0488695;0.000740464;0.0146262
    154035;0.0592191;0.000736002;0.0145821
    184839;0.0761611;0.000739294;0.0146332
    221803;0.0897213;0.000738229;0.0146319
    266159;0.0987045;0.00073572;0.0146533
    319386;0.1116;0.000740789;0.0146558
    383258;0.114196;0.0007371;0.0146298
    459904;0.116994;0.000751267;0.015051
    551879;0.115359;0.000736435;0.014638
    662249;0.119005;0.00073486;0.0147561
    794693;0.120097;0.000734039;0.0146452
    953625;0.120747;0.000733416;0.0146262
    1144343;0.118479;0.000740644;0.0146628
    1373204;0.120768;0.000775856;0.0151388
    1647837;0.120661;0.000772929;0.014735
    1977396;0.121989;0.00103971;0.0148675
    2372866;0.11892;0.00140741;0.0148716
    2847430;0.120638;0.00166705;0.0151869


    Clang 3.6.1 -Wall -pedantic -march=native -O3 (GNU stdlibc++)

    std::set;boost::container::flat_set;levelorder_vector
    10000;0.0417386;0.000341377;0.0168463
    12000;0.0417399;0.000349159;0.0169027
    14400;0.0423439;0.00031997;0.0168923
    17280;0.0430543;0.00042511;0.0167483
    20736;0.043264;0.000421114;0.0169363
    24883;0.0432674;0.000421662;0.0169812
    29859;0.0442487;0.000423517;0.0168891
    35830;0.0432414;0.000342709;0.0168915
    42995;0.0434866;0.000354699;0.0168187
    51593;0.0440113;0.000402105;0.0168149
    61910;0.0439409;0.000526146;0.0172682
    74290;0.0435135;0.000601147;0.0169024
    89146;0.0513623;0.000648336;0.0171129
    106973;0.0546859;0.000718656;0.0168484
    128365;0.0524283;0.000729523;0.0175893
    154035;0.0764826;0.000700932;0.0171135
    184839;0.0775115;0.000702296;0.0170045
    221803;0.0906357;0.000701652;0.0169417
    266159;0.0973642;0.000708999;0.0168062
    319386;0.106956;0.000712729;0.0168548
    383258;0.117187;0.000711961;0.016891
    459904;0.115757;0.000710714;0.016868
    551879;0.120767;0.000700052;0.0168105
    662249;0.117556;0.000698999;0.0168786
    794693;0.119222;0.000709851;0.0169199
    953625;0.123675;0.000723752;0.0170072
    1144343;0.120827;0.000715292;0.0168958
    1373204;0.120367;0.000713977;0.0168934
    1647837;0.120706;0.000758504;0.0169309
    1977396;0.121796;0.000938414;0.0172851
    2372866;0.120065;0.00122567;0.0174493
    2847430;0.120253;0.00167049;0.0175354


    Clang 3.6.1 -Wall -pedantic -march=native -O3 (LLVM libc++)

    std::set;boost::container::flat_set;levelorder_vector
    10000;0.0421094;0.000433572;0.0174907
    12000;0.0416049;0.000429394;0.0174393
    14400;0.0420245;0.000426524;0.0174684
    17280;0.0436509;0.000332382;0.0175928
    20736;0.0440515;0.000329182;0.0173394
    24883;0.044052;0.000342822;0.0174632
    29859;0.0438997;0.000341482;0.017644
    35830;0.0442544;0.000425812;0.0178497
    42995;0.0446459;0.000515587;0.0178085
    51593;0.0441517;0.000483189;0.0177848
    61910;0.044744;0.000567786;0.0178402
    74290;0.044896;0.000601651;0.0176016
    89146;0.0430186;0.000694245;0.017695
    106973;0.0455896;0.000706889;0.0177571
    128365;0.0526248;0.000692947;0.0175568
    154035;0.0882242;0.000721327;0.0176997
    184839;0.107886;0.000725102;0.017628
    221803;0.122777;0.000719787;0.017623
    266159;0.141874;0.000704827;0.0176322
    319386;0.139417;0.00071619;0.0180853
    383258;0.145941;0.000704596;0.0176367
    459904;0.142815;0.000704553;0.0176681
    551879;0.156851;0.000715048;0.0177817
    662249;0.145418;0.000730398;0.0177499
    794693;0.147788;0.000712339;0.0178183
    953625;0.147059;0.000716288;0.0178217
    1144343;0.157357;0.000704795;0.017893
    1373204;0.152407;0.000739034;0.0187936
    1647837;0.150415;0.000724018;0.0179466
    1977396;0.151268;0.000907862;0.0180083
    2372866;0.157685;0.00122255;0.0181203
    2847430;0.148503;0.00171368;0.018334

    ReplyDelete
  2. But, since you're developing a new set, not just a new implementation of std::set, you're free to choose the semantics of iteration. You've chosen to iterate in sorted order, as std::set does, with more or less expected results. What if you instead allow iteration in stored order, like that of unordered_set? I bet you'd see performance very close to that of boost::container_flat_set.

    ReplyDelete
    Replies
    1. Traversing in stored order would result in exactly the same performance as boost::container::flat_set, but the order produced would be of no particular use, and to the user would look as random as say that of a std::unordered_set. If we decided to do that, users would be better off using a std::unordered_set to begin with.

      Delete
    2. Still, sometimes one does not care.

      I would like to have:
      - unsorted_begin()/unsorted_end()/unsorted() to get an unsorted view of the range with the same performance as flat set.
      - unsorted insert operations that insert a single element in O(1), and m elements in O(m), that violate the invariants of the container. In debug mode the container can assert that the invariants are restored before any operation that needs them is performed.

      Delete
  3. I'm betting on better performance than unordered_set, due to better locality of reference, and also reduced memory consumption (each bucket in unordered_set must, by inference from the interface, be a doubly linked list.)

    I think performance would be slightly worse than container_flat_set. You either have to ensure perfect balance after every insertion/removal, which has a cost, or handle unoccupied slots in iteration, which would have a cost.

    ReplyDelete
    Replies
    1. I'm afraid lookup for levelorder_vector is worse than for std::unordered_set. You can compare the following
      * Cache-friendly binary search
      * A better hash table

      which are roughly comparable for MSVC. Lookup times for levelorder_vector range between 1 and 2 us, whereas for hashed sets these times are around 0.2-1 us.

      Delete
  4. Here are my results https://gist.github.com/GrayShade/a32a7e9ae777ccded639 .

    The distribution is Arch Linux x64, same as for the previous post. The i7 data might be noisy, as this is a laptop CPU and thermal throttling and/or Turbo Boost might have kicked in.

    ReplyDelete