Bannalia: trivial notes on themes diverse: Traversing a linearized tree

Saturday, June 27, 2015

Traversing a linearized tree

In a previous entry we have introduced the data structure levelorder_vector<T>, which provides faster binary search than sorted vectors (such as sported by Boost.Container flat associative containers) by laying out the elements in a linear fashion following the breadth-first or level order of the binary tree associated to the sequence. As usual with data structures, this is expected to come at a price in the form of poorer performance for other operations.

Let us begin with iteration: traversing a levelorder_vector is algorithmically equivalent to doing so in its homologous binary tree:

The picture above shows the three generic cases the algorithm for incrementing a position has to deal with, highlighted with the same color in the following pseudocode:

void increment(position& i)
{
  position j=right(i);
  if(j){                      // right child exists
    do{
      i=j;                    // take the path down
      j=left(i);              // and follow left children
    }while(j);
  }
  else if(is_last_leaf(i)){
    i=end;                    // next of last is end
  }
  else{
    while(is_right_child(i)){ // go up while the node is
      i=parent(i);            // the right child of its parent 
    }
    i=parent(i);              // and up an extra level
  }
}

In the case of levelorder_vector, positions are numerical indexes into an array, so these operations are purely arithmetic (by contrast, in a node-based tree we would need to follow pointers around, which is more expensive as we will see shortly):

parent(i) = (i − 1)/2,
right(i) = 2i +2 (if less than size),
left(i) = 2i + 1 (if less than size),
is_right_child(i) = (i mod 2 = 0),
is_last_leaf(i) = i does not have a right child and i + 2 is a power of 2.

The last identity might be a little trickier to understand: it simply relies on the fact that the last leaf of the tree must also be the last element of a fully occupied level, and the number of elements in a complete binary tree with d levels is 2^d − 1, hence the zero-based index of the element + 2 has to be a power of two.

A test program is provided that implements forward iterators for levelorder_vector (backwards traversal is mostly symmetrical) and measures the time taken to traverse std::set, boost::container::flat_set and levelorder_vector instances holding n = 10,000 to 3 million integer elements.

MSVC on Windows

Test compiled with Microsoft Visual Studio 2012 using default release mode settings and run on a Windows box with an Intel Core i5-2520M CPU @2.50GHz.

Traversal time / number of elements.

As expected, boost::container::flat_set is the fastest: nothing (single threaded) can beat traversing an array of memory in sequential order. levelorder_vector is around 5x slower than boost::container::flat_set but up to 4x faster than std::set. More interestingly, its traversal time per element is virtually non-affected by the size of the container. This can be further confirmed by measuring a non-touching traversal where the range [begin, end) is walked through without actually dereferencing the iterator, and consequently not accesing the container memory at all: resulting times remain almost the same, which indicates that the traversal time is dominated by the arithmetic operations on indexes. In a later entry we provide the formal proof that traversing a levelorder_vector is indeed cache-oblivious.

GCC on Linux

Contributed by Manu Sánchez (GCC 5.1, Arch Linux 3.16, Intel Core i7 4790k @ 4.5GHz) and Laurentiu Nicola (GCC 5.1, Arch Linux x64, Intel Core i7-2760QM CPU @ 2.40GHz and Intel Celeron J1900 @ 1.99GHz). Rather than showing the plots, we provide the observed ranges in execution speed ratios between the three containers (for instance, the cell "10-30" means that boost::container::flat_set is between 10 and 30 times faster than levelorder_vector for Intel Core i7).

	flat_set levelorder_vector	levelorder_vector std::set
i7	10-30	3-8
Celeron	10-12	2.5-4.5

Clang on Linux

Clang 3.6.1 with libstdc++-v3 and libc++ Arch Linux 3.16, Intel Core i7 4790k @ 4.5GHz. Clang 3.6.1, Arch Linux x64, Intel Core i7-2760QM CPU @ 2.40GHz and Intel Celeron J1900 @ 1.99GHz.

	flat_set levelorder_vector	levelorder_vector std::set
libstdc++-v3, i7	10-50	2-7
libc++, i7	10-50	2.5-8
Celeron	15-30	2.5-4.5

Conclusions

Traversing a levelorder_vector takes a constant time per element that sits in between those of boost::container::flat_set (10-50 times slower, depending on number of elements and environment) and std::set (2-8 times faster). Users need to consider their particular execution scenarios to balance this fact vs. the superior performance offered at lookup.

8 comments :

UnknownJune 27, 2015 at 6:39 PM
Here are my results: Arch Linux 3.16 x86_64 Intel Core i7 4790k @4.5GHz 16GB RAM

GCC 5.1 -Wall -pedantic -march=native -O3

std::set;boost::container::flat_set;levelorder_vector
10000;0.0420321;0.000491979;0.014563
12000;0.0421156;0.000483647;0.0144815
14400;0.0422168;0.000480834;0.0145216
17280;0.0428365;0.000478953;0.0145598
20736;0.042634;0.000476514;0.0146061
24883;0.0425655;0.000478717;0.0146198
29859;0.0427637;0.000476505;0.0145363
35830;0.0426727;0.000479686;0.0145594
42995;0.0426188;0.000478767;0.0145501
51593;0.0428331;0.000526655;0.0146135
61910;0.0429622;0.000634265;0.0146206
74290;0.0430819;0.000689963;0.0146423
89146;0.0429906;0.000740952;0.0146404
106973;0.0435249;0.000741801;0.0146144
128365;0.0488695;0.000740464;0.0146262
154035;0.0592191;0.000736002;0.0145821
184839;0.0761611;0.000739294;0.0146332
221803;0.0897213;0.000738229;0.0146319
266159;0.0987045;0.00073572;0.0146533
319386;0.1116;0.000740789;0.0146558
383258;0.114196;0.0007371;0.0146298
459904;0.116994;0.000751267;0.015051
551879;0.115359;0.000736435;0.014638
662249;0.119005;0.00073486;0.0147561
794693;0.120097;0.000734039;0.0146452
953625;0.120747;0.000733416;0.0146262
1144343;0.118479;0.000740644;0.0146628
1373204;0.120768;0.000775856;0.0151388
1647837;0.120661;0.000772929;0.014735
1977396;0.121989;0.00103971;0.0148675
2372866;0.11892;0.00140741;0.0148716
2847430;0.120638;0.00166705;0.0151869

Clang 3.6.1 -Wall -pedantic -march=native -O3 (GNU stdlibc++)

std::set;boost::container::flat_set;levelorder_vector
10000;0.0417386;0.000341377;0.0168463
12000;0.0417399;0.000349159;0.0169027
14400;0.0423439;0.00031997;0.0168923
17280;0.0430543;0.00042511;0.0167483
20736;0.043264;0.000421114;0.0169363
24883;0.0432674;0.000421662;0.0169812
29859;0.0442487;0.000423517;0.0168891
35830;0.0432414;0.000342709;0.0168915
42995;0.0434866;0.000354699;0.0168187
51593;0.0440113;0.000402105;0.0168149
61910;0.0439409;0.000526146;0.0172682
74290;0.0435135;0.000601147;0.0169024
89146;0.0513623;0.000648336;0.0171129
106973;0.0546859;0.000718656;0.0168484
128365;0.0524283;0.000729523;0.0175893
154035;0.0764826;0.000700932;0.0171135
184839;0.0775115;0.000702296;0.0170045
221803;0.0906357;0.000701652;0.0169417
266159;0.0973642;0.000708999;0.0168062
319386;0.106956;0.000712729;0.0168548
383258;0.117187;0.000711961;0.016891
459904;0.115757;0.000710714;0.016868
551879;0.120767;0.000700052;0.0168105
662249;0.117556;0.000698999;0.0168786
794693;0.119222;0.000709851;0.0169199
953625;0.123675;0.000723752;0.0170072
1144343;0.120827;0.000715292;0.0168958
1373204;0.120367;0.000713977;0.0168934
1647837;0.120706;0.000758504;0.0169309
1977396;0.121796;0.000938414;0.0172851
2372866;0.120065;0.00122567;0.0174493
2847430;0.120253;0.00167049;0.0175354

Clang 3.6.1 -Wall -pedantic -march=native -O3 (LLVM libc++)

std::set;boost::container::flat_set;levelorder_vector
10000;0.0421094;0.000433572;0.0174907
12000;0.0416049;0.000429394;0.0174393
14400;0.0420245;0.000426524;0.0174684
17280;0.0436509;0.000332382;0.0175928
20736;0.0440515;0.000329182;0.0173394
24883;0.044052;0.000342822;0.0174632
29859;0.0438997;0.000341482;0.017644
35830;0.0442544;0.000425812;0.0178497
42995;0.0446459;0.000515587;0.0178085
51593;0.0441517;0.000483189;0.0177848
61910;0.044744;0.000567786;0.0178402
74290;0.044896;0.000601651;0.0176016
89146;0.0430186;0.000694245;0.017695
106973;0.0455896;0.000706889;0.0177571
128365;0.0526248;0.000692947;0.0175568
154035;0.0882242;0.000721327;0.0176997
184839;0.107886;0.000725102;0.017628
221803;0.122777;0.000719787;0.017623
266159;0.141874;0.000704827;0.0176322
319386;0.139417;0.00071619;0.0180853
383258;0.145941;0.000704596;0.0176367
459904;0.142815;0.000704553;0.0176681
551879;0.156851;0.000715048;0.0177817
662249;0.145418;0.000730398;0.0177499
794693;0.147788;0.000712339;0.0178183
953625;0.147059;0.000716288;0.0178217
1144343;0.157357;0.000704795;0.017893
1373204;0.152407;0.000739034;0.0187936
1647837;0.150415;0.000724018;0.0179466
1977396;0.151268;0.000907862;0.0180083
2372866;0.157685;0.00122255;0.0181203
2847430;0.148503;0.00171368;0.018334
ReplyDelete
Replies
Björn FahllerJune 27, 2015 at 8:01 PM
But, since you're developing a new set, not just a new implementation of std::set, you're free to choose the semantics of iteration. You've chosen to iterate in sorted order, as std::set does, with more or less expected results. What if you instead allow iteration in stored order, like that of unordered_set? I bet you'd see performance very close to that of boost::container_flat_set.
ReplyDelete
Replies
Björn FahllerJune 27, 2015 at 8:40 PM
I'm betting on better performance than unordered_set, due to better locality of reference, and also reduced memory consumption (each bucket in unordered_set must, by inference from the interface, be a doubly linked list.)

I think performance would be slightly worse than container_flat_set. You either have to ensure perfect balance after every insertion/removal, which has a cost, or handle unoccupied slots in iteration, which would have a cost.
ReplyDelete
Replies
GrayShadeJune 27, 2015 at 11:24 PM
Here are my results https://gist.github.com/GrayShade/a32a7e9ae777ccded639 .

The distribution is Arch Linux x64, same as for the previous post. The i7 data might be noisy, as this is a laptop CPU and thermal throttling and/or Turbo Boost might have kicked in.
ReplyDelete
Replies
AnonymousJuly 16, 2015 at 6:32 PM
One of the good data structure and algorithm question for Interviews. Thanks
ReplyDelete
Replies

Add comment