## Saturday, June 27, 2015

### Traversing a linearized tree

In a previous entry we have introduced the data structure levelorder_vector<T>, which provides faster binary search than sorted vectors (such as sported by Boost.Container flat associative containers) by laying out the elements in a linear fashion following the breadth-first or level order of the binary tree associated to the sequence. As usual with data structures, this is expected to come at a price in the form of poorer performance for other operations.
Let us begin with iteration: traversing a levelorder_vector is algorithmically equivalent to doing so in its homologous binary tree:
The picture above shows the three generic cases the algorithm for incrementing a position has to deal with, highlighted with the same color in the following pseudocode:
```void increment(position& i)
{
position j=right(i);
if(j){                      // right child exists
do{
i=j;                    // take the path down
j=left(i);              // and follow left children
}while(j);
}
else if(is_last_leaf(i)){
i=end;                    // next of last is end
}
else{
while(is_right_child(i)){ // go up while the node is
i=parent(i);            // the right child of its parent
}
i=parent(i);              // and up an extra level
}
}```
In the case of levelorder_vector, positions are numerical indexes into an array, so these operations are purely arithmetic (by contrast, in a node-based tree we would need to follow pointers around, which is more expensive as we will see shortly):
• parent(i) = (i − 1)/2,
• right(i) = 2i +2 (if less than size),
• left(i) = 2i + 1 (if less than size),
• is_right_child(i) = (i mod 2 = 0),
• is_last_leaf(i) = i does not have a right child and i + 2 is a power of 2.
The last identity might be a little trickier to understand: it simply relies on the fact that the last leaf of the tree must also be the last element of a fully occupied level, and the number of elements in a complete binary tree with d levels is 2d − 1, hence the zero-based index of the element + 2 has to be a power of two.
A test program is provided that implements forward iterators for levelorder_vector (backwards traversal is mostly symmetrical) and measures the time taken to traverse std::set, boost::container::flat_set and levelorder_vector instances holding n = 10,000 to 3 million integer elements.
MSVC on Windows
Test compiled with Microsoft Visual Studio 2012 using default release mode settings and run on a Windows box with an Intel Core i5-2520M CPU @2.50GHz.
 Traversal time / number of elements.
As expected, boost::container::flat_set is the fastest: nothing (single threaded) can beat traversing an array of memory in sequential order. levelorder_vector is around 5x slower than boost::container::flat_set but up to 4x faster than std::set. More interestingly, its traversal time per element is virtually non-affected by the size of the container. This can be further confirmed by measuring a non-touching traversal where the range [begin, end) is walked through without actually dereferencing the iterator, and consequently not accesing the container memory at all: resulting times remain almost the same, which indicates that the traversal time is dominated by the arithmetic operations on indexes. In a later entry we provide the formal proof that traversing a levelorder_vector is indeed cache-oblivious.
GCC on Linux
Contributed by Manu Sánchez (GCC 5.1, Arch Linux 3.16, Intel Core i7 4790k @ 4.5GHz) and Laurentiu Nicola (GCC 5.1, Arch Linux x64, Intel Core i7-2760QM CPU @ 2.40GHz and Intel Celeron J1900 @ 1.99GHz). Rather than showing the plots, we provide the observed ranges in execution speed ratios between the three containers (for instance, the cell "10-30" means that boost::container::flat_set is between 10 and 30 times faster than levelorder_vector for Intel Core i7).
 flat_set levelorder_vector levelorder_vector      std::set i7 10-30 3-8 Celeron 10-12 2.5-4.5
Clang on Linux
Clang 3.6.1 with libstdc++-v3 and libc++ Arch Linux 3.16, Intel Core i7 4790k @ 4.5GHz.  Clang 3.6.1, Arch Linux x64, Intel Core i7-2760QM CPU @ 2.40GHz and Intel Celeron J1900 @ 1.99GHz.
 flat_set levelorder_vector levelorder_vector      std::set libstdc++-v3, i7 10-50 2-7 libc++, i7 10-50 2.5-8 Celeron 15-30 2.5-4.5
Conclusions
Traversing a levelorder_vector takes a constant time per element that sits in between those of boost::container::flat_set (10-50 times slower, depending on number of elements and environment) and std::set (2-8 times faster). Users need to consider their particular execution scenarios to balance this fact vs. the superior performance offered at lookup.

1. Here are my results: Arch Linux 3.16 x86_64 Intel Core i7 4790k @4.5GHz 16GB RAM

GCC 5.1 -Wall -pedantic -march=native -O3

std::set;boost::container::flat_set;levelorder_vector
10000;0.0420321;0.000491979;0.014563
12000;0.0421156;0.000483647;0.0144815
14400;0.0422168;0.000480834;0.0145216
17280;0.0428365;0.000478953;0.0145598
20736;0.042634;0.000476514;0.0146061
24883;0.0425655;0.000478717;0.0146198
29859;0.0427637;0.000476505;0.0145363
35830;0.0426727;0.000479686;0.0145594
42995;0.0426188;0.000478767;0.0145501
51593;0.0428331;0.000526655;0.0146135
61910;0.0429622;0.000634265;0.0146206
74290;0.0430819;0.000689963;0.0146423
89146;0.0429906;0.000740952;0.0146404
106973;0.0435249;0.000741801;0.0146144
128365;0.0488695;0.000740464;0.0146262
154035;0.0592191;0.000736002;0.0145821
184839;0.0761611;0.000739294;0.0146332
221803;0.0897213;0.000738229;0.0146319
266159;0.0987045;0.00073572;0.0146533
319386;0.1116;0.000740789;0.0146558
383258;0.114196;0.0007371;0.0146298
459904;0.116994;0.000751267;0.015051
551879;0.115359;0.000736435;0.014638
662249;0.119005;0.00073486;0.0147561
794693;0.120097;0.000734039;0.0146452
953625;0.120747;0.000733416;0.0146262
1144343;0.118479;0.000740644;0.0146628
1373204;0.120768;0.000775856;0.0151388
1647837;0.120661;0.000772929;0.014735
1977396;0.121989;0.00103971;0.0148675
2372866;0.11892;0.00140741;0.0148716
2847430;0.120638;0.00166705;0.0151869

Clang 3.6.1 -Wall -pedantic -march=native -O3 (GNU stdlibc++)

std::set;boost::container::flat_set;levelorder_vector
10000;0.0417386;0.000341377;0.0168463
12000;0.0417399;0.000349159;0.0169027
14400;0.0423439;0.00031997;0.0168923
17280;0.0430543;0.00042511;0.0167483
20736;0.043264;0.000421114;0.0169363
24883;0.0432674;0.000421662;0.0169812
29859;0.0442487;0.000423517;0.0168891
35830;0.0432414;0.000342709;0.0168915
42995;0.0434866;0.000354699;0.0168187
51593;0.0440113;0.000402105;0.0168149
61910;0.0439409;0.000526146;0.0172682
74290;0.0435135;0.000601147;0.0169024
89146;0.0513623;0.000648336;0.0171129
106973;0.0546859;0.000718656;0.0168484
128365;0.0524283;0.000729523;0.0175893
154035;0.0764826;0.000700932;0.0171135
184839;0.0775115;0.000702296;0.0170045
221803;0.0906357;0.000701652;0.0169417
266159;0.0973642;0.000708999;0.0168062
319386;0.106956;0.000712729;0.0168548
383258;0.117187;0.000711961;0.016891
459904;0.115757;0.000710714;0.016868
551879;0.120767;0.000700052;0.0168105
662249;0.117556;0.000698999;0.0168786
794693;0.119222;0.000709851;0.0169199
953625;0.123675;0.000723752;0.0170072
1144343;0.120827;0.000715292;0.0168958
1373204;0.120367;0.000713977;0.0168934
1647837;0.120706;0.000758504;0.0169309
1977396;0.121796;0.000938414;0.0172851
2372866;0.120065;0.00122567;0.0174493
2847430;0.120253;0.00167049;0.0175354

Clang 3.6.1 -Wall -pedantic -march=native -O3 (LLVM libc++)

std::set;boost::container::flat_set;levelorder_vector
10000;0.0421094;0.000433572;0.0174907
12000;0.0416049;0.000429394;0.0174393
14400;0.0420245;0.000426524;0.0174684
17280;0.0436509;0.000332382;0.0175928
20736;0.0440515;0.000329182;0.0173394
24883;0.044052;0.000342822;0.0174632
29859;0.0438997;0.000341482;0.017644
35830;0.0442544;0.000425812;0.0178497
42995;0.0446459;0.000515587;0.0178085
51593;0.0441517;0.000483189;0.0177848
61910;0.044744;0.000567786;0.0178402
74290;0.044896;0.000601651;0.0176016
89146;0.0430186;0.000694245;0.017695
106973;0.0455896;0.000706889;0.0177571
128365;0.0526248;0.000692947;0.0175568
154035;0.0882242;0.000721327;0.0176997
184839;0.107886;0.000725102;0.017628
221803;0.122777;0.000719787;0.017623
266159;0.141874;0.000704827;0.0176322
319386;0.139417;0.00071619;0.0180853
383258;0.145941;0.000704596;0.0176367
459904;0.142815;0.000704553;0.0176681
551879;0.156851;0.000715048;0.0177817
662249;0.145418;0.000730398;0.0177499
794693;0.147788;0.000712339;0.0178183
953625;0.147059;0.000716288;0.0178217
1144343;0.157357;0.000704795;0.017893
1373204;0.152407;0.000739034;0.0187936
1647837;0.150415;0.000724018;0.0179466
1977396;0.151268;0.000907862;0.0180083
2372866;0.157685;0.00122255;0.0181203
2847430;0.148503;0.00171368;0.018334

2. But, since you're developing a new set, not just a new implementation of std::set, you're free to choose the semantics of iteration. You've chosen to iterate in sorted order, as std::set does, with more or less expected results. What if you instead allow iteration in stored order, like that of unordered_set? I bet you'd see performance very close to that of boost::container_flat_set.

1. Traversing in stored order would result in exactly the same performance as boost::container::flat_set, but the order produced would be of no particular use, and to the user would look as random as say that of a std::unordered_set. If we decided to do that, users would be better off using a std::unordered_set to begin with.

2. Still, sometimes one does not care.

I would like to have:
- unsorted_begin()/unsorted_end()/unsorted() to get an unsorted view of the range with the same performance as flat set.
- unsorted insert operations that insert a single element in O(1), and m elements in O(m), that violate the invariants of the container. In debug mode the container can assert that the invariants are restored before any operation that needs them is performed.

3. I'm betting on better performance than unordered_set, due to better locality of reference, and also reduced memory consumption (each bucket in unordered_set must, by inference from the interface, be a doubly linked list.)

I think performance would be slightly worse than container_flat_set. You either have to ensure perfect balance after every insertion/removal, which has a cost, or handle unoccupied slots in iteration, which would have a cost.

1. I'm afraid lookup for levelorder_vector is worse than for std::unordered_set. You can compare the following
* Cache-friendly binary search
* A better hash table

which are roughly comparable for MSVC. Lookup times for levelorder_vector range between 1 and 2 us, whereas for hashed sets these times are around 0.2-1 us.

4. Here are my results https://gist.github.com/GrayShade/a32a7e9ae777ccded639 .

The distribution is Arch Linux x64, same as for the previous post. The i7 data might be noisy, as this is a laptop CPU and thermal throttling and/or Turbo Boost might have kicked in.