tag:blogger.com,1999:blog-27159684727355469622024-03-08T17:09:24.758+01:00Bannalia: trivial notes on themes diverseJoaquín M López Muñozhttp://www.blogger.com/profile/08579853272674211100noreply@blogger.comBlogger152125tag:blogger.com,1999:blog-2715968472735546962.post-7924925746818863002023-10-20T18:30:00.020+02:002023-10-22T19:44:29.653+02:00Bulk visitation in boost::concurrent_flat_map<p></p><h1 dir="auto" id="user-content-bulk-visitation-in-boostconcurrent_flat_map" tabindex="-1"></h1><ul dir="auto"><li><a href="#introduction">Introduction</a></li><li><a href="#prior-art">Prior art</a></li><li><a href="#bulk-visitation-design">Bulk visitation design</a></li><li><a href="#performance-analysis">Performance analysis</a></li><li><a href="#conclusions-and-next-steps">Conclusions and next steps</a></li></ul>
<h2 dir="auto" id="user-content-introduction" tabindex="-1"><a name="introduction"><b><span style="font-size: large;">Introduction</span></b></a><a class="heading-link" href="https://github.com/joaquintides/bulk_visit_article/tree/main#introduction"></a></h2><p dir="auto"><a href="https://www.boost.org/doc/libs/develop/libs/unordered/doc/html/unordered.html#concurrent_flat_map" rel="nofollow"><code>boost::concurrent_flat_map</code></a>
and its <a href="https://www.boost.org/doc/libs/develop/libs/unordered/doc/html/unordered.html#concurrent_flat_set" rel="nofollow"><code>boost::concurrent_flat_set</code></a>
counterpart are Boost.Unordered's associative containers for
high-performance concurrent scenarios. These containers dispense with iterators in favor of a
<i>visitation</i>-based interface:</p>
<div class="highlight highlight-source-c++ notranslate position-relative overflow-auto" dir="auto"><pre class="prettyprint">boost::concurrent_flat_map<<span class="pl-k">int</span>, <span class="pl-k">int</span>> m;
...
<span class="pl-c"><span class="pl-c">//</span> find the element with key k and increment its associated value</span>
m.visit(k, [](<span class="pl-k">auto</span>& x) {
++x.<span class="pl-smi">second</span>;
});</pre><div class="zeroclipboard-container position-absolute right-0 top-0">
</div></div><p dir="auto">This design choice was made because visitation is not affected by some inherent
problems afflicting iterators in multithreaded environments.</p>
<p dir="auto">Starting in Boost 1.84,
code like the following:</p>
<div class="highlight highlight-source-c++ notranslate position-relative overflow-auto" dir="auto"><pre class="prettyprint">std::array<<span class="pl-k">int</span>, N> keys;
...
<span class="pl-k">for</span>(<span class="pl-k">const</span> <span class="pl-k">auto</span>& key: keys) {
m.<span class="pl-c1">visit</span>(key, [](<span class="pl-k">auto</span>& x) { ++x.<span class="pl-smi">second</span>; });
}</pre><div class="zeroclipboard-container position-absolute right-0 top-0">
</div></div><p dir="auto">can be written more succintly via the so-called <i>bulk visitation</i> API:</p>
<div class="highlight highlight-source-c++ notranslate position-relative overflow-auto" dir="auto"><pre class="prettyprint">m.visit(keys.begin(), keys.end(), [](<span class="pl-k">auto</span>& x) { ++x.<span class="pl-smi">second</span>; });</pre><div class="zeroclipboard-container position-absolute right-0 top-0">
</div></div><p dir="auto">As it happens, bulk visitation is not provided merely for syntactic convenience:
this operation is internally optimized so that it performs significantly faster
than the original for-loop. We discuss here the key ideas behind bulk visitation
internal design and analyze its performance.</p>
<h2 dir="auto" id="user-content-prior-art" tabindex="-1"><a name="prior-art"><b><span style="font-size: large;">Prior art</span></b></a></h2><p dir="auto">In their paper
<a href="https://dl.acm.org/doi/pdf/10.1145/3552326.3587457" rel="nofollow">"DRAMHiT: A Hash Table Architected for the Speed of DRAM"</a>,
Narayanan et al. explore some optimization techniques from the domain of
distributed system as translated to concurrent hash tables running on modern
multi-core architectures with hierarchical caches. In particular, they note
that cache misses can be avoided by batching requests to the hash table, prefetching
the memory positions required by those requests and then completing the operations
asynchronously when enough time has passed for the data to be effectively
retrieved. Our bulk visitation implementation draws inspiration from this
technique, although in our case visitation is fully synchronous and in-order,
and it is the responsibility of the user to batch keys before calling
the bulk overload of <code>boost::concurrent_flat_map::visit</code>.</p>
<h2 dir="auto" id="user-content-bulk-visitation-design" tabindex="-1"><a name="bulk-visitation-design"><b><span style="font-size: large;">Bulk visitation design</span></b></a></h2><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijST3J7HxWSPrQwhRoXQBiNXkaZ1Q4bzMNVcgVfevMOGG5iNde5LnNo7dbk8KIc7kPg9FHWMJ0DdYP6seHktxoNevDhkuVTq9I74-cStGjZ3AeueYRf1o4MHcOJPaTwUiqz34nGlnOXnfgMt3XQ_vhK8NfU3R2sKHvTtdXEWzyn3nQiLa8LdD7C2-fRJA/s16000/data_structure.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="351" data-original-width="935" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijST3J7HxWSPrQwhRoXQBiNXkaZ1Q4bzMNVcgVfevMOGG5iNde5LnNo7dbk8KIc7kPg9FHWMJ0DdYP6seHktxoNevDhkuVTq9I74-cStGjZ3AeueYRf1o4MHcOJPaTwUiqz34nGlnOXnfgMt3XQ_vhK8NfU3R2sKHvTtdXEWzyn3nQiLa8LdD7C2-fRJA/w600/data_structure.png" width="600" /></a></div><p dir="auto"></p>
<p dir="auto">As discussed in a previous <a href="https://bannalia.blogspot.com/2023/07/inside-boostconcurrentflatmap.html" rel="nofollow">article</a>,
<code>boost::concurrent_flat_map</code> uses an open-addressing data structure comprising:</p>
<ul dir="auto"><li>A bucket array split into 2<sup><i>n</i></sup> groups of <i>N</i> = 15 slots.</li><li>A metadata array associating a 16-byte metadata word to each slot group, used for SIMD-based reduced-hash matching.</li><li>An access array with a spinlock (and some additional information) for locked access to each group.</li></ul>
<p dir="auto">The happy path for successful visitation looks like this:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhgKWUR150_-IJ8AM1oo_ov5enCDojj0ZCXFL_DLhVC8ZezFngAi4it1nEn9tspznwTl5JoU2MtEzEG1PL-7hlUaoJzwo7UjPwi7dlvhiLN6v9e25xXpjTG58zMp58U9T746qZoFW9Mmi5HlYDvoYA58leZ7-yM9jgH72XgbnWyT6KMbaoSLHuy7wKf9gE/s16000/visit.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="422" data-original-width="626" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhgKWUR150_-IJ8AM1oo_ov5enCDojj0ZCXFL_DLhVC8ZezFngAi4it1nEn9tspznwTl5JoU2MtEzEG1PL-7hlUaoJzwo7UjPwi7dlvhiLN6v9e25xXpjTG58zMp58U9T746qZoFW9Mmi5HlYDvoYA58leZ7-yM9jgH72XgbnWyT6KMbaoSLHuy7wKf9gE/w600/visit.png" width="600" /></a></div><p dir="auto"></p>
<ol dir="auto"><li>The hash value for the looked-up key and its mapped group position are calculated.</li><li>The metadata for the group is retrieved and matched against the hash value.</li><li>If the match is positive (which is the case for the <i>happy</i> path), the group is locked for access
and the element indicated by the matching mask is retrieved and compared with the key. Again, in the
happy path this comparison is positive (the element is found); in the unhappy path, more elements
(within this group or beyond) need be checked.</li></ol>
<p dir="auto">(Note that happy <i>unsuccessful</i> visitation simply terminates at step 2, so we focus our
analysis on successful visitation.) As the diagram shows, the CPU has to
wait for memory retrieval between steps 1 and 2 and between steps 2 and 3 (in the latter case,
retrievals of mutex and element are parallelized through manual prefetching). A key insight
is that, under normal circumstances, these memory accesses will almost always be cache misses:
successive visitation operations, unless for the very same key, won't have any cache
locality. In bulk visitation, the stages of the algorithm are <i>pipelined</i> as follows
(the diagram shows the case of three operations in the bulk batch):</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjeD4U6u3q2NHFiBm4EHKHwnhpygNIZT8efRg665Qa8PsbAGfM7-nWpZTQP38lhWx7Kr1p5LassPsYeuzakkMtPxdvbsQQs0NNIC3cAaVwlXkk3VDHd5GUMTxsA9OXzToHLAw3hJA4I8Me7dXYXUvG5lAyFzvfN6f_1C-ZAhh-3LlciomyCLtYXM93VEAk/s16000/bulk_visit.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="536" data-original-width="626" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjeD4U6u3q2NHFiBm4EHKHwnhpygNIZT8efRg665Qa8PsbAGfM7-nWpZTQP38lhWx7Kr1p5LassPsYeuzakkMtPxdvbsQQs0NNIC3cAaVwlXkk3VDHd5GUMTxsA9OXzToHLAw3hJA4I8Me7dXYXUvG5lAyFzvfN6f_1C-ZAhh-3LlciomyCLtYXM93VEAk/w600/bulk_visit.png" width="600" /></a></div><p dir="auto"></p>
<p dir="auto">The data required at step <i>N</i>+1 is prefetched at the end of step <i>N</i>. Now,
if a sufficiently large number of operations are pipelined, we can effectively eliminate
cache miss stalls: all memory addresses will be already cached by the time they are used.</p>
<p dir="auto">The operation <code>visit(first, last, f)</code> internally splits [<code>first</code>, <code>last</code>) into chunks
of
<a href="https://www.boost.org/doc/libs/develop/libs/unordered/doc/html/unordered.html#concurrent_flat_map_constants" rel="nofollow"><code>bulk_visit_size</code></a>
elements that are then processed as described above.
This chunk size has to be sufficiently large to give time for memory to be
actually cached at the point of usage. On the upper side, the chunk size is limited
by the number of outstanding memory requests that the CPU can handle at a time: in
Intel architectures, this is limited by the size of the
<a href="https://cdrdv2-public.intel.com/671488/248966-046A-software-optimization-manual.pdf" rel="nofollow"><i>line fill buffer</i></a>,
typically 10-12. We have empirically confirmed that bulk visitation maxes at around
<code>bulk_visit_size</code> = 16, and stabilizes beyond that.</p>
<h2 dir="auto" id="user-content-performance-analysis" tabindex="-1"><a name="performance-analysis"><b><span style="font-size: large;">Performance analysis</span></b></a><a class="heading-link" href="https://github.com/joaquintides/bulk_visit_article/tree/main#performance-analysis"><svg aria-hidden="true" class="octicon octicon-link" height="16" viewbox="0 0 16 16" width="16"></svg></a></h2><p dir="auto">For our study of bulk visitation performance, we have used a computer with
a <a href="https://www.7-cpu.com/cpu/Skylake.html" rel="nofollow">Skylake</a>-based Intel Core i5-8265U CPU:</p>
<table border="1" style="border-collapse: collapse; height: 201px; margin-left: auto; margin-right: auto; padding: 0pt; text-align: center; width: 80%;">
<thead>
<tr>
<th align="left"><br /></th>
<th align="center">Size/core</th>
<th align="center">Latency [ns]</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" style="padding-left: 5px;">L1 data cache</td>
<td align="center">32 KB</td>
<td align="center">3.13</td>
</tr>
<tr>
<td align="left" style="padding-left: 5px;">L2 cache</td>
<td align="center">256 KB</td>
<td align="center">6.88</td>
</tr>
<tr>
<td align="left" style="padding-left: 5px;">L3 cache</td>
<td align="center">6 MB</td>
<td align="center">25.00</td>
</tr>
<tr>
<td align="left" style="padding-left: 5px;">DDR4 RAM</td>
<td align="center"><br /></td>
<td align="center">77.25</td>
</tr>
</tbody>
</table>
<p dir="auto">We measure the throughput in Mops/sec of single-threaded lookup (50/50 successful/unsuccessful)
for both regular and bulk visitation on a <code>boost::concurrent_flat_map<int, int></code> with sizes
<i>N</i> = 3k, 25k, 600k, and 10M: for the three first values, the container fits entirely into
L1, L2 and L3, respectively. The <a href="https://github.com/joaquintides/bulk_visit_performance/blob/main/bulk_visit_performance.cpp">test program</a>
has been compiled with <code>clang-cl</code> for Visual Studio 2022 in release mode.</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhSAQ1wKDfyS7KD0CqWW3vTTlPdjv2h9zjQI_IMSi85AgUR2frnm40h6zjhKlmwTodv8HFEII5T9G7KeJRE5DmPOS0VFSAGnIhuTQx-baUZjFD9qNZBDbTm588aZ3ydBsYIzrsP3tMb0idW6DIz363VJuU8WlCP0uYB9fQsJJUWe_BibHvV58GszWJo0cQ/s16000/performance.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="446" data-original-width="714" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhSAQ1wKDfyS7KD0CqWW3vTTlPdjv2h9zjQI_IMSi85AgUR2frnm40h6zjhKlmwTodv8HFEII5T9G7KeJRE5DmPOS0VFSAGnIhuTQx-baUZjFD9qNZBDbTm588aZ3ydBsYIzrsP3tMb0idW6DIz363VJuU8WlCP0uYB9fQsJJUWe_BibHvV58GszWJo0cQ/w600/performance.png" width="600" /></a></div><p dir="auto"></p>
<p dir="auto">As expected, the relative performance of bulk vs. regular visitation grows as
data is fetched from a slower cache (or RAM in the latter case). The theoretical
throughput achievable by bulk visitation has been estimated from regular visitation
by subtracting memory retrieval times as calculated with the following model:</p>
<ul dir="auto"><li>If the container fits in L<i>n</i> (L4 = RAM), L<i>n</i>−1 is entirely occupied
by metadata and access objects (and some of this data spills over to L<i>n</i>).</li><li>Mutex and element retrieval times (which only apply to successful visitation)
are dominated by the latter.</li></ul>
<p dir="auto">Actual and theoretical figures match quite well, which sugggests that the
algorithmic overhead imposed by bulk visitation is negligible.</p>
<p dir="auto">We have also run
<a href="https://github.com/boostorg/boost_unordered_benchmarks/tree/boost_concurrent_flat_map">benchmarks</a>
under conditions more similar to real-life for <code>boost::concurrent_flat_map</code>,
with and without bulk visitation, and other concurrent containers, using different
compilers and architectures. As an example, these are the results for a workload of 50M
insert/lookup mixed operations distributed across several concurrent threads for different
data distributions with Clang 12 on an ARM64 computer:</p>
<table>
<thead>
<tr>
<th align="center"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg66THzURjiy1yHmoRUdbZVLDi57GaDCeA1755_Sx5jdbZVwL-0SyN6rr1dYu6V5icrJPpD9Uw0j5wMd21eGvZRxKjHf4Yu4xFMQSBUaKxVMR_lCz0lQM1HxLIwx5o1VFzya4nbd8twV0C8ucbNhnFdTZi-H86y8dn-wDDPhDoOA2-m941ufa1pPAVPjXY/s698/Parallel%20workload.xlsx.5M,%200.01.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="449" data-original-width="698" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg66THzURjiy1yHmoRUdbZVLDi57GaDCeA1755_Sx5jdbZVwL-0SyN6rr1dYu6V5icrJPpD9Uw0j5wMd21eGvZRxKjHf4Yu4xFMQSBUaKxVMR_lCz0lQM1HxLIwx5o1VFzya4nbd8twV0C8ucbNhnFdTZi-H86y8dn-wDDPhDoOA2-m941ufa1pPAVPjXY/w200-h129/Parallel%20workload.xlsx.5M,%200.01.png" width="180" /></a>
</th>
<th align="center"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEioy6059L3HFaVpX5dgOJibmTo6NYqahfxfzP-kDsBUMgXroAoyfWX-s8jHRzCe8Yz4-K3JmUeIVPW6hd5GySNWZtPB79vYKhfzHinfNlFsMjN20TSFqpPDV7IEpaeAYtD2A5Wjxmt73A_AGshvF2V2-CCgn-8AmNP18lnpVrTqbIsA1sW0HvVkjBtyMwY/s698/Parallel%20workload.xlsx.5M,%200.5.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="449" data-original-width="698" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEioy6059L3HFaVpX5dgOJibmTo6NYqahfxfzP-kDsBUMgXroAoyfWX-s8jHRzCe8Yz4-K3JmUeIVPW6hd5GySNWZtPB79vYKhfzHinfNlFsMjN20TSFqpPDV7IEpaeAYtD2A5Wjxmt73A_AGshvF2V2-CCgn-8AmNP18lnpVrTqbIsA1sW0HvVkjBtyMwY/w200-h129/Parallel%20workload.xlsx.5M,%200.5.png" width="180" /></a>
</th>
<th align="center"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjk6rxpKAZw_5IVAVIdzGfBGahg0NLbxctPtUsvExLJPFDleXYAaqePi6a89imdguwfaVXZCXCk7xoMTPKq058eIC69kzbKxhgRNLrPzU-1zt_tcM_wq2oZLOQNrxOdCU9TfniNvt0TW64KnpOydHgFW36y7IbQAc5-AKBu99tfY0PIuzq5jRLIfyHUarg/s698/Parallel%20workload.xlsx.5M,%200.99.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="449" data-original-width="698" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjk6rxpKAZw_5IVAVIdzGfBGahg0NLbxctPtUsvExLJPFDleXYAaqePi6a89imdguwfaVXZCXCk7xoMTPKq058eIC69kzbKxhgRNLrPzU-1zt_tcM_wq2oZLOQNrxOdCU9TfniNvt0TW64KnpOydHgFW36y7IbQAc5-AKBu99tfY0PIuzq5jRLIfyHUarg/w200-h129/Parallel%20workload.xlsx.5M,%200.99.png" width="180" /></a>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">5M updates, 45M lookups<br />skew=0.01</td>
<td align="center">5M updates, 45M lookups<br />skew=0.5</td>
<td align="center">5M updates, 45M lookups<br />skew=0.99</td>
</tr>
</tbody>
</table>
<p dir="auto">Again, bulk visitation increases performance noticeably. Please refer to the
<a href="https://github.com/boostorg/boost_unordered_benchmarks/tree/boost_concurrent_flat_map">benchmark site</a>
for further information and results.</p>
<h2 dir="auto" id="user-content-conclusions-and-next-steps" tabindex="-1"><a name="conclusions-and-next-steps"><div style="text-align: left;"><b><span style="font-size: large;">Conclusions and next steps</span></b></div></a></h2><p dir="auto">Bulk visitation is an addition to the interface of <code>boost::concurrent_flat_map</code>
and <code>boost::concurrent_flat_set</code> that improves lookup performance by pipelining
the internal visitation operations for chunked groups of keys. The tradeoff
for this increased throughput is higher latency, as keys need to be batched
by the user code before issuing the <code>visit</code> operation.</p>
<p dir="auto">The insights we have gained with bulk visitation for concurrent containers
can be leveraged for future Boost.Unordered features:</p>
<ul dir="auto"><li>In principle, insertion can also be made to operate in bulk mode, although
the resulting pipelined algorithm is likely more complex than in the
visitation case, and thus performance increases are expected to be lower.</li><li>Bulk visitation (and insertion) is directly applicable to non-concurrent
containers such as
<a href="https://www.boost.org/doc/libs/develop/libs/unordered/doc/html/unordered.html#unordered_flat_map" rel="nofollow"><code>boost::unordered_flat_map</code></a>:
the main problem for this is one of interface design because we are not using
visitation here as the default lookup API (classical iterator-based
lookup is provided instead). Some possible options are:
<ol dir="auto"><li>Use visitation as in the concurrent case.</li><li>Use an iterator-based lookup API that outputs the resulting iterators to
some user-provided buffer (probably modelled as an output "meta" iterator taking
container iterators).</li></ol>
</li></ul>
<p dir="auto">Bulk visitation will be officially shipping in Boost 1.84 (December 2023) but is already
available by checking out the
<a href="https://github.com/boostorg/unordered/">Boost.Unordered repo</a>. If you are interested
in this feature, please try it and report your local results and suggestions for
improvement. Your feedback on our current and future work is much welcome.</p><p></p>Joaquín M López Muñozhttp://www.blogger.com/profile/08579853272674211100noreply@blogger.com0tag:blogger.com,1999:blog-2715968472735546962.post-75878048418699769562023-08-18T18:27:00.003+02:002023-08-18T18:27:48.501+02:00User-defined class qualifiers in C++23<p>It is generally known that type qualifiers (such as <code>const</code> and <code>volatile</code> in C++) can be regarded as <a href="https://dl.acm.org/doi/pdf/10.1145/301618.301665">a form of subtyping</a>: for instance, <code>const T</code> is a <i>supertype</i> of <code>T</code> because the interface (available operations) of <code>T</code> are strictly wider than those of <code>const T</code>. Foster et al. call a qualifier <b>q</b> <i>positive</i> if <b>q</b> <code>T</code> is a supertype of <code>T</code>, and <i>negative</i> it if is the other way around. Without real loss of generality, in what follows we only consider negative qualifiers, where <b>q</b> <code>T</code> is a <i>subtype</i> of (extends the interface of) <code>T</code>.</p><p>C++23 <a href="https://en.cppreference.com/w/cpp/language/member_functions#Explicit_object_parameter">explicit object parameters</a> (coloquially known as "deducing <code>this</code>") allow for a particularly concise and effective realization of user-defined qualifiers for class types beyond what the language provides natively. For instance, this is a syntactically complete implementation of qualifier <code>mut</code>, the dual/inverse of <code>const</code> (not to be confused with <code>mutable</code>):<br /></p>
<div class="highlight highlight-source-c++ notranslate position-relative overflow-auto" dir="auto"><pre class="prettyprint">template<typename T><br />struct mut: T<br />{<br /> using T::T;<br />};<br /><br />template<typename T><br />T& as_const(T& x) { return x;}<br /><br />template<typename T><br />T& as_const(mut<T>& x) { return x;}<br /><br />struct X<br />{<br /> void foo() {}<br /> void bar(this mut<X>&) {}<br />};<br /><br />int main()<br />{<br /> mut<X> x;<br /> x.foo();<br /> x.bar();<br /><br /> auto& y = as_const(x);<br /> y.foo();<br /> y.bar(); // <span class="linked-compiler-output-line">error: cannot convert argument 1 from 'X' to 'mut<X> &'<br /></span><br /> X& z = x;<br /> z.foo();<br /> z.bar(); // <span class="linked-compiler-output-line">error: cannot convert argument 1 from 'X' to 'mut<X> &'<br /></span>}</pre></div><p>The class <code>X</code> has a regular (generally accessible) member function <code>foo</code> and then <code>bar</code>, which is only accessible to instances of the form <code>mut<X></code>. Access checking and implicit and explicit conversion between subtype <code>mut<X></code> and <code>mut<X></code> work as expected.</p><p>With some help fom <a href="https://boost.org/libs/mp11">Boost.Mp11</a>, the idiom can be generalized to the case of several qualifiers:<br /></p>
<div class="highlight highlight-source-c++ notranslate position-relative overflow-auto" dir="auto"><pre class="prettyprint">#include <boost/mp11/algorithm.hpp><br />#include <boost/mp11/list.hpp><br />#include <type_traits><br /><br />template<typename T,typename... Qualifiers><br />struct access: T<br />{<br /> using qualifier_list=boost::mp11::mp_list<Qualifiers...>;<br /><br /> using T::T;<br />};<br /><br />template<typename T, typename... Qualifiers><br />concept qualified =<br /> (boost::mp11::mp_contains<<br /> typename std::remove_cvref_t<T>::qualifier_list,<br /> Qualifiers>::value && ...);<br /><br />// some qualifiers<br />struct mut;<br />struct synchronized;<br /><br />template<typename T><br />concept is_mut = qualified<T, mut>;<br /><br />template<typename T><br />concept is_synchronized = qualified<T, synchronized>;<br /><br />struct X<br />{<br /> void foo() {}<br /><br /> template<is_mut Self><br /> void bar(this Self&&) {} <br /><br /> template<is_synchronized Self><br /> void baz(this Self&&) {}<br /><br /> template<typename Self><br /> void qux(this Self&&)<br /> requires qualified<Self, mut, synchronized><br /> {}<br />};<br /><br />int main()<br />{<br /> access<X, mut> x;<br /><br /> x.foo();<br /> x.bar();<br /> x.baz(); // error: associated constraints are not satisfied<br /> x.qux(); // error: associated constraints are not satisfied<br /><br /> X y;<br /> x.foo();<br /> y.bar(); // error: associated constraints are not satisfied<br /><br /> access<X, mut, synchronized> z;<br /> z.bar();<br /> z.baz();<br /> z.qux();<br />}</pre></div>One difficulty remains, though:<br />
<div class="highlight highlight-source-c++ notranslate position-relative overflow-auto" dir="auto"><pre class="prettyprint">int main()<br />{<br /> access<X, mut, synchronized> z;<br /> //...<br /> access<X, mut>& w=z; // error: <span class="linked-compiler-output-line">cannot convert from<br /> // 'access<X,mut,synchronized>'<br /> // to 'access<X,mut> &'</span><br />}</pre></div><code>access<T,Qualifiers...>&</code> converts to <code>T&</code>, but not to <code>access<T,Qualifiers2...>&</code> where <code>Qualifiers2</code> is a subset of <code>Qualifiers</code> (for the mathematically inclined, qualifiers <b>q</b><sub><b>1</b></sub>, ... , <b>q</b><sub><b><i>N</i></b></sub> over a type <code>T</code> induce a <a href="https://en.wikipedia.org/wiki/Lattice_(order)">lattice</a> of subtypes <b>Q</b> <code>T</code>, <b>Q</b> ⊆ {<b>q</b><sub><b>1</b></sub>, ... , <b>q</b><sub><b><i>N</i></b></sub>}, ordered by qualifier inclusion). Incurring undefined behavior, we could do the following:<br />
<div class="highlight highlight-source-c++ notranslate position-relative overflow-auto" dir="auto"><pre class="prettyprint">template<typename T,typename... Qualifiers><br />struct access: T<br />{<br /> using qualifier_list=boost::mp11::mp_list<Qualifiers...>;<br /><br /> using T::T;<br /><br /> template<typename... Qualifiers2><br /> operator access<T, Qualifiers2...>&()<br /> requires qualified<access, Qualifiers2...><br /> {<br /> return reinterpret_cast<access<T, Qualifiers2...>&>(*this);<br /> }<br />};</pre></div>A more interesting challenge is the following: As laid out, this technique implements <i>syntactic</i> qualifier subtyping, but does not do anything towards enforcing the semantics associated to each qualifier: for instance, <code>synchronized</code> should lock a mutex automatically, and a qualifier associated to some particular invariant should assert it after each invocation to a qualifier-constraied member function. I don't know if this functionality can be more or less easily integrated into the presented framework: feedback on the matter is much welcome.<br />Joaquín M López Muñozhttp://www.blogger.com/profile/08579853272674211100noreply@blogger.com0tag:blogger.com,1999:blog-2715968472735546962.post-17717052044500273432023-07-07T12:47:00.005+02:002023-07-07T22:05:21.979+02:00Inside boost::concurrent_flat_map<p></p><h1 style="text-align: left;" tabindex="-1"></h1><ul style="text-align: left;"><li><a href="#introduction">Introduction</a></li><li><a href="#state-of-the-art">State of the art</a></li><li><a href="#design-principles">Design principles</a></li><li><a href="#data-structure">Data structure</a></li><li><a href="#algorithms">Algorithms</a></li><ul><li><a href="#lookup">Lookup</a></li><li><a href="#insertion">Insertion</a></li></ul><li><a href="#visitation-api">Visitation API</a></li><li><a href="#benchmarks">Benchmarks</a></li><li><a href="#conclusions-and-next-steps">Conclusions and next steps</a><br /></li></ul><a name="introduction"><div style="text-align: left;"><b><span style="font-size: large;">Introduction</span></b></div></a>
<p dir="auto">Starting in Boost 1.83, <a href="https://www.boost.org/doc/libs/release/libs/unordered/" rel="nofollow">Boost.Unordered</a> provides
<code>boost::concurrent_flat_map</code>, an associative container suitable for high-load parallel scenarios.
<code>boost::concurrent_flat_map</code> leverages much of the work done for
<a href="https://bannalia.blogspot.com/2022/11/inside-boostunorderedflatmap.html" rel="nofollow"><code>boost::unordered_flat_map</code></a>,
but also introduces innovations, particularly in the areas of low-contention
operation and API design, that we find worth discussing.</p>
<a name="state-of-the-art"><div style="text-align: left;"><b><span style="font-size: large;">State of the art</span></b></div></a>
<p dir="auto">The space of C++ concurrent hashmaps spans a diversity of competing techniques, from
traditional ones such as lock-based structures or sharding, to very specialized approaches relying
on CAS instructions, hazard pointers, Read-Copy-Update (RCU), etc. We list some
prominent examples:</p>
<ul dir="auto"><li><a href="https://spec.oneapi.io/versions/latest/elements/oneTBB/source/containers/concurrent_hash_map_cls.html" rel="nofollow"><code>tbb::concurrent_hash_map</code></a>
uses closed addressing combined with bucket-level read-write locking. The bucket array is split
in a number of <i>segments</i> to allow for incremental rehashing without locking the entire table.
Concurrent insertion, lookup and erasure are supported, but iterators are not thread safe.
Locked access to elements is done via so-called <i>accessors</i>.</li><li><a href="https://spec.oneapi.io/versions/latest/elements/oneTBB/source/containers/concurrent_unordered_map_cls.html" rel="nofollow"><code>tbb::concurrent_unordered_map</code></a>
also uses closed addressing, but buckets are organized into lock-free
<a href="https://dl.acm.org/doi/10.1145/1147954.1147958" rel="nofollow"><i>split-ordered lists</i></a>.
Concurrent insertion, lookup, and traversal are supported, whereas erasure is not
thread safe. Element access via iterators is not protected against data races.</li><li><i>Sharding</i> consists in dividing the hashmap into a fixed number <i>N</i> of submaps indexed
by hash (typically, the element <i>x</i> goes into the submap with index hash(<i>x</i>) mod <i>N</i>).
Sharding is extremely easy to implement starting from a non-concurrent hashmap and provides
incremental rehashing, but the degree of concurrency is limited by <i>N</i>.
As an example,
<a href="https://github.com/greg7mdp/gtl/blob/main/docs/phmap.md"><code>gtl::parallel_flat_hash_map</code></a> uses sharding
with submaps essentially derived from <a href="https://abseil.io/docs/cpp/guides/container" rel="nofollow"><code>absl::flat_hash_map</code></a>,
and inherits the excellent performance of this base container.</li><li><a href="https://efficient.github.io/libcuckoo/classlibcuckoo_1_1cuckoohash__map.html" rel="nofollow"><code>libcuckoo::cuckoohash_map</code></a>
adds efficient thread safety to classical
<a href="https://en.wikipedia.org/wiki/Cuckoo_hashing" rel="nofollow">cuckoo hashing</a> by means of a number of carefully
engineered <a href="https://www.cs.princeton.edu/~mfreed/docs/cuckoo-eurosys14.pdf" rel="nofollow">techniques</a>
including fine-grained locking of slot groups or "strips" (of size 4 by default),
optimistic insertion and data prefetching.</li><li>Meta's <a href="https://github.com/facebook/folly/blob/main/folly/concurrency/ConcurrentHashMap.h"><code>folly::ConcurrentHashMap</code></a>
combines closed addressing, sharding and <a href="https://en.wikipedia.org/wiki/Hazard_pointer" rel="nofollow"><i>hazard pointers</i></a>
to elements to achieve lock-free lookup (modifying operations such as insertion
and erasure lock the affected shard). Iterators, which internally hold a hazard
pointer to the element, can be validly dereferenced even after the element
has been erased from the map; access, on the other hand, is constant and
elements are basically treated as immutable.</li><li><a href="https://github.com/facebook/folly/blob/main/folly/docs/AtomicHashMap.md"><code>folly::AtomicHashMap</code></a>
is a very specialized hashmap that imposes severe usage restrictions in exchange for
very high time and space performance. Keys must be trivially copyable and 32 or 64 bits in size so that
they can be handled internally by means of atomic instructions; also, some key values
must be reserved to mark empty slots, tombstones and locked elements, so that
no extra memory is required for bookkeeping information and locks. The internal
data structure is based on open addressing with linear probing. Non-modifying
operations are lock-free. Rehashing is not provided: instead, extra bucket arrays are appended
when the map becomes full, the expectation being that the user provide the estimated final size at
construction time to avoid this rather inefficient growth mechanism. Element access
is not protected against data races.</li><li>On a more experimental/academic note, we can mention initiatives such as
<a href="https://preshing.com/20160201/new-concurrent-hash-maps-for-cpp/" rel="nofollow">Junction</a>,
<a href="https://arxiv.org/pdf/1601.04017.pdf" rel="nofollow">Folklore</a> and
<a href="https://dl.acm.org/doi/pdf/10.1145/3552326.3587457" rel="nofollow">DRAMHiT</a>. In general, these
do not provide industry-grade container implementations but explore interesting
ideas that could eventually be adopted by mainstream libraries, such as
<a href="https://en.wikipedia.org/wiki/Read-copy-update" rel="nofollow">RCU</a>-based data structures,
lock-free algorithms relying on
<a href="https://en.wikipedia.org/wiki/Compare-and-swap" rel="nofollow">CAS</a> and/or
<a href="https://en.wikipedia.org/wiki/Transactional_memory" rel="nofollow">transactional memory</a>,
parallel rehashing and operation batching.</li></ul>
<a name="design-principles"><div style="text-align: left;"><b><span style="font-size: large;">Design principles</span></b></div></a>
<p dir="auto">Unlike non-concurrent C++ containers, where the STL acts as a sort of
reference interface, concurrent hashmaps in the market
differ wildly in terms of requirements, API and provided functionality. When
designing <code>boost::concurrent_flat_map</code>, we have aimed for a general-purpose
container</p>
<ul dir="auto"><li>with no special restrictions on key and mapped types,</li><li>providing full thread safety without external synchronization mechanisms,</li><li>and disrupting as little as possible the conceptual and operational model
of "traditional" containers.</li></ul>
<p dir="auto">These principles rule out some scenarios such as requiring that
keys be of an integral type or putting an extra burden on the user in terms
of access synchronization or active garbage collection. They also inform
concrete design decisions:</p>
<ul dir="auto"><li><code>boost::concurrent_flat_map<Key, T, Hash, Pred, Allocator></code> must be a valid
instantiation in all practical cases where
<code>boost::unordered_flat_map<Key, T, Hash, Pred, Allocator></code> is.</li><li>Thread-safe value semantics are provided (including copy construction, assignment,
swap, etc.)</li><li>All member functions in <code>boost::unordered_flat_map</code> are provided by
<code>boost::concurrent_flat_map</code> except if there's a fundamental reason why
they can't work safely or efficiently in a concurrent setting.</li></ul>
<p dir="auto">The last guideline has the most impact on API design. In particular, we have
decided <i>not to provide iterators</i>, either blocking or not: if not-blocking,
they're unsafe, and if blocking they increase contention when not properly used,
and can very easily lead to deadlocks:</p>
<div class="highlight highlight-source-c++ notranslate position-relative overflow-auto" dir="auto"><pre class="prettyprint"><span class="pl-c"><span class="pl-c">//</span> Thread 1</span>
map_type::iterator it1=map.find(x1), it2=map.find(x2);
<span class="pl-c"><span class="pl-c">//</span> Thread 2</span>
map_type::iterator it2=map.find(x2), it1=map.find(x1);</pre><div class="zeroclipboard-container">
</div></div><p dir="auto">In place of iterators, <code>boost::concurrent_flat_map</code> offers an access API
based on internal visitation, as described in a later <a href="#visitation-api">section</a>.</p>
<a name="data-structure"><div style="text-align: left;"><b><span style="font-size: large;">Data structure</span></b></div></a>
<p dir="auto"><code>boost::concurrent_flat_map</code> uses the same
<a href="https://bannalia.blogspot.com/#boostunordered_flat_map-data-structure" rel="nofollow">open-addressing layout</a>
as <code>boost::unordered_flat_map</code>, where the bucket array is split into
2<sup><i>n</i></sup> groups of <i>N</i> = 15 slots and each group has an associated
16-byte metadata word for SIMD-based reduced-hash matching and insertion overflow
control.</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhgqQsUWKZyjAuhnZkvVV6DvZQ6ytlHWwT-uo9jSonjlpvFeK0pFu8Lkn-Awvr2n0c8BSQqD_tyPxjTTW_ybWxbFMc-nyUF_MfehkQ4es7pVwyH_GIws29indL8Nzsjiu67qdJSROpd7Di4eVgx9reLqZBrRFFLf8CNjFk03nMVyH7Ze6mdiUAfvFzFN3E/s16000/data_structure.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="351" data-original-width="931" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhgqQsUWKZyjAuhnZkvVV6DvZQ6ytlHWwT-uo9jSonjlpvFeK0pFu8Lkn-Awvr2n0c8BSQqD_tyPxjTTW_ybWxbFMc-nyUF_MfehkQ4es7pVwyH_GIws29indL8Nzsjiu67qdJSROpd7Di4eVgx9reLqZBrRFFLf8CNjFk03nMVyH7Ze6mdiUAfvFzFN3E/w600/data_structure.png" width="600" /></a></div><p dir="auto"></p>
<p dir="auto">On top of this layout, two synchronization levels are added:</p>
<ul dir="auto"><li>Container level: A read-write mutex is used to control access from any operation to the
container. This access is always requested in read mode (i.e. shared) except for operations
that require that the whole bucket array be replaced, like rehashing, swapping,
assignment, etc. This means that, in practice, this level of synchronization does
not cause any contention at all, even for modifying operations like insertion and
erasure. To reduce cache coherence traffic, the mutex is implemented as an array
of read-write spinlocks occupying separate cache lines, and each thread is
assigned one spinlock in a round-robin fashion at <code>thread_local</code> construction time:
read/shared access does only involve the assigned spinlock, whereas write/exclusive
access, which is comparatively much rarer, requires that all spinlocks be locked.</li><li>Group level: Each group has a dedicated read-write spinlock to control access to its
slots, plus an atomic <i>insertion counter</i> used for transactional optimistic insertion
as described below.</li></ul>
<a name="algorithms"><div style="text-align: left;"><b><span style="font-size: large;">Algorithms</span></b></div></a>
<p dir="auto">The core algorithms of <code>boost::concurrent_flat_map</code> are variations of those of
<code>boost::unordered_flat_map</code> with minimal changes to prevent data races while keeping
group-level contention to a minimum.</p>
<p dir="auto">In the following diagrams, white boxes represent lock-free steps, while gray boxes
are executed within the scope of a group lock. Metadata is handled atomically both
in locked and lock-free scenarios.</p><p>
<a name="lookup"></a></p><div style="text-align: left;"><a name="lookup"><span style="font-size: medium;"><b>Lookup <br /></b></span></a></div><p></p><p>
</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiD1yg-ZTZW9jAZEnAvGVb8MBF-GDLjjGNzL_H7oEqPU0qv-f3yeOaAlAl-mf5yRFnEQEY3iaq1Ms7Jg-nmQf8ZeFqxWC2DOsKgUFqfiIuHN-2nDhSs2Oz6l9JT4RroxxcYJnfg558CsPlYFWfi7yOi4QFGTLxKS6Ng8Zmmp-a9Ve7Ad2jRzESIyXo6cjY/s1222/lookup.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="605" data-original-width="1222" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiD1yg-ZTZW9jAZEnAvGVb8MBF-GDLjjGNzL_H7oEqPU0qv-f3yeOaAlAl-mf5yRFnEQEY3iaq1Ms7Jg-nmQf8ZeFqxWC2DOsKgUFqfiIuHN-2nDhSs2Oz6l9JT4RroxxcYJnfg558CsPlYFWfi7yOi4QFGTLxKS6Ng8Zmmp-a9Ve7Ad2jRzESIyXo6cjY/w600/lookup.png" witdh="600" /></a></div><p dir="auto"></p>
<p dir="auto">Most steps of the lookup algorithm (hash calculation, probing, element
pre-checking via SIMD matching with the value's reduced hash) are lock-free and
do not synchronize with any operation on the metadata. When SIMD matching detects
a potential candidate, double-checking for slot occupancy and the
actual comparison with the element are done within the group lock; note
that the occupancy double check is necessary precisely because SIMD matching
is lock-free and the status of the identified slot may have changed before
group locking.</p>
<a name="insertion"><div style="text-align: left;"><span style="font-size: medium;"><b>Insertion</b></span></div></a>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBF4Si7cQVms5W1QzREohqt632hKN8zSbvlUwvo-7mZuFNl7jbsdy526MnuX_WmZBlpQeouNtQveo7cTL9wj2S_I6IlAW-4UIZ36g4V6-A8nuN1PTFMU4wz5--FoVkCP9LBz3Hiq_DKVVqjySwomj6C8a01iGvv2-oS6Q8ia4l4yijILeTQDNXNUZYwzA/s1222/insertion.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="616" data-original-width="1222" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBF4Si7cQVms5W1QzREohqt632hKN8zSbvlUwvo-7mZuFNl7jbsdy526MnuX_WmZBlpQeouNtQveo7cTL9wj2S_I6IlAW-4UIZ36g4V6-A8nuN1PTFMU4wz5--FoVkCP9LBz3Hiq_DKVVqjySwomj6C8a01iGvv2-oS6Q8ia4l4yijILeTQDNXNUZYwzA/w600/insertion.png" width="600" /></a></div><p dir="auto"></p>
<p dir="auto">The main challenge of any concurrent insertion algorithm is to prevent an
element <i>x</i> from being inserted twice by different threads running at the
same time. As open-addressing probing starts at a position <i>p</i><sub>0</sub>
univocally determined by the hash value of <i>x</i>, a naïve (and flawed) approach
is to lock <i>p</i><sub>0</sub> for the entire duration of the insertion
procedure: this leads to deadlocking if the probing sequences of two
different elements intersect.</p>
<p dir="auto">We have implemented the following transactional optimistic insertion algorithm:
At the beginning of insertion, the value of the insertion counter for
the group at position <i>p</i><sub>0</sub> is saved locally and insertion
proceeds normally, first checking that an element equivalent to <i>x</i> does
not exist and then looking for available slots starting at <i>p</i><sub>0</sub>
and locking only one group of the probing sequence at a time; when
an available slot is found, the associated metadata is updated,
the insertion counter at <i>p</i><sub>0</sub> is incremented, and:</p>
<ul dir="auto"><li>If no other thread got in the way (i.e. if the pre-increment value of the counter
coincides with the local value stored at the beginning), then the transaction
is successful and insertion can be finished by storing the element into
the slot before releasing the group lock.</li><li>Otherwise, metadata changes are rolled back and the entire insertion process is
started over.</li></ul>
<p dir="auto">Our measurements indicate that, even under adversarial situations, the
ratio of start-overs to successful insertions ranges in the parts per million.</p>
<a name="visitation-api"><div style="text-align: left;"><b><span style="font-size: large;">Visitation API</span></b></div></a>
<p dir="auto">From an operational point of view, container iterators serve two main purposes:
combining lookup/insertion with further access to the relevant element:</p>
<div class="highlight highlight-source-c++ notranslate position-relative overflow-auto" dir="auto"><pre class="prettyprint"><span class="pl-k">auto</span> it = m.find(k);
<span class="pl-k">if</span> (it != m.end()) {
it-><span class="pl-smi">second</span> = <span class="pl-c1">0</span>;
}</pre><div class="zeroclipboard-container">
</div></div><p dir="auto">and container traversal:</p>
<div class="highlight highlight-source-c++ notranslate position-relative overflow-auto" dir="auto"><pre class="prettyprint"><span class="pl-c"><span class="pl-c">//</span> iterators used internally by range-for</span>
<span class="pl-k">for</span>(<span class="pl-k">auto</span>& x: m) {
x.<span class="pl-smi">second</span> = <span class="pl-c1">0</span>;
}</pre><div class="zeroclipboard-container">
</div></div><p dir="auto">Having decided that <code>boost::concurrent_flat_map</code> not rely on iterators
due to their inherent concurrency problems, a design alternative is to move element
access into the container operations themselves, where it can be done in a
thread-safe manner. This is just a form of the familiar visitation pattern:</p>
<div class="highlight highlight-source-c++ notranslate position-relative overflow-auto" dir="auto"><pre class="prettyprint">m.visit(k, [](<span class="pl-k">auto</span>& x) {
x.<span class="pl-smi">second</span> = <span class="pl-c1">0</span>;
});
m.visit_all([](<span class="pl-k">auto</span>& x) {
x.<span class="pl-smi">second</span> = <span class="pl-c1">0</span>;
});</pre><div class="zeroclipboard-container">
</div></div><p dir="auto"><code>boost::concurrent_flat_map</code> provides visitation-enabled variations
of classical map operations wherever it makes sense:</p>
<ul dir="auto"><li><code>visit</code>, <code>cvisit</code> (in place of <code>find</code>)</li><li><code>visit_all</code>, <code>cvisit_all</code> (as a substitute of container traversal)</li><li><code>emplace_or_visit</code>, <code>emplace_or_cvisit</code></li><li><code>insert_or_visit</code>, <code>insert_or_cvisit</code></li><li><code>try_emplace_or_visit</code>, <code>try_emplace_or_cvisit</code></li></ul>
<p dir="auto"><code>cvisit</code> stands for constant visitation, that is, the visitation function
is granted read-only access to the element, which has less contention than
write access.</p>
<p dir="auto">Traversal functions <code>[c]visit_all</code> and <code>erase_if</code> have also parallel versions:</p>
<div class="highlight highlight-source-c++ notranslate position-relative overflow-auto" dir="auto"><pre class="prettyprint">m.visit_all(std::execution::par, [](<span class="pl-k">auto</span>& x) {
x.<span class="pl-smi">second</span> = <span class="pl-c1">0</span>;
});</pre><div class="zeroclipboard-container">
</div></div>
<a name="benchmarks"><div style="text-align: left;"><b><span style="font-size: large;">Benchmarks</span></b></div></a>
<p dir="auto">We've tested <code>boost::concurrent_flat_map</code> against
<a href="https://spec.oneapi.io/versions/latest/elements/oneTBB/source/containers/concurrent_hash_map_cls.html" rel="nofollow"><code>tbb::concurrent_hash_map</code></a>
and <a href="https://github.com/greg7mdp/gtl/blob/main/docs/phmap.md"><code>gtl::parallel_flat_hash_map</code></a>
for the following synthetic scenario:
<i>T</i> threads concurrently perform <i>N</i> operations <b>update</b>, <b>successful lookup</b>
and <b>unsuccessful lookup</b>, randomly chosen with probabilities 10%, 45% and 45%, respectively,
on a concurrent map of (<code>int</code>, <code>int</code>) pairs.
The keys used by all operations are also random, where <b>update</b> and <b>successful lookup</b> follow a
<a href="https://en.wikipedia.org/wiki/Zipf%27s_law#Formal_definition" rel="nofollow">Zipf distribution</a> over [1, <i>N</i>/10]
with skew exponent <i>s</i>, and <b>unsuccessful lookup</b> follows a Zip distribution
with the same skew <i>s</i> over an interval not overlapping with the former.</p>
<p dir="auto">We provide the full benchmark code and results for different 64- and 32-bit architectures in a
<a href="https://github.com/boostorg/boost_unordered_benchmarks/tree/boost_concurrent_flat_map">dedicated repository</a>;
here, we just show as an example the plots for Visual Studio 2022 in x64 mode on an
AMD Ryzen 5 3600 6-Core @ 3.60 GHz without hyperthreading and 64 GB of RAM.</p>
<table>
<thead>
<tr>
<th align="center"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgWafV9y6Ty-jtfoVHdezLwz6GVeyRq6xS5MlVJwdtnhOSgHUOsfYrPL-Rm8PRWD1glgeKX3CD0zTIlIiDScwkYxrm1R5EiqlepLdCBacSzQDRcqf1D4ZgQj9x8AL794xGCC0Wf0XVHHq0moqtSt89ap6eqOAqTgw1ELz2Lgsu1V9cJ0vKYJzbgvj8yCHQ/s698/Parallel%20workload.xlsx.500k,%200.01.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="449" data-original-width="698" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgWafV9y6Ty-jtfoVHdezLwz6GVeyRq6xS5MlVJwdtnhOSgHUOsfYrPL-Rm8PRWD1glgeKX3CD0zTIlIiDScwkYxrm1R5EiqlepLdCBacSzQDRcqf1D4ZgQj9x8AL794xGCC0Wf0XVHHq0moqtSt89ap6eqOAqTgw1ELz2Lgsu1V9cJ0vKYJzbgvj8yCHQ/w200-h129/Parallel%20workload.xlsx.500k,%200.01.png" style="max-width: 100%;" width="180" /></a></th>
<th align="center"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhRgzObgcq-_Ji3rF2TKCd2kKqghItn9JDq011zt6TaGVkRnrA1b7n8tZENJ9aEMmdFJuQ7L3_il71JkcCSQr09l4T0aDsr4K_LsNTiWOVUB-nR5WYuq2HYzcOBdJzi0p7gsFRW1q-Xs8zOIlDt_mIs7W5FlyMxNBK8ywci0yXDOhlaXf3ItHbYyp6y4as/s698/Parallel%20workload.xlsx.500k,%200.5.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="449" data-original-width="698" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhRgzObgcq-_Ji3rF2TKCd2kKqghItn9JDq011zt6TaGVkRnrA1b7n8tZENJ9aEMmdFJuQ7L3_il71JkcCSQr09l4T0aDsr4K_LsNTiWOVUB-nR5WYuq2HYzcOBdJzi0p7gsFRW1q-Xs8zOIlDt_mIs7W5FlyMxNBK8ywci0yXDOhlaXf3ItHbYyp6y4as/w200-h129/Parallel%20workload.xlsx.500k,%200.5.png" style="max-width: 100%;" width="180" /></a></th>
<th align="center"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZiMciXCLSSNoNiNMDwxnGETmO9E7nYYqqzy9i0tY5yAUSzaqwh_OaBWpTLqK6yzQ8SaQiU5ZbdskiKRLFDqgQuTG2KiOVJPIshDf0-ptvabsCQ6KT1i7ahHkZZWOrWSSeAoGSHvaFX-AaIr-N9QcsK0MnQeRqzu05wAn7uHv_hA9AcHArvHUbdQ9d3_0/s698/Parallel%20workload.xlsx.500k,%200.99.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="449" data-original-width="698" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZiMciXCLSSNoNiNMDwxnGETmO9E7nYYqqzy9i0tY5yAUSzaqwh_OaBWpTLqK6yzQ8SaQiU5ZbdskiKRLFDqgQuTG2KiOVJPIshDf0-ptvabsCQ6KT1i7ahHkZZWOrWSSeAoGSHvaFX-AaIr-N9QcsK0MnQeRqzu05wAn7uHv_hA9AcHArvHUbdQ9d3_0/w200-h129/Parallel%20workload.xlsx.500k,%200.99.png" style="max-width: 100%;" width="180" /></a></th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">500k updates, 4.5M lookups<br />skew=0.01</td>
<td align="center">500k updates, 4.5M lookups<br />skew=0.5</td>
<td align="center">500k updates, 4.5M lookups<br />skew=0.99</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th align="center"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhWtuJ4mMtBuI8fl0MfuQFFGRsMsuobAvxilnZ_TiUYWMtL9WnMgVbpFyNm7Mp24AjjXbK0yhwJ4FoAq4WW7amKiEMW-ftzJPBsQ6v2GSsiMhbC1mHuwEUEDnBHJjZQXuC8aKCRIve6IpZWvp0UH9AfHfTb9ZrhKY_6Sj-3OnzWc-WpsGUFrYsuLQjfzuw/s698/Parallel%20workload.xlsx.5M,%200.01.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="449" data-original-width="698" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhWtuJ4mMtBuI8fl0MfuQFFGRsMsuobAvxilnZ_TiUYWMtL9WnMgVbpFyNm7Mp24AjjXbK0yhwJ4FoAq4WW7amKiEMW-ftzJPBsQ6v2GSsiMhbC1mHuwEUEDnBHJjZQXuC8aKCRIve6IpZWvp0UH9AfHfTb9ZrhKY_6Sj-3OnzWc-WpsGUFrYsuLQjfzuw/s320/Parallel%20workload.xlsx.5M,%200.01.png" style="max-width: 100%;" width="180" /></a></th>
<th align="center"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi9hNnnSQsMyXv0VtgW9BFmmqQavv8m8ooiSmrUhhAwSDwvoWxb-dap4rR1Oj8ZDM7ODFBFB5ITHN84u9kD1929XTleQRY55LtdkhCziEA_p0p0i6VI0IFYlWnAZmzNHD8c_4nGPfYkPbnYauye-zc8fMT9HSqMKvaa28ezWiQ6mmQPUfaF7ttvj6rYYgQ/s698/Parallel%20workload.xlsx.5M,%200.5.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="449" data-original-width="698" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi9hNnnSQsMyXv0VtgW9BFmmqQavv8m8ooiSmrUhhAwSDwvoWxb-dap4rR1Oj8ZDM7ODFBFB5ITHN84u9kD1929XTleQRY55LtdkhCziEA_p0p0i6VI0IFYlWnAZmzNHD8c_4nGPfYkPbnYauye-zc8fMT9HSqMKvaa28ezWiQ6mmQPUfaF7ttvj6rYYgQ/s320/Parallel%20workload.xlsx.5M,%200.5.png" style="max-width: 100%;" width="180" /></a></th>
<th align="center"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYsHan-Z-IYPffC6TEPPYFG-6C-taBuBo-UAEFOKFNrDoKj5L--STowUp9c7QeokmpeEj8eiQSTg_drKj-RckuVeuBhQM5B9E6Yc4pO5NlgCZAaNItENbi2p7iRkbEY-td8bzIwFPxSLZEnzJiqWMrFRyBgk1sVBPXPSI-gHbjTM3KDPt5IyjfDKih9a0/s698/Parallel%20workload.xlsx.5M,%200.99.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="449" data-original-width="698" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYsHan-Z-IYPffC6TEPPYFG-6C-taBuBo-UAEFOKFNrDoKj5L--STowUp9c7QeokmpeEj8eiQSTg_drKj-RckuVeuBhQM5B9E6Yc4pO5NlgCZAaNItENbi2p7iRkbEY-td8bzIwFPxSLZEnzJiqWMrFRyBgk1sVBPXPSI-gHbjTM3KDPt5IyjfDKih9a0/s320/Parallel%20workload.xlsx.5M,%200.99.png" style="max-width: 100%;" width="180" /></a></th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">5M updates, 45M lookups<br />skew=0.01</td>
<td align="center">5M updates, 45M lookups<br />skew=0.5</td>
<td align="center">5M updates, 45M lookups<br />skew=0.99</td>
</tr>
</tbody>
</table>
<p></p><p dir="auto">Note that, for the scenario with 500k updates, <code>boost::concurrent_flat_map</code>
continues to improve after the number of threads exceed the number of cores (6),
a phenomenon for which we don't have a readily explanation —we could hypothesize
that execution is limited by memory latency, but the behavior does
not reproduce in the scenario with 5M updates, where the cache miss ratio is
necessarily higher. Note also that <code>gtl::parallel_flat_hash_map</code> performs
comparatively worse for high-skew scenarios where the load is concentrated on
a very small number of keys: this may be due to <code>gtl::parallel_flat_hash_map</code>
having a much coarser lock granularity (256 shards in the configuration used) than
the other two containers.</p>
<p dir="auto">In general, results are very dependent on the particular CPU and memory system used;
you are welcome to try out the benchmark in your architecture of interest and
report back.</p>
<a name="conclusions-and-next-steps"><div style="text-align: left;"><b><span style="font-size: large;">Conclusions and next steps</span></b></div></a>
<p dir="auto"><code>boost::concurrent_flat_map</code> is a new, general-purpose concurrent hashmap that
leverages the very performant open-addressing techniques of <code>boost::unordered_flat_map</code>
and provides a fully thread-safe, iterator-free API we hope future users will
find flexible and convenient.</p>
<p dir="auto">We are considering a number of new functionalities for upcoming releases:</p>
<ul dir="auto"><li>As <code>boost::concurrent_flat_map</code> and <code>boost::unordered_flat_map</code> basically
share the same data layout, it's possible to efficiently implement move
construction from one to another by simply transferring the internal structure.
There are scenarios where this feature can lead to more performant
execution, like, for instance, multithreaded population of a
<code>boost::concurrent_flat_map</code> followed by single- or multithreaded
read-only lookup on a <code>boost::unordered_flat_map</code> move-constructed from
the former.</li><li><a href="https://dl.acm.org/doi/pdf/10.1145/3552326.3587457" rel="nofollow">DRAMHiT</a> shows
that pipelining/batching several map operations on the same thread
in combination with heavy memory prefetching can reduce or eliminate
waiting CPU cycles. We have conducted some preliminary experiments
using this idea for a feature we dubbed <i>bulk lookup</i>
(providing an array of keys to look for at once), with promising results.</li></ul>
<p dir="auto">We're launching this new container with trepidation: we cannot possibly
try the vast array of different CPU architectures and scenarios
where concurrent hashmaps are used, and we don't have yet field data on
the suitability of the novel API we're proposing for
<code>boost::concurrent_flat_map</code>. For these reasons, your feedback
and proposals for improvement are most welcome.</p>Joaquín M López Muñozhttp://www.blogger.com/profile/08579853272674211100noreply@blogger.com0tag:blogger.com,1999:blog-2715968472735546962.post-62769045014226530742022-11-18T13:23:00.002+01:002022-12-19T09:50:41.851+01:00Inside boost::unordered_flat_map<ul dir="auto"><li><a href="#introduction">Introduction</a></li><li><a href="#the-case-for-open-addressing">The case for open addressing</a>
<ul dir="auto"><li><a href="#simd-accelerated-lookup">SIMD-accelerated lookup</a></li></ul>
</li><li><a href="#boostunordered_flat_map-data-structure"><code>boost::unordered_flat_map</code> data structure</a></li><li><a href="#rehashing">Rehashing</a></li><li><a href="#hash-post-mixing">Hash post-mixing</a></li><li><a href="#statistical-properties-of-boostunordered_flat_map">Statistical properties of <code>boost::unordered_flat_map</code></a></li><li><a href="#benchmarks">Benchmarks</a>
<ul dir="auto"><li><a href="#running-n-plots">Running-<i>n</i> plots</a></li><li><a href="#aggregate-performance">Aggregate performance</a></li></ul>
</li><li><a href="#deviations-from-the-standard">Deviations from the standard</a></li><li><a href="#conclusions-and-next-steps">Conclusions and next steps</a></li></ul>
<a name="introduction"><div style="text-align: left;"><b><span style="font-size: large;">Introduction</span></b></div></a>
<p dir="auto">Starting in Boost 1.81 (December 2022), <a href="https://www.boost.org/doc/libs/release/libs/unordered/" rel="nofollow">Boost.Unordered</a>
provides, in addition to its previous implementations of C++ unordered associative containers,
the new containers <code>boost::unordered_flat_map</code> and <code>boost::unordered_flat_set</code> (for the sake
of brevity, we will only refer to the former in the remaining of this article).
If <code>boost::unordered_map</code> strictly adheres to the C++ specification for <code>std::unordered_map</code>,
<code>boost::unordered_flat_map</code> deviates in a number of ways from the standard
to offer dramatic performance improvements in exchange; in fact, <code>boost::unordered_flat_map</code>
ranks amongst the fastest hash containers currently available to C++ users.</p>
<p dir="auto">We describe the internal structure of <code>boost::unordered_flat_map</code> and provide
theoretical analyses and benchmarking data to help readers gain insights into
the key design elements behind this container's excellent performance.
Interface and behavioral differences with the standard are also discussed.</p>
<a name="the-case-for-open-addressing"><div style="text-align: left;"><b><span style="font-size: large;">The case for open addressing</span></b></div></a>
<p dir="auto">We have <a href="https://bannalia.blogspot.com/2022/06/advancing-state-of-art-for.html" rel="nofollow">previously discussed</a> why
<i>closed addressing</i> was chosen back in 2003 as the implicit layout for <code>std::unordered_map</code>.
20 years after, <a href="https://en.wikipedia.org/wiki/Hash_table#Open_addressing" rel="nofollow"><i>open addressing</i></a>
techniques have taken the lead in terms of performance, and the fastest hash containers in the market
all rely on some variation of open addressing, even if that means that some deviations have to be introduced
from the baseline interface of <code>std::unordered_map</code>.</p>
<p dir="auto">The defining aspect of open addressing is that elements are stored directly within the
bucket array (as opposed to closed addressing, where multiple elements can be held into the same
bucket, usually by means of a linked list of nodes). In modern CPU architectures, this layout
is extremely cache friendly:</p>
<ul dir="auto"><li>There's no indirection needed to go from the bucket position to the element contained.</li><li>Buckets are stored contiguously in memory, which improves cache locality.</li></ul>
<p dir="auto">The main technical challenge introduced by open addressing is what to do when elements
are mapped into the same bucket, i.e. when a <i>collision</i> happens: in fact, all open-addressing
variations are basically characterized by their collision management techniques.
We can divide these techniques into two broad classes:</p>
<ul dir="auto"><li><b>Non-relocating:</b> if an element is mapped to an occupied bucket, a <i>probing sequence</i> is
started from that position until a vacant bucket is located, and the element is inserted
there <i>permanently</i> (except, of course, if the element is deleted or if the bucket array is grown and elements <i>rehashed</i>).
Popular probing mechanisms are <i>linear probing</i> (buckets inspected at regular intervals),
<i>quadratic probing</i> and <a href="https://en.wikipedia.org/wiki/Double_hashing" rel="nofollow"><i>double hashing</i></a>.
There is a tradeoff between cache locality, which is better when the buckets probed are close
to each other, and <i>average probe length</i> (the expected number of buckets probed until a
vacant one is located), which grows larger (worse) precisely when probed buckets
are close —elements tend to form clusters instead of spreading uniformly throughout the bucket
array.</li><li><b>Relocating:</b> as part of the search process for a vacant bucket, elements can be
moved from their position to make room for the new element. This is done in order
to improve cache locality by keeping elements close to their "natural" location
(that indicated by the hash → bucket mapping). Well known relocating algorithms are
<a href="https://en.wikipedia.org/wiki/Cuckoo_hashing" rel="nofollow"><i>cuckoo hashing</i></a>,
<a href="https://en.wikipedia.org/wiki/Hopscotch_hashing" rel="nofollow"><i>hopscotch hashing</i></a> and
<a href="https://en.wikipedia.org/wiki/Hash_table#Robin_Hood_hashing" rel="nofollow"><i>Robin Hood hashing</i></a>.</li></ul>
<p dir="auto">If we take it as an important consideration to stay reasonably close to the original behavior
of <code>std::unordered_map</code>, relocating techniques pose the problem that <code>insert</code> may invalidate
iterators to other elements (so, they work more like <code>std::vector::insert</code>).</p>
<p dir="auto">On the other hand, non-relocating open addressing faces issues on deletion: lookup
starts at the original hash → bucket position and then keeps probing till the element is found
<i>or probing terminates</i>, which is signalled by the presence of a vacant bucket:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgCm7Mjoizsg_qsNOTa1nU3YIjPDdarUGBnuxC9eiMMXtR4zWnnWXLQLE3RGgXN203SLJcIJfGM7a25uOLapGYJtmcOIeU8yXkrhDM1bK1LmxYESoY_fPohrGtPQuggjHgivzVWDVfqND1ZvYv1MSBexyBMyZFFyN7VPsRt5opZfLlo1m2rZyLEpQ0_/s484/probe.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="62" data-original-width="484" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgCm7Mjoizsg_qsNOTa1nU3YIjPDdarUGBnuxC9eiMMXtR4zWnnWXLQLE3RGgXN203SLJcIJfGM7a25uOLapGYJtmcOIeU8yXkrhDM1bK1LmxYESoY_fPohrGtPQuggjHgivzVWDVfqND1ZvYv1MSBexyBMyZFFyN7VPsRt5opZfLlo1m2rZyLEpQ0_/s16000/probe.png" /></a></div><p dir="auto"></p>
<p dir="auto">So, erasing an element can't just restore its holding bucket as vacant, since that would preclude
lookup from reaching elements further down the probe sequence:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_UsPi-cCOsXgja14-LV4lFFY6tKVIq2XB_gB94qhDpZleeQh4fV7M02HgW72oGTFdjWxDbPSqgq65TkP9mY9KAz1OhQTcnDOE6k89eoHQucMK1vVmBoTi71pxyXm1GJlqGOzeTCUAcPllFUWdMvbXUkiT--s2drQO9NSJQITkhaI6aBkB5vhTvC89/s484/probe_interrupted.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="62" data-original-width="484" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_UsPi-cCOsXgja14-LV4lFFY6tKVIq2XB_gB94qhDpZleeQh4fV7M02HgW72oGTFdjWxDbPSqgq65TkP9mY9KAz1OhQTcnDOE6k89eoHQucMK1vVmBoTi71pxyXm1GJlqGOzeTCUAcPllFUWdMvbXUkiT--s2drQO9NSJQITkhaI6aBkB5vhTvC89/s16000/probe_interrupted.png" /></a></div><p dir="auto"></p>
<p dir="auto">A common techique to deal with this problem is to label buckets previously containing an
element with a <i>tombstone</i> marker: tombstones are good for inserting new elements but do not
stop probing on lookup:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgwq8kJr-DhGz29OqD_6PBYRQ5AIHa3x3_B3ZqWA09HRs6hKfp8pC46WjUAIbrMcMB5SU9hvfXW2G2h8ui8IcEY72fhNljgDgJx_VPKi3ZmWKQ7vDAzfD07sMrqF78XiV7MDH_LbAJDyVxq655gqA5Td1RJVQV1LWoUVbFkhIHB6GEwDCo0Cc5E3SMs/s484/probe_tombstone.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="62" data-original-width="484" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgwq8kJr-DhGz29OqD_6PBYRQ5AIHa3x3_B3ZqWA09HRs6hKfp8pC46WjUAIbrMcMB5SU9hvfXW2G2h8ui8IcEY72fhNljgDgJx_VPKi3ZmWKQ7vDAzfD07sMrqF78XiV7MDH_LbAJDyVxq655gqA5Td1RJVQV1LWoUVbFkhIHB6GEwDCo0Cc5E3SMs/s16000/probe_tombstone.png" /></a></div><p dir="auto"></p>
<p dir="auto">Note that the introduction of tombstones implies that the average lookup probe length of the
container won't decrease on deletion —again, special measures can be taken to counter this.</p>
<a name="simd-accelerated-lookup"><div style="text-align: left;"><span style="font-size: medium;"><b>SIMD-accelerated lookup</b></span></div></a>
<p dir="auto">SIMD technologies, such as <a href="https://en.wikipedia.org/wiki/SSE2" rel="nofollow">SSE2</a> and
<a href="https://en.wikipedia.org/wiki/ARM_architecture_family#Advanced_SIMD_(Neon)" rel="nofollow">Neon</a>,
provide advanced CPU instructions for parallel arithmetic and logical operations
on groups of contiguous data values: for instance, SSE2 <a href="https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_cmpeq_epi8" rel="nofollow"><code>_mm_cmpeq_epi8</code></a> takes two packs of 16 bytes and compares
them for equality <i>pointwise</i>, returning the result as another pack of bytes. Although
SIMD was originally meant for acceleration of multimedia processing applications,
the implementors of some unordered containers, notably Google's
<a href="https://abseil.io/about/design/swisstables" rel="nofollow">Abseil's Swiss tables</a> and
Meta's <a href="https://engineering.fb.com/2019/04/25/developer-tools/f14/" rel="nofollow">F14</a>, realized
they could leverage this technology to improve lookup times in hash tables.</p>
<p dir="auto">The key idea is to maintain, in addition to the bucket array itself, a separate <i>metadata
array</i> holding <i>reduced hash values</i> (usually one byte in size)
obtained from the hash values of the elements stored in the corresponding buckets.
When looking up for an element, SIMD can be used on a pack of contiguous reduced
hash values to quickly discard non-matching buckets and move on to full comparison
for matching positions. This technique effectively checks a moderate number
of buckets (16 for Abseil, 14 for F14) in constant time. Another beneficial effect
of this approach is that special bucket markers (vacant, tombstone, etc.) can be
moved to the metadata array —otherwise, these markers would take up extra space in
the bucket itself, or else some representation values of the elements would have
to be restricted from user code and reserved for marking purposes.</p>
<a name="boostunordered_flat_map-data-structure"><div style="text-align: left;"><b><span style="font-size: large;"><code>boost::unordered_flat_map</code> data structure</span></b></div></a>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi2oLcYKndxyhp0OW5b3xdoptzjKHjyLp_udDkmFb94SZzgpWPJqEUrad-unp_PNsrfKEkRQGapNWd3qxxzF8_s1bAEr4Rx4vKC2o9e-RxyBHwCeM7YIUALAHxMuOqr72kXWs-79J2lzc27B2op7-hawdTOrTfOoOB_c-TjEidw2pUvs3Es7btmNS29/s935/data_structure.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="124" data-original-width="935" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi2oLcYKndxyhp0OW5b3xdoptzjKHjyLp_udDkmFb94SZzgpWPJqEUrad-unp_PNsrfKEkRQGapNWd3qxxzF8_s1bAEr4Rx4vKC2o9e-RxyBHwCeM7YIUALAHxMuOqr72kXWs-79J2lzc27B2op7-hawdTOrTfOoOB_c-TjEidw2pUvs3Es7btmNS29/w600/data_structure.png" width="600" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><br /></td></tr></tbody></table><p dir="auto"></p>
<p dir="auto"><code>boost::unordered_flat_map</code>'s bucket array is logically split into 2<sup><i>n</i></sup>
groups of <i>N</i> = 15 buckets, and has a companion metadata array consisting of
2<sup><i>n</i></sup> 16-byte words. Hash mapping is done at the group level rather than
on individual buckets: so, to insert an element with hash value <i>h</i>, the group
at position <i>h</i> / 2<sup><i>W</i> − <i>n</i></sup> is selected and its first available bucket
used (<i>W</i> is 64 or 32 depending on whether the CPU architecture is 64- or 32-bit,
respectively); if the group is full, further groups are checked using a quadratic
probing sequence.</p>
<p dir="auto">The associated metadata is organized as follows (least significant byte depicted
rightmost):</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhRBJiNHDjc3rW42atOBgkc2HL8zJ7gif-HzCENLnxIdvU8Fi6-RuA-8dcj-ZtCM8H13CiKZH6e2xkq7EoEqNGiDTFGyXNldDOGJKxBQOk9X6G40DxKwQpzWypY0BCFBRPSDI3O0HZ1Gfya33hegTrJuHOyopSYlrWM3aAcNU8-JsJCyyRrCoDtGBm9/s550/metadata.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="188" data-original-width="550" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhRBJiNHDjc3rW42atOBgkc2HL8zJ7gif-HzCENLnxIdvU8Fi6-RuA-8dcj-ZtCM8H13CiKZH6e2xkq7EoEqNGiDTFGyXNldDOGJKxBQOk9X6G40DxKwQpzWypY0BCFBRPSDI3O0HZ1Gfya33hegTrJuHOyopSYlrWM3aAcNU8-JsJCyyRrCoDtGBm9/s16000/metadata.png" /></a></div><p dir="auto"></p>
<p dir="auto"><i>h</i><sub><i>i</i></sub> holds information about the <i>i</i>-th bucket of the group:</p>
<ul dir="auto"><li>0 if the bucket is empty,</li><li>1 to signal a <i>sentinel</i> (a special value at the end of the bucket array used to
finish container iteration).</li><li>otherwise, a reduced hash value in the range [2, 255] obtained from the least
significant byte of the element's hash value.</li></ul>
<p dir="auto">When looking up within a group for an element with hash value <i>h</i>, SIMD operations,
if available, are used to match the reduced value of <i>h</i> against the pack of
values {<i>h</i><sub>0</sub>, <i>h</i><sub>1</sub>, ... , <i>h</i><sub>14</sub>}. Locating
an empty bucket for insertion is equivalent to matching for 0.</p>
<p dir="auto"><i>ofw</i> is the so-called <i>overflow byte</i>: when inserting an element with hash value <i>h</i>,
if the group is full then the (<i>h</i> mod 8)-th bit of <i>ofw</i> is set to 1 before
moving to the next group in the probing sequence. Lookup probing can then terminate
when the corresponding overflow bit is 0. Note that this procedure removes the need
to use tombstones.</p>
<p dir="auto">If neither SSE2 nor Neon is available on the target architecture, the logical
organization of metadata stays the same, but information is mapped to two physical
64-bit words using <i>bit interleaving</i> as shown in the figure:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhJISrbaZdbPuhdt62zJsqEtrcpkKg_QvBgLgziJu9GWy7lm2G3P_RSMV2oExl90Fk_d9w-Lt_00Rce_0OL_3QWd1SQXqOdl4Flc8ODvGSB4WTL6GXAuXJcplFittqFRKhgi1NBhz5mO6o6VUgLSmKyMQRNb65QplSg6tvzdiTYdQmsMKrTfnNOS-rR/s560/metadata_interleaving.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="140" data-original-width="560" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhJISrbaZdbPuhdt62zJsqEtrcpkKg_QvBgLgziJu9GWy7lm2G3P_RSMV2oExl90Fk_d9w-Lt_00Rce_0OL_3QWd1SQXqOdl4Flc8ODvGSB4WTL6GXAuXJcplFittqFRKhgi1NBhz5mO6o6VUgLSmKyMQRNb65QplSg6tvzdiTYdQmsMKrTfnNOS-rR/s16000/metadata_interleaving.png" /></a></div><p dir="auto"></p>
<p dir="auto">Bit interleaving allows for a reasonably fast implementation of matching operations
in the absence of SIMD.</p>
<a name="rehashing"><div style="text-align: left;"><b><span style="font-size: large;">Rehashing</span></b></div></a>
<p dir="auto">The maximum load factor of <code>boost::unordered_flat_map</code> is 0.875 and can't be changed
by the user. As discussed previously, non-relocating open addressing has the problem
that average probe length doesn't decrease on deletion when the erased elements
are in mid-sequence: so, continously inserting and erasing elements without triggering
a rehash will slowly degrade the container's performance; we call this phenomenon
<i>drifting</i>. <code>boost::unordered_flat_map</code> introduces the following anti-drift mechanism:
rehashing is controled by the container's <i>maximum load</i>, initially 0.875 times the
size of the bucket array; when erasing an element whose associated overflow bit is not
zero, the maximum load is decreased by one. Anti-drift guarantees that rehashing
will be eventually triggered in a scenario of repeated insertions and deletions.</p>
<a name="hash-post-mixing"><div style="text-align: left;"><b><span style="font-size: large;">Hash post-mixing</span></b></div></a>
<p dir="auto">It is well known that open-addressing containers require that the hash function be
of good quality, in the sense that close input values (for some natural notion of
closeness) are mapped to distant hash values. In particular, a hash function is
said to have the <a href="https://en.wikipedia.org/wiki/Avalanche_effect" rel="nofollow"><i>avalanching property</i></a>
if flipping a bit in the physical representation of the input changes all bits of the
output value with probability 50%. Note that avalanching hash functions are extremely well
behaved, and less stringent behaviors are generally good enough in most open-addressing
scenarios.</p>
<p dir="auto">Being a general-purpose container, <code>boost::unordered_flat_map</code> does not impose any
condition on the user-provided hash function beyond what is required by the C++
standard for unordered associative containers. In order to cope with poor-quality
hash functions (such as the identity for integral types), an automatic bit-mixing stage
is added to hash values:</p>
<ul dir="auto"><li>64-bit architectures: we use the <code>xmx</code> function defined in Jon Maiga's
<a href="http://jonkagstrom.com/bit-mixer-construction/index.html" rel="nofollow">"The construct of a bit mixer"</a>.</li><li>32-bit architectures: the chosen mixer has been automatically generated by
<a href="https://github.com/skeeto/hash-prospector">Hash Function Prospector</a> and selected as the
best overall performer in internal benchmarks. Score assigned by Hash Prospector: 333.7934929677524.</li></ul>
<p dir="auto">There's an opt-out mechanism available to end users so that avalanching hash functions
can be marked as such and thus be used without post-mixing. In particular,
the specializations of <a href="https://www.boost.org/doc/libs/release/libs/container_hash/" rel="nofollow"><code>boost::hash</code></a>
for string types are marked as avalanching.</p>
<a name="statistical-properties-of-boostunordered_flat_map"><div style="text-align: left;"><b><span style="font-size: large;">Statistical properties of <code>boost::unordered_flat_map</code></span></b></div></a>
<p dir="auto">We have written a <a href="https://github.com/joaquintides/boost_unordered_flat_map_stats">simulation program</a>
to calculate some statistical properties
of <code>boost::unordered_flat_map</code> as compared with Abseil's <code>absl::flat_hash_map</code>,
which is generally regarded as one of the fastest hash containers available.
For the purposes of this analysis, the main design characteristics of
<code>absl::flat_hash_map</code> are:</p>
<ul dir="auto"><li>Bucket array sizes are of the form 2<i><sup>n</sup></i>, <i>n</i> ≥ 4.</li><li>Hash mapping is done at the bucket level (rather than at the group
level as in <code>boost::unordered_flat_map</code>).</li><li>Metadata consists of one byte per bucket, where the most significant bit is set to 1
if the bucket is empty, deleted (tombstone) or a sentinel. The remaining 7 bits
hold the reduced hash value for occupied buckets.</li><li>Lookup/insertion uses SIMD to inspect the 16 contiguous buckets beginning at the
hash-mapped position, and then continues with further 16-bucket groups using
quadratic probing. Probing ends when a non-full group is found.
Note that the start positions of these groups are not aligned modulo 16.</li></ul>
<p dir="auto">The figure shows:</p>
<ul dir="auto"><li>the probability that a randomly selected group is full,</li><li>the average number of hops (i.e. the average probe length minus one) for
successful and unsuccessful lookup</li></ul>
<p dir="auto">as functions of the load factor, with perfectly random input and without intervening
deletions. Solid line is <code>boost::unordered_flat_map</code>, dashed line is
<code>absl::flat_hash_map</code>.</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiclPJEg-PekZgcegdboCcZmMwYx4C6PrYqxwdqJVhfPcRj1Hf6ozHivEcKiVt3T6-jpfJu7ugSJQMrQfb6jVwElWdQueN9GJuwRnQq2i_TibmIL5mAD8x6UhA0Flru5rT7wvhRvWaOYO3VCoF5fFYoEr7f4KE1ZsHDjuGpnn1bq_DqOB9gv3dVLYQB/s804/stats1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="536" data-original-width="804" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiclPJEg-PekZgcegdboCcZmMwYx4C6PrYqxwdqJVhfPcRj1Hf6ozHivEcKiVt3T6-jpfJu7ugSJQMrQfb6jVwElWdQueN9GJuwRnQq2i_TibmIL5mAD8x6UhA0Flru5rT7wvhRvWaOYO3VCoF5fFYoEr7f4KE1ZsHDjuGpnn1bq_DqOB9gv3dVLYQB/w500/stats1.png" width="500" /></a></div><p dir="auto"></p>
<p dir="auto">Some observations:</p>
<ul dir="auto"><li><i>Pr</i>(group is full) is higher for <code>boost::unordered_flat_map</code>. This follows
from the fact that free buckets cluster at the end of 15-aligned groups,
whereas for <code>absl::flat_hash_map</code> free buckets are uniformly distributed across
the array, which increases the probability that a contiguous 16-bucket chunk
contains at least one free position. Consequently, <i>E</i>(num hops) for successful
lookup is also higher in <code>boost::unordered_flat_map</code>.</li><li>By contrast, <i>E</i>(num hops) for <i>unsuccessful</i> lookup is considerably lower
in <code>boost::unordered_flat_map</code>: <code>absl::flat_hash_map</code> uses an all-or-nothing
condition for probe termination (group is non-full/full), whereas
<code>boost::unordered_flat_map</code> uses the 8 bits of information in the overflow byte
to allow for more finely-grained termination —effectively, making probe termination
~1.75 times more likely. The overflow byte acts as a sort of
<a href="https://en.wikipedia.org/wiki/Bloom_filter" rel="nofollow">Bloom filter</a> to check for probe
termination based on reduced hash value.</li></ul>
<p dir="auto">The next figure shows the average number of actual comparisons (i.e. when
the reduced hash value matched) for successful and unsuccessful lookup.
Again, solid line is <code>boost::unordered_flat_map</code> and
dashed line is <code>absl::flat_hash_map</code>.</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjYNDl36h_80BxAs3kOussDg1fzg9iDmU7F5HH4JM0nqyY-lX29osMs2H-aqMOkOeirbLGiyjttIQVGeFRVQcJIQOXDInqHfo1zJlUBf5NI8sGrujOxdj3QIYZ60-5Pl9rnXiYjOpUXBvDtmnCvE_zzngPFW8Aq_gfDUfWzCvH7Qc_vzamP5F-ktpFD/s804/stats2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="536" data-original-width="804" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjYNDl36h_80BxAs3kOussDg1fzg9iDmU7F5HH4JM0nqyY-lX29osMs2H-aqMOkOeirbLGiyjttIQVGeFRVQcJIQOXDInqHfo1zJlUBf5NI8sGrujOxdj3QIYZ60-5Pl9rnXiYjOpUXBvDtmnCvE_zzngPFW8Aq_gfDUfWzCvH7Qc_vzamP5F-ktpFD/w500/stats2.png" width="500" /></a></div><p dir="auto"></p>
<p dir="auto"><i>E</i>(num cmps) is a function of:</p>
<ul dir="auto"><li><i>E</i>(num hops) (lower better),</li><li>the size of the group (lower better),</li><li>the number of bits of the reduced hash value (higher better).</li></ul>
<p dir="auto">We see then that <code>boost::unordered_flat_map</code> approaches <code>absl::flat_hash_map</code> on
<i>E</i>(num cmps) for successful lookup (1% higher or less), despite its poorer
<i>E</i>(num hops) figures: this is so because <code>boost::unordered_flat_map</code>
uses smaller groups (15 vs. 16) and, most importantly, because its reduced
hash values contain log<sub>2</sub>(254) = 7.99 bits vs. 7 bits in <code>absl::flat_hash_map</code>,
and each additional bit in the hash reduced value decreases the number of negative
comparisons roughly by half. In the case of <i>E</i>(num cmps) for unsuccessful lookup,
<code>boost::unordered_flat_map</code> figures are up to 3.2 times lower under
high-load conditions.</p>
<a name="benchmarks"><div style="text-align: left;"><b><span style="font-size: large;">Benchmarks</span></b></div></a>
<a name="running-n-plots"><div style="text-align: left;"><span style="font-size: medium;"><b>Running-<i>n</i> plots</b></span></div></a>
<p dir="auto">We have measured the execution times of <code>boost::unordered_flat_map</code> against
<code>absl::flat_hash_map</code> and <code>boost::unordered_map</code> for basic operations
(insertion, erasure during iteration, successful lookup, unsuccessful lookup) with
container size <i>n</i> ranging from 10,000 to 10M. We provide the full benchmark code and
results for different 64- and 32-bit architectures in a
<a href="https://github.com/boostorg/boost_unordered_benchmarks/tree/boost_unordered_flat_map">dedicated repository</a>;
here, we just show the plots for GCC 11 in x64 mode on an
AMD EPYC Rome 7302P @ 3.0GHz.
Please note that each container uses its own default hash function, so a direct
comparison of execution times may be slightly biased.</p>
<table>
<thead>
<tr>
<th align="center">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgba397_JysbdpRTRYxpr47OWLiQzZ6bYCprlnKynq0GDYBkZJfYnxuMeHnVgW8js_QbAhpll9NBUsxjiNji9oNLEhnYf3A33GpUGKaWeH_H-OR-vHfPM_6pheB_f2inwWY0vF5G8RhvhGIuErkUwkuVuXBXRkOLbP8Sift7bVXUXjVbEADAI3R9bi2/s698/Running%20insertion.xlsx.plot.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="449" data-original-width="698" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgba397_JysbdpRTRYxpr47OWLiQzZ6bYCprlnKynq0GDYBkZJfYnxuMeHnVgW8js_QbAhpll9NBUsxjiNji9oNLEhnYf3A33GpUGKaWeH_H-OR-vHfPM_6pheB_f2inwWY0vF5G8RhvhGIuErkUwkuVuXBXRkOLbP8Sift7bVXUXjVbEADAI3R9bi2/w275/Running%20insertion.xlsx.plot.png" width="275" /></a>
</th>
<th align="center">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEif6tMoSlJUh0w5nP3YmBuQSIFmANcwqQfWihGKIea7Zu7F9YCIpGDWzuaqDvqR2ZYgOkRqy9a5H9qww3zCanRAR4D62voz7xHaLYsb6K6CYbsfUHZGfCHsdjXQ15HLo9Jox7QHHSYI621AT8ZyhBI4YtRVKSJ-bpStbCP37hnpypT0jbMogHV0Hqxf/s698/Running%20erasure.xlsx.plot.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="449" data-original-width="698" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEif6tMoSlJUh0w5nP3YmBuQSIFmANcwqQfWihGKIea7Zu7F9YCIpGDWzuaqDvqR2ZYgOkRqy9a5H9qww3zCanRAR4D62voz7xHaLYsb6K6CYbsfUHZGfCHsdjXQ15HLo9Jox7QHHSYI621AT8ZyhBI4YtRVKSJ-bpStbCP37hnpypT0jbMogHV0Hqxf/w275/Running%20erasure.xlsx.plot.png" width="275" /></a>
</th>
</tr></thead>
<tbody>
<tr>
<td align="center"><b>Running insertion</b></td>
<td align="center"><b>Running erasure</b></td>
</tr>
</tbody>
</table>
<br /><table>
<thead>
<tr>
<th align="center">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhtH4Vayx5IYtLfGfVAwmvb8P6RIeHkmA5aXSSfdqbUHqycThNY-BAZrWhEZDCr4fRjB0pJtcrkTF4soDBomMJvkTHokHOzVNRoPVtMBcadBJrXUSRdPacjdoUHKmKSFkTPNPHc-FHgMFBQZIz80EnQ0lCOhhbx1mz5BLtWtyCCbDES_AncbtJCN-Kh/s698/Scattered%20successful%20looukp.xlsx.plot.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="449" data-original-width="698" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhtH4Vayx5IYtLfGfVAwmvb8P6RIeHkmA5aXSSfdqbUHqycThNY-BAZrWhEZDCr4fRjB0pJtcrkTF4soDBomMJvkTHokHOzVNRoPVtMBcadBJrXUSRdPacjdoUHKmKSFkTPNPHc-FHgMFBQZIz80EnQ0lCOhhbx1mz5BLtWtyCCbDES_AncbtJCN-Kh/w275/Scattered%20successful%20looukp.xlsx.plot.png" width="275" /></a>
</th>
<th align="center">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi6WPYYIje2lL47xId8ysqYk2KBYPQYGbO0ro4VXFdh1zqFr3ZNJQEzHmW2gBbjKhTCUon8vAb1l92_906ZwqWgWNpNZhAWJrlz4HWIgqCZFkFNqx0qDtuwBjbcdBkEx0ZeNs6mhW4yJ_1NVQepkFEegmwd-6vTf8xON5iIGQr-tH5jCb1jh6iHIaAc/s698/Scattered%20unsuccessful%20looukp.xlsx.plot.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="449" data-original-width="698" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi6WPYYIje2lL47xId8ysqYk2KBYPQYGbO0ro4VXFdh1zqFr3ZNJQEzHmW2gBbjKhTCUon8vAb1l92_906ZwqWgWNpNZhAWJrlz4HWIgqCZFkFNqx0qDtuwBjbcdBkEx0ZeNs6mhW4yJ_1NVQepkFEegmwd-6vTf8xON5iIGQr-tH5jCb1jh6iHIaAc/w275/Scattered%20unsuccessful%20looukp.xlsx.plot.png" width="275" /></a></th>
</tr>
</thead>
<tbody>
<tr>
<td align="center"><b>Successful lookup</b></td>
<td align="center"><b>Unsuccessful lookup</b></td>
</tr>
</tbody>
</table>
<p dir="auto">As predicted by our statistical analysis, <code>boost::unordered_flat_map</code> is
considerably faster than <code>absl::flat_hash_map</code> for unsuccessful lookup
because the average probe length and number of (negative) comparisons are
much lower; this effect translates also to insertion, since <code>insert</code> needs
to first check that the element is not present, so it internally performs an
unsuccessful lookup. Note how performance is less impacted (stays flatter)
when the load factor increases.</p>
<p dir="auto">As for successful lookup, <code>boost::unordered_flat_map</code> is still faster, which may be
due to its better cache locality, particularly for low load factors: in this
situation, elements are clustered at the beginning portion of each group, while for
<code>absl::flat_hash_map</code> they are uniformly distributed with more empty space
in between.</p>
<p dir="auto"><code>boost::unordered_flat_map</code> is slower than <code>absl::flat_hash_map</code> for runnning
erasure (erasure of some elements during container traversal). The actual culprit
here is iteration, which is particularly slow; this is a collateral effect
of having SIMD operations work only on 16-aligned metadata words, while
<code>absl::flat_hash_map</code> iteration looks ahead 16 metadata bytes beyond the current
iterator position.</p>
<a name="aggregate-performance"><div style="text-align: left;"><b><span style="font-size: medium;">Aggregate performance</span></b></div></a>
<p dir="auto">Boost.Unordered provides a series of <a href="https://github.com/boostorg/unordered/tree/develop/benchmark">benchmarks</a>
emulating real-life scenarios combining several operations for a number of
hash containers and key types (<code>std::string</code>, <code>std::string_view</code>, <code>std::uint32_t</code>,
<code>std::uint64_t</code> and a UUID class of size 16). The interested reader can
build and run the benchmarks on her environment of choice; as an example, these are
the results for GCC 11 in x64 mode on an Intel Xeon E5-2683 @ 2.10GHz:</p>
<pre><span style="font-size: small;"><b>std::string</b>
std::unordered_map: 38021 ms, 175723032 bytes in 3999509 allocations
boost::unordered_map: 30785 ms, 149465712 bytes in 3999510 allocations
boost::unordered_flat_map: 14486 ms, 134217728 bytes in 1 allocations
multi_index_map: 30162 ms, 178316048 bytes in 3999510 allocations
absl::node_hash_map: 15403 ms, 139489608 bytes in 3999509 allocations
absl::flat_hash_map: 13018 ms, 142606336 bytes in 1 allocations
std::unordered_map, FNV-1a: 43893 ms, 175723032 bytes in 3999509 allocations
boost::unordered_map, FNV-1a: 33730 ms, 149465712 bytes in 3999510 allocations
boost::unordered_flat_map, FNV-1a: 15541 ms, 134217728 bytes in 1 allocations
multi_index_map, FNV-1a: 33915 ms, 178316048 bytes in 3999510 allocations
absl::node_hash_map, FNV-1a: 20701 ms, 139489608 bytes in 3999509 allocations
absl::flat_hash_map, FNV-1a: 18234 ms, 142606336 bytes in 1 allocations
</span><hr /><span style="font-size: small;"><b>std::string_view</b>
std::unordered_map: 38481 ms, 207719096 bytes in 3999509 allocations
boost::unordered_map: 26066 ms, 181461776 bytes in 3999510 allocations
boost::unordered_flat_map: 14923 ms, 197132280 bytes in 1 allocations
multi_index_map: 27582 ms, 210312120 bytes in 3999510 allocations
absl::node_hash_map: 14670 ms, 171485672 bytes in 3999509 allocations
absl::flat_hash_map: 12966 ms, 209715192 bytes in 1 allocations
std::unordered_map, FNV-1a: 45070 ms, 207719096 bytes in 3999509 allocations
boost::unordered_map, FNV-1a: 29148 ms, 181461776 bytes in 3999510 allocations
boost::unordered_flat_map, FNV-1a: 15397 ms, 197132280 bytes in 1 allocations
multi_index_map, FNV-1a: 30371 ms, 210312120 bytes in 3999510 allocations
absl::node_hash_map, FNV-1a: 19251 ms, 171485672 bytes in 3999509 allocations
absl::flat_hash_map, FNV-1a: 17622 ms, 209715192 bytes in 1 allocations
</span><hr /><span style="font-size: small;"><b>std::uint32_t</b>
std::unordered_map: 21297 ms, 192888392 bytes in 5996681 allocations
boost::unordered_map: 9423 ms, 149424400 bytes in 5996682 allocations
boost::unordered_flat_map: 4974 ms, 71303176 bytes in 1 allocations
multi_index_map: 10543 ms, 194252104 bytes in 5996682 allocations
absl::node_hash_map: 10653 ms, 123470920 bytes in 5996681 allocations
absl::flat_hash_map: 6400 ms, 75497480 bytes in 1 allocations
</span><hr /><span style="font-size: small;"><b>std::uint64_t</b>
std::unordered_map: 21463 ms, 240941512 bytes in 6000001 allocations
boost::unordered_map: 10320 ms, 197477520 bytes in 6000002 allocations
boost::unordered_flat_map: 5447 ms, 134217728 bytes in 1 allocations
multi_index_map: 13267 ms, 242331792 bytes in 6000002 allocations
absl::node_hash_map: 10260 ms, 171497480 bytes in 6000001 allocations
absl::flat_hash_map: 6530 ms, 142606336 bytes in 1 allocations
</span><hr /><span style="font-size: small;"><b>uuid</b>
std::unordered_map: 37338 ms, 288941512 bytes in 6000001 allocations
boost::unordered_map: 24638 ms, 245477520 bytes in 6000002 allocations
boost::unordered_flat_map: 9223 ms, 197132280 bytes in 1 allocations
multi_index_map: 25062 ms, 290331800 bytes in 6000002 allocations
absl::node_hash_map: 14005 ms, 219497480 bytes in 6000001 allocations
absl::flat_hash_map: 10559 ms, 209715192 bytes in 1 allocations</span>
</pre>
<p dir="auto">Each container uses its own default hash function, except the entries labeled
<code>FNV-1a</code> in <code>std::string</code> and <code>std::string_view</code>, which use the same
implementation of
<a href="https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function#FNV-1a_hash" rel="nofollow">Fowler–Noll–Vo hash, version 1a</a>,
and <code>uuid</code>, where all containers use the same user-provided function based on
<a href="https://www.boost.org/libs/container_hash/doc/html/hash.html#ref_hash_combine" rel="nofollow"><code>boost::hash_combine</code></a>.</p>
<a name="deviations-from-the-standard"><div style="text-align: left;"><b><span style="font-size: large;">Deviations from the standard</span></b></div></a>
<p dir="auto">The adoption of open addressing imposes a number of deviations from the C++
standard for unordered associative containers. Users should keep them in mind
when migrating to <code>boost::unordered_flat_map</code> from <code>boost::unordered_map</code> (or
from any other implementation of <code>std::unordered_map</code>):</p>
<ul dir="auto"><li>Both <code>Key</code> and <code>T</code> in <code>boost::unordered_flat_map<Key,T></code> must be
<a href="https://en.cppreference.com/w/cpp/named_req/MoveConstructible" rel="nofollow">MoveConstructible</a>.
This is due to the fact that elements are stored directly into the bucket array and
have to be transferred to a new block of memory on rehashing; by contrast,
<code>boost::unordered_map</code> is a <i>node-based</i> container and elements are never moved
once constructed.</li><li>For the same reason, pointers and references to elements become invalid after
rehashing (<code>boost::unordered_map</code> only invalidates iterators).</li><li><code>begin()</code> is not constant-time (the bucket array is traversed till the first
non-empty bucket is found).</li><li><code>erase(iterator)</code> returns <code>void</code> rather than an iterator to the element
after the erased one. This is done to maximize performance, as locating the
next element requires traversing the bucket array; if that element is absolutely
required, the <code>erase(iterator++)</code> idiom can be used. This performance issue
is not exclusive to open addressing, and has been
<a href="https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2023.pdf" rel="nofollow">discussed</a>
in the context of the C++ standard too.</li><li>The maximum load factor can't be changed by the user (<code>max_load_factor(z)</code> is
provided for backwards compatibility reasons, but does nothing). Rehashing
can occur <i>before</i> the load reaches <code>max_load_factor() * bucket_count()</code> due
to the anti-drift mechanism described previously.</li><li>There is no bucket API (<code>bucket_size</code>, <code>begin(n)</code>, etc.) save <code>bucket_count</code>.</li><li>There are no node handling facilities (<a href="https://en.cppreference.com/w/cpp/container/unordered_map/extract" rel="nofollow"><code>extract</code></a>, etc.)
Such functionality makes no sense here as open-addressing containers are precisely
<i>not</i> node-based. <a href="https://en.cppreference.com/w/cpp/container/unordered_map/merge" rel="nofollow"><code>merge</code></a>
is provided, but the implementation relies on element movement rather than node
transferring.</li></ul>
<a name="conclusions-and-next-steps"><div style="text-align: left;"><b><span style="font-size: large;">Conclusions and next steps</span></b></div></a>
<p dir="auto"><code>boost::unordered_flat_map</code> and <code>boost::unordered_flat_set</code> are the new
open-addressing containers in Boost.Unordered providing top speed
in exchange for some interface and behavioral deviations from the standards-compliant
<code>boost::unordered_map</code> and <code>boost::unordered_set</code>. We have analyzed their
internal data structure and provided some theoretical and practical evidence
for their excellent performance. As of this writing, we claim
<code>boost::unordered_flat_map</code>/<code>boost::unordered_flat_set</code>
to rank among the fastest hash containers available to C++ programmers.</p>
<p dir="auto">With this work, we have reached an important milestone in the ongoing
<a href="https://pdimov.github.io/articles/unordered_dev_plan.html" rel="nofollow">Development Plan for Boost.Unordered</a>.
After Boost 1.81, we will continue improving the functionality
and performance of existing containers and will possibly augment the available
container catalog to offer greater freedom of choice to Boost users.
Your feedback on our current and future work is much welcome.</p>Joaquín M López Muñozhttp://www.blogger.com/profile/08579853272674211100noreply@blogger.com0tag:blogger.com,1999:blog-2715968472735546962.post-56811134320620279752022-10-02T18:14:00.007+02:002022-10-04T17:59:08.209+02:00Deferred argument evaluation<p dir="auto">Suppose our program deals with heavy entities of some type <code>object</code> which are
uniquely identified by an integer ID. The following is a possible implementation
of a function that controls ID-constrained creation of such objects:</p>
<div class="highlight highlight-source-c++ notranslate position-relative overflow-auto" dir="auto"><pre class="prettyprint">object* <span class="pl-en">retrieve_or_create</span>(<span class="pl-k">int</span> id)
{
<span class="pl-k">static</span> std::unordered_map<<span class="pl-k">int</span>, std::unique_ptr<object>> m;
<span class="pl-c"><span class="pl-c">//</span> see if the object is already in the map<br /> </span><span class="pl-k">auto</span> [it,b] = m.<span class="pl-c1">emplace</span>(id, <span class="pl-c1">nullptr</span>);<br /> <span class="pl-c"><span class="pl-c">//</span> create it otherwise</span>
<span class="pl-k">if</span>(b) it-><span class="pl-smi">second</span> = std::make_unique<object>(id);
<span class="pl-k">return</span> it-><span class="pl-smi">second</span>.<span class="pl-c1">get</span>();
}</pre></div>
<p dir="auto">Note that the code is careful not to create a spurious object if
an equivalent one already exists; but in doing so, we have introduced a
potentially inconsistency in the internal map if object creation throws:</p>
<div class="highlight highlight-source-c++ notranslate position-relative overflow-auto" dir="auto"><pre class="prettyprint"><span class="pl-c"><span class="pl-c">//</span> fixed version</span>
object* <span class="pl-en">retrieve_or_create</span>(<span class="pl-k">int</span> id)
{
<span class="pl-k">static</span> std::unordered_map<<span class="pl-k">int</span>, std::unique_ptr<object>> m;
<span class="pl-c"><span class="pl-c">//</span> see if the object is already in the map<br /> </span><span class="pl-k">auto</span> [it,b] = m.<span class="pl-c1">emplace</span>(id, <span class="pl-c1">nullptr</span>);
<span class="pl-c"><span class="pl-c">//</span> create it otherwise<br /> </span><span class="pl-k">if</span>(b){
<span class="pl-k">try</span>{
it-><span class="pl-smi">second</span> = std::make_unique<object>(id);
}
<span class="pl-k">catch</span>(...){<br /> <span class="pl-c"><span class="pl-c">//</span> we can get here when running out of memory, for instance</span>
m.<span class="pl-c1">erase</span>(it);
<span class="pl-k">throw</span>;
}
}
<span class="pl-k">return</span> it-><span class="pl-smi">second</span>.<span class="pl-c1">get</span>();
}</pre></div>
<p dir="auto">This fixed version is a little cumbersome, to say the least. Starting in C++17,
we can use <code>try_emplace</code> to rewrite <code>retrieve_or_create</code> as follows:</p>
<div class="highlight highlight-source-c++ notranslate position-relative overflow-auto" dir="auto"><pre class="prettyprint">object* <span class="pl-en">retrieve_or_create</span>(<span class="pl-k">int</span> id)
{
<span class="pl-k">static</span> std::unordered_map<<span class="pl-k">int</span>, std::unique_ptr<object>> m;
<span class="pl-k">auto</span> [it,b] = m.<span class="pl-c1">try_emplace</span>(id, std::make_unique<object>(id));
<span class="pl-k">return</span> it-><span class="pl-smi">second</span>.<span class="pl-c1">get</span>();
}</pre></div>
<p dir="auto">But then we've introduced the problem of spurious object creation we strived to
avoid. Ideally, we'd like for <code>try_emplace</code> to <b>not</b> create the object except when really needed.
What we're effectively asking for is some sort of technique for
<i>deferred argument evaluation</i>. As it happens, it is very easy to devise our own:</p>
<div class="highlight highlight-source-c++ notranslate position-relative overflow-auto" dir="auto"><pre class="prettyprint"><span class="pl-k">template</span><<span class="pl-k">typename</span> F>
<span class="pl-k">struct</span> <span class="pl-en">deferred_call</span>
{
<span class="pl-k">using</span> result_type=decltype(std::declval<<span class="pl-k">const</span> F>()());
<span class="pl-k">operator</span> <span class="pl-en">result_type</span>() <span class="pl-k">const</span> { <span class="pl-k">return</span> <span class="pl-c1">f</span>(); }
F f;
};
object* <span class="pl-en">retrieve_or_create</span>(<span class="pl-k">int</span> id)
{
<span class="pl-k">static</span> std::unordered_map<<span class="pl-k">int</span>, std::unique_ptr<object>> m;
<span class="pl-k">auto</span> [it,b] = m.<span class="pl-c1">try_emplace</span>(
id,
<span class="pl-c1">deferred_call</span>([&]{ <span class="pl-k">return</span> std::make_unique<object>(id); }));
<span class="pl-k">return</span> it-><span class="pl-smi">second</span>.<span class="pl-c1">get</span>();
}</pre><div class="zeroclipboard-container position-absolute right-0 top-0">
</div></div>
<p dir="auto"><code>deferred_call</code> is a small utlity that computes a value upon request of
conversion to <code>deferred_call::result_type</code>. In the example, such conversion will only happen if
<code>try_emplace</code> really needs to create a <code>std::pair<const int, std::unique_ptr<object>></code>, that is,
if no equivalent object was already present in the map.</p>
<p dir="auto">In a general setting, for <code>deferred_call</code> to work as expected, that is, to delay producing
the value until the point of actual usage, the following conditions must be met:</p>
<ol dir="auto"><li>The <code>deferred_call</code> object is passed to function/constructor template
accepting generic, unconstrained parameters.</li><li>All internal intermediate interfaces are also generic.</li><li>The final function/constructor where actual usage happens asks exactly for a
<code>deferred_call::result_type</code> value or reference.</li></ol>
<p dir="auto">It is the last condition that can be the most problematic:</p>
<div class="highlight highlight-source-c++ notranslate position-relative overflow-auto" dir="auto"><pre class="prettyprint"><span class="pl-k">void</span> <span class="pl-en">f</span>(std::string);
<span class="pl-c"><span class="pl-c">//</span> error: deferred_call not convertible to std::string<br /></span><span class="pl-en">f</span>(deferred_call([]{ <span class="pl-k">return</span> <span class="pl-s"><span class="pl-pds">"</span>hello<span class="pl-pds">"</span></span>; })); </pre></div>
<p dir="auto">C++ rules for conversion alows just <b>one</b> user-defined conversion to take place at most,
and here we are calling for the sequence <code>deferred_call</code> → <code>const char*</code> → <code>std::string</code>.
In this case, however, the fix is trivial:</p>
<pre class="prettyprint"><span class="pl-k">void</span> <span class="pl-en">f</span>(std::string);
<span class="pl-en">f</span>(deferred_call([]{ <span class="pl-k">return</span> <span class="pl-c1">std::string</span>(<span class="pl-s"><span class="pl-pds">"</span>hello<span class="pl-pds">"</span></span>); })); </pre><p><b>Update Oct 4</b></p><p>Jessy De Lannoit <a href="https://www.reddit.com/r/cpp/comments/xtsre3/deferred_argument_evaluation/iqvn8ag/">proposes</a> a variation on <code>deferred_call</code> that solves the problem of producing a value that is one user-defined conversion away from the target type:<br /></p>
<pre class="prettyprint">template<typename F><br />struct deferred_call<br />{<br /> using result_type=decltype(std::declval<const F>()());<br /> operator result_type() const { return f(); }<br /><br /> template<typename T><br /> requires (std::is_constructible_v<T, result_type>)<br /> constexpr operator T() const { return {f()}; }<br /> <br /> F f;<br />};<br /><br /><span class="pl-k">void</span> <span class="pl-en">f</span>(std::string);
<span class="pl-c"><span class="pl-c">//</span> works ok: deferred_call converts to std::string<br /></span><span class="pl-en">f</span>(deferred_call([]{ <span class="pl-k">return</span> <span class="pl-s"><span class="pl-pds">"</span>hello<span class="pl-pds">"</span></span>; }));
</pre><p>
This version of <code>deferred_call</code> has an eager conversion operator producing any requested value as long as it is constructible from <code>deferred_call::result_type</code>. The solution comes with a different set of problems, though: </p><pre class="prettyprint"><span class="pl-k">void</span> <span class="pl-en">f</span>(std::string);<br />void f(const char*);
<span class="pl-c"><span class="pl-c">//</span> ambiguous call to f<br /></span><span class="pl-en">f</span>(deferred_call([]{ <span class="pl-k">return</span> <span class="pl-s"><span class="pl-pds">"</span>hello<span class="pl-pds">"</span></span>; }));
</pre>There is probably little more we can do without language support. One can imagine some sort of "silent" conversion operator that does not add to the cap on user-defined conversions allowed by the rules of C++:<br /><pre class="prettyprint">template<typename F><br />struct deferred_call<br />{<br /> using result_type=decltype(std::declval<const F>()());<br /> operator result_type() const { return f(); }<br /><br /> // "silent" conversion operator marked with ~explicit<br /> // (not actual C++)<br /> template<typename T><br /> requires (std::is_constructible_v<T, result_type>)<br /> ~explicit constexpr operator T() const { return {f()}; }<br /> <br /> F f;<br />};</pre>Joaquín M López Muñozhttp://www.blogger.com/profile/08579853272674211100noreply@blogger.com0tag:blogger.com,1999:blog-2715968472735546962.post-53293029772454743272022-06-18T21:16:00.009+02:002022-06-20T13:43:47.618+02:00Advancing the state of the art for std::unordered_map implementations<h2 style="text-align: left;">Introduction</h2>
<p style="text-align: left;">Several Boost authors have embarked on a <a href="https://pdimov.github.io/articles/unordered_dev_plan.html" rel="nofollow">project</a>
to improve the performance of <a href="https://www.boost.org/doc/libs/release/libs/unordered/" rel="nofollow">Boost.Unordered</a>'s
implementation of <code>std::unordered_map</code> (and <code>multimap</code>, <code>set</code> and <code>multiset</code> variants),
and to extend its portfolio of available containers to offer faster, non-standard
alternatives based on open addressing.</p>
<p style="text-align: left;">The first goal of the project has been completed in time for Boost 1.80 (due August 2022). We
describe here the technical innovations introduced in <code>boost::unordered_map</code>
that makes it the fastest implementation of <code>std::unordered_map</code> on the market.</p>
<h2 style="text-align: left;">Closed vs. open addressing</h2>
<p style="text-align: left;">On a first approximation, hash table implementations fall on either of two general classes:</p>
<ul style="text-align: left;"><li><i>Closed addressing</i> (also known as <a href="https://en.wikipedia.org/wiki/Hash_table#Separate_chaining" rel="nofollow"><i>separate chaining</i></a>)
relies on an array of <i>buckets</i>, each of which points to a list of elements belonging to it.
When a new element goes to an already occupied bucket, it is simply linked to the
associated element list.
The figure depicts what we call the <i>textbook implementation</i> of closed addressing, arguably
the simplest layout, and among the fastest, for this type of hash tables.</li></ul>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhDSQQoqH5c7LuFi_SzM70cCioPxDCUmtuUyb0aPkcjDCfMg_65498faqsJtxZ2mBlzpmdFwowHxXdTUzjqEtbK-fIeaZR9y26CXD8zXE4V89VJDUjZG9cRRhGyxrWiEKYa29qU78_zVa9wKD60tyGUIwImqKxkkYLXfjuio87uU3-fJwxXN9WuFuiH/s713/bucket-groups.png" style="margin-left: 1em; margin-right: 1em;"><img alt="textbook layout" border="0" data-original-height="221" data-original-width="713" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhDSQQoqH5c7LuFi_SzM70cCioPxDCUmtuUyb0aPkcjDCfMg_65498faqsJtxZ2mBlzpmdFwowHxXdTUzjqEtbK-fIeaZR9y26CXD8zXE4V89VJDUjZG9cRRhGyxrWiEKYa29qU78_zVa9wKD60tyGUIwImqKxkkYLXfjuio87uU3-fJwxXN9WuFuiH/s713/bucket-groups.png" width="100%" /></a></div>
<ul style="text-align: left;"><li><a href="https://en.wikipedia.org/wiki/Hash_table#Open_addressing" rel="nofollow"><i>Open addressing</i></a>
(or <i>closed hashing</i>) stores at most one element in each bucket (sometimes called a <i>slot</i>).
When an element goes to an already occupied slot, some
<i>probing</i> mechanism is used to locate an available slot, preferrably close to the original one.</li></ul>
<p style="text-align: left;">Recent, high-performance hash tables use open addressing and leverage on
its inherently better cache locality and on widely available
<a href="https://en.wikipedia.org/wiki/Single_instruction,_multiple_data" rel="nofollow">SIMD</a> operations.
Closed addressing provides some functional advantages, though, and
remains relevant as the required foundation for the implementation
of <code>std::unodered_map</code>.</p>
<h2 style="text-align: left;">Restrictions on the implementation of <code>std::unordered_map</code></h2>
<p style="text-align: left;">The standardization of C++ unordered associative containers is based on Matt Austern's 2003
<a href="https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2003/n1456.html" rel="nofollow">N1456</a> paper.
Back in the day, open-addressing approaches were not regarded as sufficiently mature,
so closed addressing was taken as the safe implementation of choice. Even though the
C++ standard does not explicitly require that closed addressing must be used, the
assumption that this is the case leaks through the public interface of <code>std::unordered_map</code>:</p>
<ul style="text-align: left;"><li>A bucket API is provided.</li><li>Pointer stability implies that the container is node-based. In C++17, this implication
was made explicit with the introduction of <code>extract</code> capabilities.</li><li>Users can control the container load factor.</li><li>Requirements on the hash function are very lax (open addressing depends on high-quality
hash functions with the ability to spread keys widely across the space of <code>std::size_t</code>
values.)</li></ul>
<p style="text-align: left;">As a result, all standard library implementations use some form of closed addressing
for the internal structure of their <code>std::unordered_map</code> (and related containers).</p>
<p style="text-align: left;">Coming as an additional difficulty, there are two complexity requirements:</p>
<ul style="text-align: left;"><li>iterator increment must be (amortized) constant time,</li><li><code>erase</code> must be constant time on average,</li></ul>
<p style="text-align: left;">that rule out the textbook implementation of closed addressing (see <a href="https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2023.pdf" rel="nofollow">N2023</a>
for details). To cope with this problem,
standard libraries depart from the textbook layout in ways that introduce speed and memory
penalties: this is, for instance, how libstdc++-v3 and libc++ layouts look like:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhLPOMW-s6sNwyOo0lHoDCEEicJ78kiN5ZvR82GSYL8HiRwYgtXYd_nAjLpM4DZo16EH2WRP1ApbrYAU5C0LsmJfGL2beVtoTnMLjxyfv8erOGWY6lUDfUPwc93ie9gW6s4VPST-Tk3wkkZSMcYpIVToxrbhe1-F5LsXLet2h14KfvCKWLpzikS6Po/s481/singly-linked.png" style="margin-left: 1em; margin-right: 1em;"><img alt="libstdc++-v3/libc++ layout" border="0" data-original-height="252" data-original-width="481" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhLPOMW-s6sNwyOo0lHoDCEEicJ78kiN5ZvR82GSYL8HiRwYgtXYd_nAjLpM4DZo16EH2WRP1ApbrYAU5C0LsmJfGL2beVtoTnMLjxyfv8erOGWY6lUDfUPwc93ie9gW6s4VPST-Tk3wkkZSMcYpIVToxrbhe1-F5LsXLet2h14KfvCKWLpzikS6Po/s481/singly-linked.png" width="70%" /></a></div><p style="text-align: center;"></p>
<p style="text-align: left;">To provide constant iterator increment, all nodes are linked together, which in its turn
forces two adjustments to the data structure:</p>
<ul style="text-align: left;"><li>Buckets point to the node <i>before</i> the first one
in the bucket so as to preserve constant-time erasure.</li><li>To detect the end of a bucket, the element hash value is added as a data member of
the node itself (libstdc++-v3 opts for on-the-fly hash calculation under some
circumstances).</li></ul>
<p style="text-align: left;">Visual Studio standard library (formerly from Dinkumware) uses an entirely different
approach to circumvent the problem, but the general outcome is that resulting data
structures perform significantly worse than the textbook layout in terms of speed,
memory consumption, or both.</p>
<h2 style="text-align: left;">Boost.Unordered 1.80 data layout</h2>
<p style="text-align: left;">The new data layout used by Boost.Unordered goes back to the textbook approach:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdpstbRFaPO9Xe_kWyiVK_cRRiAnOPwkb5t-WlgV8lqP86CBDBvedNogygLsojv_5rERgE1YZ31wmTjp9tJKy3oXcmN9AVTZwrhg1wPg4EEkTt51KF0CJ7bvXeaSFufSQkuQcX-N_3byNVIhXgnXnP1wmAXCY71FOlkRuJGQbfdSYN1QsfuJTdpMrw/s771/fca.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Boost.Unordered layout" border="0" data-original-height="326" data-original-width="771" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdpstbRFaPO9Xe_kWyiVK_cRRiAnOPwkb5t-WlgV8lqP86CBDBvedNogygLsojv_5rERgE1YZ31wmTjp9tJKy3oXcmN9AVTZwrhg1wPg4EEkTt51KF0CJ7bvXeaSFufSQkuQcX-N_3byNVIhXgnXnP1wmAXCY71FOlkRuJGQbfdSYN1QsfuJTdpMrw/s771/fca.png" width="100%" /></a></div><p style="text-align: center;"></p>
<p style="text-align: left;">Unlike the rest of standard library implementations, nodes are not linked across the
container but only within each bucket. This makes constant-time <code>erase</code> trivially
implementable, but leaves unsolved the problem of constant-time iterator increment: to
achieve it, we introduce so-called <i>bucket groups</i> (top of the diagram). Each bucket
group consists of a 32/64-bit bucket occupancy mask plus <code>next</code> and <code>prev</code> pointers linking non-empty
bucket groups together. Iteration across buckets resorts to a
combination of bit manipulation operations on the bitmasks plus group traversal through
<code>next</code> pointers, which is not only constant time but also very lightweight in terms
of execution time and of memory overhead (4 bits per bucket).</p>
<h2 style="text-align: left;">Fast modulo</h2>
<p style="text-align: left;">When inserting or looking for an element, hash table implementations need to map the element hash
value into the array of buckets (or slots in the open-addressing case). There
are two general approaches in common use:</p>
<ul style="text-align: left;"><li>Bucket array sizes follow a sequence of prime numbers <i>p</i>, and mapping is of the form
<i>h</i> → <i>h</i> mod <i>p</i>.</li><li>Bucket array sizes follow a power-of-two sequence 2<i><sup>n</sup></i>, and mapping takes
<i>n</i> bits from <i>h</i>. Typically it is the <i>n</i> least significant bits that are used,
but in some cases, like when <i>h</i> is postprocessed to improve its uniformity
via multiplication by a well-chosen constant <i>m</i> (such as defined by
<a href="https://en.wikipedia.org/wiki/Hash_function#Fibonacci_hashing" rel="nofollow">Fibonacci hashing</a>),
it is best to take the <i>n</i> <i>most</i> significant bits, that is,
<i>h</i> → (<i>h</i> × <i>m</i>) >> (<i>N</i> − <i>n</i>), where <i>N</i> is the bitwidth of <code>std::size_t</code>
and >> is the usual C++ right shift operation.</li></ul>
<p style="text-align: left;">We use the modulo by a prime approach because it produces very good spreading even if
hash values are not uniformly distributed. In modern CPUs, however, modulo is an expensive
operation involving integer division; compilers, on the other hand, know how to perform
modulo <i>by a constant</i> much more efficiently, so one possible optimization is to keep a
table of pointers to functions <i>f</i><sub><i>p</i></sub> : <i>h</i> → <i>h</i> mod <i>p</i>. This technique
replaces expensive modulo calculation with a table jump plus a modulo-by-a-constant operation.</p>
<p style="text-align: left;">In Boost.Unordered 1.80, we have gone a step further.
<a href="https://arxiv.org/abs/1902.01961" rel="nofollow">Daniel Lemire et al.</a> show how to calculate
<i>h</i> mod <i>p</i> as an operation involving some shifts and multiplications by <i>p</i> and
a pre-computed <i>c</i> value acting as a sort of reciprocal of <i>p</i>. We have used this work
to implement hash mapping as <i>h</i> → fastmod(<i>h</i>, <i>p</i>, <i>c</i>) (some details omitted).
Note that, even though fastmod is generally faster than modulo by a constant,
most performance gains actually come from the fact that we are eliminating the
table jump needed to select <i>f</i><sub><i>p</i></sub>, which prevented code inlining.</p>
<h2 style="text-align: left;">Time and memory performance of Boost 1.80 <code>boost::unordered_map</code></h2>
<p style="text-align: left;">We are providing some <a href="https://www.boost.org/doc/libs/develop/libs/unordered/doc/html/unordered.html#benchmarks" rel="nofollow">benchmark results</a>
of the <code>boost::unordered_map</code> against libstdc++-v3, libc++ and Visual Studio standard library
for insertion, lookup and erasure scenarios. <code>boost::unordered_map</code> is mostly
faster across the board, and in some cases significantly so. There are three factors
contributing to this performance advantage:</p>
<ul style="text-align: left;"><li>the very reduced memory footprint improves cache utilization,</li><li>fast modulo is used,</li><li>the new layout incurs one less pointer indirection than libstdc++-v3 and libc++
to access the elements of a bucket.</li></ul>
<p style="text-align: left;">As for memory consumption, let <i>N</i> be the number of elements in a container with
<i>B</i> buckets: the memory overheads (that is, memory allocated minus memory used
strictly for the elements themselves) of the different implementations on 64-bit
architectures are:</p>
<table style="margin-left: auto; margin-right: auto; text-align: left;">
<thead>
<tr>
<th>Implementation</th>
<th align="center">Memory overhead (bytes)</th>
</tr>
</thead>
<tbody>
<tr>
<td>libstdc++-v3</td>
<td align="center">16 <i>N</i> + 8 <i>B</i> (<a href="https://gcc.gnu.org/onlinedocs/libstdc++/manual/unordered_associative.html#containers.unordered.cache" rel="nofollow">hash caching</a>)<br />8 <i>N</i> + 8 <i>B</i> (no hash caching)</td>
</tr>
<tr>
<td>libc++</td>
<td align="center">16 <i>N</i> + 8 <i>B</i></td>
</tr>
<tr>
<td>Visual Studio (Dinkumware)</td>
<td align="center">16 <i>N</i> + 16 <i>B</i></td>
</tr>
<tr>
<td>Boost.Unordered</td>
<td align="center">8 <i>N</i> + 8.5 <i>B</i></td>
</tr>
</tbody>
</table>
<div style="text-align: left;"> </div><h2 style="text-align: left;">Which hash container to choose</h2>
<p style="text-align: left;">Opting for closed-addressing (which, in the realm of C++, is almost
synonymous with using an implementation of <code>std::unordered_map</code>) or choosing a
speed-oriented, open-addressing container is in practice not a clear-cut decision.
Some factors favoring one or the other option are listed:</p>
<ul style="text-align: left;"><li><code>std::unordered_map</code>
<ul><li>The code uses some specific parts of its API like node extraction, the bucket interface
or the ability to set the maximum load factor, which are generally not available
in open-addressing containers.</li><li>Pointer stability and/or non-moveability of values required (though some open-addressing alternatives
support these at the expense of reduced performance).</li><li>Constant-time iterator increment required.</li><li>Hash functions used are only mid-quality (open addressing requires that the hash
function have very good key-spreading properties).</li><li>Equivalent key support, ie. <code>unordered_multimap</code>/<code>unordered_multiset</code> required.
We do not know of any open-addressing container supporting equivalent keys.</li></ul>
</li><li>Open-addressing containers
<ul><li>Performance is the main concern.</li><li>Existing code can be adapted to a basically more stringent API and more demanding requirements
on the element type (like moveability).</li><li>Hash functions are of good quality (or the default ones from the container provider are used).</li></ul>
</li></ul>
<p style="text-align: left;">If you decide to use <code>std::unordered_map</code>, Boost.Unordered 1.80 now gives you the fastest,
fully-conformant implementation on the market.</p>
<h2 style="text-align: left;">Next steps</h2>
<p style="text-align: left;">There are some further areas of improvement to <code>boost::unordered_map</code> that we will
investigate post Boost 1.80:</p>
<ul style="text-align: left;"><li>Reduce the memory overhead of the new layout from 4 bits to 3 bits per bucket.</li><li>Speed up performance for equivalent key variants (<code>unordered_multimap</code>/<code>unordered_multiset</code>).</li></ul>
<p style="text-align: left;">In parallel, we are working on the future <code>boost::unordered_flat_map</code>, our proposal for
a top-speed, open-addressing container beyond the limitations imposed by <code>std::unordered_map</code>
interface. Your feedback on our current and future work is much welcome.</p>Joaquín M López Muñozhttp://www.blogger.com/profile/08579853272674211100noreply@blogger.com0tag:blogger.com,1999:blog-2715968472735546962.post-84097737567298754232022-03-10T11:30:00.004+01:002023-01-18T10:32:44.748+01:00Emulating template named arguments in C++20<p style="text-align: justify;"><code>std::unordered_map</code> is a highly configurable class template with five parameters:</p>
<pre class="prettyprint">template<
class Key,
class Value,
class Hash = std::hash<Key>,
class KeyEqual = std::equal_to<Key>,
class Allocator = std::allocator< std::pair<const Key, Value> >
> class unordered_map;
</pre>
<p style="text-align: justify;">Typical usage depends on default values for most of these parameters:</p>
<pre class="prettyprint">using my_map=std::unordered_map<int,std::string>;
</pre>
<p style="text-align: justify;">but things get cumbersome when we want to specify one of the usually defaulted types:<br /></p>
<pre class="prettyprint">template<typename T> class my_allocator{ ... };<br />using my_map=std::unordered_map<<br /> int, std::string,<br /> std::hash<int>, std::equal_to<int>,<br /> my_allocator< std::pair<const int, std::string> ><br />>;
</pre>
<p>In the example, we are forced to specify the hash and equality predicate with their default value types just to get to the allocator, which is the parameter we really wanted to specify. Ideally we would like to have a syntax like this:<br /></p>
<pre class="prettyprint">// this is not actual C++<br />using my_map = std::unordered_map<<br /> Key=int, Value=std::string,<br /> Allocator=my_allocator< std::pair<const int, std::string> ><br />>;
</pre>
<p>Turns out we can emulate this by resorting to <a href="https://en.cppreference.com/w/cpp/language/aggregate_initialization#Designated_initializers"><i>designated initializers</i></a>, introduced in C++20:<br /></p>
<pre class="prettyprint">template<<br /> typename Key, typename Value,<br /> typename Hash = std::hash<Key>,<br /> typename Equal = std::equal_to<Key>,<br /> typename Allocator = std::allocator< std::pair<const Key,Value> ><br />><br />struct unordered_map_config<br />{<br /> Key *key = nullptr;<br /> Value *value = nullptr;<br /> Hash *hash = nullptr;<br /> Equal *equal = nullptr;<br /> Allocator *allocator = nullptr;<br /><br /> using type = std::unordered_map<Key,Value,Hash,Equal,Allocator>;<br />};<br /><br />template<typename T><br />constexpr T *type = nullptr;<br /><br />template<unordered_map_config Cfg><br />using unordered_map = typename decltype(Cfg)::type;<br /><br />...<br /><br />using my_map = unordered_map<{<br /> .key = type<int>, .value = type<std::string>,<br /> .allocator = type< my_allocator< std::pair<const int, std::string > > ><br />}>;
</pre>
<p>The approach taken by the simulation is to use designated initializers to create an aggregate object consisting of dummy null pointers: the values of the pointers do not matter, but their types are captured via <a href="https://en.cppreference.com/w/cpp/language/class_template_argument_deduction">CTAD</a> and used to synthesize the associated <code>std::unordered_map</code> instantiation. Two more C++20 features this technique depends on are:</p><ul style="text-align: left;"><li>Non-type template parameters have been extended to accept <a href="https://en.cppreference.com/w/cpp/named_req/LiteralType"><i>literal types</i></a> (which include aggregate types such as <code>unordered_map_config</code> instantiations).<br /></li><li>The class template <code>unordered_map_config</code> can be specified as a non-type template parameter of <code>unordered_map</code>. In C++17, we would have had to define <code>unordered_map</code> as <br />
<pre class="prettyprint">template<auto Cfg><br />using unordered_map = typename decltype(Cfg)::type;
</pre>which would force the user to explicit name <code>unordered_map_config</code> in<br /><pre class="prettyprint">using my_map = unordered_map<<code>unordered_map_config</code>{...}>;</pre></li></ul><p>There is still the unavoidable noise of having to use the <code>type</code> template alias since, of course, aggregate initialization is about values rather than types.</p><p>Another limitation of this simulation is that we cannot mix named and unnamed parameters:<br /></p>
<pre class="prettyprint">// compiler error: either all initializer clauses should be designated<br />// or none of them should be<br />using my_map = unordered_map<{<br /> type<int>, type<std::string>,<br /> .allocator = type< my_allocator< std::pair<const int, std::string > > ><br />}>;
</pre>
<p>C++20 designated parameters are more restrictive than their C99 counterpart; some of the constraints (initializers cannot be specified out of order) are totally valid in the context of C++, but I personally fail to see why mixing named and unnamed parameters would pose any problem.<br /></p>Joaquín M López Muñozhttp://www.blogger.com/profile/08579853272674211100noreply@blogger.com0tag:blogger.com,1999:blog-2715968472735546962.post-82766733012089991912022-01-17T14:34:00.001+01:002022-01-17T14:57:36.137+01:00Start Wordle with TARES<p>There have been some discussions on what the best first guess is for the game <a href="https://www.powerlanguage.co.uk/wordle/" target="_blank">Wordle</a>, but none, to the best of my knowledge, has used the following approach. After each guess, the game answers back with a matching result like these:<br /></p><p style="margin-left: 40px; text-align: left;"><span style="font-size: x-large;"><span style="color: #666666;">■■■■■</span></span> (all letters wrong),<span style="font-size: x-large;"><span style="color: #666666;"> </span></span></p><p style="margin-left: 40px; text-align: left;"><span style="font-size: x-large;"><span style="color: #666666;">■<span style="color: #04ff00;">■■</span><span style="color: #bf9000;">■</span>■</span></span> (two letters right, one mispositioned),</p><p style="margin-left: 40px; text-align: left;"><span style="font-size: x-large;"><span style="color: #04ff00;">■■■■■</span></span> (all letters right).</p>There are 3<sup>5</sup>=243 possible answers. From an information-theoretic point of view, the word we are trying to guess is a random variable (selected from a predefined dictionary), and the information we are obtaining by submitting our query is measured by the <a href="https://mathworld.wolfram.com/Entropy.html" target="_blank">entropy</a> formula<br /><div><p style="text-align: center;"><i>H</i>(guess) = <span>− </span> ∑ <i>p<sub>i</sub></i> log<sub>2</sub> <i>p<sub>i</sub></i> bits,<br /></p><p>where <i>p<sub>i</sub></i> is the probability that the game returns the <i>i</i>-th answer (<i>i</i> = 1, ... , 243) for our particular guess. So, the best first guess is the one for which we get the most information, that is, the associated entropy is maximum. Intuitively speaking, we are going for the guess that yields the most balanced partition of the dictionary words as grouped by their matching result: entropy is maximum when all <i>p<sub>i </sub></i>are equal (this is impossible for our problem, but gives an upper bound on the attainable entropy of log<sub>2</sub>(243) = 7.93 bits).<br /></p><p>Let's compute then the best guesses. Wordle uses a dictionary of 2,315 entries which is unfortunately not disclosed; in its place we will resort to <a href="https://www-cs-faculty.stanford.edu/%7Eknuth/sgb.html" target="_blank">Stanford GraphBase list</a>. I wrote a trivial <a href="https://github.com/joaquintides/bannalia/blob/master/wordle.cpp" target="_blank">C++17 program</a> that goes through each of the 5,757 words of Stanford's list and computes its associated entropy as a first guess (see it <a href="http://coliru.stacked-crooked.com/a/8867de8adef60b13" target="_blank">running online</a>). The resulting top 10 best words, along with their entropies are:</p><p style="text-align: center;">TARES 6.20918<br />RATES 6.11622<br />TALES 6.09823<br />TEARS 6.05801<br />NARES 6.01579<br />TIRES 6.01493<br />REALS 6.00117<br />DARES 5.99343<br />LORES 5.99031<br />TRIES 5.98875<br /></p><p><br /></p></div>Joaquín M López Muñozhttp://www.blogger.com/profile/08579853272674211100noreply@blogger.com2tag:blogger.com,1999:blog-2715968472735546962.post-50864797273564364542016-09-08T00:01:00.000+02:002016-09-08T18:19:35.407+02:00Global warming as falling into the Sun<div style="text-align: left;">
This summer in Spain has been so particularly hot that people came up with graphical jokes like this:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://3.bp.blogspot.com/-ur_br5qTrDU/V9AtFdqfgNI/AAAAAAAABeE/KnZtV6tFTIclRwgb4pl1UdUpGqlQsRHzgCLcB/s1600/c%25C3%25A1ceres.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="250" src="https://3.bp.blogspot.com/-ur_br5qTrDU/V9AtFdqfgNI/AAAAAAAABeE/KnZtV6tFTIclRwgb4pl1UdUpGqlQsRHzgCLcB/s1600/c%25C3%25A1ceres.jpg" width="400" /></a></div>
<div style="text-align: left;">
(<a href="https://en.wikipedia.org/wiki/C%C3%A1ceres,_Spain">Cáceres</a> is my hometown; versions of this picture for many other Spanish populations swarm the net.) Pursuing this idea half-seriously, one can reason that an increase in global temperatures due to climate change might be journalistically equated with the Earth getting closer to the Sun and thus receiving more radiation, which analogy conjures up doomy visions of our planet falling into the blazing hell of the star: let us do the calculations.</div>
<div style="text-align: left;">
<a href="https://en.wikipedia.org/wiki/Climate_sensitivity"><i>Climate sensitivity</i></a>, usually denoted by <i>λ</i>, links changes in global surface temperature with variations of received radiative power</div>
<div style="text-align: center;">
Δ<i>T</i> = <i>λ</i> Δ<i>W</i>.</div>
<div style="text-align: left;">
The mechanism by which radiative power changes (increased albedo, greenhouse effect) results in a different associated <i>λ</i> parameter. For the case of power variations due to changes in solar activity, <a href="http://www.calpoly.edu/~camp/Publications/Tung_etal_GRL_2008.pdf">Tung et. al</a> have calculated <i>λ</i><sub><i>s</i></sub> to be in the range of 0.69 to 0.97 K/(W/m<sup>2</sup>) using data from observations of 11-year solar cycles, and estimate that the stationary sensitivity (i.e. if the change in power was permanent) would be 1.5 times higher, thus in the range of 1.03 to 1.45 K/(W/m<sup>2</sup>).</div>
<div style="text-align: left;">
Now, the Earth is <i>D</i><sub>0</sub> = <a href="https://en.wikipedia.org/wiki/Earth%27s_orbit">1.496 × 10<sup>8</sup> km</a> away from the Sun, and receives an average radiation of <i>W</i><sub>0</sub> = <a href="https://en.wikipedia.org/wiki/Solar_irradiance#Earth">1366 W/m<sup>2</sup></a>. Assuming <a href="https://en.wikipedia.org/wiki/Near_and_far_field">far-field conditions</a>, the radiative power received at the Earth as a function of the distance <i>D</i> to the Sun is then</div>
<div style="text-align: center;">
<i>W</i> = <i>w</i> / <i>D</i><sup>2</sup>,<br />
<i>w</i> = 3.057 × 10<sup>25</sup> W/sr,</div>
<div style="text-align: left;">
which allows us to calculate Δ<i>T</i> = <i>λ</i><sub><i>s</i></sub> Δ<i>W</i> from Δ<i>D</i> = <i>D</i><sub>0</sub> − <i>D</i>, as shown in the graph for the minimum and maximum estimated values of <i>λ</i><sub><i>s</i></sub>. Although this cannot be checked visually, the lines are not straight but include a negligible (in these distance ranges) quadratic component.<i> </i></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-Lpg5rUSC9uU/V9CHJhzjWoI/AAAAAAAABeg/UiBJVgM17HA4MZhPt_ScSsCq-mkgAWb2wCLcB/s1600/deltat_deltad.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="296" src="https://1.bp.blogspot.com/-Lpg5rUSC9uU/V9CHJhzjWoI/AAAAAAAABeg/UiBJVgM17HA4MZhPt_ScSsCq-mkgAWb2wCLcB/s1600/deltat_deltad.png" width="400" /></a></div>
So, the estimated increase of <a href="https://en.wikipedia.org/wiki/Global_warming#Observed_temperature_changes">0.75 °C in global temperature during the 20th century</a> is equivalent to pushing the Earth between 30 and 40 thousand kilometers towards the Sun. Each extra °C brings us 38,000-54,000 km closer to the star. For those stuck with <a href="https://en.wikipedia.org/wiki/United_States_customary_units">USCS</a>, each °F is equivalent to 13,000-18,000 miles.<br />
<div style="text-align: left;">
</div>
<div style="text-align: left;">
As an alarmist meme, the figure works poorly since no amount of global warming will translate to anything resembling "falling into the Sun": relative changes in distance measure in the <a href="https://en.wikipedia.org/wiki/Basis_point#Permyriad">permyriads</a>. And, yes, the joke at the beginning of this article is definitely a gross exaggeration.</div>
Joaquín M López Muñozhttp://www.blogger.com/profile/08579853272674211100noreply@blogger.com0tag:blogger.com,1999:blog-2715968472735546962.post-41865411280351914852016-09-06T21:24:00.000+02:002016-09-07T12:44:59.047+02:00Compile-time checking the existence of a class template<div style="text-align: left;">
(Updated after a suggestion from <a href="https://www.reddit.com/r/cpp/comments/51gxb3/compiletime_checking_the_existence_of_a_class/d7bvgij">bluescarni</a>.) I recently had to use C++14's <a href="http://en.cppreference.com/w/cpp/types/is_final"><span style="font-family: "courier new" , "courier" , monospace;">std::is_final</span></a> but wanted to downgrade to <a href="http://www.boost.org/libs/type_traits/doc/html/boost_typetraits/reference/is_final.html"><span style="font-family: "courier new" , "courier" , monospace;">boost::is_final</span></a> if the former was not available. Trusting <span style="font-family: "courier new" , "courier" , monospace;">__cplusplus</span> implies overlooking the fact that compilers never provide 100% support for any version of the language, and <a href="http://www.boost.org/libs/config/doc/html/index.html">Boost.Config</a> is usually helpful with these matters, but, as of this writing, it does not provide any macro to check for the existence of <span style="font-family: "courier new" , "courier" , monospace;">std::is_final</span>. It turns out the matter can be investigated with some compile-time manipulations. We first set up some helping machinery in a namespace of our own:</div>
<pre class="prettyprint">namespace std_is_final_exists_detail{
template<typename> struct is_final{};
struct helper{};
}</pre>
<div style="text-align: left;">
<span style="font-family: "courier new" , "courier" , monospace;">std_is_final_exists_detail::is_final</span> has the same signature as the (possibly existing) <span style="font-family: "courier new" , "courier" , monospace;">std::is_final</span> homonym, but need not implement any of the functionality since it will be used for detection only. The class <span style="font-family: "courier new" , "courier" , monospace;">helper</span> is now used to write code directly into <span style="font-family: "courier new" , "courier" , monospace;">namespace std</span>, as the rules of the language allow (and, in some cases, encourage) us to specialize standard class templates for our own types, like for instance with <span style="font-family: "courier new" , "courier" , monospace;">std::hash</span>:</div>
<pre class="prettyprint">namespace std{
template<>
struct hash<std_is_final_exists_detail::helper>
{
std::size_t operator()(
const std_is_final_exists_detail::helper&)const{return 0;}
static constexpr bool check()
{
using helper=std_is_final_exists_detail::helper;
using namespace std_is_final_exists_detail;
return
!std::is_same<
is_final<helper>,
std_is_final_exists_detail::is_final<helper>>::value;
}
};
}
</pre>
<div style="text-align: left;">
<span style="font-family: "courier new" , "courier" , monospace;">operator()</span> is defined to nominally comply with the expected semantics of <span style="font-family: "courier new" , "courier" , monospace;">std::hash</span> specialization; it is in <span style="font-family: "courier new" , "courier" , monospace;">check</span> that the interesting work happens. By a non-obvious but totally sensible C++ rule, the directive</div>
<pre class="prettyprint">using namespace std_is_final_exists_detail;
</pre>
<div style="text-align: left;">
makes all the symbols of the namespace (including <span style="font-family: "courier new" , "courier" , monospace;">is_final</span>) visible as if they were declared in the <i>nearest namespace containing both <span style="font-family: "courier new" , "courier" , monospace;">std_is_final_exists_detail</span> and <span style="font-family: "courier new" , "courier" , monospace;">std</span></i>, that is, at global namespace level. This means that the unqualified use of <span style="font-family: "courier new" , "courier" , monospace;">is_final</span> in</div>
<pre class="prettyprint">!std::is_same<
is_final<helper>,...
</pre>
<div style="text-align: left;">
resolves to <span style="font-family: "courier new" , "courier" , monospace;">std::is_final</span> if it exists (as it is within namespace <span style="font-family: "courier new" , "courier" , monospace;">std</span>, i.e. closer than the global level), and to <span style="font-family: "courier new" , "courier" , monospace;">std_is_final_exists_detail::is_final</span> otherwise. We can wrap everything up in a utility class:</div>
<pre class="prettyprint">using std_is_final_exists=std::integral_constant<
bool,
std::hash<std_is_final_exists_detail::helper>::check()
>;
</pre>
<div style="text-align: left;">
and check with a program</div>
<pre class="prettyprint">#include <iostream>
int main()
{
std::cout<<"std_is_final_exists: "
<<std_is_final_exists::value<<"\n";
}
</pre>
<div style="text-align: left;">
that dutifully ouputs</div>
<pre>std_is_final_exists: 0
</pre>
<div style="text-align: left;">
with <a href="http://coliru.stacked-crooked.com/a/697c9c7ead1c4711">GCC in <span style="font-family: "courier new" , "courier" , monospace;">-std=c+11</span></a> mode and</div>
<pre>std_is_final_exists: 1
</pre>
<div style="text-align: left;">
when with <a href="http://coliru.stacked-crooked.com/a/22416d0e7347a51c"><span style="font-family: "courier new" , "courier" , monospace;">-std=c+14</span></a>. Clang and Visual Studio also handle this code properly.</div>
<div style="text-align: left;">
(Updated Sep 7, 2016.) The same technique can be used to walk the last mile and implement an <span style="font-family: "courier new" , "courier" , monospace;">is_final</span> type trait class relying on <span style="font-family: "courier new" , "courier" , monospace;">std::final</span> but falling back to <span style="font-family: "courier new" , "courier" , monospace;">boost::is_final</span> if the former is not present. I've slightly changed naming and used <span style="font-family: "courier new" , "courier" , monospace;">std::is_void</span> for the specialization trick as it involves a little less typing.</div>
<pre class="prettyprint">#include <boost/type_traits/is_final.hpp>
#include <type_traits>
namespace my_lib{
namespace is_final_fallback{
template<typename T> using is_final=boost::is_final<T>;
struct hook{};
}}
namespace std{
template<>
struct is_void<::my_lib::is_final_fallback::hook>:
std::false_type
{
template<typename T>
static constexpr bool is_final_f()
{
using namespace ::my_lib::is_final_fallback;
return is_final<T>::value;
}
};
} /* namespace std */
namespace my_lib{
template<typename T>
struct is_final:std::integral_constant<
bool,
std::is_void<is_final_fallback::hook>::template is_final_f<T>()
>{};
} /* namespace mylib */
</pre>
Joaquín M López Muñozhttp://www.blogger.com/profile/08579853272674211100noreply@blogger.com0tag:blogger.com,1999:blog-2715968472735546962.post-40270230044903284172016-07-28T13:17:00.000+02:002018-01-26T10:29:42.481+01:00Passing capturing C++ lambda functions as function pointers<div style="text-align: left;">
Suppose we have a function accepting a C-style callback function like this:</div>
<pre class="prettyprint">void do_something(void (*callback)())
{
...
callback();
}
</pre>
<div style="text-align: left;">
As <i>captureless</i> <a href="http://en.cppreference.com/w/cpp/language/lambda">C++ lambda functions</a> can be cast to regular function pointers, the following works as expected:</div>
<pre class="prettyprint">auto callback=[](){std::cout<<"callback called\n";};
do_something(callback);
<span class="nocode"><b>output: callback called</b></span>
</pre>
<div style="text-align: left;">
Unfortunately , if our callback code captures some variable from the context, we are out of luck</div>
<pre class="prettyprint">int num_callbacks=0;
···
auto callback=[&](){
std::cout<<"callback called "<<++num_callbacks<<" times \n";
};
do_something(callback);
<span class="nocode"><b>error: cannot convert 'main()::<lambda>' to 'void (*)()'</b></span>
</pre>
<div style="text-align: left;">
because capturing lambda functions create a <i>closure</i> of the used context that needs to be carried around to the point of invocation. If we are allowed to modify <span style="font-family: "courier new" , "courier" , monospace;">do_something</span> we can easily circumvent the problem by accepting a more powerful <span style="font-family: "courier new" , "courier" , monospace;">std::function</span>-based callback: </div>
<pre class="prettyprint">void do_something(std::function<void()> callback)
{
...
callback();
}
int num_callbacks=0;
...
auto callback=[&](){
std::cout<<"callback called "<<++num_callbacks<<" times \n";
};
do_something(callback);
<span class="nocode"><b>output: callback called 1 times</b></span>
</pre>
<div style="text-align: left;">
but we want to explore the challenge when this is not available (maybe because <span style="font-family: "courier new" , "courier" , monospace;">do_something</span> is legacy C code, or because we do not want to incur the runtime penalty associated with <span style="font-family: "courier new" , "courier" , monospace;">std::function</span>'s usage of dynamic memory). Typically, C-style callback APIs accept an additional callback argument through a type-erased <span style="font-family: "courier new" , "courier" , monospace;">void*</span>:</div>
<pre class="prettyprint">void do_something(void(*callback)(void*),void* callback_arg)
{
...
callback(callback_arg);
}
</pre>
<div style="text-align: left;">
and this is actually the only bit we need to force our capturing lambda function through <span style="font-family: "courier new" , "courier" , monospace;">do_something</span>. The gist of the trick is passing the lambda function as the callback argument and providing a captureless <a href="https://en.wikipedia.org/wiki/Thunk"><i>thunk</i></a> as the callback function pointer:</div>
<pre class="prettyprint">int num_callbacks=0;
...
auto callback=[&](){
std::cout<<"callback called "<<++num_callbacks<<" times \n";
};
auto thunk=[](void* arg){ // note thunk is captureless
(*static_cast<decltype(callback)*>(arg))();
};
do_something(thunk,&callback);
<span class="nocode"><b>output: callback called 1 times</b></span>
</pre>
<div style="text-align: left;">
Note that we are not using dynamic memory nor doing any extra copying of the captured data, since <span style="font-family: "courier new" , "courier" , monospace;">callback</span> is accessed in the point of invocation through a pointer; so, this technique can be advantageous even if modern <span style="font-family: "courier new" , "courier" , monospace;">std::function</span>s could be used instead. The caveat is that the user code must make sure that captured data is alive when the callback is invoked (which is not the case when execution happens after scope exit if, for instance, it is carried out in a different thread).</div>
<div style="text-align: left;">
<b>Postcript</b> </div>
<div style="text-align: left;">
<a href="https://www.reddit.com/r/cpp/comments/4v06wu/passing_capturing_c_lambda_functions_as_function/d5uce2g">Tcbrindle poses the issue</a> of lambda functions casting to function pointers with C++ linkage, where C linkage may be needed. Although this is rarely a problem in practice, it can be solved through another layer of indirection:</div>
<pre class="prettyprint">extern "C" void do_something(
void(*callback)(void*),void* callback_arg)
{
...
callback(callback_arg);
}
...
using callback_pair=std::pair<void(*)(void*),void*>;
extern "C" void call_thunk(void * arg)
{
callback_pair* p=static_cast<callback_pair*>(arg);
p->first(p->second);
}
...
int num_callbacks=0;
...
auto callback=[&](){
std::cout<<"callback called "<<++num_callbacks<<" times \n";
};
auto thunk=[](void* arg){ // note thunk is captureless
(*static_cast<decltype(callback)*>(arg))();
};
callback_pair p{thunk,&callback};
do_something(call_thunk,&p);
<span class="nocode"><b>output: callback called 1 times</b></span>
</pre>
Joaquín M López Muñozhttp://www.blogger.com/profile/08579853272674211100noreply@blogger.com8tag:blogger.com,1999:blog-2715968472735546962.post-40976698392033642902016-02-21T20:21:00.000+01:002016-02-22T21:26:31.615+01:00A formal definition of mutation independence<div style="text-align: left;">
Louis Dionne poses the problem of <a href="http://ldionne.com//2016/02/17/a-tentative-notion-of-move-independence/"><i>move independence</i></a> in the context of C++, that is, under which conditions a sequence of operations</div>
<pre class="prettyprint">f(std::move(x));
g(std::move(x));
</pre>
<div style="text-align: left;">
is sound in the sense that the first does not interfere with the second. We give here a functional definition for this property that can be applied to the case Louis discusses.</div>
<div style="text-align: left;">
Let <i>X</i> be some type and functions <i>f</i>: <i>X</i> <span class="st" data-hveid="49">→ <i>T</i></span><span class="_Tgc">×<i>X</i> and <i>g</i></span><span class="_Tgc">: <i>X</i> <span class="st" data-hveid="49">→ <i>Q</i></span><span class="_Tgc">×<i>X</i></span>. The impurity of a non-functional construct in an imperative language such as C++ is captured in this functional setting by the fact that these functions return, besides the output value itself, a new, possibly changed, value of <i>X</i>. We denote by <i>f<sub>T</sub></i> and <i>f<sub>X</sub></i> the projection of <i>f</i> onto <i>T</i> and <i>X</i>, respectively, and similarly for <i>g</i>. We say that <i>f does not affect g</i> if </span></div>
<div style="text-align: center;">
<i>g</i><span class="_Tgc"><i><sub>Q</sub></i></span>(<i>x</i>) = <i>g</i><span class="_Tgc"><i><sub>Q</sub></i></span>(<span class="_Tgc"><i>f<sub>X</sub></i></span>(<i>x</i>)) ∀<i>x</i>∈<i>X</i>.</div>
<div style="text-align: left;">
If we define the equivalence relationship <span class="_Tgc">~<i><sub>g</sub></i></span> in <i>X</i> as</div>
<div style="text-align: center;">
<i>x</i> <span class="_Tgc">~<i><sub>g</sub></i></span> <i>y</i> iff <i>g</i><span class="_Tgc"><i><sub>Q</sub></i></span>(<i>x</i>) = <i>g</i><span class="_Tgc"><i><sub>Q</sub></i></span>(<i>y</i>),</div>
<div style="text-align: left;">
then <i>f</i> does not affect <i>g</i> iff</div>
<div style="text-align: center;">
<span class="_Tgc"><i>f<sub>X</sub></i></span>(<i>x</i>) <span class="_Tgc">~<i><sub>g</sub></i></span> <i>x</i> ∀<i>x</i>∈<i>X</i></div>
<div style="text-align: left;">
or</div>
<div style="text-align: left;">
<div style="text-align: center;">
<span class="_Tgc"><i>f<sub>X</sub></i></span>([<i>x</i>]<span class="_Tgc"><i><sub>g</sub></i></span>) ⊆ [<i>x</i>]<span class="_Tgc"><i><sub>g</sub></i></span> ∀<i>x</i>∈<i>X</i>,</div>
</div>
<div style="text-align: left;">
where [<i>x</i>]<span class="_Tgc"><i><sub>g</sub></i></span> is the equivalence class of <i>x</i> under <span class="_Tgc">~<i><sub>g</sub></i></span>.</div>
<div style="text-align: left;">
We say that <i>f</i> and <i>g</i> are <i>mutation-independent</i> if <i>f</i> does not affect <i>g</i> and <i>g</i> does not affect <i>f</i>, that is,</div>
<div style="text-align: center;">
<span class="_Tgc"><i>f<sub>X</sub></i></span>([<i>x</i>]<span class="_Tgc"><i><sub>g</sub></i></span>) ⊆ [<i>x</i>]<span class="_Tgc"><i><sub>g</sub></i></span> and <span class="_Tgc"><i>g<sub>X</sub></i></span>([<i>x</i>]<sub><i>f</i></sub><span class="_Tgc"><i><sub></sub></i></span>) ⊆ [<i>x</i>]<sub><i>f</i></sub><span class="_Tgc"><i><sub></sub></i></span> ∀<i>x</i>∈<i>X,</i></div>
<div style="text-align: left;">
The following considers the case of <i>f</i> and <i>g</i> acting on separate components of a tuple: suppose that <i>X</i> = <i>X</i><span class="_Tgc"><sub>1</sub>×<i>X</i></span><sub><i>2</i></sub> and <i>f</i> and <i>g</i> depend on and mutate <i>X</i><span class="_Tgc"><sub>1</sub></span> and <i>X</i><span class="_Tgc"><sub>2</sub></span> alone, respectively, or put more formally:</div>
<div style="text-align: center;">
<span class="_Tgc"><i>f<sub>T</sub></i></span>(<i>x</i><span class="_Tgc"><sub>1</sub></span>,<i>x</i><span class="_Tgc"><sub>2</sub></span>) = <span class="_Tgc"><i>f<sub>T</sub></i></span>(<i>x</i><span class="_Tgc"><sub>1</sub></span>,<i>x'</i><span class="_Tgc"><sub>2</sub></span>),<br />
<span class="_Tgc"><i>f</i><sub><i>X</i><sub>2</sub></sub></span>(<i>x</i><span class="_Tgc"><sub>1</sub></span>,<i>x</i><span class="_Tgc"><sub>2</sub></span>) = <i>x</i><sub>2</sub>,<br />
<i>g</i><span class="_Tgc"><i><sub>Q</sub></i></span>(<i>x</i><span class="_Tgc"><sub>1</sub></span>,<i>x</i><span class="_Tgc"><sub>2</sub></span>) = <i>g</i><span class="_Tgc"><i><sub>Q</sub></i></span>(<i>x'</i><span class="_Tgc"><sub>1</sub></span>,<i>x</i><span class="_Tgc"><sub>2</sub></span>),<br />
<span class="_Tgc"><i>g</i><sub><i>X</i><sub>1</sub></sub></span>(<i>x</i><span class="_Tgc"><sub>1</sub></span>,<i>x</i><span class="_Tgc"><sub>2</sub></span>) = <i><i>x</i></i><sub>1</sub> </div>
<div style="text-align: left;">
for all <i><i>x</i></i><sub>1</sub>, <i><i>x'</i></i><sub>1</sub> ∈ <i>X</i><span class="_Tgc"><sub>1</sub></span>, <i><i>x</i></i><sub>2</sub>, <i><i>x'</i></i><sub>2</sub> ∈ <i>X</i><span class="_Tgc"><sub>2</sub></span>. Then <i>f</i> and <i>g</i> are mutation-independent (proof trivial). Getting back to C++, given a tuple <span style="font-family: "courier new" , "courier" , monospace;">x</span>, two operations of the form:</div>
<pre class="prettyprint">f(std::get<i>(std::move(x)));
g(std::get<j>(std::move(x)));
</pre>
<div style="text-align: left;">
are mutation-independent if <span style="font-family: "courier new" , "courier" , monospace;">i!=j</span>; this can be extended to the case where <span style="font-family: "courier new" , "courier" , monospace;">f</span> and <span style="font-family: "courier new" , "courier" , monospace;">g</span> read from (but not write to) any component of <span style="font-family: "courier new" , "courier" , monospace;">x</span> except the <span style="font-family: "courier new" , "courier" , monospace;">j</span>-th and <span style="font-family: "courier new" , "courier" , monospace;">i</span>-th, respectively.</div>
Joaquín M López Muñozhttp://www.blogger.com/profile/08579853272674211100noreply@blogger.com0tag:blogger.com,1999:blog-2715968472735546962.post-17608491840301879022016-01-11T22:58:00.002+01:002016-01-13T11:47:38.515+01:00(Oil+tax)-free Spanish gas prices 2014-15<div style="text-align: left;">
We use the data gathered at our hysteresis analysis of Spanish gas prices for <a href="http://bannalia.blogspot.com/2015/01/gas-price-hysteresis-spain-2014.html">2014</a> and <a href="http://bannalia.blogspot.com/2016/01/gas-price-hysteresis-spain-2015.html">2015</a> to gain further insight on their dynamics. This is a simple breakdown of gas (or gasoil) price:</div>
<div style="text-align: center;">
Price = oil cost + other costs + taxes + margin.</div>
<div style="text-align: left;">
A barrel of crude oil is refined into several final products totalling approximately the same amount of volume, that is, it takes roughly one liter of crude oil to produce one liter of gas (or gasoil). The simplest allocation model is to use market Brent prices as the oil cost for fuel production (we will see more realistic models later). If we eliminate taxes and oil cost, what remains in the fuel price is other costs plus margin. We plot this number for 95 octane gas and gasoil compared with Brent oil price, all in c€/l, for the period 2014-2015:</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://3.bp.blogspot.com/-Jh0bIRWyrmc/VpQ08QoWaMI/AAAAAAAABZQ/sdpNcVhb4Rg/s1600/oil_tax_free_prices.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="265" src="http://3.bp.blogspot.com/-Jh0bIRWyrmc/VpQ08QoWaMI/AAAAAAAABZQ/sdpNcVhb4Rg/s1600/oil_tax_free_prices.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><b>(Oil+tax)-free fuel price, simple cost allocation model [c€/l]<br />
Brent oil cost [c€/l]</b></td></tr>
</tbody></table>
<div style="text-align: left;">
When we factor out crude oil cost, the remaning parts of the price increase moderately (~25% for gasoline, ~15% for gas). In a scenario of oil price reduction, oil direct costs as a percentage of tax-free fuel prices have consequently dropped from 70% to 50%:</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://2.bp.blogspot.com/-jHwqyidgnVQ/VpQ0PkZyV7I/AAAAAAAABZI/7oMst7pfN8A/s1600/oil_cost_contribution.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="265" src="http://2.bp.blogspot.com/-jHwqyidgnVQ/VpQ0PkZyV7I/AAAAAAAABZI/7oMst7pfN8A/s1600/oil_cost_contribution.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><b>Oil direct cost / tax-free fuel price,</b><b>simple cost allocation model</b></td></tr>
</tbody></table>
<div style="text-align: left;">
<b>Value-based cost allocation</b></div>
<div style="text-align: left;">
Crude oil is refined into several final products from high-quality fuel to asphalt, plastic etc. The <a href="http://www.eia.gov/">EIA</a> provides typical <a href="http://www.eia.gov/dnav/pet/pet_pnp_pct_dc_nus_pct_a.htm">yield data for US refineries</a> that we can use as a reasonable approximation to the Spanish case. The volume breakdown we are interested in is roughly:</div>
<div style="text-align: left;">
<ul>
<li>Gas: 45%</li>
<li>Gasoil: 30%</li>
<li>Other products: 37% </li>
</ul>
</div>
<div style="text-align: left;">
(Note that the sum is greater than 100% because additional components are mixed in the process). Now, as these products have very different prices in the market, it is natural to allocate oil costs proportionally to end-user value:</div>
<div style="text-align: center;">
price<sub>total</sub> = <span style="font-size: large;"></span>45% price<sub>gasoline</sub> + 30% price<sub>gasoil</sub> + 37% price<sub>other</sub> ,<br />
cost<sub>gasoline</sub> = cost<sub>oil</sub> × price<sub>gasoline</sub> / price<sub>total</sub> ,<br />
cost<sub>gas</sub> = cost<sub>oil</sub> × price<sub>gas</sub> / price<sub>total</sub><span style="font-size: large;"></span></div>
<div style="text-align: left;">
(prices without taxes). Since it is difficult to obtain accurate data on prices for the remaining products, we consider two conventional scenarios where these products are valued at 50% and 25% of the average fuel price, respectively:</div>
<div style="text-align: left;">
<ul>
<li>A: price<sub>other</sub> = 50% (price<sub>gasoline</sub> + price<sub>gasoil</sub>)/2</li>
<li>B: price<sub>other</sub> = 25% (price<sub>gasoline</sub> + price<sub>gasoil</sub>)/2</li>
</ul>
</div>
<div style="text-align: left;">
The figure depicts resulting prices without oil costs or taxes (i.e. other costs plus margin):</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://1.bp.blogspot.com/-C3vWSuFch8M/VpVIO5NYCQI/AAAAAAAABZk/xNjKi1-sXvA/s1600/oil_tax_free_prices_value_cost_allocation.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="260" src="http://1.bp.blogspot.com/-C3vWSuFch8M/VpVIO5NYCQI/AAAAAAAABZk/xNjKi1-sXvA/s1600/oil_tax_free_prices_value_cost_allocation.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><b>(Oil+tax)-free fuel price, value-based cost allocation [c€/l]<br />
Brent oil cost [c€/l]</b></td></tr>
</tbody></table>
<div style="text-align: left;">
Unlike with our previous, naïve allocation model, here we see, both in scenarios A and B, that margins for gasoline and gas match very precisely almost all the time: this can be seen as further indication that value-based cost allocation is indeed the model used by gas companies themselves. Visual inspection reveals two insights:</div>
<div style="text-align: left;">
<ul>
<li>Short-term, margin fluctuations are countercyclical to oil price. This might be due to an effort from companies to stabilize prices.</li>
<li>In the two-year period studied, margins grow <i>very much</i>, around 30% for scenario A and 60% for scenario B. This trend has been somewhat corrected in the second half of 2015, though.</li>
</ul>
</div>
<div style="text-align: left;">
The percentual contribution of oil costs to fuel prices (which is by virtue of the cost allocation model exactly the same for gasoline and gas) drops in 2014-15 from 75% to 55% (scenario A) and from 85% to 60% (scenario B).</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://4.bp.blogspot.com/-1lHZBlXpDBg/VpVPsH3jd0I/AAAAAAAABZ0/TAXgCKwJZ6I/s1600/oil_cost_contribution_value_cost_allocation.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="256" src="http://4.bp.blogspot.com/-1lHZBlXpDBg/VpVPsH3jd0I/AAAAAAAABZ0/TAXgCKwJZ6I/s1600/oil_cost_contribution_value_cost_allocation.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><b>Oil direct cost / tax-free fuel price, </b><b>value-based cost allocation</b></td></tr>
</tbody></table>
Joaquín M López Muñozhttp://www.blogger.com/profile/08579853272674211100noreply@blogger.com0tag:blogger.com,1999:blog-2715968472735546962.post-59554011954049385532016-01-11T20:53:00.000+01:002016-01-11T20:53:58.985+01:00Gas price hysteresis, Spain 2015<div style="text-align: left;">
We begin the new year redoing our <a href="http://bannalia.blogspot.com/2015/01/gas-price-hysteresis-spain-2014.html">hysteresis analysis</a> for Spanish gas prices with data from 2015, obtained from the usual sources:</div>
<div style="text-align: left;">
<ul>
<li>Retail gasoline and gasoil prices from the <a href="http://ec.europa.eu/energy/observatory/oil/bulletin_en.htm">European Commision Oil Bulletin</a>.</li>
<li>Brent oil spot prices from the the <a href="http://www.eia.gov/dnav/pet/hist/LeafHandler.ashx?n=PET&s=RBRTE&f=D">US Energy Information Administration</a>.</li>
<li>Euro to dollar exchange rates from <a href="http://www.oanda.com/convert/fxhistory">FXHistory</a>.</li>
</ul>
</div>
<div style="text-align: left;">
The figure shows the weekly evolution during 2015 of prices of Brent oil and average retail prices without taxes of 95 octane gas and gasoil in Spain, all in c€ per liter.</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-OOmXzMf8LoY/VpO2gyrsAvI/AAAAAAAABXo/t4br53xOHsc/s1600/gas_prices.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="270" src="http://1.bp.blogspot.com/-OOmXzMf8LoY/VpO2gyrsAvI/AAAAAAAABXo/t4br53xOHsc/s1600/gas_prices.png" width="400" /></a></div>
<div style="text-align: left;">
For gasoline, the corresponding scatter plot of Δ(gasoline price before taxes) against Δ(Brent price) is</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-qMxCibRf2c8/VpO3JFn-umI/AAAAAAAABXw/_0_H5-FphF8/s1600/dgasoline_dbrent.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="http://1.bp.blogspot.com/-qMxCibRf2c8/VpO3JFn-umI/AAAAAAAABXw/_0_H5-FphF8/s1600/dgasoline_dbrent.png" width="393" /></a></div>
<div style="text-align: left;">
with linear regressions for the entire graph and both semiplanes Δ(Brent price) ≥ 0 and ≤ 0, given by</div>
<div style="text-align: center;">
overall → <span style="font-style: italic;">y</span> = <span style="font-style: italic;">f</span><sub></sub>(<span style="font-style: italic;">x</span>) = <span style="font-style: italic;">b</span><sub></sub> + <span style="font-style: italic;">m</span><span style="font-style: italic;">x</span> = −0.1210 + 0.2554<span style="font-style: italic;">x,</span><br />
ΔBrent ≥ 0 → <span style="font-style: italic;">y</span> = <span style="font-style: italic;">f</span><sub>+</sub>(<span style="font-style: italic;">x</span>) = <span style="font-style: italic;">b</span><sub>+</sub> + <span style="font-style: italic;">m</span><sub>+</sub><span style="font-style: italic;">x</span> = 0.2866 − 0.0824<span style="font-style: italic;">x</span>,<br />
ΔBrent ≤ 0 → <span style="font-style: italic;">y</span> = <span style="font-style: italic;">f</span><sub>−</sub>(<span style="font-style: italic;">x</span>) = <span style="font-style: italic;">b</span><sub>−</sub> + <span style="font-style: italic;">m</span><sub>−</sub><span style="font-style: italic;">x</span> = 0.3552 + 0.4040<span style="font-style: italic;">x</span>.</div>
<div style="text-align: left;">
Due to the outlier in the right lower corner (with date August 31), positive variations in oil price don't translate, in average, as positive increments in the price of gasoline. The most worrisome aspect is the fact that <span style="font-style: italic;">b</span><sub>+</sub> and are <span style="font-style: italic;">b</span><sub>−</sub> positive, which suggests an underlying trend to increase prices when oil is stable.</div>
<div style="text-align: left;">
For gasoil we have</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-b_B46jw-FDM/VpPBMeWbCYI/AAAAAAAABYE/J5vK2k0jATE/s1600/dgasoil_dbrent.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="http://1.bp.blogspot.com/-b_B46jw-FDM/VpPBMeWbCYI/AAAAAAAABYE/J5vK2k0jATE/s1600/dgasoil_dbrent.png" width="393" /></a></div>
<div style="text-align: left;">
with regressions</div>
<div style="text-align: center;">
overall → <span style="font-style: italic;">y</span> = <span style="font-style: italic;">f</span>(<span style="font-style: italic;">x</span>) = <span style="font-style: italic;">b</span> + <span style="font-style: italic;">m</span><span style="font-style: italic;">x</span> = −0.0672 + 0.3538<span style="font-style: italic;">x,</span><br />
ΔBrent ≥ 0 → <span style="font-style: italic;">y</span> = <span style="font-style: italic;">f</span><sub>+</sub>(<span style="font-style: italic;">x</span>) = <span style="font-style: italic;">b</span><sub>+</sub> + <span style="font-style: italic;">m</span><sub>+</sub><span style="font-style: italic;">x</span> = −0.2457 + 0.2013<span style="font-style: italic;">x</span>,<br />
ΔBrent ≤ 0 → <span style="font-style: italic;">y</span> = <span style="font-style: italic;">f</span><sub>−</sub>(<span style="font-style: italic;">x</span>) = <span style="font-style: italic;">b</span><sub>−</sub> + <span style="font-style: italic;">m</span><sub>−</sub><span style="font-style: italic;">x</span> = 0.2468 + 0.3956<span style="font-style: italic;">x</span>.</div>
<div style="text-align: left;">
Again, no "rocket and feather" effect here (in fact, <span style="font-style: italic;">m</span><sub>+</sub> is slightly smaller than <span style="font-style: italic;">m</span><sub>−</sub>). Variations around ΔBrent = 0 are fairly symmetrical and, seemingly, fair.</div>
Joaquín M López Muñozhttp://www.blogger.com/profile/08579853272674211100noreply@blogger.com0tag:blogger.com,1999:blog-2715968472735546962.post-51922032840858645462015-12-28T19:54:00.000+01:002015-12-29T10:12:39.306+01:00How likely?<div style="text-align: left;">
Yesterday, <a href="http://cup.cat/">CUP</a> political party held a general assembly to determine whether to support or not <a href="https://en.wikipedia.org/wiki/Artur_Mas_i_Gavarr%C3%B3">Artur Mas's</a> candidacy to President of the Catalonian regional government. The final voting round among 3,030 representatives ended up in an <a href="http://www.catalannewsagency.com/politics/item/cup-s-base-fails-to-reach-decision-on-mas-investiture">exact 1,515/1,515 tie</a>, leaving the question unsolved for the moment being. Such an unexpected result has prompted a flurry of Internet activity about the mathematical probability of its occurrence.</div>
<div style="text-align: left;">
The question "how likely was this result to happen?" is of course unanswerable without a specification of the context (i.e. the probability space) we choose to frame the event. A plausible formulation is:</div>
<div style="text-align: center;">
If a proportion <i>p</i> of CUP voters are pro-Mas, how likely is it that a random sample based on 3,030 individuals yields a 50/50 tie?</div>
<div style="text-align: left;">
The simple answer (assuming the number of CUP voters is much larger that 3,030) is <i>P<sub>p</sub></i>(1,015 | 3,030), where <i>P<sub>p</sub></i>(<i>n</i> | <i>N</i>) is the <a href="http://mathworld.wolfram.com/BinomialDistribution.html">binomial distribution</a> of <i>N</i> Bernouilli trials with probability <i>p</i> resulting in exactly <i>n</i> successes.</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-S7dWIHkbz-Y/VoF7IzeaSBI/AAAAAAAABWo/rr-QJYncKFI/s1600/prob_tie.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="293" src="http://1.bp.blogspot.com/-S7dWIHkbz-Y/VoF7IzeaSBI/AAAAAAAABWo/rr-QJYncKFI/s1600/prob_tie.png" width="400" /></a></div>
<div style="text-align: left;">
The figure shows this value for 40% ≤ <i>p</i> ≤ 60%. At <i>p</i> = 50%, which without further information is our best estimation of pro-Mas supporters among CUP voters, the probability of a tie is 1.45%. A deviation in <i>p</i> of ±4% would have made this result virtually impossible.</div>
<div style="text-align: left;">
A slightly more interesting question is the following:</div>
<div style="text-align: center;">
If a proportion <i>p</i> of CUP voters are pro-Mas, how likely is a random sample of 3,030 individuals to misestimate the majority opinion? </div>
<div style="text-align: left;">
When <i>p</i> is in the vicinity of 50%, there is a non-negligible probability that the assembly vote come up with the wrong (i.e. against voters' wishes) result. This probability is</div>
<div style="text-align: center;">
<i>I<sub>p</sub></i>(1,516, 1,515) if <i>p</i> < 50%,<br />
1 − <i>P<sub>p</sub></i>(1,015 | 3,030) if <i>p</i> = 50%, <br />
<i>I</i><sub>1−</sub><i><sub>p</sub></i>(1,516, 1,515) if <i>p</i> > 50%,</div>
<div style="text-align: left;">
where <i>I<sub>p</sub></i>(<i>a</i>,<i>b</i>) is the <a href="http://mathworld.wolfram.com/RegularizedBetaFunction.html">regularized beta function</a>. The figure shows the corresponding graph for 3,030 representatives and 40% ≤ <i>p</i> ≤ 60%.</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-upeGciahvjU/VoGEu0ixmdI/AAAAAAAABXE/r_QZwmG_GqU/s1600/prob_misestimation.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="293" src="http://1.bp.blogspot.com/-upeGciahvjU/VoGEu0ixmdI/AAAAAAAABXE/r_QZwmG_GqU/s1600/prob_misestimation.png" width="400" /></a></div>
<div style="text-align: left;">
The function shows a discontinuity at the singular (and zero-probability) event <i>p</i> = 50%, in which case the assembly will yield the wrong result always except for the previously studied situation that there is an exact tie (so, the probability of misestimation is 1 − 1.45% = 98.55 %). Other than this, the likelihood of misestimation approaches 49%+ as <i>p</i> tends to 50%. We have learnt that CUP voters are almost evenly divided between pro- and anti-Mas: if the difference between both positions is 0.7% or less, an assembly of 3,030 representatives such as held yesterday will fail to reflect the party's global position in more than 1 out of 5 cases.</div>
<div style="text-align: left;">
</div>
Joaquín M López Muñozhttp://www.blogger.com/profile/08579853272674211100noreply@blogger.com0tag:blogger.com,1999:blog-2715968472735546962.post-27759243578485569492015-11-14T22:02:00.001+01:002015-11-15T14:36:44.153+01:00SOA container for encapsulated C++ DOD<div style="text-align: left;">
In a <a href="http://bannalia.blogspot.com/2015/09/c-encapsulation-for-data-oriented.html">previous entry</a> we saw how to decouple the logic of a class from the access to its member data so that the latter can be laid out in a <a href="https://www.youtube.com/watch?v=rX0ItVEVjHc">DOD</a>-friendly fashion for faster sequential processing. Instead of having a <span style="font-family: "courier new" , "courier" , monospace;">std::vector</span> of, say, particles, now we can store the different particle members (position, velocity, etc.) in separate containers. This unfortunately results in more cumbersome initialization code: whereas for the traditional, OOP approach particle creation and access is compact and nicely localized:</div>
<pre class="prettyprint">std::vector<plain_particle> pp_;
...
for(std::size_t i=0;i<n;++i){
pp_.push_back(plain_particle(...));
}
...
render(pp_.begin(),pp_.end());
</pre>
<div style="text-align: left;">
when using DOD, in contrast, the equivalent code grows linearly with the number of members, even if most of it is boilerplate: </div>
<pre class="prettyprint">std::vector<char> color_;
std::vector<int> x_,y_,dx_,dy_;
...
for(std::size_t i=0;i<n;++i){
color_.push_back(...);
x_.push_back(...);
y_.push_back(...);
dx_.push_back(...);
dy_.push_back(...);
}
...
auto beg_=make_pointer<particle>(
access(&color_[0],&x_[0],&y_[0],&dx_[0],&dy_[0])),
auto end_=beg_dod+n;
render(beg_,end_);
</pre>
<div style="text-align: left;">
We would like to rely on a container using SOA (<i>structure of arrays</i>) for its storage that allows us to retain our original OOP syntax:</div>
<pre class="prettyprint">using access=dod::access<color,x,y,dx,dy>;
dod::vector<particle<access>> p_;
...
for(std::size_t i=0;i<n;++i){
p_.emplace_back(...);
}
...
render(p_.begin(),p_.end());
</pre>
<div style="text-align: left;">
Note that particles are inserted into the container using <span style="font-family: "courier new" , "courier" , monospace;">emplace_back</span> rather than <span style="font-family: "courier new" , "courier" , monospace;">push_back</span>: this is due to the fact that a <span style="font-family: "courier new" , "courier" , monospace;">particle</span> object (which <span style="font-family: "courier new" , "courier" , monospace;">push_back</span> accepts as its argument) cannot be created out of the blue without its constituent members being previously stored somewhere; <span style="font-family: "courier new" , "courier" , monospace;">emplace_back</span>, on the other hand, does not suffer from this chicken-and-egg problem.</div>
<div style="text-align: left;">
The implementation of such a container class is fairly straightfoward (limited here to the operations required to make the previous code work):</div>
<pre class="prettyprint">namespace dod{
template<typename Access>
class vector_base;
template<>
class vector_base<access<>>
{
protected:
access<> data(){return {};}
void emplace_back(){}
};
template<typename Member0,typename... Members>
class vector_base<access<Member0,Members...>>:
protected vector_base<access<Members...>>
{
using super=vector_base<access<Members...>>;
using type=typename Member0::type;
using impl=std::vector<type>;
using size_type=typename impl::size_type;
impl v;
protected:
access<Member0,Members...> data()
{
return {v.data(),super::data()};
}
size_type size()const{return v.size();}
template<typename Arg0,typename... Args>
void emplace_back(Arg0&& arg0,Args&&... args){
v.emplace_back(std::forward<Arg0>(arg0));
try{
super::emplace_back(std::forward<Args>(args)...);
}
catch(...){
v.pop_back();
throw;
}
}
};
template<typename T> class vector;
template<template <typename> class Class,typename Access>
class vector<Class<Access>>:protected vector_base<Access>
{
using super=vector_base<Access>;
public:
using iterator=pointer<Class<Access>>;
iterator begin(){return super::data();}
iterator end(){return this->begin()+super::size();}
using super::emplace_back;
};
} // namespace dod
</pre>
<div style="text-align: left;">
<span style="font-family: "courier new" , "courier" , monospace;">dod::vector<Class<Members...>></span> derives from an implementation class that holds a <span style="font-family: "courier new" , "courier" , monospace;">std::vector</span> for each of the <span style="font-family: "courier new" , "courier" , monospace;">Members</span> declared. Inserting elements is just a simple matter of multiplexing to the vectors, and <span style="font-family: "courier new" , "courier" , monospace;">begin</span> and <span style="font-family: "courier new" , "courier" , monospace;">end</span> return <span style="font-family: "courier new" , "courier" , monospace;">dod::pointer</span>s to this structure of arrays. From the point of view of the user all the necessary magic is hidden by the framework and DOD processing becomes nearly identical in syntax to OOP.</div>
<div style="text-align: left;">
We provide a <a href="https://www.dropbox.com/s/q4zbrvtxpymi8sk/dod_vector.cpp?dl=0">test program</a> that exercises <span style="font-family: "courier new" , "courier" , monospace;">dod::vector</span> against the classical OOP approach based on a <span style="font-family: "courier new" , "courier" , monospace;">std::vector</span> of plain (i.e., non DOD) particles. Results are the same as <a href="http://bannalia.blogspot.com/2015/09/c-encapsulation-for-data-oriented.html">previously discussed</a> when we used DOD with manual initialization, that is, there is no abstraction penalty associated to using <span style="font-family: "courier new" , "courier" , monospace;">dod::vector</span>, so we won't present any additional figures here. </div>
<div style="text-align: left;">
The framework we have constructed so far provides the bare minimum needed to test the ideas presented. In order to be fully usable there are various aspects that should be expanded upon:</div>
<div style="text-align: left;">
<ul>
<li><span style="font-family: "courier new" , "courier" , monospace;">access<Members...></span> just considers the case where each member is stored separately. Sometimes the most efficient layout will call for mixed scenarios where some of the members are grouped together. This can be modelled, for instance, by having <span style="font-family: "courier new" , "courier" , monospace;">member</span> accept multiple pieces of data in its declaration.</li>
<li><span style="font-family: "courier new" , "courier" , monospace;">dod::pointer</span> does not properly implement const access, that is, <span style="font-family: "courier new" , "courier" , monospace;">pointer<const particle<...>></span> does not compile.</li>
<li><span style="font-family: "courier new" , "courier" , monospace;">dod::vector</span> should be implemented to provide the full interface of a proper vector class. </li>
</ul>
</div>
<div style="text-align: left;">
All of this can be in principle tackled without serious design dificulties.</div>
Joaquín M López Muñozhttp://www.blogger.com/profile/08579853272674211100noreply@blogger.com3tag:blogger.com,1999:blog-2715968472735546962.post-71643048462615774532015-09-06T12:13:00.000+02:002015-09-09T08:27:20.819+02:00C++ encapsulation for Data-Oriented Design: performance<div style="text-align: left;">
(Many thanks to <a href="https://plus.google.com/+ManuS%C3%A1nchezManu343726/posts">Manu Sánchez</a> for his help with running tests and analyzing results.)</div>
<div style="text-align: left;">
In a <a href="http://bannalia.blogspot.com/2015/08/c-encapsulation-for-data-oriented-design.html">past entry</a>, we implemented a little C++ framework that allows us to do <a href="https://www.youtube.com/watch?v=rX0ItVEVjHc">DOD</a> while retaining some of the encapsulation benefits and the general look and feel of traditional object-based programming. We complete here the framework by adding a critical piece from the point of view of usability, namely the ability to process sequences of DOD entities with as terse a syntax as we would have in OOP.</div>
<div style="text-align: left;">
To enable DOD for a particular class (like the <span style="font-family: "Courier New",Courier,monospace;">particle</span> we used in the previous entry), i.e., to distribute its different data members in separate memory locations, we change the class source code to turn it into a class template <span style="font-family: "Courier New",Courier,monospace;">particle<Access></span> where <span style="font-family: "Courier New",Courier,monospace;">Access</span> is a framework-provided entity in charge of granting access to the external data members with a similar syntax as if they were an integral part of the class itself. Now, <span style="font-family: "Courier New",Courier,monospace;">particle<Access></span> is no longer a regular class with <a href="https://en.wikipedia.org/wiki/Value_semantics"><i>value semantics</i></a>, but a mere proxy to the external data without ownership to it. Importantly, it is the members and not the <span style="font-family: "Courier New",Courier,monospace;">particle</span> objects that are stored: <span style="font-family: "Courier New",Courier,monospace;">particle</span>s are constructed on the fly when needed to use its interface in order to process the data. So, code like</div>
<pre class="prettyprint">for(const auto& p:particle_)p.render();
</pre>
<div style="text-align: left;">
cannot possibly work because the application does not have any <span style="font-family: "Courier New",Courier,monospace;">particle_</span> container to begin with: instead, the information is stored in separate locations:</div>
<pre class="prettyprint">std::vector<char> color_;
std::vector<int> x_,y_,dx_,dy_;
</pre>
<div style="text-align: left;">
and "traversing" the particles requires that we go through the associated containers in parallel and invoke <span style="font-family: "Courier New",Courier,monospace;">render</span> on a temporary <span style="font-family: "Courier New",Courier,monospace;">particle</span> object constructed out of them:</div>
<pre class="prettyprint">auto itc=&color_[0],ec=itc+color_.size();
auto itx=&x_[0];
auto ity=&y_[0];
auto itdx=&dx_[0];
auto itdy=&dy_[0];
for(;itc!=ec;++itc,++itx,++ity,++itdx,++itdy){
auto p=make_particle(
access<color,x,y,dx,dy>(itc,itx,ity,itdx,itdy));
p.render();
}
</pre>
<div style="text-align: left;">
Fortunately, this boilerplate code can be hidden by the framework by using these auxiliary constructs:</div>
<pre class="prettyprint">template<typename T> class pointer;
template<template <typename> class Class,typename Access>
class pointer<Class<Access>>
{
// behaves as Class<Access>>*
};
template<template <typename> class Class,typename Access>
pointer<Class<Access>> make_pointer(const Access& a)
{
return pointer<Class<Access>>(a);
}
</pre>
<div style="text-align: left;">
We won't delve into the implementation details of <span style="font-family: "Courier New",Courier,monospace;">pointer</span> (the interested reader can see the actual code in the test program given below): from the point of view of the user, this utility class accepts an access entity, which is a collection of pointers to the data members plus an offset member (this offset has been added to the former version of the framework), it keeps everything in sync when doing pointer arithmetic and dereferences to a temporary <span style="font-family: "Courier New",Courier,monospace;">particle</span> object. The resulting user code is as simple as it gets:</div>
<pre class="prettyprint">auto n=color_.size();
auto beg_=make_pointer<particle>(access<color,x,y,dx,dy>(
&color_[0],&x_[0],&y_[0],&dx_[0],&dy_[0]));
auto end_=beg_+n;
for(auto it=beg_;it!=end_;++it)it->render();
</pre>
<div style="text-align: left;">
Index-based traversal is also possible:</div>
<pre class="prettyprint">for(std::size_t i=0;i<n;++i)beg_[i].render();
</pre>
<div style="text-align: left;">
Once the containers are populated and <span style="font-family: "Courier New",Courier,monospace;">beg_</span> and <span style="font-family: "Courier New",Courier,monospace;">end_</span> defined, user code can handle <span style="font-family: "Courier New",Courier,monospace;">particle</span>s as if they were stored in [<span style="font-family: "Courier New",Courier,monospace;">beg_</span>, <span style="font-family: "Courier New",Courier,monospace;">end_</span>), thus effectively isolated from the fact that the actual data is scattered around different containers for maximum processing performance.</div>
<div style="text-align: left;">
Are we paying an abstraction penalty for the convenience this framework affords? There are two sources of concern:</div>
<div style="text-align: left;">
<ul>
<li>Even though traversal code is <i>in principle</i> equivalent to hand-written DOD code, compilers might not be able to optimize all the template scaffolding away.</li>
<li>Traversing with <span style="font-family: "Courier New",Courier,monospace;">access<color,x,y,dx,dy></span> for rendering when only <span style="font-family: "Courier New",Courier,monospace;">color</span>, <span style="font-family: "Courier New",Courier,monospace;">x</span> and <span style="font-family: "Courier New",Courier,monospace;">y</span> are needed (because <span style="font-family: "Courier New",Courier,monospace;">render</span> does not access <span style="font-family: "Courier New",Courier,monospace;">dx</span> or <span style="font-family: "Courier New",Courier,monospace;">dy</span>) involves iterating over <span style="font-family: "Courier New",Courier,monospace;">dx_</span> and <span style="font-family: "Courier New",Courier,monospace;">dy_</span> without actually accessing either one: again, the compiler might or might not optimize this extra code.</li>
</ul>
</div>
<div style="text-align: left;">
We provide a <a href="https://www.dropbox.com/s/o3arzt9kbyhor4t/dod_perf.cpp?dl=0">test program</a> (Boost required) that measures the performance of this framework against some alternatives. The looped-over <span style="font-family: "Courier New",Courier,monospace;">render</span> procedure simply updates a global variable so that resulting execution times are basically those of the produced iteration code. The different options compared are:</div>
<div style="text-align: left;">
<ul style="list-style-type: none;">
<li><span style="color: #010101;">⬛</span> <span style="font-family: "Courier New",Courier,monospace;">oop</span>: iteration over a traditional object-based structure</li>
<li><span style="color: #ed2e2d;">⬛</span> <span style="font-family: "Courier New",Courier,monospace;">raw</span>: hand-written data-processing loop</li>
<li><span style="color: #008c47;">⬛</span> <span style="font-family: "Courier New",Courier,monospace;">dod</span>: DOD framework with <span style="font-family: "Courier New",Courier,monospace;">access<color,x,y,dx,dy></span></li>
<li><span style="color: #1859a9;">⬛</span> <span style="font-family: "Courier New",Courier,monospace;">render_dod</span>: DOD framework with <span style="font-family: "Courier New",Courier,monospace;">access<color,x,y></span></li>
<li><span style="color: #f37d22;">⬛</span> <span style="font-family: "Courier New",Courier,monospace;">oop[i]</span>: index-based access instead of iterator traversal </li>
<li><span style="color: #662c91;">⬛</span> <span style="font-family: "Courier New",Courier,monospace;">raw[i]</span>: hand-written index-based loop</li>
<li><span style="color: #a11d20;">⬛</span> <span style="font-family: "Courier New",Courier,monospace;">dod[i]</span>: index-based with <span style="font-family: "Courier New",Courier,monospace;">access<color,x,y,dx,dy></span></li>
<li><span style="color: #b33893;">⬛</span> <span style="font-family: "Courier New",Courier,monospace;">render_dod[i]</span>: index-based with <span style="font-family: "Courier New",Courier,monospace;">access<color,x,y></span></li>
</ul>
</div>
<div style="text-align: left;">
The difference between <span style="font-family: "Courier New",Courier,monospace;">dod</span> and <span style="font-family: "Courier New",Courier,monospace;">render_dod</span> (and the same applies to their index-based variants) is that the latter keeps access only to the data members strictly required by <span style="font-family: "Courier New",Courier,monospace;">render</span>: if the compiler were not able to optimize unnecessary pointer manipulations in <span style="font-family: "Courier New",Courier,monospace;">dod</span>, <span style="font-family: "Courier New",Courier,monospace;">render_dod</span> would be expected to be faster; the drawback is that this would require fine tuning the access entity for each member function. </div>
<div style="text-align: left;">
</div>
<div style="text-align: left;">
<a href="https://plus.google.com/+ManuS%C3%A1nchezManu343726/posts">Manu Sánchez</a> has set up an extensive testing environment to build and run the program using different compilers and machines:</div>
<div style="text-align: left;">
<ul>
<li><a href="https://github.com/Manu343726/cpp-dod-tests">GitHub repository</a></li>
</ul>
</div>
<div style="text-align: left;">
The figures show the release-mode execution times of the eight options described above when traversing sequences of <i>n</i> = 10<sup>4</sup>, 10<sup>5</sup>, 10<sup>6</sup> and 10<sup>7</sup> particles.</div>
<div style="text-align: left;">
<b>GCC 5.1, MinGW, Intel Core i7-4790k @4.5GHz</b></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://4.bp.blogspot.com/-pyVnu2CrXIo/VewAlh-_5hI/AAAAAAAABS8/ix48mK6Wz3M/s1600/gcc5.1-i7.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="258" src="http://4.bp.blogspot.com/-pyVnu2CrXIo/VewAlh-_5hI/AAAAAAAABS8/ix48mK6Wz3M/s1600/gcc5.1-i7.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><b>Execution times / number of elements.</b></td></tr>
</tbody></table>
<div style="text-align: left;">
As expected, OOP is the slowest due to cache effects. The rest of options are basically equivalent, which shows that GCC is able to entirely optimize away the syntactic niceties brought in by our DOD framework.</div>
<div style="text-align: left;">
<b>MSVC 14.0, Windows, Intel Core i7-4790k @4.5GHz</b></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://4.bp.blogspot.com/-YSZRDEmWuhY/VewEEisBNbI/AAAAAAAABTI/gH3WZHFp_Lw/s1600/msvc14-i7.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="258" src="http://4.bp.blogspot.com/-YSZRDEmWuhY/VewEEisBNbI/AAAAAAAABTI/gH3WZHFp_Lw/s1600/msvc14-i7.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><b>Execution times / number of elements.</b></td></tr>
</tbody></table>
<div style="text-align: left;">
Here, again, all DOD options are roughly equivalent, although <span style="font-family: "Courier New",Courier,monospace;">raw</span> (pointer-based hand-written loop) is slightly slower. Curiously enough, MSVC is much worse at optimizing DOD with respect to OOP than GCC is, with execution times up to 4 times higher for <i>n</i> = 10<sup>4</sup> and 1.3 times higher for <i>n</i> = 10<sup>7</sup>, the latter scenario being presumably dominated by cache efficiencies. </div>
<div style="text-align: left;">
<b>GCC 5.2, Linux, AMD A6-1450 APU @1.0 GHz</b></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://4.bp.blogspot.com/-ollNWD_ufJU/VewQngv-7iI/AAAAAAAABT4/iXNawfk6lyM/s1600/gcc5.2-a6.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="258" src="http://4.bp.blogspot.com/-ollNWD_ufJU/VewQngv-7iI/AAAAAAAABT4/iXNawfk6lyM/s1600/gcc5.2-a6.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><b>Execution times / number of elements.</b></td></tr>
</tbody></table>
<div style="text-align: left;">
From a qualitative point of view, these results are in line with those obtained for GCC 5.1 under an Intel Core i7, although as the AMD A6 is a much less powerful processor execution times are higher (×8-10 for <i>n</i> = 10<sup>4</sup>, ×4-5.5 for <i>n</i> = 10<sup>7</sup>).</div>
<div style="text-align: left;">
<b>Clang 3.6, Linux, AMD A6-1450 APU @1.0 GHz</b></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://3.bp.blogspot.com/-IQNQTGydo0c/VewMcE7LXWI/AAAAAAAABTs/EN3fG2U8k6w/s1600/clang3.6-a6.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="258" src="http://3.bp.blogspot.com/-IQNQTGydo0c/VewMcE7LXWI/AAAAAAAABTs/EN3fG2U8k6w/s1600/clang3.6-a6.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><b>Execution times / number of elements.</b></td></tr>
</tbody></table>
<div style="text-align: left;">
As it happens with the rest of compilers, DOD options (both manual and framework-supported) perform equally well. However, the comparison with GCC 5.2 on the same machine shows important differences: iterator-based OOP is faster (×1.1-1.4) in Clang, index-based OOP yields the same results for both compilers, and the DOD options in Clang are consistently slower (×2.3-3.4) than in GCC, to the point that OOP outperforms them for low values of <i>n</i>. A detailed analysis of the assembly code produced would probably gain us more insight into these contrasting behaviors: interested readers can access the resulting assembly listings at the associated <a href="https://github.com/Manu343726/cpp-dod-tests">GitHub repository</a>.</div>
Joaquín M López Muñozhttp://www.blogger.com/profile/08579853272674211100noreply@blogger.com9tag:blogger.com,1999:blog-2715968472735546962.post-88940645432656545032015-08-31T22:03:00.000+02:002015-12-11T18:57:41.989+01:00C++ encapsulation for Data-Oriented Design<div style="text-align: left;">
Data-Oriented Design, or DOD for short, seeks to maximize efficiency by laying out data in such a way that their processing is as streamlined as possible. This is often against the usual object-based principles that naturally lead to grouping the information accordingly to the user-domain entities that it models. Consider for instance a game where large quantities of particles are rendered and moved around:</div>
<pre class="prettyprint">class particle
{
char color;
int x;
int y;
int dx;
int dy;
public:
static const int max_x=200;
static const int max_y=100;
particle(char color_,int x_,int y_,int dx_,int dy_):
color(color_),x(x_),y(y_),dx(dx_),dy(dy_)
{}
void render()const
{
// for explanatory purposes only: dump to std::cout
std::cout<<"["<<x<<","<<y<<","<<int(color)<<"]\n";
}
void move()
{
x+=dx;
if(x<0){
x*=-1;
dx*=-1;
}
else if(x>max_x){
x=2*max_x-x;
dx*=-1;
}
y+=dy;
if(y<0){
y*=-1;
dx*=-1;
}
else if(y>max_y){
y=2*max_y-y;
dy*=-1;
}
}
};
...
// game particles
std::vector<particle> particles;
</pre>
<div style="text-align: left;">
In the rendering loop, the program might do:</div>
<pre class="prettyprint">for(const auto& p:particles)p.render();
</pre>
<div style="text-align: left;">
Trivial as it seems, the execution speed of this approach is nevertheless suboptimal. The memory layout for <span style="font-family: "courier new" , "courier" , monospace;">particles</span> looks like: </div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-Cb9LNLCHZSo/VeSUE3hKaxI/AAAAAAAABSY/Q6beWGhmt_Q/s1600/particles.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="13" src="http://4.bp.blogspot.com/-Cb9LNLCHZSo/VeSUE3hKaxI/AAAAAAAABSY/Q6beWGhmt_Q/s1600/particles.png" width="400" /></a></div>
<div style="text-align: left;">
which, when traversed in the rendering loop, results in 47% of the data cached by the CPU (the part corresponding to <span style="font-family: "courier new" , "courier" , monospace;">dx</span> and <span style="font-family: "courier new" , "courier" , monospace;">dy</span>, in white) not being used, or even more if padding occurs. A more intelligent layout based on 5 different vectors</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://2.bp.blogspot.com/-nBIURcKw91o/VeSV4QG3bFI/AAAAAAAABSk/R_ShmVrkjqk/s1600/particles%2Binterlaced.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="131" src="http://2.bp.blogspot.com/-nBIURcKw91o/VeSV4QG3bFI/AAAAAAAABSk/R_ShmVrkjqk/s200/particles%2Binterlaced.png" width="200" /></a></div>
<div style="text-align: left;">
allows the needed data, and only this, to be cached in three parallel cache lines, thus maximizing occupancy and minimizing misses. For the moving loop, it is a different set of data vectors that must be provided. DOD is increasingly popular, in particular in very demanding areas such as game programming. Mike Acton's <a href="https://www.youtube.com/watch?v=rX0ItVEVjHc">presentation</a> on DOD and C++ is an excellent introduction to the principles of data orientation.</div>
<div style="text-align: left;">
The problem with DOD is that encapsulation is lost: rather than being nicely packed in contiguous chunks of memory whose lifetime management is heavily supported by the language rules, "objects" now live as virtual entities with disemboweled, scattered pieces of information floating around in separate data structures. Methods acting on the data need to publish the exact information they require as part of their interface, and it is the responsibility of the user to locate it and provide it. We want to explore ways to remedy this situation by allowing a modest level of object encapsulation compatible with DOD. </div>
<div style="text-align: left;">
Roughly speaking, in C++ an object serves two different purposes:</div>
<div style="text-align: left;">
</div>
<div style="text-align: left;">
<ul>
<li>Providing a public interface (a set of member functions) acting on the associated data.</li>
<li>Keeping access to the data and managing its lifetime.</li>
</ul>
</div>
<div style="text-align: left;">
Both roles are mediated by the <span style="font-family: "courier new" , "courier" , monospace;">this</span> pointer. In fact, executing a member function on an object</div>
<pre class="prettyprint">x.f(args...);
</pre>
<div style="text-align: left;">
is conceptually equivalent to invoking a function with an implicit extra argument</div>
<pre class="prettyprint">X::f(this,args...);
</pre>
<div style="text-align: left;">
where the data associated to <span style="font-family: "courier new" , "courier" , monospace;">x</span>, assumed to be contiguous, is pointed to by <span style="font-family: "courier new" , "courier" , monospace;">this</span>. We can break this intermediation by letting objects be supplied with an <i>access</i> entity replacing <span style="font-family: "courier new" , "courier" , monospace;">this</span> for the purpose of reaching out to the information. We begin with a purely syntactic device:</div>
<pre class="prettyprint">template<typename T,int Tag=0>
struct member
{
using type=T;
static const int tag=Tag;
};
</pre>
<div style="text-align: left;">
<span style="font-family: "courier new" , "courier" , monospace;">member<T,Tag></span> will be used to specify that a given class has a piece of information with type <span style="font-family: "courier new" , "courier" , monospace;">T</span>. <span style="font-family: "courier new" , "courier" , monospace;">Tag</span> is needed to tell apart different members of the same type (for instance, <span style="font-family: "courier new" , "courier" , monospace;">particle</span> has four different members of type <span style="font-family: "courier new" , "courier" , monospace;">int</span>, namely <span style="font-family: "courier new" , "courier" , monospace;">x</span>, <span style="font-family: "courier new" , "courier" , monospace;">y</span>, <span style="font-family: "courier new" , "courier" , monospace;">dx</span> and <span style="font-family: "courier new" , "courier" , monospace;">dy</span>). Now, the following class:</div>
<pre class="prettyprint">template<typename Member>
class access
{
using type=typename Member::type;
type* p;
public:
access(type* p):p(p){}
type& get(Member){return *p;}
const type& get(Member)const{return *p;}
};
</pre>
<div style="text-align: left;">
stores a pointer to a piece of data accessing the specified member. This can be easily expanded to accommodate for more than one member:</div>
<pre class="prettyprint">template<typename... Members>class access;
template<typename Member>
class access<Member>
{
using type=typename Member::type;
type* p;
public:
access(type* p):p(p){}
type& get(Member){return *p;}
const type& get(Member)const{return *p;}
};
template<typename Member0,typename... Members>
class access<Member0,Members...>:
public access<Member0>,access<Members...>
{
public:
template<typename Arg0,typename... Args>
access(Arg0&& arg0,Args&&... args):
access<Member0>(std::forward<Arg0>(arg0)),
access<Members...>(std::forward<Args>(args)...)
{}
using access<Member0>::get;
using access<Members...>::get;
};
</pre>
<div style="text-align: left;">
To access, say, the data labeled as <span style="font-family: "courier new" , "courier" , monospace;">member<int,0></span> we need to write <span style="font-family: "courier new" , "courier" , monospace;">get(member<int,0>())</span>. The price we have to pay for having data scattered around memory is that the access entity holds several pointers, one for member: on the other hand, the resulting objects, as we will see, really behave as on-the-fly proxies to their associated information, so access entities will seldom be stored. <span style="font-family: "courier new" , "courier" , monospace;">particle</span> can be rewritten so that data is accessed through a generic access object:</div>
<pre class="prettyprint">template<typename Access>
class particle:Access
{
using Access::get;
using color=member<char,0>;
using x=member<int,0>;
using y=member<int,1>;
using dx=member<int,2>;
using dy=member<int,3>;
public:
static const int max_x=200;
static const int max_y=100;
particle(const Access& a):Access(a){}
void render()const
{
std::cout<<"["<<get(x())<<","
<<get(y())<<","<<int(get(color()))<<"]\n";
}
void move()
{
get(x())+=get(dx());
if(get(x())<0){
get(x())*=-1;
get(dx())*=-1;
}
else if(get(x())>max_x){
get(x())=2*max_x-get(x());
get(dx())*=-1;
}
get(y())+=get(dy());
if(get(y())<0){
get(y())*=-1;
get(dy())*=-1;
}
else if(get(y())>max_y){
get(y())=2*max_y-get(y());
get(dy())*=-1;
}
}
};
template<typename Access>
particle<Access> make_particle(Access&& a)
{
return particle<Access>(std::forward<Access>(a));
}
</pre>
<div style="text-align: left;">
The transformations that need be done on the source code are not many:</div>
<div style="text-align: left;">
<ul>
<li>Turn the class into a class template dependent on an access entity from which it derives.</li>
<li>Rather than declaring internal data members, define the corresponding <span style="font-family: "courier new" , "courier" , monospace;">member</span> labels.</li>
<li>Delete former OOP constructors define just one constructor taking an access object as its only data member.</li>
<li>Replace mentions of data member by their corresponding access member function invocation (in the example, substitute <span style="font-family: "courier new" , "courier" , monospace;">get(color())</span> for <span style="font-family: "courier new" , "courier" , monospace;">color</span>, <span style="font-family: "courier new" , "courier" , monospace;">get(x())</span> for <span style="font-family: "courier new" , "courier" , monospace;">x</span>, etc.)</li>
<li>For convenience's sake, provide a <span style="font-family: "courier new" , "courier" , monospace;">make</span> template function (in the example <span style="font-family: "courier new" , "courier" , monospace;">make_particle</span>) to simplify object creation.</li>
</ul>
</div>
<div style="text-align: left;">
Observe how this woks in practice:</div>
<pre class="prettyprint">using color=member<char,0>;
using x=member<int,0>;
using y=member<int,1>;
using dx=member<int,2>;
using dy=member<int,3>;
char color_=5;
int x_=20,y_=40,dx_=2,dy_=-1;
auto p=make_particle(access<color,x,y>(&color_,&x_,&y_));
auto q=make_particle(access<x,y,dx,dy>(&x_,&y_,&dx_,&dy_));
p.render();
q.move();
p.render();
</pre>
<div style="text-align: left;">
The particle data now lives externally as a bunch of separate variables (or, in a more real-life scenario, stored in containers). <span style="font-family: "courier new" , "courier" , monospace;">p</span> and <span style="font-family: "courier new" , "courier" , monospace;">q</span> act as proxies to the same information (i.e., they don't copy data internally) but other than this they provide the same interface as the OOP version of <span style="font-family: "courier new" , "courier" , monospace;">particle</span>, and can be used similarly. Note that the two objects specify different sets of access members, as required by <span style="font-family: "courier new" , "courier" , monospace;">render</span> and <span style="font-family: "courier new" , "courier" , monospace;">move</span>, respectively. So, the following</div>
<pre class="prettyprint">q.render(); // error
</pre>
<div style="text-align: left;">
would result in a compile time error as <span style="font-family: "courier new" , "courier" , monospace;">render</span> accesses data that <span style="font-family: "courier new" , "courier" , monospace;">q</span> does not provide. Of course we can do</div>
<pre class="prettyprint">auto p=make_particle(
access<color,x,y,dx,dy>(&color_,&x_,&y_,&dx_,&dy_)),
q=p;
</pre>
<div style="text-align: left;">
so that the resulting objects can take advantage of the entire <span style="font-family: "courier new" , "courier" , monospace;">particle</span> interface. In later entries we will see how this need not affect performance in traversal algorithms. A nice side effect of this technique is that, when a DOD class is added extra data, former code will continue to work as long as this data is only used in new member functions of the class.</div>
Implementing DOD enablement as a template policy also allows us to experiment with alternative access semantics. For instance, the <span style="font-family: "courier new" , "courier" , monospace;">tuple_storage</span> utility<br />
<pre class="prettyprint">template<typename Tuple,std::size_t Index,typename... Members>
class tuple_storage_base;
template<typename Tuple,std::size_t Index>
class tuple_storage_base<Tuple,Index>:public Tuple
{
struct inaccessible{};
public:
using Tuple::Tuple;
void get(inaccessible);
Tuple& tuple(){return *this;}
const Tuple& tuple()const{return *this;}
};
template<
typename Tuple,std::size_t Index,
typename Member0,typename... Members
>
class tuple_storage_base<Tuple,Index,Member0,Members...>:
public tuple_storage_base<Tuple,Index+1,Members...>
{
using super=tuple_storage_base<Tuple,Index+1,Members...>;
using type=typename Member0::type;
public:
using super::super;
using super::get;
type& get(Member0)
{return std::get<Index>(this->tuple());}
const type& get(Member0)const
{return std::get<Index>(this->tuple());}
};
template<typename... Members>
class tuple_storage:
public tuple_storage_base<
std::tuple<typename Members::type...>,0,Members...
>
{
using super=tuple_storage_base<
std::tuple<typename Members::type...>,0,Members...
>;
public:
using super::super;
};
</pre>
<div style="text-align: left;">
can we used to replace the external access policy with an object containing the data proper:</div>
<pre class="prettyprint">using storage=tuple_storage<color,x,y,dx,dy>;
auto r=make_particle(storage(3,100,10,10,-15));
auto s=r;
r.render();
r.move();
r.render();
s.render(); // different data than r
</pre>
<div style="text-align: left;">
which effectively brings us back the old OOP class with ownership semantics. (Also, it is easy to implement an access policy on top of <span style="font-family: "courier new" , "courier" , monospace;">tuple_storage</span> that gives proxy semantics for tuple-based storage. This is left as an exercise for the reader.) </div>
<div style="text-align: left;">
A C++11 <a href="https://www.dropbox.com/s/b64983xsf10842y/dod.cpp?dl=0">example program</a> is provided that puts to use the ideas we have presented.</div>
<div style="text-align: left;">
Traversal is at the core of DOD, as the paradigm is oriented towards handling large numbers of like objects. In a later entry we will extend this framework to provide for easy object traversal and measure the resulting performance as compared with OOP.</div>
Joaquín M López Muñozhttp://www.blogger.com/profile/08579853272674211100noreply@blogger.com4tag:blogger.com,1999:blog-2715968472735546962.post-55453719442119122352015-07-04T04:21:00.001+02:002015-07-04T04:27:09.688+02:00Allowing consumers into the advertising value chain<div style="text-align: left;">
Online advertising market operates through a small number of service providers such as Google or Facebook which aggregate consumers' attention and offer it to advertisers via a B2B model based on <a href="https://en.wikipedia.org/wiki/Pay_per_click">CPC</a>, <a href="https://en.wikipedia.org/wiki/Cost_per_impression">CPM</a> or similar indicators. Companies determine their total expenditure in advertising <i>A</i> as a fixed percentage of their revenues or with some other formula that takes into account the estimated increase in revenues that ads might bring in as a result of exposure of the company's products to their potential customers. In any case, we can consider that this figure is an increasing function of consumer spending <i>C</i>:</div>
<div style="text-align: center;">
<i>A</i> = <i>A</i>(<i>C</i>).</div>
<div style="text-align: left;">
In the United States, digital advertising spend amounted to <a href="http://marketingland.com/digital-ad-spend-2014-iab-126020">$49.5B</a> or 0.3% of the country's GDP, roughy $155 per person. Figures for Spain are more modest: <a href="http://www.statista.com/statistics/406668/digital-advertising-expenditure-in-spain-by-medium/">€1.1B</a> or €23 per person.</div>
<div style="text-align: left;">
Consumers receive no direct payback from this business other than access to the free services online providers offer them. What would happen if users could actually charge online service providers for their share of attention? At first blush there seem to be no changes in terms of national economy as we are merely redistributing online revenues among a different set of players. On closer analysis, though, we can predict a multiplier effect that works as follows:</div>
<div style="text-align: left;">
<ul>
<li>Online providers' net income <i>I</i> is reduced by some quantity <i>I<sub>c</sub></i> that is directly received by users themselves.</li>
<li>Consumer income <i>Y</i> is added <i>I<sub>c</sub></i> and detracted some unkown quantity <i>L</i> resulting from online provider's lower dividends (if any, as currently neither Google nor Facebook pay them), reduced market cap, adjustments in employees' salaries, etc. The key aspect is that, even if <i>I<sub>c</sub></i> = <i>L</i> and so <i>Y</i> remains the same, this redistribution will benefit lower-income residents, with a net effect of higher overall consumption because these people pay proportionally less taxes <i>T</i> (so, the national disposable income <i>Y<sub>d</sub></i> grows) and have higher <a href="https://en.wikipedia.org/wiki/Marginal_propensity_to_consume">marginal propensity to consume</a> <i>MPC</i>.</li>
<li>Higher consumption increases demand and supply and produces a net gain in GDP (and online ad spending in its turn).</li>
</ul>
</div>
<div style="text-align: left;">
So, the entire cycle is summarized as</div>
<div style="text-align: center;">
Δ<i>A</i> = (Δ<i>A</i>/Δ<i>C</i>) · Δ<i>MPC</i> · Δ(<i>Y</i> − <i>T</i>).</div>
<div style="text-align: left;">
Of course this analysis is extremely simple and there are additional factors that complicate it:</div>
<div style="text-align: left;">
<ul>
<li><a href="https://www.google.com/search?q=google+tax+avoidance&ie=utf-8&oe=utf-8#q=google+%22tax+avoidance%22+overseas">Tax avoidance practices by big online providers</a> basically take all their income out of some countries (<a href="http://www.eleconomista.es/interstitial/volver/299341462/telecomunicaciones-tecnologia/noticias/4973761/07/13/Google-Espana-tan-solo-paga-33000-euros-al-fisco-Como-es-posible.html#.Kku87IjZiBxRgYa">Spain</a>, for instance) to derive it to others where taxes are lower. For the losing countries, redistribution of ad revenues among their residents is a win-win as their national economies receive no input from online providers to begin with (that is, online income losses do not affect their GDP).</li>
<li>Global online advertising is currently a duopoly and challenges to the existing value chain can confront very aggressive opposition: no matter how big the multiplier effect Δ<i>A</i>/<i>I<sub>c</sub></i>, online providers can not compensate their loss with increased revenues in a bigger advertisement market, since ad spend is by definition a fraction of consumers' income. When French newspapers demanded sharing into Google News' business, a <a href="http://www.bbc.com/news/technology-21302168">modest agreement</a> was settled after harsh disputes, but the same situation in Spain <a href="http://www.wired.com/2014/12/google-news-shutdown-spain-empty-victory-publishers/">ended much worse</a> for everybody.</li>
<li>In fact, threatening the very source of revenues of online service providers could damage their business to the point of making it unviable. Current operating margins for this market go from <a href="http://finance.yahoo.com/q/ks?s=GOOG+Key+Statistics">25%</a> to <a href="http://finance.yahoo.com/q/ks?s=FB+Key+Statistics">35%</a>, though, which seems to indicate otherwise. </li>
</ul>
</div>
<div style="text-align: left;">
Now, if users were to enter into the ad value chain, they would need some proxy on their behalf who has the power to oppose online service providers (for a share of the pie, at any rate). Two potential players come to mind:</div>
<div style="text-align: left;">
<ul>
<li>ISPs,</li>
<li>Internet browser companies.</li>
</ul>
</div>
<div style="text-align: left;">
As online ad spend per person is actually not that high, we are talking here of potential revenues for the end user of a few dollars a year at most. This would be best provided by the proxy in the form of virtual goods or, in the case of ISPs, discounts in their services.</div>
Joaquín M López Muñozhttp://www.blogger.com/profile/08579853272674211100noreply@blogger.com0tag:blogger.com,1999:blog-2715968472735546962.post-23903514119103106452015-06-29T22:29:00.000+02:002015-07-03T08:23:11.219+02:00Design patterns for invariant suspension<div style="text-align: left;">
One of the most important aspects of <a href="https://en.wikipedia.org/wiki/Encapsulation_%28computer_programming%29">data encapsulation</a> as embodied by classes and similar constructs is the ability to maintain <a href="https://en.wikipedia.org/wiki/Class_invariant">class invariants</a> by restricting access to the implementation data through a well defined interface. (Incidentally, a class without significant invariants is little more than a tuple of values, which is why the presence of getters and setters is a sure sign of poor design). Sometimes, however, maintaining the class invariant can lead to unacceptable costs in terms of complexity or performance. Consider two examples: </div>
<div style="text-align: left;">
<ul>
<li>A <span style="font-family: "Courier New",Courier,monospace;">sorted_vector</span> class that maintains an internal buffer of ordered elements.</li>
<li>A <span style="font-family: "Courier New",Courier,monospace;">spread_sheet</span> class that implements automatic formula calculation.</li>
</ul>
</div>
<div style="text-align: left;">
These classes restrict the potential states of their implementation data so that they meet some extra requirements the interface relies on (for instance, the fact that the elements of <span style="font-family: "Courier New",Courier,monospace;">sorted_vector</span> are ordered allows for an lookup operations, automatic values in the spread sheet are consistent with input data cells, etc.). State can only then be changed by using the provided interface, which takes proper care that the invariant is kept upon exit from any method —and this is all good and desireable.</div>
<div style="text-align: left;">
The problem arises when we need to do bulk modifications on the internal data and the class interface is too rigid to support this:</div>
<div style="text-align: left;">
<ul>
<li><span style="font-family: "Courier New",Courier,monospace;">sorted_vector</span> typically only allows for deletion of elements or insertion of new values, and does not provide mechanisms for directly modifying the elements or temporarily rearranging them.</li>
<li><span style="font-family: "Courier New",Courier,monospace;">spread_sheet</span> might only accept changing one data cell at a time so as to check which formulas are affected and recalculate those alone.</li>
</ul>
</div>
<div style="text-align: left;">
If we granted unrestricted access to the data, the class would have to renormalize (i.e. restore the invariant) inside every method. Let us see this for the case of <span style="font-family: "Courier New",Courier,monospace;">sorted_vector</span>:</div>
<pre class="prettyprint">class sorted_vector
{
value_type* data(); // unrestricted acccess to data
iterator find(const key_type& x)const
{
sort();
...
}
...
private:
void sort()mutable;
};
</pre>
<div style="text-align: left;">
This is obviously impracticable for efficiency reasons. Another alternative, sometimes found in questionable designs, is to put the burden of maintaining the invariant on the user herself: </div>
<pre class="prettyprint">class sorted_vector
{
value_type* data();
void sort();
iterator find(const key_type& x)const // must be sorted
...
}
</pre>
<div style="text-align: left;">
The problem with this approach is that it is all too easy to forget to restore the invariant. What is worse, there is no way that the class can detect the overlook and raise some signal (an error code or an exception). A poor attempt at solving this:</div>
<pre class="prettyprint">class sorted_vector
{
value_type* data();
void modifying_data() // to indicate invariant breach
{
sorted=false;
}
void sort();
iterator find(const key_type& x)const
{
if(!sorted)throw std::runtime_exception("not sorted");
...
}
...
private:
bool sorted;
...
}
</pre>
<div style="text-align: left;">
is just as brittle as the previous situation and more cumbersome for the user. If we finally renounce to giving access to the internal data, the only way to do bulk modifications of the internal state is constructing the data externally and then reassigning:</div>
<pre class="prettyprint">sorted_vector v;
vector aux;
... // manipulate data in aux
v=sorted_vector(aux.begin(),aux.end())</pre>
<div style="text-align: left;">
This last option is not entirely bad, but incurs serious inefficiencies related to the external allocation of the auxiliary data and the recreation of a new <span style="font-family: "Courier New",Courier,monospace;">sorted_vector</span> object. </div>
<div style="text-align: left;">
We look for design patterns that enable <i>invariant suspension</i> (i.e. unrestricted access to a class internal data) in a controlled and maximally efficient way.</div>
<div style="text-align: left;">
<b>Internal invariant suspension</b></div>
<div style="text-align: left;">
In this pattern, we allow for suspension only within the execution of a class method: </div>
<pre class="prettyprint">class X
{
template<typename F>
void modify_data(F f); // f is invoked with X internal data
};</pre>
<div style="text-align: left;">
How does this work? The idea is that <span style="font-family: "Courier New",Courier,monospace;">f</span> is a user-provided piece of code that is given access to the internal data to manipulate freely. After <span style="font-family: "Courier New",Courier,monospace;">f</span> is done, <span style="font-family: "Courier New",Courier,monospace;">modify_data</span>, which issued the invocation, can renormalize the data before exiting. Let us see this in action for <span style="font-family: "Courier New",Courier,monospace;">sorted_vector</span>:</div>
<pre class="prettyprint">class sorted_vector
{
template<typename F>
void modify_data(F f)
{
f(data);
sort(); // restore invariant
}
};
...
sorted_vector v;
...
v.modify_data([](value_type* data){
... // manipulate data as wee see fit
});
</pre>
<div style="text-align: left;">
This pattern is most suitable when modifications are relatively localized in code. When the modification takes place in a more uncontrolled setting, we need something even more flexible.</div>
<div style="text-align: left;">
<b>External invariant suspension</b></div>
<div style="text-align: left;">
Here, we move the data out and accept it back again when the user is done with it.</div>
<pre class="prettyprint">class X
{
data extract_data(); // X becomes "empty"
void accept_data(data& d); // X should be "empty"
};
</pre>
<div style="text-align: left;">
At first sight, this seems just as bad as our first example with <span style="font-family: "Courier New",Courier,monospace;">sorted_vector::data()</span>: as soon as we grant access to the internal state, we can no longer be sure the invariant is kept. The key aspect of the pattern, however, is that <span style="font-family: "Courier New",Courier,monospace;">extract_data</span> sets <span style="font-family: "Courier New",Courier,monospace;">X</span> to some <i>empty state</i>, i.e., a state without internal data where the invariant is trivially true in all cases —in other words, <span style="font-family: "Courier New",Courier,monospace;">X</span> gives its internal data away.</div>
<pre class="prettyprint">class sorted_vector
{
value_type* extract_data()
{
value_type res=data;
data=nullptr;
return res;
}
void accept_data(value_type*& d)
{
if(data)throw std::runtime("not empty");
// alternatively: delete data
data=d;
d=nullptr;
sort(); // restore invariant
}
};
...
sorted_vector v;
...
value_type* data=v.extract_data()
... // manipulate data
v.accept_data(data);
</pre>
<div style="text-align: left;">
How does this meet our requirements of control and efficiency?</div>
<div style="text-align: left;">
<ul>
<li>Control: the invariant of <span style="font-family: "Courier New",Courier,monospace;">v</span> is always true regardless of the user code: when invoking <span style="font-family: "Courier New",Courier,monospace;">extract_data</span>, <span style="font-family: "Courier New",Courier,monospace;">v</span> becomes empty, i.e. a vector with zero elements which is of course sorted. The methods of <span style="font-family: "Courier New",Courier,monospace;">sorted_vector</span> need not check for special cases of invariant suspension.</li>
<li>Efficiency: the data is never re-created, but merely <i>moved</i> around; no extra allocations are required. </li>
</ul>
</div>
<div style="text-align: left;">
<b>Conclusions</b></div>
<div style="text-align: left;">
Class invariants are a key aspect of data encapsulation, but can lead to inefficient code when the class interface is too restrictive for data manipulation, We have described two general design patterns, <b>internal invariant suspension</b> and <b>external invariant suspension</b>, that allow for internal state modification in a controlled fashion without interfering with the class general interface or requiring any type of data copying.</div>
Joaquín M López Muñozhttp://www.blogger.com/profile/08579853272674211100noreply@blogger.com1tag:blogger.com,1999:blog-2715968472735546962.post-88403453474030227592015-06-28T13:17:00.000+02:002015-06-29T01:09:52.699+02:00Traversing a linearized tree: cache friendliness analysis<div style="text-align: left;">
We do a formal analysis of the cache-related behavior of <a href="http://bannalia.blogspot.com/2015/06/traversing-linearized-tree.html">traversing a <span style="font-family: "Courier New",Courier,monospace;">levelorder_vector</span></a>, a linear array of elements laid out according to the breadth-first order of a binary tree. Ignoring roundoff effects, let us assume that the array has <i>n</i> = 2<sup><i>d</i></sup> − 1 elements and the CPU provides a cache system of <i>N</i> lines capable of holding <i>L</i> contiguous array elements each. We further assume that <i>N</i> > <i>d</i>.</div>
<div style="text-align: left;">
Under these conditions, the elements of level <i>i</i>, being contiguous, are paged to 2<sup><i>i</i></sup>/<i>L</i> lines.</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-BaDekTvtlOU/VY_XLbhxkaI/AAAAAAAABRU/Hq9YHZxm4uI/s1600/vector_levels.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="60" src="http://1.bp.blogspot.com/-BaDekTvtlOU/VY_XLbhxkaI/AAAAAAAABRU/Hq9YHZxm4uI/s1600/vector_levels.png" width="400" /></a></div>
<div style="text-align: left;">
The traversal of the array corresponds to the post-order sequence of the tree implicitly encoded, and it is easy to see that this meets the elements of each level <i>i</i> exactly in the order they are laid out in memory: so, given that <i>N</i> > <i>d</i>, i.e. the CPU can hold at least as many lines in cache as there are levels in the tree, an optimal caching scheduler will result in only 2<sup><i>i</i></sup>/<i>L</i> misses per level, or <i>n</i>/<i>L</i> overall. The number of cache misses <i>per element</i> during traversal is then constant and equal to 1/<i>L</i>, which is the minimum attainable for <i>any</i> traversal scheme (like the simplest one consisting of walking linearly the array from first to last).</div>
<div style="text-align: left;">
The questions remains of whether we can justifiably hold the assumption that <i>N</i> > <i>d</i> in modern architectures. In the <a href="https://en.wikipedia.org/wiki/Sandy_Bridge#Technology">Intel Core</a> family, for instance, the data L1 cache is 32 KB (per core) and line size is 64 B, so <i>N</i> = 512. On the other hand, the maximum <i>d</i> possible in 64-bit architectures is precisely 64, therefore the condition <i>N</i> > <i>d</i> is met indeed by a large margin.</div>
Joaquín M López Muñozhttp://www.blogger.com/profile/08579853272674211100noreply@blogger.com0tag:blogger.com,1999:blog-2715968472735546962.post-20400475689047434652015-06-27T16:35:00.000+02:002015-06-28T13:18:44.095+02:00Traversing a linearized tree<div style="text-align: left;">
In a <a href="http://bannalia.blogspot.com/2015/06/cache-friendly-binary-search.html">previous entry</a> we have introduced the data structure <span style="font-family: "Courier New",Courier,monospace;">levelorder_vector<T></span>, which provides faster binary search than sorted vectors (such as sported by <a href="http://www.boost.org/doc/html/container/non_standard_containers.html#container.non_standard_containers.flat_xxx">Boost.Container flat associative containers</a>) by laying out the elements in a linear fashion following the <a href="https://en.wikipedia.org/wiki/Tree_traversal#Breadth-first">breadth-first</a> or level order of the binary tree associated to the sequence. As usual with data structures, this is expected to come at a price in the form of poorer performance for other operations.</div>
<div style="text-align: left;">
Let us begin with iteration: traversing a <span style="font-family: "Courier New",Courier,monospace;">levelorder_vector</span> is algorithmically equivalent to doing so in its homologous binary tree:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-MpE0_x1E8qo/VY6c1bSP8rI/AAAAAAAABQQ/asnRVKysD2Q/s1600/increment_tree.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="127" src="http://1.bp.blogspot.com/-MpE0_x1E8qo/VY6c1bSP8rI/AAAAAAAABQQ/asnRVKysD2Q/s1600/increment_tree.png" width="400" /></a></div>
<div style="text-align: left;">
The picture above shows the three generic cases the algorithm for incrementing a position has to deal with, highlighted with the same color in the following pseudocode:</div>
<pre>void increment(position& i)
{
<span style="color: red;"> position j=right(i);
if(j){ // right child exists
do{
i=j; // take the path down
j=left(i); // and follow left children
}while(j);
}</span>
<span style="color: #00b050;">else if(is_last_leaf(i)){
i=end; // next of last is end
}</span>
<span style="color: #0070c0;">else{
while(is_right_child(i)){ // go up while the node is
i=parent(i); // the right child of its parent
}
i=parent(i); // and up an extra level
}</span>
}</pre>
<div style="text-align: left;">
In the case of <span style="font-family: "Courier New",Courier,monospace;">levelorder_vector</span>, positions are numerical indexes into an array, so these operations are purely arithmetic (by contrast, in a node-based tree we would need to follow pointers around, which is more expensive as we will see shortly):</div>
<div style="text-align: left;">
<ul>
<li>parent(<i>i</i>) = (<i>i</i> − 1)/2,</li>
<li>right(<i>i</i>) = 2<i>i</i> +2 (if less than size),</li>
<li>left(<i>i</i>) = 2<i>i</i> + 1 (if less than size),</li>
<li>is_right_child(<i>i</i>) = (<i>i</i> mod 2 = 0),</li>
<li>is_last_leaf(<i>i</i>) = <i>i</i> does not have a right child and <i>i</i> + 2 is a power of 2. </li>
</ul>
</div>
<div style="text-align: left;">
The last identity might be a little trickier to understand: it simply relies on the fact that the last leaf of the tree must also be the last element of a fully occupied level, and the number of elements in a <i>complete</i> binary tree with <i>d</i> levels is 2<sup><i>d</i></sup> − 1, hence the zero-based index of the element + 2 has to be a power of two.</div>
<div style="text-align: left;">
A <a href="https://www.dropbox.com/s/kfg5opacgomgzi1/levelorder_vector_traverse.cpp?dl=0">test program</a> is provided that implements forward iterators for <span style="font-family: "Courier New",Courier,monospace;">levelorder_vector</span> (backwards traversal is mostly symmetrical) and measures the time taken to traverse <span style="font-family: "Courier New",Courier,monospace;">std::set</span>, <span style="font-family: "Courier New",Courier,monospace;">boost::container::flat_set</span> and <span style="font-family: "Courier New",Courier,monospace;">levelorder_vector</span> instances holding <i>n</i> = 10,000 to 3 million integer elements.</div>
<div style="text-align: left;">
<b>MSVC on Windows </b></div>
<div style="text-align: left;">
Test compiled with Microsoft Visual Studio 2012 using default release mode settings and run on a Windows box with an Intel Core i5-2520M CPU @2.50GHz.</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://1.bp.blogspot.com/-BQQtR3b2C_o/VY6pps-95YI/AAAAAAAABQg/DNSPTJi_jos/s1600/traverse_msvc.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="256" src="http://1.bp.blogspot.com/-BQQtR3b2C_o/VY6pps-95YI/AAAAAAAABQg/DNSPTJi_jos/s1600/traverse_msvc.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><b>Traversal time / number of elements.</b></td></tr>
</tbody></table>
<div style="text-align: left;">
As expected, <span style="font-family: "Courier New",Courier,monospace;">boost::container::flat_set</span> is the fastest: nothing (single threaded) can beat traversing an array of memory in sequential order. <span style="font-family: "Courier New",Courier,monospace;">levelorder_vector</span> is around 5x slower than <span style="font-family: "Courier New",Courier,monospace;">boost::container::flat_set</span> but up to 4x faster than <span style="font-family: "Courier New",Courier,monospace;">std::set</span>. More interestingly, its traversal time per element is virtually non-affected by the size of the container. This can be further confirmed by measuring a <i>non-touching</i> traversal where the range [<span style="font-family: "Courier New",Courier,monospace;">begin</span>, <span style="font-family: "Courier New",Courier,monospace;">end</span>) is walked through without actually dereferencing the iterator, and consequently not accesing the container memory at all: resulting times remain almost the same, which indicates that the traversal time is dominated by the arithmetic operations on indexes. In a <a href="http://bannalia.blogspot.com/2015/06/traversing-linearized-tree-cache.html">later entry</a> we provide the formal proof that traversing a <span style="font-family: "Courier New",Courier,monospace;">levelorder_vector</span> is indeed <a href="https://en.wikipedia.org/wiki/Cache-oblivious_algorithm"><i>cache-oblivious</i></a>. </div>
<div style="text-align: left;">
<b>GCC on Linux</b></div>
<div style="text-align: left;">
Contributed by <a href="https://plus.google.com/+ManuS%C3%A1nchezManu343726/posts">Manu Sánchez</a> (GCC 5.1, Arch Linux 3.16, Intel Core i7 4790k @ 4.5GHz) and <a href="https://github.com/GrayShade">Laurentiu Nicola</a> (GCC 5.1, Arch Linux x64, Intel Core i7-2760QM CPU @ 2.40GHz and Intel Celeron J1900 @ 1.99GHz). Rather than showing the plots, we provide the observed ranges in execution speed ratios between the three containers (for instance, the cell "10-30" means that <span style="font-family: "Courier New",Courier,monospace;">boost::container::flat_set</span> is between 10 and 30 times faster than <span style="font-family: "Courier New",Courier,monospace;">levelorder_vector</span> for Intel Core i7).</div>
<div style="text-align: center;">
<table border="1" style="margin-left: auto; margin-right: auto; padding: 0pt; text-align: center;"><tbody>
<tr><td></td><td><span style="font-family: "Courier New",Courier,monospace; font-size: 85%;">flat_set</span><br />
<span style="font-size: 85%;"><span style="font-family: "Courier New",Courier,monospace; text-decoration: overline;">levelorder_vector</span></span></td><td><span style="font-size: 85%;"><span style="font-family: "Courier New",Courier,monospace;">levelorder_vector</span><br />
<span style="font-family: "Courier New",Courier,monospace; text-decoration: overline;"> std::set </span></span><span style="font-size: 85%;"><span style="font-family: "Courier New",Courier,monospace; text-decoration: overline;"> </span></span></td></tr>
<tr><td>i7</td><td>10-30<code></code></td><td>3-8</td></tr>
<tr><td>Celeron</td><td>10-12</td><td>2.5-4.5</td></tr>
</tbody></table>
</div>
<div style="text-align: left;">
<b>Clang on Linux</b></div>
<div style="text-align: left;">
Clang 3.6.1 with libstdc++-v3 and libc++ Arch Linux 3.16, Intel Core i7 4790k @ 4.5GHz. Clang 3.6.1, Arch Linux x64, Intel Core i7-2760QM CPU @ 2.40GHz and Intel Celeron J1900 @ 1.99GHz.</div>
<div style="text-align: center;">
<table border="1" style="margin-left: auto; margin-right: auto; padding: 0pt; text-align: center;"><tbody>
<tr><td></td><td><span style="font-family: "Courier New",Courier,monospace; font-size: 85%;">flat_set</span><br />
<span style="font-size: 85%;"><span style="font-family: "Courier New",Courier,monospace; text-decoration: overline;">levelorder_vector</span></span></td><td><span style="font-size: 85%;"><span style="font-family: "Courier New",Courier,monospace;">levelorder_vector</span><br />
<span style="font-family: "Courier New",Courier,monospace; text-decoration: overline;"> std::set </span></span><span style="font-size: 85%;"><span style="font-family: "Courier New",Courier,monospace; text-decoration: overline;"> </span></span></td></tr>
<tr><td>libstdc++-v3, i7</td><td>10-50</td><td>2-7</td></tr>
<tr><td>libc++, i7</td><td>10-50</td><td>2.5-8</td></tr>
<tr><td>Celeron</td><td>15-30</td><td>2.5-4.5</td></tr>
</tbody></table>
</div>
<div style="text-align: left;">
<b>Conclusions</b></div>
<div style="text-align: left;">
Traversing a <span style="font-family: "Courier New",Courier,monospace;">levelorder_vector</span> takes a constant time per element that sits in between those of <span style="font-family: "Courier New",Courier,monospace;">boost::container::flat_set</span> (10-50 times slower, depending on number of elements and environment) and <span style="font-family: "Courier New",Courier,monospace;">std::set</span> (2-8 times faster). Users need to consider their particular execution scenarios to balance this fact vs. the superior performance offered at lookup.</div>
Joaquín M López Muñozhttp://www.blogger.com/profile/08579853272674211100noreply@blogger.com8tag:blogger.com,1999:blog-2715968472735546962.post-41810178576106565472015-06-25T00:48:00.002+02:002015-07-07T14:21:09.113+02:00Cache-friendly binary search<div style="text-align: left;">
High-speed memory caches present in modern computer architectures favor data structures with good <i>locality of reference</i>, i.e. the property by which elements accessed in sequence are located in memory addresses close to each other. This is the rationale behind classes such as <a href="http://www.boost.org/doc/html/container/non_standard_containers.html#container.non_standard_containers.flat_xxx">Boost.Container flat associative containers</a>, that emulate the functionality of standard C++ node-based associative containers while storing the elements contiguously (and in order). This is an example of how <a href="https://en.wikipedia.org/wiki/Binary_search_algorithm">binary search</a> works in a <span style="font-family: "Courier New",Courier,monospace;">boost::container::flat_set</span> with elements 0 trough 30:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/-g7a7m_42A5w/VYs05n5fWmI/AAAAAAAABNY/CSxDV5K5Ujc/s1600/flat_set.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="48" src="http://3.bp.blogspot.com/-g7a7m_42A5w/VYs05n5fWmI/AAAAAAAABNY/CSxDV5K5Ujc/s1600/flat_set.png" width="400" /></a></div>
<div style="text-align: left;">
As searching narrows down towards the looked-for value, visited elements are progressively closer, so locality of reference is very good <i>for the last part of the process</i>, even if the first steps hop over large distances. All in all, the process is more cache-friendly than searching in a regular <span style="font-family: "Courier New",Courier,monospace;">std::set</span> (where elements are totally scattered throughout memory according to the allocator's whims) and resulting lookup times are consequently much better. But we can get even more cache-friendly than that.</div>
<div style="text-align: left;">
Consider the structure of a classical <a href="https://en.wikipedia.org/wiki/Red%E2%80%93black_tree">red-black tree</a> holding values 0 to 30:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-OeUzPe7-O_4/VYsSqDJZV0I/AAAAAAAABMQ/IZfwXIqcogg/s1600/rb_tree.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="122" src="http://1.bp.blogspot.com/-OeUzPe7-O_4/VYsSqDJZV0I/AAAAAAAABMQ/IZfwXIqcogg/s1600/rb_tree.png" width="400" /></a></div>
<div style="text-align: left;">
and lay out the elements in a contiguous array following the tree's <a href="https://en.wikipedia.org/wiki/Tree_traversal#Breadth-first">breadth-first</a> (or level) order:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-9gD_pbaWDyA/VYsTasCqpiI/AAAAAAAABMY/vwCuXYvHF_c/s1600/preorder_vector.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="15" src="http://1.bp.blogspot.com/-9gD_pbaWDyA/VYsTasCqpiI/AAAAAAAABMY/vwCuXYvHF_c/s1600/preorder_vector.png" width="400" /></a></div>
<div style="text-align: left;">
We can perform binary search on this <i>level-order vector</i> beginning with the first element (the root) and then jumping to either the "left" or "right" child as comparisons indicate:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-1QQEaqsOvzA/VYsUZL084uI/AAAAAAAABMg/KYsITID56P4/s1600/preorder_vector_binary_search.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="30" src="http://1.bp.blogspot.com/-1QQEaqsOvzA/VYsUZL084uI/AAAAAAAABMg/KYsITID56P4/s1600/preorder_vector_binary_search.png" width="400" /></a></div>
<div style="text-align: left;">
The hopping pattern is somewhat different than with <span style="font-family: "Courier New",Courier,monospace;">boost::container::flat_set</span>: here, the first visited elements are close to each other and hops become increasingly longer. So, we get good locality of reference <i>for the first part of the process</i>, exactly the opposite than before. It could seem then that we are no better off with this new layout, but in fact we are. Let us have a look at the <i>heat map</i> of <span style="font-family: "Courier New",Courier,monospace;">boost::container::flat_set</span>:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-eJgJ0Uconcg/VYsWBWJ2m0I/AAAAAAAABMs/2FoX_FIIjbQ/s1600/flat_set_heat.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="15" src="http://4.bp.blogspot.com/-eJgJ0Uconcg/VYsWBWJ2m0I/AAAAAAAABMs/2FoX_FIIjbQ/s1600/flat_set_heat.png" width="400" /></a></div>
<div style="text-align: left;">
This shows how frequently a given element is visited through an average binary search operation (brighter means "hotter", that is, accessed more). As every lookup begins at the mid element 15, this gets visited 100% of the times, then 7 and 23 have 50% visit frequency, etc. On the other hand, the analogous heat map for a level-order vector looks very different:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/-VWfoaB0Y5FU/VYsXNoJyg2I/AAAAAAAABM4/elr4n1IyQaY/s1600/preorder_vector_heat.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="15" src="http://3.bp.blogspot.com/-VWfoaB0Y5FU/VYsXNoJyg2I/AAAAAAAABM4/elr4n1IyQaY/s1600/preorder_vector_heat.png" width="400" /></a></div>
<div style="text-align: left;">
Now hotter elements are much closer: as cache management mechanisms try to keep hot areas longer in cache, we can then expect this layout to result in fewer cache misses overall. To summarize, we are improving locality of reference for hotter elements at the expense of colder areas of the array.</div>
<div style="text-align: left;">
Let us measure this data structure in practice. I have written a protoype class template <span style="font-family: "Courier New",Courier,monospace;">levelorder_vector<T></span> that stores its contents in level-order sequence and provides a member function for binary search:</div>
<pre class="prettyprint">const_iterator lower_bound(const T& x)const
{
size_type n=impl.size(),i=n,j=0;
while(j<n){
if(impl[j]<x){
j=2*j+2;
}
else{
i=j;
j=2*j+1;
}
}
return begin()+i;
}
</pre>
<div style="text-align: left;">
A <a href="https://www.dropbox.com/s/gn5mmfqn47edlm5/levelorder_vector.cpp?dl=0">test program</a> is provided that measures <span style="font-family: "Courier New",Courier,monospace;">lower_bound</span> execution times with a random sequence of values on containers <span style="font-family: "Courier New",Courier,monospace;">std::set</span>, <span style="font-family: "Courier New",Courier,monospace;">boost::container::flat_set</span> and <span style="font-family: "Courier New",Courier,monospace;">levelorder_vector</span> holding <i>n</i> = 10,000 to 3 million <sup></sup>elements.</div>
<div style="text-align: left;">
<b>MSVC on Windows</b></div>
<div style="text-align: left;">
Test built with Microsoft Visual Studio 2012 using default release mode settings and run on a Windows box with an Intel Core i5-2520M CPU @2.50GHz. Depicted values are microseconds/<i>n</i>, where <i>n</i> is the number of elements in the container.</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://1.bp.blogspot.com/-bceJYXENYLw/VYukuNbvwRI/AAAAAAAABOE/xWsY03Lx6ME/s1600/binary_lookup_msvc.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="256" src="http://1.bp.blogspot.com/-bceJYXENYLw/VYukuNbvwRI/AAAAAAAABOE/xWsY03Lx6ME/s1600/binary_lookup_msvc.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><b><span style="font-family: "Courier New",Courier,monospace;">lower_bound</span> execution times / number of elements.</b></td></tr>
</tbody></table>
<div style="text-align: left;">
The theoretical lookup times for the three containers are O(log <i>n</i>), which should show as straight lines (horizontal scale is logarithmic), but increasing cache misses as <i>n</i> grows result in some upwards bending. As predicted, <span style="font-family: "Courier New",Courier,monospace;">levelorder_vector</span> performs better than <span style="font-family: "Courier New",Courier,monospace;">boost::container::flat_set</span>, with improvement factors ranging from 10% to 30%. Both containers are nevertheless much faster than <span style="font-family: "Courier New",Courier,monospace;">std::set</span>, and don't show much degradation until <i>n</i> <span class="st" data-hveid="66">≃ 5·1</span><span class="st" data-hveid="66">0<sup>5</sup></span> (~7·10<sup>4</sup> for <span style="font-family: "Courier New",Courier,monospace;">std::set</span>), which is a direct consequence of their superior cache friendliness.</div>
<div style="text-align: left;">
<b>GCC on Linux</b></div>
<div style="text-align: left;">
The following people have kindly sent in test outputs for GCC on varius Linux platforms: Philippe Navaux, <a href="https://xckd.wordpress.com/">xckd</a>, <a href="https://plus.google.com/+ManuS%C3%A1nchezManu343726/posts">Manu Sánchez</a>. Results are very similar, so we just show those corresponding to GCC 5.1 on a Intel Core i7-860 @ 2.80GHz with Linux kernel 4.0 from Debian:</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://3.bp.blogspot.com/-fN6zjHMeGIY/VYxO3KbG1eI/AAAAAAAABOY/03gQuJV59rY/s1600/binary_lookup_gcc51.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="256" src="http://3.bp.blogspot.com/-fN6zjHMeGIY/VYxO3KbG1eI/AAAAAAAABOY/03gQuJV59rY/s1600/binary_lookup_gcc51.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><b><span style="font-family: "Courier New",Courier,monospace;">lower_bound</span> execution times / number of elements.</b></td></tr>
</tbody></table>
<div style="text-align: left;">
Although <span style="font-family: "Courier New",Courier,monospace;">levelorder_vector</span> performs generally better than <span style="font-family: "Courier New",Courier,monospace;">boost::container::flat_set</span>, there is a strange phenomenon for which I don't have an explanation: improvement factors are around 40% for low values of <i>n</i>, but then decrease or even become negative as <i>n</i> aproaches <span class="st" data-hveid="66">3·1</span><span class="st" data-hveid="66">0<sup>6</sup></span>. I'd be grateful if readers can offer interpretations of this. </div>
<div style="text-align: left;">
<a href="https://github.com/GrayShade">Laurentiu Nicola</a> has later submitted results with <span style="font-family: "Courier New",Courier,monospace;">-march=native -O2</span> settings that differ significantly from the previous ones: GCC 5.1 on an Intel Core i7-2760QM CPU @ 2.40GHz (Linux distro undisclosed): </div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://4.bp.blogspot.com/-bslLMBZtZFw/VY3nPe0-5ZI/AAAAAAAABPg/PXo1gzt5LAA/s1600/binary_lookup_gcc51_i7_native.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="256" src="http://4.bp.blogspot.com/-bslLMBZtZFw/VY3nPe0-5ZI/AAAAAAAABPg/PXo1gzt5LAA/s1600/binary_lookup_gcc51_i7_native.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><b><span style="font-family: "Courier New",Courier,monospace;">lower_bound</span> execution times / number of elements.</b></td></tr>
</tbody></table>
<div style="text-align: left;">
and GCC 5.1 on an Intel Celeron J1900 @ 1.99GHz:</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://4.bp.blogspot.com/-aaRnTnjd5CI/VY3nzLtcY3I/AAAAAAAABPo/DlRqmtvJcvk/s1600/binary_lookup_gcc51_celeron_native.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="256" src="http://4.bp.blogspot.com/-aaRnTnjd5CI/VY3nzLtcY3I/AAAAAAAABPo/DlRqmtvJcvk/s1600/binary_lookup_gcc51_celeron_native.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><b><span style="font-family: "Courier New",Courier,monospace;">lower_bound</span> execution times / number of elements.</b></td></tr>
</tbody></table>
<div style="text-align: left;">
<span style="font-family: "Courier New",Courier,monospace;">-march=native</span> seems to make a big difference: now <span style="font-family: "Courier New",Courier,monospace;">levelorder_vector</span> execution times are up to 25% (i7) and 40% (Celeron) lower than <span style="font-family: "Courier New",Courier,monospace;">boost::container::flat_set</span> when <i>n</i> approaches 3·10<sup>6</sup>, but there is no significant gain for low values of <i>n</i>.</div>
<div style="text-align: left;">
<b>Clang on OS X</b></div>
<div style="text-align: left;">
As provided by <a href="https://www.blogger.com/profile/16286395755438527078">fortch</a>: clang-500.2.79 in <span style="font-family: "Courier New",Courier,monospace;">-std=c++0x</span> mode on an Darwin 13.1.0 box with an Intel Core i7-3740QM CPU @ 2.70GHz.</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://2.bp.blogspot.com/-5X5A65Akvgk/VYxS-4x7bJI/AAAAAAAABOk/29dXzjxRZ9Q/s1600/binary_lookup_clang500279_darwin.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="256" src="http://2.bp.blogspot.com/-5X5A65Akvgk/VYxS-4x7bJI/AAAAAAAABOk/29dXzjxRZ9Q/s1600/binary_lookup_clang500279_darwin.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><b><span style="font-family: "Courier New",Courier,monospace;">lower_bound</span> execution times / number of elements.</b></td></tr>
</tbody></table>
<div style="text-align: left;">
The improvement of <span style="font-family: "Courier New",Courier,monospace;">levelorder_vector</span> with respect to <span style="font-family: "Courier New",Courier,monospace;">boost::container::flat_set</span> ranges from 10% to 20%.</div>
<b>Clang on Linux</b><br />
<div style="text-align: left;">
Manu Sánchez has run the tests with Clang 3.6.1 on an Arch Linux 3.16 box with an Intel Core i7 4790k @4.5GHz. In the first case he has used the <a href="https://gcc.gnu.org/libstdc++/">libstdc++-v3</a> standard library:</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://4.bp.blogspot.com/-JkZOfjxZIzA/VYxVTILGU5I/AAAAAAAABOw/CeHqB2_g7_Y/s1600/binary_lookup_clang361_libstdc%252B%252B_arch.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="256" src="http://4.bp.blogspot.com/-JkZOfjxZIzA/VYxVTILGU5I/AAAAAAAABOw/CeHqB2_g7_Y/s1600/binary_lookup_clang361_libstdc%252B%252B_arch.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><b><span style="font-family: "Courier New",Courier,monospace;">lower_bound</span> execution times / number of elements.</b></td></tr>
</tbody></table>
<div style="text-align: left;">
and then repeated with <a href="http://libcxx.llvm.org/">libc++</a>:</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://1.bp.blogspot.com/-M_uKlIY8i9A/VYxVn0eHjvI/AAAAAAAABO4/c1aWd-ljIvk/s1600/binary_lookup_clang361_libc%252B%252B_arch.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="256" src="http://1.bp.blogspot.com/-M_uKlIY8i9A/VYxVn0eHjvI/AAAAAAAABO4/c1aWd-ljIvk/s1600/binary_lookup_clang361_libc%252B%252B_arch.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><b><span style="font-family: "Courier New",Courier,monospace;">lower_bound</span> execution times / number of elements.</b></td></tr>
</tbody></table>
<div style="text-align: left;">
Results are similar, with <span style="font-family: "Courier New",Courier,monospace;">levelorder_vector</span> execution speeds between 5% and 20% less than those of <span style="font-family: "Courier New",Courier,monospace;">boost::container::flat_set</span>. The very noticeable bumps on the graphs for these two containers when using libc++ are also present on the tests for Clang on OS X but do not appear if the standard library used is libstdc++-v3, which suggests that they might be due to some artifact related to libc++ memory allocator.</div>
<div style="text-align: left;">
Laurentiu's results for Clang 3.6.1 with <span style="font-family: "Courier New",Courier,monospace;">-march=native -O2</span> are similar to Manu's for an i7 CPU, whereas for an Intel Celeron J1900 @ 1.99GHz we have:</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://4.bp.blogspot.com/-oUHuVq3Wp-A/VY3qdfS8DUI/AAAAAAAABP8/nKBvdJtq9Ts/s1600/binary_lookup_clang361_celeron_native.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="256" src="http://4.bp.blogspot.com/-oUHuVq3Wp-A/VY3qdfS8DUI/AAAAAAAABP8/nKBvdJtq9Ts/s1600/binary_lookup_clang361_celeron_native.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><b><span style="font-family: "Courier New",Courier,monospace;">lower_bound</span> execution times / number of elements.</b></td></tr>
</tbody></table>
<div style="text-align: left;">
where <span style="font-family: "Courier New",Courier,monospace;">levelorder_vector</span> execution times are up to 35% lower than those of <span style="font-family: "Courier New",Courier,monospace;">boost::container::flat_set</span>.</div>
<div style="text-align: left;">
<b>Conclusions</b></div>
<div style="text-align: left;">
Even though binary search on sorted vectors perform much better than node-base data structures, alternative contiguous layouts can achieve even more cache friendliness. We have shown a possibility based on arranging elements according to the breadth-first order of a binary tree holding the sorted sequence. This data structure can possibly perform worse, though, in other operations like insertion and deletion: we defer the study of this matter to a future entry.</div>
Joaquín M López Muñozhttp://www.blogger.com/profile/08579853272674211100noreply@blogger.com21tag:blogger.com,1999:blog-2715968472735546962.post-33638973580945547752015-05-29T00:33:00.000+02:002015-05-30T17:49:27.081+02:00A case in political reasoning<div style="text-align: left;">
Municipal elections in Spain have just been run with interesting results. In Barcelona, the nationalist ruling party <a href="http://www.ciu.cat/">CiU</a>, noted by its recent shift towards unreserved support for the independence of Catalonia, has lost to <a href="https://barcelonaencomu.cat/">Barcelona en Comú</a>, a political alliance loosely associated to left-wing <a href="http://podemos.info/">Podemos</a>, ambivalent on these matters. In Madrid, the party in office and undisputed winner for more than twenty years, right-wing <a href="http://www.pp.es/">PP</a>, has lost a great deal of support, almost yielding first place to <a href="http://programa.ahoramadrid.org/">Ahora Madrid</a>, also linked to Podemos, which is now in a better position to form government. Commenting on the elections, CiU's Artur Mas, president of the <a href="http://web.gencat.cat/ca/inici/">Catalonian regional government</a>, <a href="http://ccaa.elpais.com/ccaa/2015/05/28/catalunya/1432835149_216735.html">says</a> (translation mine):</div>
<div style="text-align: left;">
<i>Mas has discarded the possibility that CiU's defeat in Barcelona has been influenced by the party's pro-independence discourse. "Madrid might also have a mayor from Podemos, and there is no independentist movement there", he said.</i></div>
<div style="text-align: left;">
Is Mr. Mas's argument sound? Let us analyze it formally.</div>
<div style="text-align: left;">
<b>Logical analysis</b></div>
<div style="text-align: left;">
Write</div>
<div style="text-align: center;">
<i>i</i>(<i>x</i>) = party in power in county <i>x</i> supports independence of the region,<br />
<i>p</i>(<i>x</i>) = Podemos (or associated alliances) gains support in <i>x</i> over the party in power,</div>
<div style="text-align: left;">
where <i>x</i> ranges over all counties in Spain. Then, Mas's reasoning can be formulated as</div>
<div style="text-align: center;">
It is not the case that <span class="st">∀<i>x</i> <i>i</i>(<i>x</i>) </span><span class="_Tgc">→ <i>p</i>(<i>x</i>) because </span><span class="_Tgc">¬<i>i</i>(Madrid) </span>∧ <i>p</i>(Madrid).</div>
<div style="text-align: left;">
Put this way, the argument is blatantly false: a refutation of <span class="st">∀<i>x</i> <i>i</i>(<i>x</i>) </span><span class="_Tgc">→ <i>p</i>(<i>x</i>)</span> would ask for a city with a pro-independence governing party where Podemos did not succeed, which is not the case of Madrid. Let us try another approach.</div>
<div style="text-align: left;">
<b>Probabilistic analysis</b></div>
<div style="text-align: left;">
If Mas was thinking in statistical terms, a plausible formulation could be</div>
<div style="text-align: center;">
It is not the case that <i>P</i>(<i>p</i>(<i>x</i>) | <i>i</i>(<i>x</i>)) > <i>P</i>(<i>p</i>(x)) because <span class="_Tgc">¬<i>i</i>(Madrid) </span>∧ <i>p</i>(Madrid), </div>
<div style="text-align: left;">
that is, the probability that Podemos gains power in a pro-independence context is not higher that in general, as supported by the results in Madrid. If we write</div>
<div style="text-align: center;">
<i>PI</i> = number of counties with pro-independence councils where Podemos wins,<br />
<i>I</i> = number of counties with pro-independence councils where Podemos runs,<br />
<i>P</i> = number of counties where Podemos wins,<br />
<i>N</i> = total number of counties where Podemos runs,</div>
<div style="text-align: left;">
then Mas afirms that</div>
<div style="text-align: center;">
<i>PI</i> / <i>I</i> <span class="_Tgc">≤<i> P </i>/ <i>N</i></span></div>
<div style="text-align: left;">
or, equivalently,</div>
<div style="text-align: center;">
<i>D</i> = (<span class="_Tgc"><i>P </i>/ <i>N</i>)<i> </i></span><span class="_Tgc">−</span> (<i>PI</i> / <i>I</i>) ≥ 0.</div>
Is the fact that <span class="_Tgc">¬<i>i</i>(Madrid) </span>∧ <i>p</i>(Madrid) giving any support to this contention? Let<br />
<div style="text-align: center;">
<i>D'</i> = (<span class="_Tgc"><i>P</i></span><span class="_Tgc"><i><i>'</i> </i>/ (<i>N</i></span><span class="_Tgc"><span class="_Tgc"> − 1</span>))<i> </i></span><span class="_Tgc">−</span> (<i>PI</i><i>'</i> / <i>I</i><i>'</i>)</div>
<div style="text-align: left;">
be the value resulting from considering every county in Spain where Podemos ran except Madrid. When taking Madrid into account, we have</div>
<div style="text-align: center;">
<i>D</i> = ((<span class="_Tgc"><i>P</i></span><span class="_Tgc"><i><i>'</i></i></span> + 1) / <i>N</i>) <span class="_Tgc">−</span> (<i>PI</i><i>'</i> / <i>I</i><i>'</i>)</div>
<div style="text-align: left;">
</div>
<div style="text-align: left;">
and, consequently,</div>
<div style="text-align: center;">
<i>D</i> <span class="_Tgc">−</span> <i>D'</i> = ((<span class="_Tgc"><i>P</i></span><span class="_Tgc"><i><i>'</i></i></span> + 1) / <i>N</i>) <span class="_Tgc">− </span><span class="_Tgc">(<span class="_Tgc"><i>P</i></span><span class="_Tgc"><i><i>'</i> </i>/ (<i>N</i></span><span class="_Tgc"><span class="_Tgc"> − 1</span>)),</span> </span> </div>
<div style="text-align: left;">
which is always non-negative (since <span class="_Tgc"><span class="_Tgc"><i>P</i></span><span class="_Tgc"><i><i>'</i> </i></span></span> is never greater than <span class="_Tgc"><span class="_Tgc"><i>N</i></span><span class="_Tgc"><span class="_Tgc"> − 1</span></span></span>), and furthermore strictly positive if <span class="_Tgc"><span class="_Tgc"><i>P</i></span><span class="_Tgc"><i><i>'</i> </i>< <i>N</i></span><span class="_Tgc"><span class="_Tgc"> − 1</span>, that is, if Podemos has lost somewhere (which is certainly the case). So, knowing about the existence of counties </span></span><span class="_Tgc"><span class="_Tgc">without pro-independence councils where Podemos won increases the chances that, in effect, <i>D</i> </span></span>≥ 0 and, consequently, independentism not play a beneficial role in the success of this party. But if this knowledge is of just one county out of the hundreds where Podemos ran, the degree of extra confidence we gain is extremely small. This is the very weak sense in which Mas's argument can be held valid.</div>
Joaquín M López Muñozhttp://www.blogger.com/profile/08579853272674211100noreply@blogger.com1tag:blogger.com,1999:blog-2715968472735546962.post-90502035488553269592015-01-11T18:39:00.000+01:002016-01-11T12:40:17.838+01:00Gas price hysteresis, Spain 2014<div style="text-align: left;">
In 2008 we did a <a href="http://bannalia.blogspot.com/2008/11/more-on-price-gas-hysteresis.html">hysteresis analysis</a> to try to understand how variations in the price of oil translate to retail prices of gasoline and gasoil (diesel) in Spain. As crude oil prices have plummeted in the last few months, spurring some interest in whether consumers have benefited from this as much as they should, we redo here the analysis with 2014 data:</div>
<div style="text-align: left;">
<ul>
<li>Retail gasoline and gasoil prices from the <a href="http://ec.europa.eu/energy/observatory/oil/bulletin_en.htm">European Commision Oil Bulletin</a>.</li>
<li>Brent oil spot prices from the the <a href="http://www.eia.gov/dnav/pet/hist/LeafHandler.ashx?n=PET&s=RBRTE&f=D">US Energy Information Administration</a>.</li>
<li>Euro to dollar exchange rates from <a href="http://www.oanda.com/convert/fxhistory">FXHistory</a>.</li>
</ul>
</div>
<div style="text-align: left;">
The figure shows the weekly evolution during 2014 of prices of Brent oil and average retail prices without taxes of 95 octane gas and gasoil in Spain, all in c€ per liter.</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-IVHjbqQpFEM/VLKkX9wY2uI/AAAAAAAABLA/pG6EBxLmM04/s1600/gas_prices.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="272" src="http://1.bp.blogspot.com/-IVHjbqQpFEM/VLKkX9wY2uI/AAAAAAAABLA/pG6EBxLmM04/s1600/gas_prices.png" width="400" /></a></div>
<div style="text-align: left;">
Below is the scatter plot of Δ(gasoline price before taxes) against Δ(Brent price),</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-7ndrvRsJXn8/VLKsWP8L7vI/AAAAAAAABLY/wUG3vZfA7ps/s1600/dgasoline_dbrent.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="http://1.bp.blogspot.com/-7ndrvRsJXn8/VLKsWP8L7vI/AAAAAAAABLY/wUG3vZfA7ps/s1600/dgasoline_dbrent.png" width="393" /></a></div>
<div style="text-align: left;">
with linear regressions for the semiplanes Δ(Brent price) ≥ 0 and ≤ 0, given by </div>
<div style="text-align: center;">
ΔBrent ≥ 0 → <span style="font-style: italic;">y</span> = <span style="font-style: italic;">f</span><sub>+</sub>(<span style="font-style: italic;">x</span>) = <span style="font-style: italic;">b</span><sub>+</sub> + <span style="font-style: italic;">m</span><sub>+</sub><span style="font-style: italic;">x</span> = 0.0734 − 0.3685<span style="font-style: italic;">x</span>,<br />
ΔBrent ≤ 0 → <span style="font-style: italic;">y</span> = <span style="font-style: italic;">f</span><sub>−</sub>(<span style="font-style: italic;">x</span>) = <span style="font-style: italic;">b</span><sub>−</sub> + <span style="font-style: italic;">m</span><sub>−</sub><span style="font-style: italic;">x</span> = 0.2338 + 0.5447<span style="font-style: italic;">x</span>,</div>
<div style="text-align: left;">
respectively (shown in the plot as dotted red lines). This is entirely unexpected: retail gas prices tend to reduce even for positive variations of oil. The overall regression line for the scatter plot (dotted black line) is </div>
<div style="text-align: center;">
<span style="font-style: italic;">y</span> = <span style="font-style: italic;">f</span>(<span style="font-style: italic;">x</span>) = <span style="font-style: italic;">b</span> + <span style="font-style: italic;">m</span><span style="font-style: italic;">x</span> = −0.2489 + 0.4208<span style="font-style: italic;">x</span>.</div>
<div style="text-align: left;">
As for gasoil, we have</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-XQO3MBExjwU/VLKt7dQjphI/AAAAAAAABLk/kYgo7IX8kKg/s1600/dgasoil_dbrent.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="http://1.bp.blogspot.com/-XQO3MBExjwU/VLKt7dQjphI/AAAAAAAABLk/kYgo7IX8kKg/s1600/dgasoil_dbrent.png" width="393" /></a></div>
<div style="text-align: left;">
and</div>
<div style="text-align: left;">
<div style="text-align: center;">
ΔBrent ≥ 0 → <span style="font-style: italic;">y</span> = <span style="font-style: italic;">f</span><sub>+</sub>(<span style="font-style: italic;">x</span>) = <span style="font-style: italic;">b</span><sub>+</sub> + <span style="font-style: italic;">m</span><sub>+</sub><span style="font-style: italic;">x</span> = −0.1965 − 0.1115<span style="font-style: italic;">x</span>,<br />
ΔBrent ≤ 0 → <span style="font-style: italic;">y</span> = <span style="font-style: italic;">f</span><sub>−</sub>(<span style="font-style: italic;">x</span>) = <span style="font-style: italic;">b</span><sub>−</sub> + <span style="font-style: italic;">m</span><sub>−</sub><span style="font-style: italic;">x</span> = 0.16 + 0.4498<span style="font-style: italic;">x</span>,<br />
<span style="font-style: italic;"> </span>overall → <span style="font-style: italic;">y</span> = <span style="font-style: italic;">f</span>(<span style="font-style: italic;">x</span>) = <span style="font-style: italic;">b</span> + <span style="font-style: italic;">m</span><span style="font-style: italic;">x</span> = −0.2169 + 0.5039<span style="font-style: italic;">x</span>, </div>
which is qualitatively similar to the behavior for gasoline. In both cases, retail prices under constant oil prices (that is, when ΔBrent is small) have reduced around 0.2c€/l per week.</div>
<div style="text-align: left;">
So, oil's fall during the second half of 2014 has of course resulted in a general reduction of gas prices, but these were are also showing a correction trend during the first half of the year, when oil rises were translated less strongly than drops. At least for 2014, we haven't found evidence of the "rocket and feather" effect fuel companies are usually suspected of.</div>
Joaquín M López Muñozhttp://www.blogger.com/profile/08579853272674211100noreply@blogger.com0