tag:blogger.com,1999:blog-2715968472735546962.post4181017857610656547..comments2023-12-14T02:21:18.222+01:00Comments on Bannalia: trivial notes on themes diverse: Cache-friendly binary searchJoaquín M López Muñozhttp://www.blogger.com/profile/08579853272674211100noreply@blogger.comBlogger21125tag:blogger.com,1999:blog-2715968472735546962.post-43008389343114793052017-03-07T08:15:43.308+01:002017-03-07T08:15:43.308+01:00Thanks for a great blogpost!
I've implemented ...Thanks for a great blogpost!<br />I've implemented this layout as a generic STL-like associative container:<br />https://github.com/mikekazakov/eytzinger/<br />https://kazakov.life/2017/03/06/cache-friendly-associative-container/Anonymoushttps://www.blogger.com/profile/12014080405101263023noreply@blogger.comtag:blogger.com,1999:blog-2715968472735546962.post-22742255365363132482016-10-26T10:10:37.769+02:002016-10-26T10:10:37.769+02:00Hi Fabio,
Yes, this arrangement is called "E...Hi Fabio,<br /><br />Yes, this arrangement is called "Eytzinger layout" in the academic literature. You can, for instance, follow the link below for more info and additional references:<br /><br />https://arxiv.org/ftp/arxiv/papers/1509/1509.05053.pdfJoaquín M López Muñozhttps://www.blogger.com/profile/08579853272674211100noreply@blogger.comtag:blogger.com,1999:blog-2715968472735546962.post-28908536733368915592016-10-25T17:52:22.912+02:002016-10-25T17:52:22.912+02:00Hi. I am writing a paper on vectorization (SIMD) o...Hi. I am writing a paper on vectorization (SIMD) of various algorithms to search in a sorted array. I like this approach and I would like to add it to the set of algorithm I am testing. Is this algorithm described in any journal publication? FabAnonymoushttps://www.blogger.com/profile/15104005203920816939noreply@blogger.comtag:blogger.com,1999:blog-2715968472735546962.post-45203817795599141522015-07-14T03:30:10.458+02:002015-07-14T03:30:10.458+02:00Wow, a great Interview question for senior develop...Wow, a great Interview question for senior developers ... Thanks <a href="http://java67.blogspot.com" rel="nofollow">JP</a>Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-2715968472735546962.post-16949482275824910222015-07-07T14:25:34.052+02:002015-07-07T14:25:34.052+02:00"less cache misses" -> "fewer ca...<i>"less cache misses" -> "fewer cache misses"</i><br /><br />Corrected, thank you.<br /><br /><i>And now we need to combine the two algorithms [...]</i><br /><br />Yes, that would make for an interesting variation, I might try if time permits.Joaquín M López Muñozhttps://www.blogger.com/profile/08579853272674211100noreply@blogger.comtag:blogger.com,1999:blog-2715968472735546962.post-69611383867194522652015-07-07T09:11:21.202+02:002015-07-07T09:11:21.202+02:00"less cache misses" -> "fewer ca..."less cache misses" -> "fewer cache misses"<br /><br />And now we need to combine the two algorithms: levelorder_vector to start the search then switch to Ye Olde plain vector when the search radius falls below the cache size (or something like that, to be profiled).Geraldhttps://www.blogger.com/profile/12980942597800162053noreply@blogger.comtag:blogger.com,1999:blog-2715968472735546962.post-80244540080538039602015-06-27T11:01:26.540+02:002015-06-27T11:01:26.540+02:00> Seems like Blogger does not have that option....> Seems like Blogger does not have that option. Do you think otherwise?<br /><br />Nope. Seems the only way is to delete and rewrite, as I'm doing for this one...Anonymoushttps://www.blogger.com/profile/05885986452155138099noreply@blogger.comtag:blogger.com,1999:blog-2715968472735546962.post-17469773477810193802015-06-27T10:59:36.207+02:002015-06-27T10:59:36.207+02:00This comment has been removed by the author.Anonymoushttps://www.blogger.com/profile/05885986452155138099noreply@blogger.comtag:blogger.com,1999:blog-2715968472735546962.post-47808075721804632782015-06-27T02:39:29.477+02:002015-06-27T02:39:29.477+02:00Hi Chris, you're right traversing this structu...Hi Chris, you're right traversing this structure is bound to be slower than boost::container::flat_set, and I'd like to study this in detail in a future entry, if time permits. As for hash tables, they're definitely faster for lookup (check for instance <a href="http://bannalia.blogspot.com/2014/01/a-better-hash-table.html" rel="nofollow">this entry</a>, where lookup times for MSVC are around 2-3x faster than shown here), but then you lose the ability to traverse the elements in ordered fashion. At the end of the day, data structures are about balancing trade-offs among the different usage scenarios one is interested in.Joaquín M López Muñozhttps://www.blogger.com/profile/08579853272674211100noreply@blogger.comtag:blogger.com,1999:blog-2715968472735546962.post-12352989323329661652015-06-27T02:27:49.089+02:002015-06-27T02:27:49.089+02:00Thank you for the submission, just included in the...Thank you for the submission, just included in the article. I don't know what happened to your previous comment, it does't appear anywhere on my control panel (spam, waiting moderation, etc.) Joaquín M López Muñozhttps://www.blogger.com/profile/08579853272674211100noreply@blogger.comtag:blogger.com,1999:blog-2715968472735546962.post-83473653360017742532015-06-26T18:59:49.625+02:002015-06-26T18:59:49.625+02:00I don't know if my previous comment made it, b...I don't know if my previous comment made it, but here are my results: https://gist.github.com/GrayShade/807e1efd4a42a47e1506 .GrayShadehttps://www.blogger.com/profile/08614811410089147863noreply@blogger.comtag:blogger.com,1999:blog-2715968472735546962.post-5620200455054997472015-06-26T14:49:09.308+02:002015-06-26T14:49:09.308+02:00This comes at the cost of decreased in-order trave...This comes at the cost of decreased in-order traversal. The nice thing about flat_map is that it is ordered such that in-order traversal moves sequentially in memory. Not only cache friendly, but prefetchers love that.<br /><br />That being said, maybe you don't need to do that often and you spend more time searching, but if that's the case, one should consider if ordering is needed at all, and if not, consider a hash table instead.Anonymoushttps://www.blogger.com/profile/14731979716846301849noreply@blogger.comtag:blogger.com,1999:blog-2715968472735546962.post-28484880619968911652015-06-25T21:57:53.123+02:002015-06-25T21:57:53.123+02:00Are threaded comments enabled Joaquin?
I've j...<i>Are threaded comments enabled Joaquin?</i><br /><br />I've just done so.<br /><br /><i>Also, it's possible to edit comments?</i><br /><br />Seems like Blogger does not have that option. Do you think otherwise?Joaquín M López Muñozhttps://www.blogger.com/profile/08579853272674211100noreply@blogger.comtag:blogger.com,1999:blog-2715968472735546962.post-79031959204692691222015-06-25T18:53:11.770+02:002015-06-25T18:53:11.770+02:00Someone else started looking into memory layouts f...Someone else started looking into <a href="http://cglab.ca/~morin/misc/arraylayout/" rel="nofollow">memory layouts for binary search</a> not too long ago.Ryanhttps://www.blogger.com/profile/17574626231451600275noreply@blogger.comtag:blogger.com,1999:blog-2715968472735546962.post-91707064123906458042015-06-25T18:39:28.937+02:002015-06-25T18:39:28.937+02:00Because virtual memory is a hardware feature. So t...Because virtual memory is a hardware feature. So the prefetching, caching, etc, works in virtual address space, not physical address space. Anonymoushttps://www.blogger.com/profile/14731979716846301849noreply@blogger.comtag:blogger.com,1999:blog-2715968472735546962.post-30891626113734147922015-06-25T17:23:06.409+02:002015-06-25T17:23:06.409+02:00Are threaded comments enabled Joaquin? I'm not...<i>Are threaded comments enabled Joaquin? I'm not sure how I can directly answer to Sudsy. Also, it's possible to edit comments?</i><br /><br />@Sudsy typical L1 caches have lines 64 bytes wide. On most OSs VM pages are about 4KB. These are completely different beasts. Note the scale, VM locality does not affect cache since VM works in units orders of magnitude bigger.Anonymoushttps://www.blogger.com/profile/05885986452155138099noreply@blogger.comtag:blogger.com,1999:blog-2715968472735546962.post-41836897530819648912015-06-25T16:03:40.579+02:002015-06-25T16:03:40.579+02:00One thing I've never understood about algorith...One thing I've never understood about algorithms that attempt to optimize for cache effects is how you could expect these algorithms to work when they take place in virtual memory locations, which can be mapped to unpredictable physical memory locations. Since the cache works on physical memory, I would think that many of the desirable performance improvements in such algorithms would go away if contiguous virtual memory is mapped to non-contiguous physical memory.Sudsyhttps://www.blogger.com/profile/06005665577735911530noreply@blogger.comtag:blogger.com,1999:blog-2715968472735546962.post-475516773540752602015-06-25T14:51:04.038+02:002015-06-25T14:51:04.038+02:00Arch Linux 3.16 x86_64 Intel Core i7 4790k @4.5GHz...Arch Linux 3.16 x86_64 Intel Core i7 4790k @4.5GHz 16GB RAM<br /><br /><b>GCC 5.1.0 -Wall -O3</b><br /><br />std::set;boost::container::flat_set;levelorder_vector<br />10000;0.942178;0.649732;0.405604<br />12000;1.0163;0.661393;0.426551<br />14400;1.09829;0.673412;0.394561<br />17280;1.1708;0.68414;0.397729<br />20736;1.22752;0.696192;0.4574<br />24883;1.3168;0.70917;0.445425<br />29859;1.39845;0.719152;0.417932<br />35830;1.52342;0.732551;0.42978<br />42995;1.54287;0.74962;0.486241<br />51593;1.61979;0.758761;0.480406<br />61910;1.68448;0.776161;0.444349<br />74290;1.76935;0.809774;0.4911<br />89146;1.8292;0.832541;0.562394<br />106973;1.93549;0.865164;0.544891<br />128365;2.07622;0.891607;0.50524<br />154035;2.23172;0.924304;0.584387<br />184839;2.354;0.949735;0.678893<br />221803;2.66305;0.983547;0.641758<br />266159;2.94522;1.01617;0.635775<br />319386;3.49813;1.03448;0.771336<br />383258;3.94718;1.05866;0.847865<br />459904;4.07259;1.08642;0.929145<br />551879;4.16393;1.12095;0.917717<br />662249;5.14943;1.18708;0.857538<br />794693;5.98453;1.12341;0.848175<br />953625;6.41291;1.1347;0.830826<br />1144343;6.82237;1.15969;0.964873<br />1373204;7.1026;1.20294;1.03471<br />1647837;7.53148;1.21857;0.967981<br />1977396;8.0361;1.32692;1.0474<br />2372866;8.09712;1.41156;1.25784<br />2847430;8.53848;1.57566;1.74904<br /><br /><b>Clang 3.6.1 -Wall -O3 (gnu stdlibc++)</b>:<br /><br />std::set;boost::container::flat_set;levelorder_vector<br />10000;0.993368;0.666357;0.630939<br />12000;1.07131;0.673273;0.635633<br />14400;1.13079;0.683148;0.643084<br />17280;1.213;0.693476;0.658646<br />20736;1.33367;0.705337;0.656572<br />24883;1.3532;0.7131;0.66242<br />29859;1.41948;0.720393;0.671182<br />35830;1.53041;0.734108;0.680093<br />42995;1.63618;0.744925;0.686951<br />51593;1.73344;0.757208;0.693145<br />61910;1.74293;0.775989;0.70711<br />74290;1.84671;0.818686;0.733187<br />89146;1.87382;0.839728;0.748626<br />106973;1.96041;0.859623;0.765006<br />128365;2.06532;0.901873;0.782139<br />154035;2.212;0.923579;0.79369<br />184839;2.36325;0.950378;0.804458<br />221803;2.63924;0.988126;0.831224<br />266159;3.43489;1.01654;0.838551<br />319386;3.42463;1.03174;0.847947<br />383258;3.94939;1.06131;0.86878<br />459904;4.35884;1.08742;0.875501<br />551879;4.77623;1.13165;0.891176<br />662249;5.46352;1.15979;0.901064<br />794693;6.04875;1.13259;0.914436<br />953625;6.60138;1.15176;0.925438<br />1144343;6.8115;1.16985;0.943092<br />1373204;7.16612;1.1905;0.958147<br />1647837;7.74718;1.22039;0.994402<br />1977396;8.00013;1.27861;1.03661<br />2372866;8.50959;1.41785;1.14635<br />2847430;9.1114;1.56756;1.26845<br /><br /><b>Clang 3.6.1 -Wall -O3 (LLVM libc++)</b><br /><br />std::set;boost::container::flat_set;levelorder_vector<br />10000;1.20145;0.900542;0.858861<br />12000;1.21311;0.866906;0.82188<br />14400;1.24415;0.838789;0.79369<br />17280;1.46327;0.974148;0.935117<br />20736;1.47669;0.939277;0.888318<br />B24883;1.488;0.904221;0.848018<br />29859;1.53827;0.87751;0.821355<br />35830;1.78498;1.0131;0.949232<br />42995;1.72539;0.981349;0.913598<br />51593;1.76445;0.946652;0.886394<br />61910;1.7986;0.922748;0.864523<br />74290;2.04623;1.08704;0.994173<br />89146;2.15265;1.0754;0.961312<br />106973;2.14005;1.04145;0.934215<br />128365;2.25706;1.03695;0.9399<br />154035;3.4363;1.2096;1.06582<br />184839;3.12646;1.1852;1.0347<br />221803;3.5032;1.18023;1.00237<br />266159;3.7558;1.30898;1.12387<br />319386;3.51973;1.28644;1.0925<br />383258;3.62738;1.26987;1.06401<br />459904;4.03144;1.25995;1.07837<br />551879;4.82011;1.40394;1.18071<br />662249;4.98677;1.39467;1.13598<br />794693;5.61304;1.32911;1.10182<br />953625;6.1442;1.3126;1.07715<br />1144343;7.16668;1.44368;1.20948<br />1373204;6.92954;1.46197;1.23476<br />1647837;7.50992;1.41724;1.30037<br />1977396;7.88137;1.43963;1.17976<br />2372866;8.63361;1.69802;1.46556<br />2847430;8.58553;1.78567;1.52624<br /><br /><i>Interesting. GCC clearly wins the race, but note how Clang with stdlibc++ runs faster at the beginning, times get closer as the input grows, then libc++ wins at huge input sets.</i>Anonymoushttps://www.blogger.com/profile/05885986452155138099noreply@blogger.comtag:blogger.com,1999:blog-2715968472735546962.post-28213298599395866222015-06-25T10:15:53.473+02:002015-06-25T10:15:53.473+02:00Compiled with g++ 4.9.2 on Debian:
g++ -Ofast -Wal...Compiled with g++ 4.9.2 on Debian:<br />g++ -Ofast -Wall -Wextra -Werror -std=c++11 -o test levelorder_vector.cpp<br /><br />The machine is an Intel(R) Core(TM) i7-4790T CPU @ 2.70GHz with 16GB RAM<br /><br />Binary search:<br />std::set;boost::container::flat_set;levelorder_vector<br />10000;1.10664;0.774499;0.533323<br />12000;1.18407;0.789164;0.555826<br />14400;1.27004;0.799802;0.54171<br />17280;1.34722;0.81454;0.530249<br />20736;1.43015;0.830834;0.594138<br />24883;1.50467;0.842233;0.596146<br />29859;1.5937;0.868157;0.566802<br />35830;1.75523;0.874971;0.5836<br />42995;1.83315;0.881442;0.647675<br />51593;1.89309;0.919203;0.639589<br />61910;1.94726;0.936893;0.614695<br />74290;2.19871;0.948165;0.651943<br />89146;2.07348;0.979815;0.735306<br />106973;2.221;1.01198;0.712115<br />128365;2.3946;1.058;0.692551<br />154035;2.4648;1.07647;0.851917<br />184839;2.80888;1.1326;0.908473<br />221803;3.40933;1.171;0.866847<br />266159;3.28181;1.1883;0.843519<br />319386;4.14999;1.2394;0.951587<br />383258;4.23496;1.27694;1.00523<br />459904;4.8146;1.28158;1.1303<br />551879;5.07588;1.37047;1.06402<br />662249;5.67157;1.40763;1.13232<br />794693;6.52337;1.426;1.52635<br />953625;6.65547;1.37775;1.16741<br />1144343;7.1171;1.4559;1.24296<br />1373204;7.27995;1.47189;1.37101<br />1647837;8.07396;1.88349;1.58652<br />1977396;8.25508;1.75467;1.60646<br />2372866;8.49498;1.84866;1.80135<br />2847430;9.44917;2.07216;2.32611<br />Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-2715968472735546962.post-64963539534128754622015-06-25T09:48:14.306+02:002015-06-25T09:48:14.306+02:00It has been a bit since I have updated my c++ libr...It has been a bit since I have updated my c++ libraries, don't mind the 0x heh<br /><br />$ clang++ -v<br />Apple LLVM version 5.0 (clang-500.2.79) (based on LLVM 3.3svn)<br />Target: x86_64-apple-darwin13.1.0<br />Thread model: posix<br /><br />$ sysctl -n machdep.cpu.brand_string<br />Intel(R) Core(TM) i7-3740QM CPU @ 2.70GHz<br /><br />$ clang++ preorder_vector.cpp -std=c++11 -O3<br /><a href="http://pastebin.com/dapmy9HZ" rel="nofollow">output_clang_11.txt</a><br /><br />$ clang++ preorder_vector.cpp -std=c++0x -O3<br /><a href="http://pastebin.com/10nihnGX" rel="nofollow">output_clang_0x.txt</a>fortchhttps://www.blogger.com/profile/16286395755438527078noreply@blogger.comtag:blogger.com,1999:blog-2715968472735546962.post-35614731897246940562015-06-25T03:35:29.740+02:002015-06-25T03:35:29.740+02:00I did a series of runs with g++ 4.{6789} and 5.1 -...I did a series of runs with g++ 4.{6789} and 5.1 -O3 on a Intel Core i7 860 @ 2.80GHz with Linux kernel 4.0 from Debian.<br /><br />The results can be found at http://charm.cs.illinois.edu/~phil/flat-btree/Unknownhttps://www.blogger.com/profile/10074148119346343309noreply@blogger.com