Monday, August 25, 2008

Cpp bignum arithmetic with memoization

The multiplication algorithm we implemented for C preprocessor bignums incurs a certain amount of repetitive work: multiplying bn1 by bn2 = bn21 bn22···bn2m, involves calculating the set of intermediate products bn1·bn21, ... ,bn1·bn2m, which includes as many repeated values as equal digits are in the decimal expansion of bn2. We can avoid recalculating those by applying memoization to the BN_DIGIT_MUL macro:

‎#define BN_CACHED_DIGIT_MUL(bn,y,cache) \
‎BOOST_PP_IF( \
‎ BOOST_PP_EQUAL( \
‎ 1, \
‎ BOOST_PP_SEQ_SIZE(BOOST_PP_SEQ_ELEM(y,cache))), \
‎ BN_CACHED_DIGIT_MUL_CASE1, \
‎ BN_CACHED_DIGIT_MUL_CASE2)(bn,y,cache)

‎#define BN_CACHED_DIGIT_MUL_CASE1(bn,y,cache) \
‎BN_CACHED_DIGIT_MUL_CASE1_IMPL(BN_DIGIT_MUL(bn,y),y,cache)

‎#define BN_CACHED_DIGIT_MUL_CASE1_IMPL(bny,y,cache) \
‎(bny)(BOOST_PP_SEQ_REPLACE(cache,y,(~)bny))

‎#define BN_CACHED_DIGIT_MUL_CASE2(bn,y,cache) \
‎(BOOST_PP_SEQ_TAIL(BOOST_PP_SEQ_ELEM(y,cache)))(cache)

BN_CACHED_DIGIT_MUL(bn,y,cahe) accepts a cache memory in the form of a Boost.Preprocessor sequence storing those of the values bn·0, bn·1, ..., bn·9 calculated so far (with a special format not worth discussing in detail here). If the cache already has the requested multiplication, then this is used, otherwise we calculate the value using BN_DIGIT_MUL and update the cache accordingly. The return value is a pair containing both the result and the updated cache, which can be passed later to subsequent invocations to BN_CACHED_DIGIT_MUL. To make BN_MUL benefit from the usage of BN_CACHED_DIGIT_MUL we have to add the corresponding cache as an additional member of the state structure that constitutes the core of the multiplication algorithm.

‎‎#define BN_MUL(bn1,bn2) \
‎BN_MUL_STATE_RESULT( \
‎ BOOST_PP_SEQ_FOLD_LEFT( \
‎ BN_MUL_OP, \
‎ ((~))(bn2) \
(((~)(0))((~)bn2)((~))((~))((~))((~))((~))((~))((~))((~))) \
‎ ((0)), \
‎ BOOST_PP_SEQ_REVERSE(bn1)))

‎#define BN_MUL_OP(s,state,x) \
‎BN_MUL_OP_IMPL( \
‎ x, \
‎ BN_MUL_STATE_SHIFT(state), \
‎ BN_MUL_STATE_BN2(state), \
‎ BN_MUL_STATE_CACHE(state), \
‎ BN_MUL_STATE_RESULT(state))

‎#define BN_MUL_OP_IMPL(x,shift,bn2,cache,result) \
‎BN_MUL_OP_IMPL_RETURN( \
‎ BN_CACHED_DIGIT_MUL(bn2,x,cache),shift,bn2,result)

‎#define BN_MUL_OP_IMPL_RETURN(bn2x_cache,shift,bn2,result) \
‎(shift(0)) \
‎(bn2) \
‎(BOOST_PP_SEQ_ELEM(1,bn2x_cache)) \
‎(BN_ADD( \
‎ result, \
‎ BOOST_PP_SEQ_ELEM(0,bn2x_cache)BOOST_PP_SEQ_TAIL(shift)))

‎#define BN_MUL_STATE_SHIFT(state) BOOST_PP_SEQ_ELEM(0,state)
‎#define BN_MUL_STATE_BN2(state) BOOST_PP_SEQ_ELEM(1,state)
‎#define BN_MUL_STATE_CACHE(state) BOOST_PP_SEQ_ELEM(2,state)
‎#define BN_MUL_STATE_RESULT(state) BOOST_PP_SEQ_ELEM(3,state)

The initial cache value is marked in red above, note that it already contains precomputed values for the easy cases bn2·0 and bn2·1.

We can also apply memoization to the implementation of BN_DIV: the macro BN_QDIGIT_REMAINDER(bn1,bn2), which returns the largest decimal digit q such that bn2·qbn1, along with bn1bn2·q, can be supplemented with a cache to store the values bn2·0, bn2·1, ..., bn2·9. The implementation details are entirely similar to those involved in memoizing BN_MUL. A full implementation of the memoized bignum arithmetic operations is provided.

Let us measure the improvement brought about by memoization. These 10-, 20- and 30-digit random numbers were selected with the aid of random.org:

n1 = 8579726378,
n2 = 88242457695260718048,
n3 = 7278842148,
n4 = 17478148223952531376,
n5 = 22737602045089519821,
n6 = 859621743146723715335628844223,
n7 = 97249055978154178611,
n8 = 502003375419877211491785910042,

and the following operations were timed for the plain and memoized versions of bignum multiplication and division macros; figures are averages of several individual measurements, tests were run with GCC 3.4.4 for Cygwin in a Windows box with a 1.66 GHz Intel Core 2 Duo and 1 GB RAM:

operationplain version
(s)
memoized version
(s)
speedup
factor
n1 · n211.2
9.1
19%
n2 · n113.3
9.7
27%
n3 · n49.8
7.0
28%
n4 · n312.0
9.7
19%
n5 · n627.2
20.6
24%
n6 · n530.0
22.3
25%
n7 · n827.9
19.9
28%
n8 · n731.7
22.0
30%
n2 / n18.9
7.8
13%
n4 / n35.9
4.9
17%
n6 / n511.7
9.5
19%
n8 / n711.2
9.8
12%

The speedup factor is defined as (pm)/p, where p is the time taken by the plain version and m the time with memoization. Improvements are noticeable though not dramatic. Multiplication could be further accelerated by using alternative algorithms like for instance Karatsuba's, but we decide to leave it at this for the moment.

No comments :

Post a Comment