Sunday, June 29, 2008

Cpp bignum comparison

Equality of two C preprocessor bignums is trivially established (negatively) when the lengths of their representative sequences differ:

‎#define BN_EQUAL(bn1,bn2) \
‎BOOST_PP_IF( \
‎ BOOST_PP_NOT_EQUAL( \
‎ BOOST_PP_SEQ_SIZE(bn1),BOOST_PP_SEQ_SIZE(bn2)), \
‎ BN_EQUAL_CASE1, \
‎ BN_EQUAL_CASE2)(bn1,bn2)

‎#define BN_EQUAL_CASE1(bn1,bn2) 0

When bn1 and bn2 have the same length, testing for equality involves doing the comparison digitwise. Boost.Preprocessor provides the macro BOOST_PP_SEQ_FOLD_LEFT to traverse the elements of a sequence. For instance, if bn1 is the sequence (bn11)...(bn1m), the call

BOOST_PP_SEQ_FOLD_LEFT(OP,state0,bn1))
expands to
OP(s,···OP(s,OP(s,state0,bn11),bn12)···,bn1m)
where the result of an invocation to OP is provided as a state argument to the next OP invocation (s is a lefotver parameter from older versions of Boost.Preprocessor and currently has no use). To perform digitwise equality comparison of bn1 and bn2 folding over bn1 we can use a state structure that contains the remaining of bn2 to be processed and the intermediate result of the comparison so far:
statei = (seqi,resi);
state
0 = (bn2,true),
seqi+1 = tail(seqi),
resi+1 =
‎ · if not resi, false,
‎ · else (head(seqi) == bn1i),
where head and tail return the first element and the rest of a sequence, respectively (implemented in Boost.Preprocessor by BOOST_PP_SEQ_HEAD and BOOST_PP_SEQ_TAIL). When the folding over bn1 is complete, the result member of the returned state structure contains the equality comparison output. In our implementation of this algorithm we choose to represent the state structure as a Boost.Preprocessor sequence:
‎‎#define BN_EQUAL_CASE2(bn1,bn2) \
‎BN_EQUAL_STATE_RESULT( \
‎ BOOST_PP_SEQ_FOLD_LEFT(BN_EQUAL_CASE2_OP,(bn2)(1),bn1))

‎#define BN_EQUAL_CASE2_OP(s,state,x) \
‎BN_EQUAL_CASE2_OP_IMPL( \
‎ x, \
‎ BOOST_PP_SEQ_HEAD(BN_EQUAL_STATE_BN2(state)), \
‎ BN_EQUAL_STATE_BN2(state), \
‎ BN_EQUAL_STATE_RESULT(state))

‎#define BN_EQUAL_CASE2_OP_IMPL(x,y,bn2,result) \
‎(BOOST_PP_SEQ_TAIL(bn2(0))) \
‎(BOOST_PP_IF(result,BOOST_PP_EQUAL(x,y),0))

‎#define BN_EQUAL_STATE_BN2(state) BOOST_PP_SEQ_ELEM(0,state)
‎#define BN_EQUAL_STATE_RESULT(state) BOOST_PP_SEQ_ELEM(1,state)

"Less than" comparison can be implemented along a similar pattern:

‎#define BN_LESS(bn1,bn2) \
‎BOOST_PP_IF( \
‎ BOOST_PP_LESS(BOOST_PP_SEQ_SIZE(bn1),BOOST_PP_SEQ_SIZE(bn2)), \
‎ BN_LESS_CASE1, \
‎ BOOST_PP_IF( \
‎ BOOST_PP_LESS(BOOST_PP_SEQ_SIZE(bn2),BOOST_PP_SEQ_SIZE(bn1)), \
‎ BN_LESS_CASE2, \
‎ BN_LESS_CASE3))(bn1,bn2)

‎#define BN_LESS_CASE1(bn1,bn2) 1
‎#define BN_LESS_CASE2(bn1,bn2) 0
The remaning BN_LESS_CASE3 applicable when bn1 and bn2 are of the same length can be again implemented by folding over bn1 using the following recursive state definition:
statei = (seqi,resi), with resi one of
‎ · −1 (bn1 smaller),
‎ · 0 (bn1 == bn2 ),
‎ · +1 (bn1 smaller);
state0 = (bn2,0),
seqi+1 = tail(seqi),
resi+1 =
‎ · if resi ≠ 0, resi,
‎ · else if bn1i < head(seqi), −1,
‎ · else if bn1i > head(seqi), +1,
‎ · else 0.
Having implemented BN_EQUAL and BN_LESS, the rest of comparison operations are trivially derived:
‎#define BN_NOT_EQUAL(bn1,bn2) BOOST_PP_NOT(BN_EQUAL(bn1,bn2))
‎#define BN_GREATER(bn1,bn2) BN_LESS(bn2,bn1)
‎#define BN_GREATER_EQUAL(bn1,bn2) BOOST_PP_NOT(BN_LESS(bn1,bn2))
‎#define BN_LESS_EQUAL(bn1,bn2) BOOST_PP_NOT(BN_LESS(bn2,bn1))

A complete implementation program of these macros is provided for your preprocessing pleasure. In a later entry we will continue with bignum arithmetic operations.

Wednesday, June 25, 2008

Back to the future

Future-based concurrent programming can be unexpectedly inefficient when some futures depend on others. Consider the following code (although C++ syntax is used the discussion is not particularly related to this programming language):

‎int constant(int x)
‎{
‎ return x;
‎}

‎int add(future<int> x,future<int> y)
‎{
‎ return x.get()+y.get(); // blocks on x and y
‎}

‎int main()
‎{
‎ // (1+2)+(3+4)
‎ future<int> g=
‎ schedule(add,
‎ schedule(add,schedule(constant,1),schedule(
constant,2)),
‎ schedule(add,schedule(
constant,3),schedule(constant,4)));
‎}

where schedule takes a function along with a set of arguments and sends the job of evaluating the function to some internal task processor running concurrently with the program's main thread of execution. A future object is provided as a token for further claim of the execution's return value. Calling get() on a future object returns the associated value, blocking if necessary in wait for the asynchronous value calculation to complete. The future values generated by the code above can be more easily seen in the following equivalent rewriting:

‎int main()
‎{
‎ future<int> a=schedule(constant,1),
‎ b=schedule(constant,2),
‎ c=schedule(constant,3),
‎ d=schedule(constant,4),
‎ e=schedule(add,a,b),
‎ f=schedule(add,c,d),
‎ g=schedule(add,e,f);
‎}

Future-based programming can achieve better performance than sequential computation because execution is translated to a thread pool taking advantage of multicore and multiprocessor architectures. The task queue can be shared among all threads of the pool (so that when a thread completes a task goes to the common queue for another) or we can equip each thread with its own task queue, so that when a new future task is generated it gets added to the least busy queue. Although from the point of view of queueing theory this second configuration is less efficient (see for instance the discussion on section 2.4 of Andreas Willig's A Short Introduction to Queueing Theory), it also have benefits generally not considered in basic queueing theory, namely reduced locking contention when accessing the task queues --queue reading by the pool threads needs no locking.

Now, let us suppose we have a pool with two threads, each with its own task queue; on a two-core machine this can in theory double the performance of a sequential computation. But consider now the case where the seven future values of the example program get assigned to the two threads of the pool in the following manner:

thread 1 ← a b c d e f,
thread 2 ← g.

This seems an unlikely distribution (one would expect a more or less balanced assignment of tasks among the threads) but is a consistent situation and could happen if, say, thread 2 was already overloaded with prior tasks. The scenario is extremely bad performancewise: thread 2 blocks processing g until the future values it depends on, e and f, are completed by thread 1, so basically we get no improvement over the sequential case.

We could remedy this contention scenario if g was rescheduled rather than allowed to block on e and f. For instance if future<T>::get exhibited the following behavior:

  • Return the future value is it has already been calculated;
  • otherwise, block in wait for the value calculation to complete if in a user thread
  • or throw a future_pending exception if in a future pool thread.

This would prevent g from blocking; thread 2 could then catch the future_pending exception and reschedule the task, becoming available for further processing. There are two important issues that have to be taken into account for this approach to be feasible:

  • Scheduled functions must be written with the knowledge that future<T>::get can throw and, when this happens, the function will be called again in the future, which could pose a problem if the computation produces side effects. No problems of this kind arise in the special case when the scheduled code is purely functional.
  • Rescheduling a task to the end of the queue might be too drastic and introduce excessive delay in the computation of the task: ideally, the task should be rescheduled right after the last future it depends on, but that is an information the scheduler does not have (it is implicitly present in the task code itself). An in-between compromise is to reschedule the task after the future that has thrown the future_pending exception. More intrusive approaches could require that the list of futures a given future depends on be explicitly given by the programmer to the scheduler.

Saturday, June 21, 2008

Making choices: adaptive strategy

We revisit the problem of selecting one among N candidates based on the evaluation of M independent normally distributed random candidate traits, with the constraint that at most N traits (out of the total NM) can be evaluated: our goal is to improve the simple and two-stage strategies we had previously analyzed.

The idea behind an adaptive strategy is to make at each step the decision that maximizes the expected result given the information obtained so far. We study this approach for a simplified scenario: suppose we have only three candidates c1, c2, c3 with (unkown to the strategy) scores s1, s2, s3 and current partial scores (known to the strategy) s'1 > s'2 > s'3, respectively. In this situation, the best strategy is to choose c1, and the average result is precisely s'1, since s1s'1 is a normal distribution with mean value 0. Now, if we are allowed to evaluate one more trait, and the three candidates have still unevaluated traits, how to best spend this extra resource? Let us examine the result for each possible decision:

Evaluate c1. Let us call t1 the newly evaluated trait of c1. If s'1 + t1 > s'2, we stick to c1; otherwise, we change our selection to c2. The average score we obtain can then be calculated as

S1 = (s'1 + E(t1 | t1 > s'2s'1))·P(t1 > s'2s'1) + s2·P(t1 < s'2s'1),

where t1 follows a normal distribution with mean 0 and variance 1. Note that the unevaluated traits do not affect the expected score because their mean is 0.

Evaluate c2. If we call t2 the extra trait evaluated for c2, the analysis is similar to that of the previous case and yields

S2 = s'1·P(t2 < s'1s'2) + (s'2 + E(t2 | t2 > s'1s'2))·P(t2 > s'1s'2).

Using the fact that the statistical distribution of t2 is identical to that of t1 and the symmetry of this distribution around 0, elementary manipulations on the expressions above lead us to conclude that

S1 = S2.

Evaluate c3. Calling t3 the extra trait evaluated for c3, we have

S3 = s'1·P(t3 < s'1s'3) + (s'3 + E(t3 | t3 > s'1s'3))·P(t3 > s'1s'3).

Again, t3 has the same statistical distribution as t2; but since s'3 < s'2 it is not hard to prove that

S3 < S2.

So, the most efficient way to spend the extra trait evaluation is using it with either c1 or c2. This result readily extends to the general case:

adaptive strategy(c1,...,cN)
{s holds the partial scores of c1,...,cN}
s ← {0,...0}
for
i = 1 to N
····select some cj with unevaluated traits and maximum sj
····evaluate new trait t on cj
····sjsj + t
end for
select a candidate ck with maximum sk

This extremely simple strategy has been implemented in C++ (Boost is needed to compile the program) and exercised for the same settings as in previous cases, N = 60, M = 5. The figure shows the performance of the adaptive strategy compared with the best simple and two-stage strategies and the optimum case (which is not attainable as it requires knowledge of all the NM traits of the candidate pool):

Note that we have not proved that this adaptive strategy is optimum, i.e. a strategy with the best possible expected score. That would be the case if the following optimal substructure property held true: if S is an optimum strategy for n evaluation traits and a strategy S' is identical to S except in that it omits the last trait evaluation, then S' is an optimum strategy for n − 1 evaluation traits.

Tuesday, June 17, 2008

C preprocessor bignums

Consider the following Boost.Preprocessor-based snippet:

‎// ARRAY defines a stack-based array or a dynamically
‎// allocated array when the size goes beyond some threshold

‎#define ARRAY(name,type,n) \
‎BOOST_PP_IF( \
‎ n<=128, \
‎ type name[n], \
‎ type* name=(type *)malloc(n*sizeof(type)))

‎ARRAY(x1,char,64);
‎ARRAY(x2,char,200);
This does not produce the desired result, but instead expands to
‎BOOST_PP_IIF_1<=128(char x1[64], char* x1=(char *)malloc(64
‎*sizeof(char)));
‎BOOST_PP_IIF_1<=128(char x2[200], char* x2=(char *)malloc(2
‎00*sizeof(char)));

which does not seem to make much sense. What happened? The error comes from the fact that the condition in the BOOST_PP_IF clause must evaluate to 0 or 1 in the preprocessor realm: the text 64<=128, for instance, is a compile-time constant in C, but certainly does not coincide with the preprocessor tokens 0 or 1.

To remedy this, Boost.Preprocessor provides arithmetic macros that perform numerical manipulation at the preprocessor level, so that we can write:

‎‎#define ARRAY(name,type,n) \
‎BOOST_PP_IF( \
‎ BOOST_PP_LESS_EQUAL(n,128), \
‎ type name[n], \
‎ type* name=(type *)malloc(n*sizeof(type)))

‎ARRAY(x1,char,64);
‎ARRAY(x2,char,200);
which expands to
‎char x1[64];
‎char* x2=(char *)malloc(200*sizeof(char));
as intended. But now, if we decide to set the threshold in ARRAY to 1024 we get an error again:
‎‎‎BOOST_PP_IIF_BOOST_PP_BOOL_BOOST_PP_COMPL_BOOST_PP_BOOL_BOO
‎ST_PP_TUPLE_ELEM_2_0BOOST_PP_IIF_BOOST_PP_BOOL_1024(BOOST_P
‎P_WHILE_2, (64, 1024) BOOST_PP_TUPLE_EAT_3)(BOOST_PP_SUB_P,
‎BOOST_PP_SUB_O, BOOST_PP_IIF_BOOST_PP_BOOL_1024(BOOST_PP_SU
‎B_O, BOOST_PP_NIL BOOST_PP_TUPLE_EAT_2)(2, (64, 1024)))(cha
‎r x1[64], char* x1=(char *)malloc(64*sizeof(char)));
‎BOOST_PP_IIF_BOOST_PP_BOOL_BOOST_PP_COMPL_BOOST_PP_BOOL_BOO
‎ST_PP_TUPLE_ELEM_2_0BOOST_PP_IIF_BOOST_PP_BOOL_1024(BOOST_P
‎P_WHILE_2, (200, 1024) BOOST_PP_TUPLE_EAT_3)(BOOST_PP_SUB_P,
‎ BOOST_PP_SUB_O, BOOST_PP_IIF_BOOST_PP_BOOL_1024(BOOST_PP_S
‎UB_O, BOOST_PP_NIL BOOST_PP_TUPLE_EAT_2)(2, (200, 1024)))(c
‎har x2[200], char* x2=(char *)malloc(200*sizeof(char)));

We have hit a limitation of preprocessor-based arithmetics: the only way to implement BOOST_PP_LESS_EQUAL(x,y) consists in precalculating the result for all pairs (x,y) with x and y between 0 and some internal limit, which Boost.Preprocessor sets at 256; so, BOOST_PP_LESS_EQUAL(x,1024) simply does not compute. Rising the internal limit does not scale, as the size of the implementation header size grows linearly with this limit.

A way to overcome these limitations in preprocessor-based arithmetic is to replace simple numerical literals with some other representation more amenable to preprocessor handling. This is the way arbitrary-precision arithmetic support is provided in Paul Mensonides's Chaos Library, which can be regarded as an evolution of Boost.Preprocessor for very conformant C preprocessors. For the fun of it, we will develop our solution within Boost.Preprocessor. This library makes it very easy to work with so-called sequences, lists of adjacent parenthesized elements:

// sequence a, b, c
(a)(b)(c)

We define a preprocessor bignum simply as a sequence whose elements are digits between 0 and 9:

// bignum 1024
(1)(0)(2)(4)

Elements are assumed to be laid out from the most to the least significant digit, though the reverse order would probably be a little more natural for computational purposes. The largest bignum we can handle with Boost.Preprocessor is 10257 − 1, as BOOST_PP_LIMIT_MAG = 256 is the maximum sequence length supported by the library. This magnitude is much larger than what is usually representable by integer types in C; when a bignum is sufficiently small to be represented by a C integer, we can devise a macro to convert the bignum to a literal:

‎#define BN_TO_LITERAL(bn) \
‎BOOST_PP_IF( \
‎ BOOST_PP_GREATER(BOOST_PP_SEQ_SIZE(bn),1), \
‎ BN_TO_LITERAL_CASE1, \
‎ BN_TO_LITERAL_CASE2)(bn)

‎#define BN_TO_LITERAL_CASE1(bn) \
‎BOOST_PP_SEQ_FOLD_LEFT( \
‎ BN_TO_LITERAL_CASE1_OP, \
‎ BOOST_PP_SEQ_HEAD(bn), \
‎ BOOST_PP_SEQ_TAIL(bn))

‎#define BN_TO_LITERAL_CASE1_OP(s,state,x) BOOST_PP_CAT(state,x)

‎#define BN_TO_LITERAL_CASE2(bn) BOOST_PP_SEQ_HEAD(bn)

‎BN_TO_LITERAL((1)(0)(2)(4)) // expands to 1024
Note that we had to treat bignums of just one digit separately because the general branch would produce intermediate "empty" sequences, which are not allowed.

Now, if we had an appropriate BN_LESS_EQUAL macro to compare bignums, we could rewrite our original example like this and forget about the BOOST_PP_LIMIT_MAG limit of Boost.Preprocessor:

‎‎#define ARRAY(name,type,bn) \
‎BOOST_PP_IF( \
‎ BN_LESS_EQUAL(bn,(1)(0)(2)(4)), \
‎ type name[BN_TO_LITERAL(bn)], \
‎ type* name=(type *)malloc(BN_TO_LITERAL(bn)*sizeof(type)))

‎ARRAY(x1,char,(6)(4)); // stack-based
‎ARRAY(x2,char,(2)(0)(0)); // stack-based
‎ARRAY(x3,char,(1)(5)(0)(0)); // dynamic
We will show in later entries how to implement bignum comparison macros as well as common arithmetic operations.

Thursday, June 12, 2008

A complex monad in metaC++

In a prior entry we proposed a Boost.MPL-based representation of monads in C++ template metaprogramming. Let us exercise this framework by translating the resource monad R described in section 9.3 of Hudak et al. A Gentle Introduction to Haskell 98. You are advised to familiarize yourself with the original formulation of the resource monad in Haskell before reading the following. Good knowledge of Boost.MPL is also required.

Resource monads store functions from some Resource type to (Resource, Either a (R a)); the function expresses a computation that consumes resources and returns the remaining resource supply along with the result of the computation or a "suspended" computation in the form of an unevaluated resource monad if resources were exhausted.

data R a = R (Resource -> (Resource, Either a (R a)))
This translates to C++ TMP simply as
‎template<typename C>
‎struct RMonad
‎{
‎ typedef C value_type;
‎};

We cannot narrow down the kind of arguments accepted by RMonad because C++ TMP is basically untyped, so it is assumed that C is a proper metafunction class. The nested value_type definition allows us to access C without resorting to pattern matching via partial specialization, which simplifies some of the code below. Haskell pairs can be modeled by boost::mpl::pair; as for Either, we can simply translate their Left and Right data constructors:

template<typename T>struct left;
template<typename T>struct right;
The monad operations of R are defined like this:
‎instance Monad R where
‎ R c1 >>= fc2 = R (\r -> case c1 r of
‎ (r', Left v) -> let R c2 = fc2 v in
‎ c2 r'
‎ (r', Right pc1) -> (r', Right (pc1 >>= fc2)))
‎ return v = R (\r -> (r, (Left v)))

The semantics of these constructs is explained in the original source. The translation to C++ TMP consists in partially specializing the metafunctions mbind and mreturn as follows:

‎template<typename C1,typename FC2>
‎struct mbind<RMonad<C1>,FC2>
‎{
‎ struct C
‎ {
‎ template<typename R>
‎ struct apply
‎ {
‎ template<typename _>
‎ struct impl;

‎ template<typename R1,typename V>
‎ struct impl<pair<R1,left<V> > >
‎ {
‎ typedef typename boost::mpl::apply<
‎ typename boost::mpl::apply<FC2,V>::type::value_type,
‎ R1
‎ >::type type;
‎ };

‎ template<typename R1,typename PC1>
‎ struct impl<pair<R1,right<PC1> > >
‎ {
‎ typedef pair<
‎ R1,
‎ right<typename mbind<PC1,FC2>::type>
‎ > type;
‎ };

‎ typedef typename impl<
‎ typename boost::mpl::apply<C1,R>::type
‎ >::type type;
‎ };
‎ };

‎ typedef RMonad<C> type;
‎};

‎template<typename _,typename V>
‎struct mreturn<RMonad<_>,V>
‎{
‎ struct C
‎ {
‎ template<typename R>
‎ struct apply
‎ {
‎ typedef pair<R,left<V> > type;
‎ };
‎ };

‎ typedef RMonad<C> type;
‎};

Despite the imposing look of the code above, the translation is straightforward once one grasps the overall structure of the definitions: so, in the mapping of >>= to C++ TMP, C corresponds to the anonymous function in the Haskell definition; the case construct on c1 r is translated by means of the impl auxiliary class template, which is partially specialized according to the patterns (r', Left v) and (r', Right pc1). The names of the variables involved are kept consistent to aid the reader: c1 becomes C1, fc2 becomes FC2 and so on.

In line with the original exposition, we assume that the type Resource is simply an integer: each computation decreases the resource count by one until 0 is reached. This process is performed by the step function:

‎step   :: a -> R a
‎step v = c where
‎ c = R (\r -> if r /= 0 then (r-1, Left v)
‎ else (r, Right c))
In C++ TMP we will model integers by means of boost::mpl::int_ and implement step as the following metafunction:
‎template<typename V>
‎struct step
‎{
‎ struct C
‎ {
‎ template<typename R>
‎ struct apply:
‎ if_c<
‎ R::value!=0,
‎ pair<int_<R::value-1>,left<V> >,
‎ pair<R,right<RMonad<C> > >
‎ >
‎ {};
‎ };

‎ typedef RMonad<C> type;
‎};

Whereas in the Haskell definition c is the local name for the returned monad, which recursively mentions itself, in C++ TMP is simpler to implement this recursivity by letting the local struct C refer to the metafunction contained inside RMonad rather than the RMonad instantiation proper. Other than this, the translation is direct.

step is used to lift unary and binary functions from regular types to their mapped types in the resource monad world so that the resulted lifted functions operate analogously, but adding resource consumption to the computation:

‎lift1   :: (a -> b) -> (R a -> R b)
‎lift1 f = \ra1 -> do a1 <- ra1
‎ step (f a1)

‎lift2 :: (a -> b -> c) -> (R a -> R b -> R c)
‎lift2 f = \ra1 ra2 -> do a1 <- ra1
‎ a2 <- ra2
‎ step (f a1 a2)
How do we simulate the do notation in C++ TMP? Fortunately, we need not do it, since do is just syntactic sugar and can be dispensed with using the transformation:

‎do p <- e1; e2   =   e1 >>= \p -> e2

So the definitions of lift1 and lift2 can be rewritten as:

‎lift1   :: (a -> b) -> (R a -> R b)
‎lift1 f = \ra1 -> (ra1 >>= \a1 -> step (f a1))

‎lift2 :: (a -> b -> c) -> (R a -> R b -> R c)
‎lift2 f = \ra1 ra2 -> (ra1 >>= \a1 -> ( ra2 >>= \a2 -> step (f a1 a2)))

Using this formulation, the translation to C++ TMP poses no particular problems:

‎template<typename F>
‎struct lift1
‎{
‎ struct type
‎ {
‎ template<typename RA1>
‎ struct apply
‎ {
‎ struct FC
‎ {
‎ template<typename A1>
‎ struct apply
‎ {
‎ typedef typename step<
‎ typename boost::mpl::apply<F,A1>::type
‎ >::type type;
‎ };
‎ };

‎ typedef typename mbind<RA1,FC>::type type;
‎ };
‎ };
‎};

‎template<typename F>
‎struct lift2
‎{
‎ struct type
‎ {
‎ template<typename RA1,typename RA2>
‎ struct apply
‎ {
‎ struct FC1
‎ {
‎ template<typename A1>
‎ struct apply
‎ {
‎ struct FC2
‎ {
‎ template<typename A2>
‎ struct apply
‎ {
‎ typedef typename step<
‎ typename boost::mpl::apply<F,A1,A2>::type
‎ >::type type;
‎ };
‎ };

‎ typedef typename mbind<RA2,FC2>::type type;
‎ };
‎ };

‎ typedef typename mbind<RA1,FC1>::type type;
‎ };
‎ };
‎};

Arithmetic operations are lifted to the resource monad like this:

‎instance Num a => Num (R a) where
‎ (+) = lift2 (+)
‎ (-) = lift2 (-)
‎ negate = lift1 negate
‎ (*) = lift2 (*)
‎ abs = lift1 abs
fromInteger = return . fromInteger

On the other hand, == cannot be overloaded due to signature incompatibilities (the return type has to be R Bool rather than Bool), whereas if is no function at all, and cannot be used directly as its condition has to be a Bool when we need a R Bool. Consequently, alternatives are provided:

‎(==*)           :: Ord a => R a -> R a -> R Bool
‎(==*) = lift2 (==)

‎ifR :: R Bool -> R a -> R a -> R a
‎ifR tst thn els = do t <- tst
‎ if t then thn else els

Curiously, as C++ TMP is untyped and boost:mpl::if_ is not a special construct but a regular metafunction, we can partially specialize everything without the problems found in Haskell:

‎#define LIFT1(f)                            \
‎template<typename C> \
‎struct f<RMonad<C> > \
‎{ \
‎ struct F \
‎ { \
‎ template<typename T> \
‎ struct apply \
‎ { \
‎ typedef typename f<T>::type type; \
‎ }; \
‎ }; \
‎ \
‎ typedef typename apply< \
‎ typename lift1<F>::type, \
‎ RMonad<C> \
‎ >::type type; \
‎};

‎#define LIFT2(f) \
‎template<typename C1,typename C2> \
‎struct f<RMonad<C1>,RMonad<C2> > \
‎{ \
‎ struct F \
‎ { \
‎ template<typename T1,typename T2> \
‎ struct apply \
‎ { \
‎ typedef typename f<T1,T2>::type type; \
‎ }; \
‎ }; \
‎ \
‎ typedef typename apply< \
‎ typename lift2<F>::type, \
‎ RMonad<C1>,RMonad<C2> \
‎ >::type type; \
‎};

‎namespace boost{
‎namespace mpl{

LIFT1(negate)
LIFT1(next)
LIFT1(prior)
‎LIFT2(equal_to)
‎LIFT2(plus)
‎LIFT2(minus)
‎LIFT2(times)
‎LIFT2(divides)
‎LIFT2(modulus)

‎template<typename C,typename T1,typename T2>
‎struct if_<RMonad<C>,T1,T2>
‎{
‎ struct impl
‎ {
‎ template<typename CC>
‎ struct apply:if_<CC,T1,T2>{};
‎ };

‎ typedef typename mbind<RMonad<C>,impl>::type type;
‎};

‎template<typename C,typename T1,typename T2>
‎struct eval_if<RMonad<C>,T1,T2>
‎{
‎ struct impl
‎ {
‎ template<typename CC>
‎ struct apply:eval_if<CC,T1,T2>{};
‎ };

‎ typedef typename mbind<RMonad<C>,impl>::type type;
‎};

‎} // namespace mpl
‎} // namespace boost

‎template<int N>
‎struct from_int
‎{
‎ typedef typename mreturn<RMonad<void>,int_<N> >::type type;
‎};

Note that lift1 and lift2 accept metafunction classes rather than plain metafunctions, hence the local F adaptor inside the implementation of the LIFT macros. from_int is the C++ TMP translation of the fromInteger Haskell function; unlike in Haskell, we will have to use explicitly this conversion function when needed.

We have now all the the tools to implement a factorial function on RMonad<boost:mpl::int_>:

‎template<typename X>
‎struct fact
‎{
‎ struct else_:
‎ times<
‎ X,
‎ typename fact<typename prior<X>::type>::type
‎ >
‎ {};

‎ typedef typename eval_if<
‎ typename equal_to<X,from_int<0>::type>::type,
‎ from_int<1>,
‎ else_
‎ >::type type;
‎};

This looks remarkably closer to an implementation of the function for boost::mpl::int_. If you are an experienced Boost.MPL programmer you might have noticed there are a few ::types that can be omitted in the regular case: here they are needed so that the proper RMonad specializations of the arithmetic operations are invoked.

Now, these resource monads encapsulate computation, but do not automatically execute it: for instance, fact<from_int<3>::type>::type is not from_int<6>::type, but rather can be thought of as a program that will produce from_int<6>::type when fed with enough resources. This is what the original source provides the run function for:

‎run            :: Resource -> R a -> Maybe a
‎run s (R p) = case (p s) of
‎ (_, Left v) -> Just v
‎ _ -> Nothing

run accepts a maximum number of computation steps and a resource monad; if the computation takes less than the limit run returns the output, otherwise it returns Nothing. We opt for a little different treatment of the output in our C++ TMP translation:

‎template<int N,typename RM>
‎struct run
‎{
‎ template<typename _>
‎ struct impl:int_<0>
‎ {
‎ enum{steps=-1};
‎ };

‎ template<typename R,typename V>
‎ struct impl<pair<R,left<V> > >:V
‎ {
‎ enum{steps=N-R::value};
‎ };

‎ typedef impl<
‎ typename apply<typename RM::value_type,int_<N> >::type
‎ > type;
‎};
If the computation succeeds, the metafunction run returns the result, and the nested value ::steps additionally informs of the number of computational steps taken; if the computation runs out or resource, int_<0> is returned by convention and ::steps is set to −1.

A complete program is provided that exercises the ideas we have developed. GCC copes fine with the metacomputations involved; MSVC++ 8.0, on the other hand, seems to take inordinate amounts of memory resources during compilation: adding a couple of additional run instantiations to the program can easily bring a powerful machine to its knees.

One final note: if you play a little with fact monads you may be surprised to learn that computing the factorial of n takes (n+1)2 steps. This quadratic performance is due to the fact that the intermediate values are never reduced but instead they are executed any time they are needed: remember that resources monads represent computations, not final values.

Monday, June 9, 2008

Making choices: two-stage strategies

Let us consider more sophisticated selection strategies than those studied in a prior entry for the problem of selecting one among N candidates based on the evaluation of M independent normally distributed random candidate traits, with the constraint that at most N traits (out of the total NM) can be evaluated.

We consider the following two-stage scheme: first sort n1 randomly chosen candidates according to the first t1 traits; from these, pick the n2 best ones (n2 < n1) and make the choice among these based on their first t2 traits (t2 > t1). The idea is that making a preselection on fewer traits improves the quality of the candidates used at the final stage, although of course preselection incurs a cost in terms of trait evaluations consumed. The kind of selection strategies studied in our prior entry can be considered a degenerate instance of this with t1 = 0. Leaving this case aside, t1 can range between 1 and M − 1 and t2 between t1 + 1 and M. The acceptable values for n1 and n2 are governed by the following inequalities:

2 n1,
1 ≤ n2n1 − 1,
t
1n1 + (t2t1)n2N.

(If you wonder why the last inequality has (t2t1)n2 instead of t2n2, notice that we need not re-evaluate the first t1 traits in the secound round.) We are only interested in maximally efficient strategies that evaluate as many traits as allowed, i.e. for which t1n1 + (t2t1)n2 is as close as possible to N. If n1 and n1 could take fractional values, the points (n1,n2) associated to maximally efficient strategies would lie in the segment

( ((N + (t2t1))/t2 , (Nt1)/t2) , ((N − (t2t1))/t1,1) ).

The actual integer pairs (n1,n2) are obtained by approximating this segment with discrete values. In the particular case we had previously studied for simple strategies, N = 60, M = 5, there are 124 maximally efficient two-stage strategies grouped like this:

t1t2no of solutions
1229
1319
1414
1511
2310
2414
2511
345
358
453

A simulation program has been run to measure the effectiveness of all these 124 strategies. The figure shows the average score of the chosen candidate for the best strategy of each (t1,t2) group; the legend of each bar is the tuple (t1,n1,t2,n2). Click on the image to enlarge.

The most efficient two-stage strategy is (t1,n1,t2,n2) = (1,28,5,8): pick up 28 candidates at random, sort them accordingly to their first trait and then fully evaluate the first 8. The resulting average score 4.11 is notably higher than the average score 3.65 reached by the best one-stage strategy. These results seem to justify the common practice of conducting a preselection round in real life evaluation processes.

Although two-stage strategies improve upon previous one-stage schemes, they are by no means the most effective selection mechanism. We will try to develop better selection strategies in a later entry.

Thursday, June 5, 2008

Rotating bunny

Isometric perspective and a little animation can be used to conveniently visualize intricate 3D models:
The 1,889 points defining the surface of the bunny statuette have been obtained from the Stanford 3D Scanning Repository at the Stanford University Computer Graphics Laboratory. We have used the same Filmation technique exercised in a previous entry, with one important simplification: if we ignore the (rare) antitransitive cycles of the "is behind" relationship and consider a situation where all the elements in the scene have the same dimensions (as in our current setting), we can approximate the "is behind" relationship by any lexicographical order in the (x,y,z) coordinates, or, if we let the approximation be even coarser, by any order in only one of the coordinates. Since our statuette rotates along the z axis, the elements' z coordinates remain constant, so we can sort the elements by their z coordinate at the beginning of the program and use this ordering to assign a fixed drawing priority, thus sparing us the recalculation of the drawing ordering at each animation step. The resulting rendering anomalies (elements appearing in front of others when they shouldn't) are hardly noticeable.

Sunday, June 1, 2008

Monads in C++ template metaprogramming

In the context of functional programming, a monad M is a type constructor (a mapping from arbitrary type a to type Ma) along with a pair of (polymorphic) functions

>>=M : Ma → (aMb) → Mb,
returnM : aMa,

(the first function is usually pronounced "bind") which are assumed to satisfy certain laws regarding the associativity of >>=M and the identity role played by returnM. Intuitively, monads of type Ma internally carry zero, one or more values of type a, along with some monad-specific information; the expression

m >>= n

is to be understood as: monad m executes some specific computation and then uses their internal a contents to feed n, whose results are possibly postprocessed and returned in a single value of type Mb. return x is expected to produce a "fresh" monad initialized with x. In a sense, monad binding performs a computation from a to b in a sequenced manner with some additional contextual information carried along the way. For this reason, monads are used in functional languages like Haskell to model imperative constructs like those involved in IO interaction.

How can we implement monads in C++ template metaprogramming? In the following we use the conventions of Boost.MPL. As C++ TMP lacks proper types we can define global metafunctions for >>= and return and let each monad M provide the appropriate associated partial specializations for the values under its control. The following is a possible formulation of this idea:

template<typename M,typename F>
struct mbind;

template<typename M,typename T>
struct mreturn;

The signature of mbind is clear enough, but mreturn seems to accept a superfluous argument M: this extra argument is meant to disambiguate the specializations for different monads which otherwise share their input types, much as we cannot overload a C++ function only on the return type. This is how an implementation of the Maybe monad looks like in this framework:

‎template<typename T> struct maybe;

‎struct nothing_arg;
‎typedef maybe<nothing_arg> nothing;

‎template<typename T,typename F>
‎struct mbind<maybe<T>,F>:boost::mpl::apply<F,T>{};

‎template<typename F>
‎struct mbind<nothing,F>
‎{
‎ typedef nothing type;
‎};

‎template<typename _,typename T>
‎struct mreturn<maybe<_>,T>
‎{
‎ typedef maybe<T> type;
‎};
We will implement more complicated monads in later entries.