Thanks for tracking this down and pointing out where
the error is.  There are actually a handful of errors
in that code.

Assume a source value sv, source alpha sa,
mask alpha ma, destination value dv, and
destination alpha da.  The values sv and dv
are stored premultiplied by their corresponding
alphas (the one true way).

Given those values, the correct new values
for the destination pixel in an S over D op are:

    dv = (sv*ma + dv*(255-sa*ma)) / 255
    da = (sa*ma + da*(255-sa*ma)) / 255

Bug #1: The current draw.c does the division
separately on the two halves:

    dv = (sv*ma)/255 + (dv*(255-sa*ma))/255

This can be off by one if the remainders from the
two divisions sum to >= 255.

Bug #2: The MUL0123 macro assumes four
values are packed into a 32-bit int and runs
them in simultaneous pairs as 32-bit operations
(MUL02 and MUL13) operating on 16-bit halves
of the word.  Those two don't use the right rounding
for the bitwise op implementation of /255.  On
a single value, the implementation is

    x / 255 == (t = x+1, (t+(t>>8))>>8)
    (x+127) / 255 == (t = x+128, (t+(t>>8))>>8)

These calculations only need 16 bits so you can
run two of them in the two halves of a 32-bit word.
The second implements round-to-nearest and
is what the draw code tries to do in this case.
But it only adds 128 (0x80), so it only rounds
the bottom half correctly.  It needs to add 0x00800080,
which would round both of them.  This explains:

src  rFF gFF bFF αFF
mask  kFF αFF
dst  r00 g00 b00 αFF
dst after calc  rFE gFE bFF αFF

Bug #3: MUL0123 is enabled whenever the src
and dst both have 32-bit pixel width, but there is
no check that the sub-channels are in the same
order.  You don't say what the image chans were
in your test, but this:

src  rFF g00 b00 αFF
mask  kFF αFF
dst  r00 g00 b00 αFF
dst after calc  rFE gFE b00 α00

would be explained by, say, src==ARGB and
dst==RGBA.  The A and R values in src became
the R and G chans in dst.  (In fact, since the dst
R and G are FE not FF, that's almost certainly
the scenario, modulo the little-endian draw names.)

This one is similarly explained:

src  rCC g00 b00 αCC
mask  kFF αFF
dst  r00 g00 b00 αFF
dst after calc  rCB gCB b00 α33

The destination got scaled by 0x33/0xFF,
leaving 00 00 00 33, and then the source,
cc 00 00 cc, was added in the wrong place,
using the incorrect rounding, to produce
cb cb 00 33.

I have corrected these bugs in the plan9port
copy of src/libmemdraw/draw.c and finally
get the right answer for Andrey's test (attached).
I leave it as an exercise to the interested reader to
port the changes to the other dozen copies
of libmemdraw that are floating around,
or to unify them all.

Russ