On Thu, Aug 09, 2012 at 02:53:48PM +0400, Solar Designer wrote:
> Rich -
> 
> On Thu, Aug 09, 2012 at 11:29:40AM +0400, Solar Designer wrote:
> > Attached is the smaller and faster code, as discussed on IRC.
> > 
> > This is under 8 KB.  The speed is similar to the original, I measured
> > -3% to +2% on different systems/builds.
> 
> Here's an even smaller version.

I've taken this version and made some minimum changes based on my
version, mainly for integration with musl where I'm testing it. I also
think we've reached the final word on loop unrolling:

Just For Fun, I tried replacing your unrolled BF_ROUND loop with a for
loop and compiling with -O3 on gcc 4.6.3. After noticing the
performance numbers were coming out near-identical, and that the .o
sizes were mysteriously identical, I decided, Just For Fun, to
disassemble both versions with objdump and diff them. They are
identical. That is, modern gcc generates byte-for-byte identical code
with -O3 for the manually unrolled loop and the for loop.

I'm leaving both versions of the code in the attached file so that you
or anyone else interested can try this. This is not the version I
intend to commit; I want to add back some of my size optimizations in
encode/decode and possibly compare how the compiler does if I add back
my non-hand-scheduled version of the BF_ROUND code. There are also
issues (in crypt_*, not just blowfish) I want to address with
returning unmatchable hashes instead of NULL on failure; this should
further reduce the code size by eliminating all the errno accesses,
etc.

Rich