On Thu, Aug 09, 2012 at 02:53:48PM +0400, Solar Designer wrote: > Rich - > > On Thu, Aug 09, 2012 at 11:29:40AM +0400, Solar Designer wrote: > > Attached is the smaller and faster code, as discussed on IRC. > > > > This is under 8 KB. The speed is similar to the original, I measured > > -3% to +2% on different systems/builds. > > Here's an even smaller version. I've taken this version and made some minimum changes based on my version, mainly for integration with musl where I'm testing it. I also think we've reached the final word on loop unrolling: Just For Fun, I tried replacing your unrolled BF_ROUND loop with a for loop and compiling with -O3 on gcc 4.6.3. After noticing the performance numbers were coming out near-identical, and that the .o sizes were mysteriously identical, I decided, Just For Fun, to disassemble both versions with objdump and diff them. They are identical. That is, modern gcc generates byte-for-byte identical code with -O3 for the manually unrolled loop and the for loop. I'm leaving both versions of the code in the attached file so that you or anyone else interested can try this. This is not the version I intend to commit; I want to add back some of my size optimizations in encode/decode and possibly compare how the compiler does if I add back my non-hand-scheduled version of the BF_ROUND code. There are also issues (in crypt_*, not just blowfish) I want to address with returning unmatchable hashes instead of NULL on failure; this should further reduce the code size by eliminating all the errno accesses, etc. Rich