Doug Bagley's Great Language Shootout says that O'Caml and GCC get about
the same performance for Matrix-Matrix Multiplication (MMM). I'm seeing
somewhat different results for my code. In particular, on my machine
(1.2GHz PIIIM, 256M), the O'Caml version of the code is 6 times slower
than the C version.

The codes are attached. A few of the differences between my code and
Doug's code is that I have tiled the loops and I use constants for the
matrices and the blocksize. I have not (yet) hoisted the invariant
expressions.

Looking at the assembly produced by O'Caml and GCC, it appears that GCC
is performance loop unrolling (as requested with -funroll-loops) and
strength reduction in the inner loops. I can easily see why these two
optimizations can result in such a tremendous performance difference.

My question is this: I can obviously performance loop unrolling myself
by hand - does ocamlopt perform strength reduction? Is there anyway that
I can get the O'Caml code to close to the performance of the C code?

I can provide additonal information about my set up if that would help.

Thanks.


quimby-xp$ uname -a
CYGWIN_NT-5.1 QUIMBY-XP 1.3.13(0.62/3/2) 2002-10-13 23:15 i686 unknown
quimby-xp$ ocamlopt.opt -v
The Objective Caml native-code compiler, version 3.06
Standard library directory: /usr/local/ocaml/lib/ocaml
quimby-xp$ gcc --version
gcc (GCC) 3.2 20020818 (prerelease)
Copyright (C) 2002 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

quimby-xp$ ocamlopt.opt -S -o mmm_ml.exe unix.cmxa mmm_ml.ml
quimby-xp$ gcc -O4 -funroll-loops -o mmm_c.exe mmm_c.c
quimby-xp$ ./mmm_ml
(* 86.83 mflops *)
quimby-xp$ ./mmm_ml
(* 88.16 mflops *)
quimby-xp$ ./mmm_ml
(* 89.07 mflops *)
quimby-xp$ ./mmm_c
/* 523.64 mflops */
quimby-xp$ ./mmm_c
/* 523.64 mflops */
quimby-xp$ ./mmm_c
/* 523.64 mflops */
quimby-xp$