Doug Bagley's Great Language Shootout says that O'Caml and GCC get about the same performance for Matrix-Matrix Multiplication (MMM). I'm seeing somewhat different results for my code. In particular, on my machine (1.2GHz PIIIM, 256M), the O'Caml version of the code is 6 times slower than the C version. The codes are attached. A few of the differences between my code and Doug's code is that I have tiled the loops and I use constants for the matrices and the blocksize. I have not (yet) hoisted the invariant expressions. Looking at the assembly produced by O'Caml and GCC, it appears that GCC is performance loop unrolling (as requested with -funroll-loops) and strength reduction in the inner loops. I can easily see why these two optimizations can result in such a tremendous performance difference. My question is this: I can obviously performance loop unrolling myself by hand - does ocamlopt perform strength reduction? Is there anyway that I can get the O'Caml code to close to the performance of the C code? I can provide additonal information about my set up if that would help. Thanks. quimby-xp$ uname -a CYGWIN_NT-5.1 QUIMBY-XP 1.3.13(0.62/3/2) 2002-10-13 23:15 i686 unknown quimby-xp$ ocamlopt.opt -v The Objective Caml native-code compiler, version 3.06 Standard library directory: /usr/local/ocaml/lib/ocaml quimby-xp$ gcc --version gcc (GCC) 3.2 20020818 (prerelease) Copyright (C) 2002 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. quimby-xp$ ocamlopt.opt -S -o mmm_ml.exe unix.cmxa mmm_ml.ml quimby-xp$ gcc -O4 -funroll-loops -o mmm_c.exe mmm_c.c quimby-xp$ ./mmm_ml (* 86.83 mflops *) quimby-xp$ ./mmm_ml (* 88.16 mflops *) quimby-xp$ ./mmm_ml (* 89.07 mflops *) quimby-xp$ ./mmm_c /* 523.64 mflops */ quimby-xp$ ./mmm_c /* 523.64 mflops */ quimby-xp$ ./mmm_c /* 523.64 mflops */ quimby-xp$