Hi Christoph,
The problem is that your function definitions, like `loop` and `rk4_step`, have too many parameters
and OCaml is not able to eliminate them and is actually not trying. It was always a feature of OCaml
that poorly written code will be compiled into a poorly written program. OCaml never tries to fix
programmer's errors. It will try to minimize the abstraction overhead (often to the zero). But if the abstractions,
on the first hand, were chosen incorrectly, then it won't fix the code.
In your particular example, the C compiler was quite aggressive and was capable of eliminating unnecessary computations.
I wouldn't, however, assume that the C compiler will be always able to do this for you.
Let's go through the code:
let rec loop steps h n y t =
if n < steps then loop steps h (n+1) (rk4_step y t h) (t +. h) else
Here variables `steps` and `h` are loop invariants, so you shouldn't put it into the loop function parameter list.
Well yes, a compiler can find out loop invariants and remove them for you. But in our case, to remove them,
it will need to change the type of the function. The compiler will not go that far for us. It respects a programmer, and if a
programmer decided to make his loop function to depend on `6` arguments, then it means that the computation is actually depending
on 6 arguments. So, it will allocate 6 registers to hold loop variables, with all the consequences.
Now, let's take a look at the `rk4_step` function:
let k2 = h *. y' (t +. 0.5*.h) (y +. 0.5*.k1) in
let k3 = h *. y' (t +. 0.5*.h) (y +. 0.5*.k2) in
let k4 = h *. y' (t +. h) (y +. k3) in
y +. (k1+.k4)/.6.0 +. (k2+.k3)/.3.0
This function, is, in fact, a body of the loop, and everything except t is loop invariant here. Moreover,
function `y'` is defined as:
I.e., it doesn't really use it the second argument. Probably, a compiler should inline the call, and eliminate lots of unecessary computations, and thus free a few registers, but, apparently, OCaml doesn't do this
(even in 4.03+flambda).
So we should do this manually:
let rk4_step y t =
let k1 = h *. y' t in
let k2 = h *. y' (t +. 0.5*.h) in
let k3 = h *. y' (t +. 0.5*.h) in
let k4 = h *. y' (t +. h) (y +. k3) in
y +. (k1+.k4)/.6.0 +. (k2+.k3)/.3.0
We can even see, that `k3` and `k2` are equal now, so we can eliminate them:
let k2 = h *. y' (t +. 0.5*.h) in
let k4 = h *. y' (t +. h) (y +. k3) in
y +. (k1+.k4)/.6.0 +. k2 *. 1.5
Finally, we don't want to pass `y` into the `rk4_step` every time, as we don't want to require an extra register for it.
After all these manual optimizations, we have the following program:
let h = 0.1
let exact t = sin t
let rk4_step t =
let k1 = h *. cos t in
let k2 = h *. cos (t +. 0.5*.h) in
let k4 = h *. cos (t +. h) in
(k1+.k4)/.6.0 +. k2*.1.5
let compute steps =
let rec loop n y t =
if n < steps
then loop (n+1) (y +. rk4_step t) (t +. h)
else y in
loop 1 1.0 0.0
let () =
let y = compute 102 in
let err = abs_float (y -. (exact ((float_of_int 102) *. h))) in
let large = 50000000 in
let y = compute large in
Printf.printf "%b\n"
(abs_float (y -. (exact (float_of_int large) *. h)) < 2. *. err)
This program has the same performance as the C one... unless I pass really aggressive optimization options
to the C compiler, that will emit a platform specific code, e.g.,
gcc rk.c -lm -O3 -march=corei7-avx -o rksse
These options basically double the performance of the C version, leaving OCaml lagging behind. That is because,
OCaml, obviously, cannot follow the latest developments of intel CPU, especially in the field of SSE.
The fun part is that when I've tried to compile the same file with clang, the resulting program was even slower
than the original non-optimized OCaml. But this is all micro benchmarking of course, so don't jump to fast conclusions
(although I like to think that OCaml is faster than Clang :))
As a final remark, my experience in HPC shows that in general you should not really rely on compiler optimizations and hope
that the compiler will do the magic for you. Even the GCC compiler. It would be very easy to accidentally amend the above program
in a such way, that the optimizations will no longer fire in. Of course, writing in assembly is also not a choice. If you really need
to optimize, then you should find out the performance bottleneck and then optimize it manually until you get an expected performance.
Alternatively, you can use plain old Fortran to get the reliable performance. And then call it from C or OCaml.
Best wishes,
Ivan