Hi Alexey,
The recursion version is slower because you're allocating a tuple on each recursive call. Try to use a curried form of a function, and you will a performance that is even better than the for-loop (10% better in my case). E.g.,
let loop high =
let rec loop i = function
| t when i > high -> t
| t when i mod 2 = 0 -> loop (i + 1) (t + i * 2)
| t -> loop (i + 1) t in
loop 1 0
The for-loop version is indeed slower than the C version because of tagging, c.f., the OCaml implementation:
shr movq $1, %rbx
movq $3, %rdi
cmpq %rax, %rdi
jg .L100
.L101:
movq %rdi, %rsi
sarq $1, %rsi
moq $63, %rdx
leaq -2(%rbx,%rdi,2), %rbx
With the GCC output:
Notice this `sar` and `shr` instructions, they are here due to the tagging. The GCC compiler was also able to leverage the conditional move instruction `cmove`, though it looks nice I'm not sure how much it actually won. There is no need to use flambda for the recursive and for-loop as they are already optimized to the maximum.
Concerning your stream implementation, you may compare it to the Core Kernel's Sequence. It is basically the same, so you can play the benchmarking game :)
The general way to implement a stream with a zero performance penalty (i.e., as fast as a for loop) is to use staging. This topic is explored in detail in the Oleg Kiselyov et alas Strymonad [1] paper. The implementation, as well as other interesting discussions about streams, are available on his homepage [2] (there are tons of very interesting stuff there!).
Best,
Ivan