Hi Alexey,

The recursion version is slower because you're allocating a tuple on each recursive call. Try to use a curried form of a function, and you will a performance that is even better than the for-loop (10% better in my case). E.g., 


let loop high =
  let rec loop i = function
    | t when i > high -> t
    | t when i mod 2 = 0 -> loop (i + 1) (t + i * 2)
    | t -> loop (i + 1) t in
  loop 1 0

The for-loop version is indeed slower than the C version because of tagging, c.f., the OCaml implementation:

movq $1, %rbx
movq $3, %rdi
cmpq %rax, %rdi
jg .L100

.L101:
movq %rdi, %rsi
sarq $1, %rsi
movq %rsi, %rdx
shr movq $1, %rbx
movq $3, %rdi
cmpq %rax, %rdi
jg .L100

.L101:
movq %rdi, %rsi
sarq $1, %rsi
moq $63, %rdx
movq %rsi, %rcx
addq %rdx, %rcx
andq $-2, %rcx
subq %rcx, %rsi
leaq 1(%rsi,%rsi), %rsi
cmpq $1, %rsi
jne .L102
leaq -2(%rbx,%rdi,2), %rbx
.L102:
movq %rdi, %rsi
addq $2, %rdi
cmpq %rax, %rsi
jne .L101
.L100:
movq %rbx, %rax
ret

With the GCC output:


testl %edi, %edi
jle .L6
addl $1, %edi
movl $4, %ecx
movl $1, %edx
xorl %eax, %eax
jmp .L3
.L5:
leaq (%rax,%rcx), %rsi
testb $1, %dl
cmove %rsi, %rax
addq $2, %rcx
.L3:
addl $1, %edx
cmpl %edi, %edx
jne .L5
rep ret
.L6:
xorl %eax, %eax
ret

Notice this `sar` and `shr` instructions, they are here due to the tagging. The GCC compiler was also able to leverage the conditional move instruction `cmove`, though it looks nice I'm not sure how much it actually won. There is no need to use flambda for the recursive and for-loop as they are already optimized to the maximum. 

Concerning your stream implementation, you may compare it to the Core Kernel's Sequence. It is basically the same, so you can play the benchmarking game :)

The general way to implement a stream with a zero performance penalty (i.e., as fast as a for loop) is to use staging. This topic is explored in detail in the Oleg Kiselyov et alas Strymonad [1] paper. The implementation, as well as other interesting discussions about streams, are available on his homepage [2] (there are tons of very interesting stuff there!).

Best,
Ivan

[1]: http://okmij.org/ftp/meta-programming/strymonas.pdf
[2]: http://okmij.org/ftp/Streams.html





On Mon, Jul 10, 2017 at 4:26 PM, Alexey Egorov <alex.only.d@gmail.com> wrote:
Hello,

Michael Snoyman recently posted very interesting article about Haskell
streams and Rust iterators (
https://www.fpcomplete.com/blog/2017/07/iterators-streams-rust-haskell
) and I tried to do the same in OCaml.

You can find my code here -
https://gist.github.com/anonymous/127e9116b362d561c5dfa9cce6b453f3

There is four different versions:

1) c_loop.c - plain C loop
2) f_loop.ml - for-loop in OCaml
3) loop.ml - OCaml loop using recursion
4) stream.ml - OCaml loop using streams

Here is the results:

c_loop.c:   0m0.105s
f_loop.ml:  0m0.184s
loop.ml:     0m0.207s
stream.ml: 0m0.605s

It's not suprise that C version is fastest - at least, it uses untagged ints.
What surprises me is that recursion is slower than for-loop - can't
OCaml compile it down to the same imperative loop?

And it's very dissapointing that stream version is 3 times slower than
recursion-based.
The problem, I believe, is that OCaml is unable to optimize
intermediate stream constructors away (as GHC do).

I tried to add inline annotations, which helps a little - is there
something I can do to improve it further?

--
Caml-list mailing list.  Subscription management and archives:
https://sympa.inria.fr/sympa/arc/caml-list
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs