[Caml-list] Optimizing pure-functional streams

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

* [Caml-list] Optimizing pure-functional streams
@ 2017-07-10 20:26 Alexey Egorov
  2017-07-10 21:29 ` Ivan Gotovchits
  2017-07-11 12:54 ` Simon Cruanes
  0 siblings, 2 replies; 10+ messages in thread
From: Alexey Egorov @ 2017-07-10 20:26 UTC (permalink / raw)
  To: caml users

Hello,

Michael Snoyman recently posted very interesting article about Haskell
streams and Rust iterators (
https://www.fpcomplete.com/blog/2017/07/iterators-streams-rust-haskell
) and I tried to do the same in OCaml.

You can find my code here -
https://gist.github.com/anonymous/127e9116b362d561c5dfa9cce6b453f3

There is four different versions:

1) c_loop.c - plain C loop
2) f_loop.ml - for-loop in OCaml
3) loop.ml - OCaml loop using recursion
4) stream.ml - OCaml loop using streams

Here is the results:

c_loop.c:   0m0.105s
f_loop.ml:  0m0.184s
loop.ml:     0m0.207s
stream.ml: 0m0.605s

It's not suprise that C version is fastest - at least, it uses untagged ints.
What surprises me is that recursion is slower than for-loop - can't
OCaml compile it down to the same imperative loop?

And it's very dissapointing that stream version is 3 times slower than
recursion-based.
The problem, I believe, is that OCaml is unable to optimize
intermediate stream constructors away (as GHC do).

I tried to add inline annotations, which helps a little - is there
something I can do to improve it further?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Optimizing pure-functional streams
  2017-07-10 20:26 [Caml-list] Optimizing pure-functional streams Alexey Egorov
@ 2017-07-10 21:29 ` Ivan Gotovchits
  2017-07-10 22:28   ` Alexey Egorov
  2017-07-11 12:54 ` Simon Cruanes
  1 sibling, 1 reply; 10+ messages in thread
From: Ivan Gotovchits @ 2017-07-10 21:29 UTC (permalink / raw)
  To: Alexey Egorov; +Cc: caml users

[-- Attachment #1: Type: text/plain, Size: 3676 bytes --]

Hi Alexey,

The recursion version is slower because you're allocating a tuple on each
recursive call. Try to use a curried form of a function, and you will a
performance that is even better than the for-loop (10% better in my case).
E.g.,

let loop high =
  let rec loop i = function
    | t when i > high -> t
    | t when i mod 2 = 0 -> loop (i + 1) (t + i * 2)
    | t -> loop (i + 1) t in
  loop 1 0

The for-loop version is indeed slower than the C version because of
tagging, c.f., the OCaml implementation:

movq $1, %rbx
movq $3, %rdi
cmpq %rax, %rdi
jg .L100

.L101:
movq %rdi, %rsi
sarq $1, %rsi
movq %rsi, %rdx
shr movq $1, %rbx
movq $3, %rdi
cmpq %rax, %rdi
jg .L100

.L101:
movq %rdi, %rsi
sarq $1, %rsi
moq $63, %rdx
movq %rsi, %rcx
addq %rdx, %rcx
andq $-2, %rcx
subq %rcx, %rsi
leaq 1(%rsi,%rsi), %rsi
cmpq $1, %rsi
jne .L102
leaq -2(%rbx,%rdi,2), %rbx
.L102:
movq %rdi, %rsi
addq $2, %rdi
cmpq %rax, %rsi
jne .L101
.L100:
movq %rbx, %rax
ret

With the GCC output:

testl %edi, %edi
jle .L6
addl $1, %edi
movl $4, %ecx
movl $1, %edx
xorl %eax, %eax
jmp .L3
.L5:
leaq (%rax,%rcx), %rsi
testb $1, %dl
cmove %rsi, %rax
addq $2, %rcx
.L3:
addl $1, %edx
cmpl %edi, %edx
jne .L5
rep ret
.L6:
xorl %eax, %eax
ret

Notice this `sar` and `shr` instructions, they are here due to the tagging.
The GCC compiler was also able to leverage the conditional move instruction
`cmove`, though it looks nice I'm not sure how much it actually won. There
is no need to use flambda for the recursive and for-loop as they are
already optimized to the maximum.

Concerning your stream implementation, you may compare it to the Core
Kernel's Sequence. It is basically the same, so you can play the
benchmarking game :)

The general way to implement a stream with a zero performance penalty
(i.e., as fast as a for loop) is to use staging. This topic is explored in
detail in the Oleg Kiselyov et alas Strymonad [1] paper. The
implementation, as well as other interesting discussions about streams, are
available on his homepage [2] (there are tons of very interesting stuff
there!).

Best,
Ivan

[1]: http://okmij.org/ftp/meta-programming/strymonas.pdf
[2]: http://okmij.org/ftp/Streams.html

On Mon, Jul 10, 2017 at 4:26 PM, Alexey Egorov <alex.only.d@gmail.com>
wrote:

> Hello,
>
> Michael Snoyman recently posted very interesting article about Haskell
> streams and Rust iterators (
> https://www.fpcomplete.com/blog/2017/07/iterators-streams-rust-haskell
> ) and I tried to do the same in OCaml.
>
> You can find my code here -
> https://gist.github.com/anonymous/127e9116b362d561c5dfa9cce6b453f3
>
> There is four different versions:
>
> 1) c_loop.c - plain C loop
> 2) f_loop.ml - for-loop in OCaml
> 3) loop.ml - OCaml loop using recursion
> 4) stream.ml - OCaml loop using streams
>
> Here is the results:
>
> c_loop.c:   0m0.105s
> f_loop.ml:  0m0.184s
> loop.ml:     0m0.207s
> stream.ml: 0m0.605s
>
> It's not suprise that C version is fastest - at least, it uses untagged
> ints.
> What surprises me is that recursion is slower than for-loop - can't
> OCaml compile it down to the same imperative loop?
>
> And it's very dissapointing that stream version is 3 times slower than
> recursion-based.
> The problem, I believe, is that OCaml is unable to optimize
> intermediate stream constructors away (as GHC do).
>
> I tried to add inline annotations, which helps a little - is there
> something I can do to improve it further?
>
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>

[-- Attachment #2: Type: text/html, Size: 12776 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Optimizing pure-functional streams
  2017-07-10 21:29 ` Ivan Gotovchits
@ 2017-07-10 22:28   ` Alexey Egorov
  2017-07-11 12:40     ` Ivan Gotovchits
  0 siblings, 1 reply; 10+ messages in thread
From: Alexey Egorov @ 2017-07-10 22:28 UTC (permalink / raw)
  To: Ivan Gotovchits; +Cc: caml users

Hi Ivan,

thanks for your analysis!

Which version of OCaml you are using? I'm tried on 4.04.1 and trunk
(with and without flambda) and still getting the same (poor)
performance for recursive version, no matter if curried or uncurried
loop is used.

2017-07-11 2:29 GMT+05:00 Ivan Gotovchits <ivg@ieee.org>:
> Hi Alexey,
>
> The recursion version is slower because you're allocating a tuple on each
> recursive call. Try to use a curried form of a function, and you will a
> performance that is even better than the for-loop (10% better in my case).
> E.g.,
>
>
>
> let loop high =
>   let rec loop i = function
>     | t when i > high -> t
>     | t when i mod 2 = 0 -> loop (i + 1) (t + i * 2)
>     | t -> loop (i + 1) t in
>   loop 1 0
>
>
> The for-loop version is indeed slower than the C version because of tagging,
> c.f., the OCaml implementation:
>
> movq $1, %rbx
> movq $3, %rdi
> cmpq %rax, %rdi
> jg .L100
>
>
> .L101:
> movq %rdi, %rsi
> sarq $1, %rsi
> movq %rsi, %rdx
> shr movq $1, %rbx
> movq $3, %rdi
> cmpq %rax, %rdi
> jg .L100
>
> .L101:
> movq %rdi, %rsi
> sarq $1, %rsi
> moq $63, %rdx
> movq %rsi, %rcx
> addq %rdx, %rcx
> andq $-2, %rcx
> subq %rcx, %rsi
> leaq 1(%rsi,%rsi), %rsi
> cmpq $1, %rsi
> jne .L102
> leaq -2(%rbx,%rdi,2), %rbx
> .L102:
> movq %rdi, %rsi
> addq $2, %rdi
> cmpq %rax, %rsi
> jne .L101
> .L100:
> movq %rbx, %rax
> ret
>
>
> With the GCC output:
>
>
> testl %edi, %edi
> jle .L6
> addl $1, %edi
> movl $4, %ecx
> movl $1, %edx
> xorl %eax, %eax
> jmp .L3
> .L5:
> leaq (%rax,%rcx), %rsi
> testb $1, %dl
> cmove %rsi, %rax
> addq $2, %rcx
> .L3:
> addl $1, %edx
> cmpl %edi, %edx
> jne .L5
> rep ret
> .L6:
> xorl %eax, %eax
> ret
>
>
> Notice this `sar` and `shr` instructions, they are here due to the tagging.
> The GCC compiler was also able to leverage the conditional move instruction
> `cmove`, though it looks nice I'm not sure how much it actually won. There
> is no need to use flambda for the recursive and for-loop as they are already
> optimized to the maximum.
>
> Concerning your stream implementation, you may compare it to the Core
> Kernel's Sequence. It is basically the same, so you can play the
> benchmarking game :)
>
> The general way to implement a stream with a zero performance penalty (i.e.,
> as fast as a for loop) is to use staging. This topic is explored in detail
> in the Oleg Kiselyov et alas Strymonad [1] paper. The implementation, as
> well as other interesting discussions about streams, are available on his
> homepage [2] (there are tons of very interesting stuff there!).
>
> Best,
> Ivan
>
> [1]: http://okmij.org/ftp/meta-programming/strymonas.pdf
> [2]: http://okmij.org/ftp/Streams.html
>
>
>
>
>
> On Mon, Jul 10, 2017 at 4:26 PM, Alexey Egorov <alex.only.d@gmail.com>
> wrote:
>>
>> Hello,
>>
>> Michael Snoyman recently posted very interesting article about Haskell
>> streams and Rust iterators (
>> https://www.fpcomplete.com/blog/2017/07/iterators-streams-rust-haskell
>> ) and I tried to do the same in OCaml.
>>
>> You can find my code here -
>> https://gist.github.com/anonymous/127e9116b362d561c5dfa9cce6b453f3
>>
>> There is four different versions:
>>
>> 1) c_loop.c - plain C loop
>> 2) f_loop.ml - for-loop in OCaml
>> 3) loop.ml - OCaml loop using recursion
>> 4) stream.ml - OCaml loop using streams
>>
>> Here is the results:
>>
>> c_loop.c:   0m0.105s
>> f_loop.ml:  0m0.184s
>> loop.ml:     0m0.207s
>> stream.ml: 0m0.605s
>>
>> It's not suprise that C version is fastest - at least, it uses untagged
>> ints.
>> What surprises me is that recursion is slower than for-loop - can't
>> OCaml compile it down to the same imperative loop?
>>
>> And it's very dissapointing that stream version is 3 times slower than
>> recursion-based.
>> The problem, I believe, is that OCaml is unable to optimize
>> intermediate stream constructors away (as GHC do).
>>
>> I tried to add inline annotations, which helps a little - is there
>> something I can do to improve it further?
>>
>> --
>> Caml-list mailing list.  Subscription management and archives:
>> https://sympa.inria.fr/sympa/arc/caml-list
>> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
>> Bug reports: http://caml.inria.fr/bin/caml-bugs
>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Optimizing pure-functional streams
  2017-07-10 22:28   ` Alexey Egorov
@ 2017-07-11 12:40     ` Ivan Gotovchits
  0 siblings, 0 replies; 10+ messages in thread
From: Ivan Gotovchits @ 2017-07-11 12:40 UTC (permalink / raw)
  To: Alexey Egorov; +Cc: caml users

[-- Attachment #1: Type: text/plain, Size: 4958 bytes --]

On Mon, Jul 10, 2017 at 6:28 PM, Alexey Egorov <alex.only.d@gmail.com>
wrote:

> Hi Ivan,
>
> thanks for your analysis!
>
> Which version of OCaml you are using? I'm tried on 4.04.1 and trunk
> (with and without flambda) and still getting the same (poor)
> performance for recursive version, no matter if curried or uncurried
> loop is used.
>

It works the same on any version of OCaml. Are you sure, that you're
actually recompiling the code?
But if you insist, then I was trying on 4.03.0 and 4.04.0+flambda.

P.S. I would like to retract my statement that the recursive version is 10%
faster, in fact it is the same, it was a measurement error.


>
> 2017-07-11 2:29 GMT+05:00 Ivan Gotovchits <ivg@ieee.org>:
> > Hi Alexey,
> >
> > The recursion version is slower because you're allocating a tuple on each
> > recursive call. Try to use a curried form of a function, and you will a
> > performance that is even better than the for-loop (10% better in my
> case).
> > E.g.,
> >
> >
> >
> > let loop high =
> >   let rec loop i = function
> >     | t when i > high -> t
> >     | t when i mod 2 = 0 -> loop (i + 1) (t + i * 2)
> >     | t -> loop (i + 1) t in
> >   loop 1 0
> >
> >
> > The for-loop version is indeed slower than the C version because of
> tagging,
> > c.f., the OCaml implementation:
> >
> > movq $1, %rbx
> > movq $3, %rdi
> > cmpq %rax, %rdi
> > jg .L100
> >
> >
> > .L101:
> > movq %rdi, %rsi
> > sarq $1, %rsi
> > movq %rsi, %rdx
> > shr movq $1, %rbx
> > movq $3, %rdi
> > cmpq %rax, %rdi
> > jg .L100
> >
> > .L101:
> > movq %rdi, %rsi
> > sarq $1, %rsi
> > moq $63, %rdx
> > movq %rsi, %rcx
> > addq %rdx, %rcx
> > andq $-2, %rcx
> > subq %rcx, %rsi
> > leaq 1(%rsi,%rsi), %rsi
> > cmpq $1, %rsi
> > jne .L102
> > leaq -2(%rbx,%rdi,2), %rbx
> > .L102:
> > movq %rdi, %rsi
> > addq $2, %rdi
> > cmpq %rax, %rsi
> > jne .L101
> > .L100:
> > movq %rbx, %rax
> > ret
> >
> >
> > With the GCC output:
> >
> >
> > testl %edi, %edi
> > jle .L6
> > addl $1, %edi
> > movl $4, %ecx
> > movl $1, %edx
> > xorl %eax, %eax
> > jmp .L3
> > .L5:
> > leaq (%rax,%rcx), %rsi
> > testb $1, %dl
> > cmove %rsi, %rax
> > addq $2, %rcx
> > .L3:
> > addl $1, %edx
> > cmpl %edi, %edx
> > jne .L5
> > rep ret
> > .L6:
> > xorl %eax, %eax
> > ret
> >
> >
> > Notice this `sar` and `shr` instructions, they are here due to the
> tagging.
> > The GCC compiler was also able to leverage the conditional move
> instruction
> > `cmove`, though it looks nice I'm not sure how much it actually won.
> There
> > is no need to use flambda for the recursive and for-loop as they are
> already
> > optimized to the maximum.
> >
> > Concerning your stream implementation, you may compare it to the Core
> > Kernel's Sequence. It is basically the same, so you can play the
> > benchmarking game :)
> >
> > The general way to implement a stream with a zero performance penalty
> (i.e.,
> > as fast as a for loop) is to use staging. This topic is explored in
> detail
> > in the Oleg Kiselyov et alas Strymonad [1] paper. The implementation, as
> > well as other interesting discussions about streams, are available on his
> > homepage [2] (there are tons of very interesting stuff there!).
> >
> > Best,
> > Ivan
> >
> > [1]: http://okmij.org/ftp/meta-programming/strymonas.pdf
> > [2]: http://okmij.org/ftp/Streams.html
> >
> >
> >
> >
> >
> > On Mon, Jul 10, 2017 at 4:26 PM, Alexey Egorov <alex.only.d@gmail.com>
> > wrote:
> >>
> >> Hello,
> >>
> >> Michael Snoyman recently posted very interesting article about Haskell
> >> streams and Rust iterators (
> >> https://www.fpcomplete.com/blog/2017/07/iterators-streams-rust-haskell
> >> ) and I tried to do the same in OCaml.
> >>
> >> You can find my code here -
> >> https://gist.github.com/anonymous/127e9116b362d561c5dfa9cce6b453f3
> >>
> >> There is four different versions:
> >>
> >> 1) c_loop.c - plain C loop
> >> 2) f_loop.ml - for-loop in OCaml
> >> 3) loop.ml - OCaml loop using recursion
> >> 4) stream.ml - OCaml loop using streams
> >>
> >> Here is the results:
> >>
> >> c_loop.c:   0m0.105s
> >> f_loop.ml:  0m0.184s
> >> loop.ml:     0m0.207s
> >> stream.ml: 0m0.605s
> >>
> >> It's not suprise that C version is fastest - at least, it uses untagged
> >> ints.
> >> What surprises me is that recursion is slower than for-loop - can't
> >> OCaml compile it down to the same imperative loop?
> >>
> >> And it's very dissapointing that stream version is 3 times slower than
> >> recursion-based.
> >> The problem, I believe, is that OCaml is unable to optimize
> >> intermediate stream constructors away (as GHC do).
> >>
> >> I tried to add inline annotations, which helps a little - is there
> >> something I can do to improve it further?
> >>
> >> --
> >> Caml-list mailing list.  Subscription management and archives:
> >> https://sympa.inria.fr/sympa/arc/caml-list
> >> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> >> Bug reports: http://caml.inria.fr/bin/caml-bugs
> >
> >
>

[-- Attachment #2: Type: text/html, Size: 7888 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Optimizing pure-functional streams
  2017-07-10 20:26 [Caml-list] Optimizing pure-functional streams Alexey Egorov
  2017-07-10 21:29 ` Ivan Gotovchits
@ 2017-07-11 12:54 ` Simon Cruanes
  2017-07-11 17:37   ` Ivan Gotovchits
  1 sibling, 1 reply; 10+ messages in thread
From: Simon Cruanes @ 2017-07-11 12:54 UTC (permalink / raw)
  To: Alexey Egorov; +Cc: caml users


[-- Attachment #1.1: Type: text/plain, Size: 1026 bytes --]

Hello,

Iterators in OCaml have been the topic of many discussions. Another
option for fast iterators is https://github.com/c-cube/sequence ,
which (with flambda) should compile down to loops and tests on this kind
of benchmark. With the attached additional file on 4.04.0+flambda,
I obtain the following (where sequence is test-seq):

$ for i in test-* ; do echo $i ; time ./$i ; done
test-c_loop
5000000100000000
./$i  0.08s user 0.00s system 97% cpu 0.085 total
test-f_loop
5000000100000000
./$i  0.10s user 0.00s system 96% cpu 0.100 total
test-loop
5000000100000000
./$i  0.18s user 0.00s system 97% cpu 0.184 total
test-seq
5000000100000000
./$i  0.11s user 0.00s system 97% cpu 0.113 total
test-stream
5000000100000000
./$i  0.44s user 0.00s system 98% cpu 0.449 total


Note that sequence is imperative underneath, but can be safely used as a
functional structure.

-- 
Simon Cruanes

http://weusepgp.info/
key 49AA62B6, fingerprint 949F EB87 8F06 59C6 D7D3  7D8D 4AC0 1D08 49AA 62B6

[-- Attachment #1.2: seq.ml --]
[-- Type: text/plain, Size: 474 bytes --]


type 'a seq = ('a -> unit) -> unit

let enum high k =
  for i = 1 to high do k i done
[@@inline always]

let sum s =
  let n = ref 0 in
  s (fun x -> n := !n + x);
  !n
[@@inline always]

let filter pred seq k =
  seq (fun x -> if pred x then k x)
[@@inline always]

let map f seq k =
  seq (fun x -> k (f x))
[@@inline always]

let high = 1000 * 1000 * 100

let res = enum high
	|> filter (fun x -> x mod 2 = 0)
	|> map (( * ) 2)
	|> sum

let _ = Printf.printf "%d\n" res

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Optimizing pure-functional streams
  2017-07-11 12:54 ` Simon Cruanes
@ 2017-07-11 17:37   ` Ivan Gotovchits
  2017-07-11 17:46     ` Yotam Barnoy
  0 siblings, 1 reply; 10+ messages in thread
From: Ivan Gotovchits @ 2017-07-11 17:37 UTC (permalink / raw)
  To: Simon Cruanes; +Cc: Alexey Egorov, caml users

[-- Attachment #1: Type: text/plain, Size: 1774 bytes --]

TWIMC,

I've played a little bit with different optimization options in
flambda 4.04, and finally, all three versions of the loop: curried,
uncurried, and the for-loop, have the same performance, though they still
loose about 30% to the C version, due to tagging.

Basically, this means, that flambda was able to get rid of the allocation.
I don't actually know which of the options finally made the difference, but
this is how I compiled it.

ocamlopt.opt -c -S -inlining-report -unbox-closures -O3 -rounds 8
-inline-max-depth 256 -inline-max-unroll 1024 -o loop.cmx loop.ml
ocamlopt.opt loop.cmx -o loop.native


Regards,
Ivan




On Tue, Jul 11, 2017 at 8:54 AM, Simon Cruanes <simon.cruanes.2007@m4x.org>
wrote:

> Hello,
>
> Iterators in OCaml have been the topic of many discussions. Another
> option for fast iterators is https://github.com/c-cube/sequence ,
> which (with flambda) should compile down to loops and tests on this kind
> of benchmark. With the attached additional file on 4.04.0+flambda,
> I obtain the following (where sequence is test-seq):
>
> $ for i in test-* ; do echo $i ; time ./$i ; done
> test-c_loop
> 5000000100000000
> ./$i  0.08s user 0.00s system 97% cpu 0.085 total
> test-f_loop
> 5000000100000000
> ./$i  0.10s user 0.00s system 96% cpu 0.100 total
> test-loop
> 5000000100000000
> ./$i  0.18s user 0.00s system 97% cpu 0.184 total
> test-seq
> 5000000100000000
> ./$i  0.11s user 0.00s system 97% cpu 0.113 total
> test-stream
> 5000000100000000
> ./$i  0.44s user 0.00s system 98% cpu 0.449 total
>
>
> Note that sequence is imperative underneath, but can be safely used as a
> functional structure.
>
> --
> Simon Cruanes
>
> http://weusepgp.info/
> key 49AA62B6, fingerprint 949F EB87 8F06 59C6 D7D3  7D8D 4AC0 1D08 49AA
> 62B6
>

[-- Attachment #2: Type: text/html, Size: 2783 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Optimizing pure-functional streams
  2017-07-11 17:37   ` Ivan Gotovchits
@ 2017-07-11 17:46     ` Yotam Barnoy
  2017-07-11 18:04       ` Gabriel Scherer
  0 siblings, 1 reply; 10+ messages in thread
From: Yotam Barnoy @ 2017-07-11 17:46 UTC (permalink / raw)
  To: Ivan Gotovchits; +Cc: Simon Cruanes, Alexey Egorov, caml users

> I've played a little bit with different optimization options in flambda 4.04, and finally, all three versions of the loop: curried, uncurried, and the for-loop, have the same performance, though they still loose about 30% to the C version, due to tagging.

Would it perhaps make sense to try Int64 in order to avoid the tagging?

Also, I believe it would make sense to have Flambda try to switch to
an untagged type in tight loops to avoid this tagging cost.

On Tue, Jul 11, 2017 at 1:37 PM, Ivan Gotovchits <ivg@ieee.org> wrote:
> TWIMC,
>
> I've played a little bit with different optimization options in flambda
> 4.04, and finally, all three versions of the loop: curried, uncurried, and
> the for-loop, have the same performance, though they still loose about 30%
> to the C version, due to tagging.
>
> Basically, this means, that flambda was able to get rid of the allocation. I
> don't actually know which of the options finally made the difference, but
> this is how I compiled it.
>
> ocamlopt.opt -c -S -inlining-report -unbox-closures -O3 -rounds 8
> -inline-max-depth 256 -inline-max-unroll 1024 -o loop.cmx loop.ml
> ocamlopt.opt loop.cmx -o loop.native
>
>
> Regards,
> Ivan
>
>
>
>
> On Tue, Jul 11, 2017 at 8:54 AM, Simon Cruanes <simon.cruanes.2007@m4x.org>
> wrote:
>>
>> Hello,
>>
>> Iterators in OCaml have been the topic of many discussions. Another
>> option for fast iterators is https://github.com/c-cube/sequence ,
>> which (with flambda) should compile down to loops and tests on this kind
>> of benchmark. With the attached additional file on 4.04.0+flambda,
>> I obtain the following (where sequence is test-seq):
>>
>> $ for i in test-* ; do echo $i ; time ./$i ; done
>> test-c_loop
>> 5000000100000000
>> ./$i  0.08s user 0.00s system 97% cpu 0.085 total
>> test-f_loop
>> 5000000100000000
>> ./$i  0.10s user 0.00s system 96% cpu 0.100 total
>> test-loop
>> 5000000100000000
>> ./$i  0.18s user 0.00s system 97% cpu 0.184 total
>> test-seq
>> 5000000100000000
>> ./$i  0.11s user 0.00s system 97% cpu 0.113 total
>> test-stream
>> 5000000100000000
>> ./$i  0.44s user 0.00s system 98% cpu 0.449 total
>>
>>
>> Note that sequence is imperative underneath, but can be safely used as a
>> functional structure.
>>
>> --
>> Simon Cruanes
>>
>> http://weusepgp.info/
>> key 49AA62B6, fingerprint 949F EB87 8F06 59C6 D7D3  7D8D 4AC0 1D08 49AA
>> 62B6
>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Optimizing pure-functional streams
  2017-07-11 17:46     ` Yotam Barnoy
@ 2017-07-11 18:04       ` Gabriel Scherer
  2017-07-11 18:15         ` Yotam Barnoy
  0 siblings, 1 reply; 10+ messages in thread
From: Gabriel Scherer @ 2017-07-11 18:04 UTC (permalink / raw)
  To: Yotam Barnoy; +Cc: Ivan Gotovchits, Simon Cruanes, Alexey Egorov, caml users

Moving from tagging to boxing does not make much sense to me, because
the techniques to avoid either are similar and boxing is more
expensive. It might be the case that the compiler today is better at
unboxing than untagging (it currently tries to do both of these
locally), but that would only be because boxing is more expensive and
thus more effort was spent on the unboxing, but medium-term one could
hope for a uniform approach to both transformations.


On Tue, Jul 11, 2017 at 1:46 PM, Yotam Barnoy <yotambarnoy@gmail.com> wrote:
>> I've played a little bit with different optimization options in flambda 4.04, and finally, all three versions of the loop: curried, uncurried, and the for-loop, have the same performance, though they still loose about 30% to the C version, due to tagging.
>
> Would it perhaps make sense to try Int64 in order to avoid the tagging?
>
> Also, I believe it would make sense to have Flambda try to switch to
> an untagged type in tight loops to avoid this tagging cost.
>
> On Tue, Jul 11, 2017 at 1:37 PM, Ivan Gotovchits <ivg@ieee.org> wrote:
>> TWIMC,
>>
>> I've played a little bit with different optimization options in flambda
>> 4.04, and finally, all three versions of the loop: curried, uncurried, and
>> the for-loop, have the same performance, though they still loose about 30%
>> to the C version, due to tagging.
>>
>> Basically, this means, that flambda was able to get rid of the allocation. I
>> don't actually know which of the options finally made the difference, but
>> this is how I compiled it.
>>
>> ocamlopt.opt -c -S -inlining-report -unbox-closures -O3 -rounds 8
>> -inline-max-depth 256 -inline-max-unroll 1024 -o loop.cmx loop.ml
>> ocamlopt.opt loop.cmx -o loop.native
>>
>>
>> Regards,
>> Ivan
>>
>>
>>
>>
>> On Tue, Jul 11, 2017 at 8:54 AM, Simon Cruanes <simon.cruanes.2007@m4x.org>
>> wrote:
>>>
>>> Hello,
>>>
>>> Iterators in OCaml have been the topic of many discussions. Another
>>> option for fast iterators is https://github.com/c-cube/sequence ,
>>> which (with flambda) should compile down to loops and tests on this kind
>>> of benchmark. With the attached additional file on 4.04.0+flambda,
>>> I obtain the following (where sequence is test-seq):
>>>
>>> $ for i in test-* ; do echo $i ; time ./$i ; done
>>> test-c_loop
>>> 5000000100000000
>>> ./$i  0.08s user 0.00s system 97% cpu 0.085 total
>>> test-f_loop
>>> 5000000100000000
>>> ./$i  0.10s user 0.00s system 96% cpu 0.100 total
>>> test-loop
>>> 5000000100000000
>>> ./$i  0.18s user 0.00s system 97% cpu 0.184 total
>>> test-seq
>>> 5000000100000000
>>> ./$i  0.11s user 0.00s system 97% cpu 0.113 total
>>> test-stream
>>> 5000000100000000
>>> ./$i  0.44s user 0.00s system 98% cpu 0.449 total
>>>
>>>
>>> Note that sequence is imperative underneath, but can be safely used as a
>>> functional structure.
>>>
>>> --
>>> Simon Cruanes
>>>
>>> http://weusepgp.info/
>>> key 49AA62B6, fingerprint 949F EB87 8F06 59C6 D7D3  7D8D 4AC0 1D08 49AA
>>> 62B6
>>
>>
>
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Optimizing pure-functional streams
  2017-07-11 18:04       ` Gabriel Scherer
@ 2017-07-11 18:15         ` Yotam Barnoy
  2017-07-11 18:55           ` Ivan Gotovchits
  0 siblings, 1 reply; 10+ messages in thread
From: Yotam Barnoy @ 2017-07-11 18:15 UTC (permalink / raw)
  To: Gabriel Scherer; +Cc: Ivan Gotovchits, Simon Cruanes, Alexey Egorov, caml users

> Moving from tagging to boxing does not make much sense to me, because
the techniques to avoid either are similar and boxing is more
expensive. It might be the case that the compiler today is better at
unboxing than untagging (it currently tries to do both of these
locally), but that would only be because boxing is more expensive and
thus more effort was spent on the unboxing, but medium-term one could
hope for a uniform approach to both transformations.

Absolutely. I believe there was some skepticism expressed about the
need for untagging in the past by some members of the dev team, which
is why I brought it up. I suggested Int64 only because it's possible
that Flambda will *currently* do better with it than with tagged
integers.

On Tue, Jul 11, 2017 at 2:04 PM, Gabriel Scherer
<gabriel.scherer@gmail.com> wrote:
> Moving from tagging to boxing does not make much sense to me, because
> the techniques to avoid either are similar and boxing is more
> expensive. It might be the case that the compiler today is better at
> unboxing than untagging (it currently tries to do both of these
> locally), but that would only be because boxing is more expensive and
> thus more effort was spent on the unboxing, but medium-term one could
> hope for a uniform approach to both transformations.
>
>
> On Tue, Jul 11, 2017 at 1:46 PM, Yotam Barnoy <yotambarnoy@gmail.com> wrote:
>>> I've played a little bit with different optimization options in flambda 4.04, and finally, all three versions of the loop: curried, uncurried, and the for-loop, have the same performance, though they still loose about 30% to the C version, due to tagging.
>>
>> Would it perhaps make sense to try Int64 in order to avoid the tagging?
>>
>> Also, I believe it would make sense to have Flambda try to switch to
>> an untagged type in tight loops to avoid this tagging cost.
>>
>> On Tue, Jul 11, 2017 at 1:37 PM, Ivan Gotovchits <ivg@ieee.org> wrote:
>>> TWIMC,
>>>
>>> I've played a little bit with different optimization options in flambda
>>> 4.04, and finally, all three versions of the loop: curried, uncurried, and
>>> the for-loop, have the same performance, though they still loose about 30%
>>> to the C version, due to tagging.
>>>
>>> Basically, this means, that flambda was able to get rid of the allocation. I
>>> don't actually know which of the options finally made the difference, but
>>> this is how I compiled it.
>>>
>>> ocamlopt.opt -c -S -inlining-report -unbox-closures -O3 -rounds 8
>>> -inline-max-depth 256 -inline-max-unroll 1024 -o loop.cmx loop.ml
>>> ocamlopt.opt loop.cmx -o loop.native
>>>
>>>
>>> Regards,
>>> Ivan
>>>
>>>
>>>
>>>
>>> On Tue, Jul 11, 2017 at 8:54 AM, Simon Cruanes <simon.cruanes.2007@m4x.org>
>>> wrote:
>>>>
>>>> Hello,
>>>>
>>>> Iterators in OCaml have been the topic of many discussions. Another
>>>> option for fast iterators is https://github.com/c-cube/sequence ,
>>>> which (with flambda) should compile down to loops and tests on this kind
>>>> of benchmark. With the attached additional file on 4.04.0+flambda,
>>>> I obtain the following (where sequence is test-seq):
>>>>
>>>> $ for i in test-* ; do echo $i ; time ./$i ; done
>>>> test-c_loop
>>>> 5000000100000000
>>>> ./$i  0.08s user 0.00s system 97% cpu 0.085 total
>>>> test-f_loop
>>>> 5000000100000000
>>>> ./$i  0.10s user 0.00s system 96% cpu 0.100 total
>>>> test-loop
>>>> 5000000100000000
>>>> ./$i  0.18s user 0.00s system 97% cpu 0.184 total
>>>> test-seq
>>>> 5000000100000000
>>>> ./$i  0.11s user 0.00s system 97% cpu 0.113 total
>>>> test-stream
>>>> 5000000100000000
>>>> ./$i  0.44s user 0.00s system 98% cpu 0.449 total
>>>>
>>>>
>>>> Note that sequence is imperative underneath, but can be safely used as a
>>>> functional structure.
>>>>
>>>> --
>>>> Simon Cruanes
>>>>
>>>> http://weusepgp.info/
>>>> key 49AA62B6, fingerprint 949F EB87 8F06 59C6 D7D3  7D8D 4AC0 1D08 49AA
>>>> 62B6
>>>
>>>
>>
>> --
>> Caml-list mailing list.  Subscription management and archives:
>> https://sympa.inria.fr/sympa/arc/caml-list
>> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
>> Bug reports: http://caml.inria.fr/bin/caml-bugs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Optimizing pure-functional streams
  2017-07-11 18:15         ` Yotam Barnoy
@ 2017-07-11 18:55           ` Ivan Gotovchits
  0 siblings, 0 replies; 10+ messages in thread
From: Ivan Gotovchits @ 2017-07-11 18:55 UTC (permalink / raw)
  To: Yotam Barnoy; +Cc: Gabriel Scherer, Simon Cruanes, Alexey Egorov, caml users

[-- Attachment #1: Type: text/plain, Size: 5843 bytes --]

Yep, moving to boxing increased the time by ten times. However, I found a
strange behavior, that might be a bug, the following program:

open Int64

let (=) = equal
let (+) = add
let (mod) = rem
let ( * ) = mul

let loop high =
  let rec loop i = function
    | t when i > high -> t
    | t when i mod 2L = 0L -> loop (i + 1L) (t + i * 2L)
    | t -> loop (i + 1L) t in
  loop 1L 0L

let high = 1000L * 1000L * 1000L
let _ = Printf.printf "%Ld\n" (loop high)


Didn't compile with the following _tags file:

true : optimize(3), unbox_closures, optimization_rounds(8),
inline_max_depth(8), inline_max_unroll(2)



And terminates with the "Fatal error: exception Stack overflow" message.
With higher values of the inline_max_unroll parameter, it doesn't terminate
on mine machine in reasonable time and space.


The most surprising is not that, but that the offending line is `let (=) =
equal`. Without this line, the program compiles without any issues.

Best,
Ivan

P.S. Actually, I'm playing with flambda and our monads library, that is
highly functorized and abstracted. I reimplemented this loop example using
the State monad and got about 10 times better performance with flambda in
comparison to a compiler without flambda enabled (from 40 seconds to 4
seconds). That makes my happy panda :) It is still 4 times slower than the
regular version, but I'm ready to pay this cost.




On Tue, Jul 11, 2017 at 2:15 PM, Yotam Barnoy <yotambarnoy@gmail.com> wrote:

> > Moving from tagging to boxing does not make much sense to me, because
> the techniques to avoid either are similar and boxing is more
> expensive. It might be the case that the compiler today is better at
> unboxing than untagging (it currently tries to do both of these
> locally), but that would only be because boxing is more expensive and
> thus more effort was spent on the unboxing, but medium-term one could
> hope for a uniform approach to both transformations.
>
> Absolutely. I believe there was some skepticism expressed about the
> need for untagging in the past by some members of the dev team, which
> is why I brought it up. I suggested Int64 only because it's possible
> that Flambda will *currently* do better with it than with tagged
> integers.
>
> On Tue, Jul 11, 2017 at 2:04 PM, Gabriel Scherer
> <gabriel.scherer@gmail.com> wrote:
> > Moving from tagging to boxing does not make much sense to me, because
> > the techniques to avoid either are similar and boxing is more
> > expensive. It might be the case that the compiler today is better at
> > unboxing than untagging (it currently tries to do both of these
> > locally), but that would only be because boxing is more expensive and
> > thus more effort was spent on the unboxing, but medium-term one could
> > hope for a uniform approach to both transformations.
> >
> >
> > On Tue, Jul 11, 2017 at 1:46 PM, Yotam Barnoy <yotambarnoy@gmail.com>
> wrote:
> >>> I've played a little bit with different optimization options in
> flambda 4.04, and finally, all three versions of the loop: curried,
> uncurried, and the for-loop, have the same performance, though they still
> loose about 30% to the C version, due to tagging.
> >>
> >> Would it perhaps make sense to try Int64 in order to avoid the tagging?
> >>
> >> Also, I believe it would make sense to have Flambda try to switch to
> >> an untagged type in tight loops to avoid this tagging cost.
> >>
> >> On Tue, Jul 11, 2017 at 1:37 PM, Ivan Gotovchits <ivg@ieee.org> wrote:
> >>> TWIMC,
> >>>
> >>> I've played a little bit with different optimization options in flambda
> >>> 4.04, and finally, all three versions of the loop: curried, uncurried,
> and
> >>> the for-loop, have the same performance, though they still loose about
> 30%
> >>> to the C version, due to tagging.
> >>>
> >>> Basically, this means, that flambda was able to get rid of the
> allocation. I
> >>> don't actually know which of the options finally made the difference,
> but
> >>> this is how I compiled it.
> >>>
> >>> ocamlopt.opt -c -S -inlining-report -unbox-closures -O3 -rounds 8
> >>> -inline-max-depth 256 -inline-max-unroll 1024 -o loop.cmx loop.ml
> >>> ocamlopt.opt loop.cmx -o loop.native
> >>>
> >>>
> >>> Regards,
> >>> Ivan
> >>>
> >>>
> >>>
> >>>
> >>> On Tue, Jul 11, 2017 at 8:54 AM, Simon Cruanes <
> simon.cruanes.2007@m4x.org>
> >>> wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> Iterators in OCaml have been the topic of many discussions. Another
> >>>> option for fast iterators is https://github.com/c-cube/sequence ,
> >>>> which (with flambda) should compile down to loops and tests on this
> kind
> >>>> of benchmark. With the attached additional file on 4.04.0+flambda,
> >>>> I obtain the following (where sequence is test-seq):
> >>>>
> >>>> $ for i in test-* ; do echo $i ; time ./$i ; done
> >>>> test-c_loop
> >>>> 5000000100000000
> >>>> ./$i  0.08s user 0.00s system 97% cpu 0.085 total
> >>>> test-f_loop
> >>>> 5000000100000000
> >>>> ./$i  0.10s user 0.00s system 96% cpu 0.100 total
> >>>> test-loop
> >>>> 5000000100000000
> >>>> ./$i  0.18s user 0.00s system 97% cpu 0.184 total
> >>>> test-seq
> >>>> 5000000100000000
> >>>> ./$i  0.11s user 0.00s system 97% cpu 0.113 total
> >>>> test-stream
> >>>> 5000000100000000
> >>>> ./$i  0.44s user 0.00s system 98% cpu 0.449 total
> >>>>
> >>>>
> >>>> Note that sequence is imperative underneath, but can be safely used
> as a
> >>>> functional structure.
> >>>>
> >>>> --
> >>>> Simon Cruanes
> >>>>
> >>>> http://weusepgp.info/
> >>>> key 49AA62B6, fingerprint 949F EB87 8F06 59C6 D7D3  7D8D 4AC0 1D08
> 49AA
> >>>> 62B6
> >>>
> >>>
> >>
> >> --
> >> Caml-list mailing list.  Subscription management and archives:
> >> https://sympa.inria.fr/sympa/arc/caml-list
> >> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> >> Bug reports: http://caml.inria.fr/bin/caml-bugs
>

[-- Attachment #2: Type: text/html, Size: 9506 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-07-11 18:55 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-10 20:26 [Caml-list] Optimizing pure-functional streams Alexey Egorov
2017-07-10 21:29 ` Ivan Gotovchits
2017-07-10 22:28   ` Alexey Egorov
2017-07-11 12:40     ` Ivan Gotovchits
2017-07-11 12:54 ` Simon Cruanes
2017-07-11 17:37   ` Ivan Gotovchits
2017-07-11 17:46     ` Yotam Barnoy
2017-07-11 18:04       ` Gabriel Scherer
2017-07-11 18:15         ` Yotam Barnoy
2017-07-11 18:55           ` Ivan Gotovchits

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).