Yep, moving to boxing increased the time by ten times. However, I found a strange behavior, that might be a bug, the following program:

open Int64

let (=) = equal
let (+) = add
let (mod) = rem
let ( * ) = mul

let loop high =
  let rec loop i = function
    | t when i > high -> t
    | t when i mod 2L = 0L -> loop (i + 1L) (t + i * 2L)
    | t -> loop (i + 1L) t in
  loop 1L 0L

let high = 1000L * 1000L * 1000L
let _ = Printf.printf "%Ld\n" (loop high)

Didn't compile with the following _tags file:

true : optimize(3), unbox_closures, optimization_rounds(8), inline_max_depth(8), inline_max_unroll(2)


And terminates with the "Fatal error: exception Stack overflow" message. With higher values of the inline_max_unroll parameter, it doesn't terminate on mine machine in reasonable time and space. 


The most surprising is not that, but that the offending line is `let (=) = equal`. Without this line, the program compiles without any issues. 

Best,
Ivan

P.S. Actually, I'm playing with flambda and our monads library, that is highly functorized and abstracted. I reimplemented this loop example using the State monad and got about 10 times better performance with flambda in comparison to a compiler without flambda enabled (from 40 seconds to 4 seconds). That makes my happy panda :) It is still 4 times slower than the regular version, but I'm ready to pay this cost. 




On Tue, Jul 11, 2017 at 2:15 PM, Yotam Barnoy <yotambarnoy@gmail.com> wrote:
> Moving from tagging to boxing does not make much sense to me, because
the techniques to avoid either are similar and boxing is more
expensive. It might be the case that the compiler today is better at
unboxing than untagging (it currently tries to do both of these
locally), but that would only be because boxing is more expensive and
thus more effort was spent on the unboxing, but medium-term one could
hope for a uniform approach to both transformations.

Absolutely. I believe there was some skepticism expressed about the
need for untagging in the past by some members of the dev team, which
is why I brought it up. I suggested Int64 only because it's possible
that Flambda will *currently* do better with it than with tagged
integers.

On Tue, Jul 11, 2017 at 2:04 PM, Gabriel Scherer
<gabriel.scherer@gmail.com> wrote:
> Moving from tagging to boxing does not make much sense to me, because
> the techniques to avoid either are similar and boxing is more
> expensive. It might be the case that the compiler today is better at
> unboxing than untagging (it currently tries to do both of these
> locally), but that would only be because boxing is more expensive and
> thus more effort was spent on the unboxing, but medium-term one could
> hope for a uniform approach to both transformations.
>
>
> On Tue, Jul 11, 2017 at 1:46 PM, Yotam Barnoy <yotambarnoy@gmail.com> wrote:
>>> I've played a little bit with different optimization options in flambda 4.04, and finally, all three versions of the loop: curried, uncurried, and the for-loop, have the same performance, though they still loose about 30% to the C version, due to tagging.
>>
>> Would it perhaps make sense to try Int64 in order to avoid the tagging?
>>
>> Also, I believe it would make sense to have Flambda try to switch to
>> an untagged type in tight loops to avoid this tagging cost.
>>
>> On Tue, Jul 11, 2017 at 1:37 PM, Ivan Gotovchits <ivg@ieee.org> wrote:
>>> TWIMC,
>>>
>>> I've played a little bit with different optimization options in flambda
>>> 4.04, and finally, all three versions of the loop: curried, uncurried, and
>>> the for-loop, have the same performance, though they still loose about 30%
>>> to the C version, due to tagging.
>>>
>>> Basically, this means, that flambda was able to get rid of the allocation. I
>>> don't actually know which of the options finally made the difference, but
>>> this is how I compiled it.
>>>
>>> ocamlopt.opt -c -S -inlining-report -unbox-closures -O3 -rounds 8
>>> -inline-max-depth 256 -inline-max-unroll 1024 -o loop.cmx loop.ml
>>> ocamlopt.opt loop.cmx -o loop.native
>>>
>>>
>>> Regards,
>>> Ivan
>>>
>>>
>>>
>>>
>>> On Tue, Jul 11, 2017 at 8:54 AM, Simon Cruanes <simon.cruanes.2007@m4x.org>
>>> wrote:
>>>>
>>>> Hello,
>>>>
>>>> Iterators in OCaml have been the topic of many discussions. Another
>>>> option for fast iterators is https://github.com/c-cube/sequence ,
>>>> which (with flambda) should compile down to loops and tests on this kind
>>>> of benchmark. With the attached additional file on 4.04.0+flambda,
>>>> I obtain the following (where sequence is test-seq):
>>>>
>>>> $ for i in test-* ; do echo $i ; time ./$i ; done
>>>> test-c_loop
>>>> 5000000100000000
>>>> ./$i  0.08s user 0.00s system 97% cpu 0.085 total
>>>> test-f_loop
>>>> 5000000100000000
>>>> ./$i  0.10s user 0.00s system 96% cpu 0.100 total
>>>> test-loop
>>>> 5000000100000000
>>>> ./$i  0.18s user 0.00s system 97% cpu 0.184 total
>>>> test-seq
>>>> 5000000100000000
>>>> ./$i  0.11s user 0.00s system 97% cpu 0.113 total
>>>> test-stream
>>>> 5000000100000000
>>>> ./$i  0.44s user 0.00s system 98% cpu 0.449 total
>>>>
>>>>
>>>> Note that sequence is imperative underneath, but can be safely used as a
>>>> functional structure.
>>>>
>>>> --
>>>> Simon Cruanes
>>>>
>>>> http://weusepgp.info/
>>>> key 49AA62B6, fingerprint 949F EB87 8F06 59C6 D7D3  7D8D 4AC0 1D08 49AA
>>>> 62B6
>>>
>>>
>>
>> --
>> Caml-list mailing list.  Subscription management and archives:
>> https://sympa.inria.fr/sympa/arc/caml-list
>> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
>> Bug reports: http://caml.inria.fr/bin/caml-bugs