Yep, moving to boxing increased the time by ten times. However, I found a
strange behavior, that might be a bug, the following program:

open Int64

let (=) = equal
let (+) = add
let (mod) = rem
let ( * ) = mul

let loop high =
  let rec loop i = function
    | t when i > high -> t
    | t when i mod 2L = 0L -> loop (i + 1L) (t + i * 2L)
    | t -> loop (i + 1L) t in
  loop 1L 0L

let high = 1000L * 1000L * 1000L
let _ = Printf.printf "%Ld\n" (loop high)


Didn't compile with the following _tags file:

true : optimize(3), unbox_closures, optimization_rounds(8),
inline_max_depth(8), inline_max_unroll(2)


And terminates with the "Fatal error: exception Stack overflow" message.
With higher values of the inline_max_unroll parameter, it doesn't terminate
on mine machine in reasonable time and space.


The most surprising is not that, but that the offending line is `let (=) =
equal`. Without this line, the program compiles without any issues.

Best,
Ivan

P.S. Actually, I'm playing with flambda and our monads library, that is
highly functorized and abstracted. I reimplemented this loop example using
the State monad and got about 10 times better performance with flambda in
comparison to a compiler without flambda enabled (from 40 seconds to 4
seconds). That makes my happy panda :) It is still 4 times slower than the
regular version, but I'm ready to pay this cost.


On Tue, Jul 11, 2017 at 2:15 PM, Yotam Barnoy <yotambarnoy@gmail.com> wrote:

> > Moving from tagging to boxing does not make much sense to me, because
> the techniques to avoid either are similar and boxing is more
> expensive. It might be the case that the compiler today is better at
> unboxing than untagging (it currently tries to do both of these
> locally), but that would only be because boxing is more expensive and
> thus more effort was spent on the unboxing, but medium-term one could
> hope for a uniform approach to both transformations.
>
> Absolutely. I believe there was some skepticism expressed about the
> need for untagging in the past by some members of the dev team, which
> is why I brought it up. I suggested Int64 only because it's possible
> that Flambda will *currently* do better with it than with tagged
> integers.
>
> On Tue, Jul 11, 2017 at 2:04 PM, Gabriel Scherer
> <gabriel.scherer@gmail.com> wrote:
> > Moving from tagging to boxing does not make much sense to me, because
> > the techniques to avoid either are similar and boxing is more
> > expensive. It might be the case that the compiler today is better at
> > unboxing than untagging (it currently tries to do both of these
> > locally), but that would only be because boxing is more expensive and
> > thus more effort was spent on the unboxing, but medium-term one could
> > hope for a uniform approach to both transformations.
> >
> >
> > On Tue, Jul 11, 2017 at 1:46 PM, Yotam Barnoy <yotambarnoy@gmail.com>
> wrote:
> >>> I've played a little bit with different optimization options in
> flambda 4.04, and finally, all three versions of the loop: curried,
> uncurried, and the for-loop, have the same performance, though they still
> loose about 30% to the C version, due to tagging.
> >>
> >> Would it perhaps make sense to try Int64 in order to avoid the tagging?
> >>
> >> Also, I believe it would make sense to have Flambda try to switch to
> >> an untagged type in tight loops to avoid this tagging cost.
> >>
> >> On Tue, Jul 11, 2017 at 1:37 PM, Ivan Gotovchits <ivg@ieee.org> wrote:
> >>> TWIMC,
> >>>
> >>> I've played a little bit with different optimization options in flambda
> >>> 4.04, and finally, all three versions of the loop: curried, uncurried,
> and
> >>> the for-loop, have the same performance, though they still loose about
> 30%
> >>> to the C version, due to tagging.
> >>>
> >>> Basically, this means, that flambda was able to get rid of the
> allocation. I
> >>> don't actually know which of the options finally made the difference,
> but
> >>> this is how I compiled it.
> >>>
> >>> ocamlopt.opt -c -S -inlining-report -unbox-closures -O3 -rounds 8
> >>> -inline-max-depth 256 -inline-max-unroll 1024 -o loop.cmx loop.ml
> >>> ocamlopt.opt loop.cmx -o loop.native
> >>>
> >>>
> >>> Regards,
> >>> Ivan
> >>>
> >>>
> >>>
> >>>
> >>> On Tue, Jul 11, 2017 at 8:54 AM, Simon Cruanes <
> simon.cruanes.2007@m4x.org>
> >>> wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> Iterators in OCaml have been the topic of many discussions. Another
> >>>> option for fast iterators is https://github.com/c-cube/sequence ,
> >>>> which (with flambda) should compile down to loops and tests on this
> kind
> >>>> of benchmark. With the attached additional file on 4.04.0+flambda,
> >>>> I obtain the following (where sequence is test-seq):
> >>>>
> >>>> $ for i in test-* ; do echo $i ; time ./$i ; done
> >>>> test-c_loop
> >>>> 5000000100000000
> >>>> ./$i  0.08s user 0.00s system 97% cpu 0.085 total
> >>>> test-f_loop
> >>>> 5000000100000000
> >>>> ./$i  0.10s user 0.00s system 96% cpu 0.100 total
> >>>> test-loop
> >>>> 5000000100000000
> >>>> ./$i  0.18s user 0.00s system 97% cpu 0.184 total
> >>>> test-seq
> >>>> 5000000100000000
> >>>> ./$i  0.11s user 0.00s system 97% cpu 0.113 total
> >>>> test-stream
> >>>> 5000000100000000
> >>>> ./$i  0.44s user 0.00s system 98% cpu 0.449 total
> >>>>
> >>>>
> >>>> Note that sequence is imperative underneath, but can be safely used
> as a
> >>>> functional structure.
> >>>>
> >>>> --
> >>>> Simon Cruanes
> >>>>
> >>>> http://weusepgp.info/
> >>>> key 49AA62B6, fingerprint 949F EB87 8F06 59C6 D7D3  7D8D 4AC0 1D08
> 49AA
> >>>> 62B6
> >>>
> >>>
> >>
> >> --
> >> Caml-list mailing list.  Subscription management and archives:
> >> https://sympa.inria.fr/sympa/arc/caml-list
> >> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> >> Bug reports: http://caml.inria.fr/bin/caml-bugs
>