Yep, moving to boxing increased the time by ten times. However, I found a strange behavior, that might be a bug, the following program: open Int64 let (=) = equal let (+) = add let (mod) = rem let ( * ) = mul let loop high = let rec loop i = function | t when i > high -> t | t when i mod 2L = 0L -> loop (i + 1L) (t + i * 2L) | t -> loop (i + 1L) t in loop 1L 0L let high = 1000L * 1000L * 1000L let _ = Printf.printf "%Ld\n" (loop high) Didn't compile with the following _tags file: true : optimize(3), unbox_closures, optimization_rounds(8), inline_max_depth(8), inline_max_unroll(2) And terminates with the "Fatal error: exception Stack overflow" message. With higher values of the inline_max_unroll parameter, it doesn't terminate on mine machine in reasonable time and space. The most surprising is not that, but that the offending line is `let (=) = equal`. Without this line, the program compiles without any issues. Best, Ivan P.S. Actually, I'm playing with flambda and our monads library, that is highly functorized and abstracted. I reimplemented this loop example using the State monad and got about 10 times better performance with flambda in comparison to a compiler without flambda enabled (from 40 seconds to 4 seconds). That makes my happy panda :) It is still 4 times slower than the regular version, but I'm ready to pay this cost. On Tue, Jul 11, 2017 at 2:15 PM, Yotam Barnoy wrote: > > Moving from tagging to boxing does not make much sense to me, because > the techniques to avoid either are similar and boxing is more > expensive. It might be the case that the compiler today is better at > unboxing than untagging (it currently tries to do both of these > locally), but that would only be because boxing is more expensive and > thus more effort was spent on the unboxing, but medium-term one could > hope for a uniform approach to both transformations. > > Absolutely. I believe there was some skepticism expressed about the > need for untagging in the past by some members of the dev team, which > is why I brought it up. I suggested Int64 only because it's possible > that Flambda will *currently* do better with it than with tagged > integers. > > On Tue, Jul 11, 2017 at 2:04 PM, Gabriel Scherer > wrote: > > Moving from tagging to boxing does not make much sense to me, because > > the techniques to avoid either are similar and boxing is more > > expensive. It might be the case that the compiler today is better at > > unboxing than untagging (it currently tries to do both of these > > locally), but that would only be because boxing is more expensive and > > thus more effort was spent on the unboxing, but medium-term one could > > hope for a uniform approach to both transformations. > > > > > > On Tue, Jul 11, 2017 at 1:46 PM, Yotam Barnoy > wrote: > >>> I've played a little bit with different optimization options in > flambda 4.04, and finally, all three versions of the loop: curried, > uncurried, and the for-loop, have the same performance, though they still > loose about 30% to the C version, due to tagging. > >> > >> Would it perhaps make sense to try Int64 in order to avoid the tagging? > >> > >> Also, I believe it would make sense to have Flambda try to switch to > >> an untagged type in tight loops to avoid this tagging cost. > >> > >> On Tue, Jul 11, 2017 at 1:37 PM, Ivan Gotovchits wrote: > >>> TWIMC, > >>> > >>> I've played a little bit with different optimization options in flambda > >>> 4.04, and finally, all three versions of the loop: curried, uncurried, > and > >>> the for-loop, have the same performance, though they still loose about > 30% > >>> to the C version, due to tagging. > >>> > >>> Basically, this means, that flambda was able to get rid of the > allocation. I > >>> don't actually know which of the options finally made the difference, > but > >>> this is how I compiled it. > >>> > >>> ocamlopt.opt -c -S -inlining-report -unbox-closures -O3 -rounds 8 > >>> -inline-max-depth 256 -inline-max-unroll 1024 -o loop.cmx loop.ml > >>> ocamlopt.opt loop.cmx -o loop.native > >>> > >>> > >>> Regards, > >>> Ivan > >>> > >>> > >>> > >>> > >>> On Tue, Jul 11, 2017 at 8:54 AM, Simon Cruanes < > simon.cruanes.2007@m4x.org> > >>> wrote: > >>>> > >>>> Hello, > >>>> > >>>> Iterators in OCaml have been the topic of many discussions. Another > >>>> option for fast iterators is https://github.com/c-cube/sequence , > >>>> which (with flambda) should compile down to loops and tests on this > kind > >>>> of benchmark. With the attached additional file on 4.04.0+flambda, > >>>> I obtain the following (where sequence is test-seq): > >>>> > >>>> $ for i in test-* ; do echo $i ; time ./$i ; done > >>>> test-c_loop > >>>> 5000000100000000 > >>>> ./$i 0.08s user 0.00s system 97% cpu 0.085 total > >>>> test-f_loop > >>>> 5000000100000000 > >>>> ./$i 0.10s user 0.00s system 96% cpu 0.100 total > >>>> test-loop > >>>> 5000000100000000 > >>>> ./$i 0.18s user 0.00s system 97% cpu 0.184 total > >>>> test-seq > >>>> 5000000100000000 > >>>> ./$i 0.11s user 0.00s system 97% cpu 0.113 total > >>>> test-stream > >>>> 5000000100000000 > >>>> ./$i 0.44s user 0.00s system 98% cpu 0.449 total > >>>> > >>>> > >>>> Note that sequence is imperative underneath, but can be safely used > as a > >>>> functional structure. > >>>> > >>>> -- > >>>> Simon Cruanes > >>>> > >>>> http://weusepgp.info/ > >>>> key 49AA62B6, fingerprint 949F EB87 8F06 59C6 D7D3 7D8D 4AC0 1D08 > 49AA > >>>> 62B6 > >>> > >>> > >> > >> -- > >> Caml-list mailing list. Subscription management and archives: > >> https://sympa.inria.fr/sympa/arc/caml-list > >> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > >> Bug reports: http://caml.inria.fr/bin/caml-bugs >