On Dec 12, 2010, at 16:55 , Török Edwin wrote:

> On Sun, 12 Dec 2010 14:54:14 -0000
> "Jon Harrop" <jon@ffconsultancy.com> wrote:
> 
>> The Haskell guys got their best performance improvement moving to
>> LLVM from the hailstone benchmark so it is interesting to examine
>> this case as well. I just added support for 64-bit ints to HLVM to
>> implement that benchmark and my code is:
>> 
>> Here’s the equivalent OCaml code:
>> 
>>  let rec collatzLen(c, n) : int =
>>    if n = 1L then c else
>>      collatzLen (c+1, if Int64.rem n 2L = 0L then Int64.div n 2L else
> 
> OK, but boxing itself has nothing to do with the performance degration
> here. It is the lack of compiler optimizations on the Int64 type. This
> could be solved by implementing compiler optimizations for it (or
> manually optimizing some integer arithmetic that is slow).
> 
> Lets run the code under a profiler, or look at the assembly (I used
> 'perf record' and 'perf report'). 2 'idiv' instructions turn up as top
> offenders in the profile.
> 
> Problem #1: Int64.rem n 2 -> another idiv instruction
> 
> A C compiler would optimize this to an 'and' instruction.
> Change that to 'Int64.logand n 1L = 0L'/
> 
> Problem #2: Int64.div n 2 -> idiv instruction. 
> 
> A C compiler would optimize this to a right shift. Changing that to 'Int64.shift_right n 1' speeds
> up the code.

This is easy to fix in ocamlopt (see attached patch ocamlopt-natint.patch), by applying the same optimizations already used for constant int's to constant natint's (Int64 is Nativeint on 64bit). Note however, that "mod 2" is not really "and 1", neither is "div 2" equivalent to "lsr 1"; that would be the case for unsigned arithmetic (doesn't matter in your example, tho).

I don't see the point of optimizing for x86-32 (neither would I spend my time optimizing anything for VAX these days), but it would be possible to add appropriate cases for Int64 on x86 as well (regalloc would be most difficult here, since that requires support for register pairs to perform 64bit arithmetic).

>> 1. Unboxing can give huge performance improvements on serial code,
> 
> s/Unboxing/arithmetic optimizations/
> Please find an example where the performance benefit is due to
> unboxing, and not due to arithmetic optimizations performed on the
> unboxed code.

The boxing involved is relevant, but boxing in general is not the issue. In this special case, the "let nlen, n = if..." code requires heap allocation, because of the way the pattern is compiled. This could be fixed by moving the condition out of the code and using two if's to select n/nlen separately (doesn't speed up that much). Fixing the pattern compiler to handle these cases might be interesting for general benefit.

I already mentioned this multiple times, but here we go again: Unboxing optimizations may indeed prove useful if applied wisely (cmmgen.ml is of special interest here, the unboxing optimizations are more or less special cases; that could be extended to include interesting cases like moving boxing out of if-then-else in return position, etc).

But (here comes the special "Harrop note") this has absolutely nothing to do with LLVM (and of course really, really, really nothing to do with HLVM). Using a different data representation for the heap requires a nearly complete rewrite of the OCaml system (you would probably need to start at the Typing level); if one wants to do this, enjoy and come back with code. But even then, data representation issues will have to be considered long before it comes to actual code generation (if you are serious, you'll have to think about the runtime first prior to talking about code generation for a non-existing runtime), so even then it has nothing to do with LLVM (or C-- or C or whatever IR you can think of).

Combining alloc's across if-then-else constructs further reduces code size in your example (and probably other programs as well), see attached file ocamlopt-comballoc-ifthenelse.patch. It's quick&dirty, but it should illustrate the idea.

>> let alone parallel code. The optimized HLVM is running 32× faster
>> than the OCaml here.
>> 
>> 2. LLVM makes it easy to JIT fast code from OCaml. HLVM is using it
>> to beat GCC-compiled C code here.
>> 
> 
> One advantage of using LLVM is that it would notice arithmetic
> optimizations like this and perform it itself (even if you use the
> boxed representation).

In case of x86-32, it won't, simply because LLVM will be presented with the calls to caml_int32_* functions. You'd need to change the Cmm code instead (changing the low-level stuff is straight-forward as demonstrated). For 64bit targets, see attached patch.

This doesn't mean that LLVM wouldn't be useful (in fact, I've just started an LLVM backend for ocamlopt). But it is important to note that LLVM is not the solution to everything. As the name implies, it's "low level", it does a few "higher level" optimizations for C, but these are special cases (and somewhat ugly if you take the time to read the code). It won't make a faster OCaml magically, just like it didn't make a faster Haskell by itself.

I could go on by quoting common "Harrop jokes" like "you need types in the low-level IR", etc. trying to tell him that this is simply wrong; but after reading through the Caml/LISP mailing list archives (thanks for the pointers by several readers), I'm pretty much convinced that Jon simply decided to end his war against LISP just to start a new one against ocamlopt.

If anyone is serious about ocamlopt with LLVM, feel free to contact me (tho, my time is limited atm).

> Best regards,
> --Edwin

greets,
Benedikt