Value types (Was: [Caml-list] ocamlopt LLVM support)

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

* Value types (Was: [Caml-list] ocamlopt LLVM support)
@ 2010-12-12 14:54 Jon Harrop
  2010-12-12 15:55 ` Török Edwin
  2010-12-12 19:53 ` Value types (Was: [Caml-list] ocamlopt LLVM support) Brian Hurt
  0 siblings, 2 replies; 21+ messages in thread
From: Jon Harrop @ 2010-12-12 14:54 UTC (permalink / raw)
  To: caml-list

The Haskell guys got their best performance improvement moving to LLVM from
the hailstone benchmark so it is interesting to examine this case as well. I
just added support for 64-bit ints to HLVM to implement that benchmark and
my code is:

  let rec collatzLen ((c, n): int * int64) : int =
    if n = 1L then c else
      collatzLen (c+1, if n % 2L = 0L then n / 2L else 3L * n + 1L);;

  let rec loop ((i, (nlen, n)): int64 * (int * int64)) : int64 =
    if i = 1L then n else
      let ilen = collatzLen(1, i) in
      let nlen, n = if ilen > nlen then ilen, i else nlen, n in
      loop (i-1L, (nlen, n));;

When run without using LLVM's optimization passes, this produces the
following output for 10k, 100k and 1M, respectively:

  - : `Int64 = 6171L
  Live: 0
  0.00704098s total; 0s suspend; 0s mark; 0s sweep
  - : `Int64 = 77031L
  Live: 0
  0.087815s total; 0s suspend; 0s mark; 0s sweep
  - : `Int64 = 837799L
  Live: 0
  1.06907s total; 0s suspend; 0s mark; 0s sweep

With LLVM's default optimization passes enabled, I get:

  - : `Int64 = 6171L
  Live: 0
  0.00508595s total; 0s suspend; 0s mark; 0s sweep
  - : `Int64 = 77031L
  Live: 0
  0.0626569s total; 0s suspend; 0s mark; 0s sweep
  - : `Int64 = 837799L
  Live: 0
  0.759738s total; 0s suspend; 0s mark; 0s sweep

Note the ~30% performance improvement in this case. In other cases, LLVM's
default optimization passes can degrade performance significantly.

Here’s the equivalent OCaml code:

  let rec collatzLen(c, n) : int =
    if n = 1L then c else
      collatzLen (c+1, if Int64.rem n 2L = 0L then Int64.div n 2L else
Int64.add (Int64.mul 3L n) 1L);;

  let rec loop(i, (nlen, n)) =
    if i = 1L then n else
      let ilen = collatzLen(1, i) in
      let nlen, n = if ilen > nlen then ilen, i else nlen, n in
      loop (Int64.sub i 1L, (nlen, n));;

and Haskell:

  import Data.Int

  collatzLen :: Int -> Int64 -> Int
  collatzLen c 1 = c
  collatzLen c n = collatzLen (c+1) $ if n `mod` 2 == 0 then n `div` 2 else
3*n+1

  pmax x n = x `max` (collatzLen 1 n, n)

  main = print $ foldl pmax (1,1) [2..1000000]

and C99:

  #include <stdio.h>

  int collatzLen(int c, long long n) {
    return (n==1 ? c : collatzLen(c+1, (n%2==0 ? n/2 : 3*n+1)));
  }

  long long loop(long long m) {
    int nlen=1;
    long long n = 1;
    for (long long i=2; i<=m; ++i) {
      const int ilen = collatzLen(1, i);
      if (ilen > nlen) {
        nlen = ilen;
        n = i;
      }
    }
    return n;
  }

  int main() {
    printf("%lld", loop(1000000));
  }

And here are my timings:

OCaml:      24.3s   ocamlopt
Haskell:    24.0s   ghc6.10 --make -O2
F#.NET:      3.45s  fsc --optimize+ --platform:x86
C:           1.20s  gcc -O3 -std=c99
HLVM:        1.07s  ./repl --compile
HLVM (opt):  0.76s  opt -tailcallelim -std-compile-opts

These results really demonstrate two things:

1. Unboxing can give huge performance improvements on serial code, let alone
parallel code. The optimized HLVM is running 32× faster than the OCaml here.

2. LLVM makes it easy to JIT fast code from OCaml. HLVM is using it to beat
GCC-compiled C code here.

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Value types (Was: [Caml-list] ocamlopt LLVM support)
  2010-12-12 14:54 Value types (Was: [Caml-list] ocamlopt LLVM support) Jon Harrop
@ 2010-12-12 15:55 ` Török Edwin
  2010-12-12 17:14   ` Jon Harrop
                     ` (2 more replies)
  2010-12-12 19:53 ` Value types (Was: [Caml-list] ocamlopt LLVM support) Brian Hurt
  1 sibling, 3 replies; 21+ messages in thread
From: Török Edwin @ 2010-12-12 15:55 UTC (permalink / raw)
  To: Jon Harrop; +Cc: caml-list

On Sun, 12 Dec 2010 14:54:14 -0000
"Jon Harrop" <jon@ffconsultancy.com> wrote:

> The Haskell guys got their best performance improvement moving to
> LLVM from the hailstone benchmark so it is interesting to examine
> this case as well. I just added support for 64-bit ints to HLVM to
> implement that benchmark and my code is:
> 
> Here’s the equivalent OCaml code:
> 
>   let rec collatzLen(c, n) : int =
>     if n = 1L then c else
>       collatzLen (c+1, if Int64.rem n 2L = 0L then Int64.div n 2L else

OK, but boxing itself has nothing to do with the performance degration
here. It is the lack of compiler optimizations on the Int64 type. This
could be solved by implementing compiler optimizations for it (or
manually optimizing some integer arithmetic that is slow).

Lets run the code under a profiler, or look at the assembly (I used
'perf record' and 'perf report'). 2 'idiv' instructions turn up as top
offenders in the profile.

Problem #1: Int64.rem n 2 -> another idiv instruction

A C compiler would optimize this to an 'and' instruction.
Change that to 'Int64.logand n 1L = 0L'/

Problem #2: Int64.div n 2 -> idiv instruction. 

A C compiler would optimize this to a right shift. Changing that to 'Int64.shift_right n 1' speeds
up the code.

With these changes I get almost the same speed as the C code:
$ ocamlopt x.ml -o x && time ./x
837799
real    0m0.664s
user    0m0.667s
sys     0m0.000s

$ gcc -O3 x.c && time ./a.out
837799
real    0m0.635s
user    0m0.633s
sys     0m0.000s

Here's the OCaml code:
let rec collatzLen(c, n) : int =
    if n = 1L then c else
      collatzLen (c+1, if Int64.logand n 1L = 0L then Int64.shift_right
n 1 else Int64.add (Int64.mul 3L n) 1L);;

  let rec loop(i, (nlen, n)) =
    if i = 1L then n else
      let ilen = collatzLen(1, i) in
      let nlen, n = if ilen > nlen then ilen, i else nlen, n in
      loop (Int64.sub i 1L, (nlen, n));;

  let _ =
      let s = loop (1000000L, (1,1000000L)) in
      print_int (Int64.to_int s);;

> 1. Unboxing can give huge performance improvements on serial code,

s/Unboxing/arithmetic optimizations/
Please find an example where the performance benefit is due to
unboxing, and not due to arithmetic optimizations performed on the
unboxed code.

> let alone parallel code. The optimized HLVM is running 32× faster
> than the OCaml here.
> 
> 2. LLVM makes it easy to JIT fast code from OCaml. HLVM is using it
> to beat GCC-compiled C code here.
> 

One advantage of using LLVM is that it would notice arithmetic
optimizations like this and perform it itself (even if you use the
boxed representation).

Best regards,
--Edwin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Value types (Was: [Caml-list] ocamlopt LLVM support)
  2010-12-12 15:55 ` Török Edwin
@ 2010-12-12 17:14   ` Jon Harrop
  2010-12-12 17:26     ` Török Edwin
  2010-12-12 19:09   ` Benedikt Meurer
  2010-12-14  9:54   ` Value types Goswin von Brederlow
  2 siblings, 1 reply; 21+ messages in thread
From: Jon Harrop @ 2010-12-12 17:14 UTC (permalink / raw)
  To: 'Török Edwin'; +Cc: caml-list

Török Edwin wrote:
> Problem #1: Int64.rem n 2 -> another idiv instruction
> 
> A C compiler would optimize this to an 'and' instruction.
> Change that to 'Int64.logand n 1L = 0L'/

Yes. LLVM did that for me.

> Problem #2: Int64.div n 2 -> idiv instruction.
> 
> A C compiler would optimize this to a right shift. Changing that to
> 'Int64.shift_right n 1' speeds
> up the code.

Yes. LLVM also did that for me. In fact, I have been bitten by ocamlopt not optimizing div and mod by a constant in real OCaml code before. This problem also turns up in the context of hash table implementations where you want to % by the length of the spine.

> With these changes I get almost the same speed as the C code:
> $ ocamlopt x.ml -o x && time ./x
> 837799
> real    0m0.664s
> user    0m0.667s
> sys     0m0.000s
> 
> $ gcc -O3 x.c && time ./a.out
> 837799
> real    0m0.635s
> user    0m0.633s
> sys     0m0.000s
> 
> Here's the OCaml code:
> let rec collatzLen(c, n) : int =
>     if n = 1L then c else
>       collatzLen (c+1, if Int64.logand n 1L = 0L then Int64.shift_right
> n 1 else Int64.add (Int64.mul 3L n) 1L);;
> 
>   let rec loop(i, (nlen, n)) =
>     if i = 1L then n else
>       let ilen = collatzLen(1, i) in
>       let nlen, n = if ilen > nlen then ilen, i else nlen, n in
>       loop (Int64.sub i 1L, (nlen, n));;
> 
>   let _ =
>       let s = loop (1000000L, (1,1000000L)) in
>       print_int (Int64.to_int s);;

I am unable to reproduce your results. Here, the time falls from 24s to 19.5s (using ocamlopt 3.12.0 on Intel x86) which is still 26× slower than HLVM.

> > 1. Unboxing can give huge performance improvements on serial code,
> 
> s/Unboxing/arithmetic optimizations/
> Please find an example where the performance benefit is due to
> unboxing, and not due to arithmetic optimizations performed on the
> unboxed code.

The last example I gave (array of key-value pairs) demonstrates some of the performance improvements offered by unboxing in the heap (12.3× faster than OCaml in that case). I'm still not sure that this example is invalid because I cannot reproduce your results.

> > let alone parallel code. The optimized HLVM is running 32× faster
> > than the OCaml here.
> >
> > 2. LLVM makes it easy to JIT fast code from OCaml. HLVM is using it
> > to beat GCC-compiled C code here.
> 
> One advantage of using LLVM is that it would notice arithmetic
> optimizations like this and perform it itself (even if you use the
> boxed representation).

Yes. LLVM hopefully optimizes div/mod by any constant which is quite tricky in the general case.

Cheers,
Jon.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Value types (Was: [Caml-list] ocamlopt LLVM support)
  2010-12-12 17:14   ` Jon Harrop
@ 2010-12-12 17:26     ` Török Edwin
  2010-12-12 18:01       ` Jon Harrop
  0 siblings, 1 reply; 21+ messages in thread
From: Török Edwin @ 2010-12-12 17:26 UTC (permalink / raw)
  To: Jon Harrop; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 2265 bytes --]

On Sun, 12 Dec 2010 17:14:45 -0000
"Jon Harrop" <jon@ffconsultancy.com> wrote:

> Török Edwin wrote:
> > Problem #1: Int64.rem n 2 -> another idiv instruction
> > 
> > A C compiler would optimize this to an 'and' instruction.
> > Change that to 'Int64.logand n 1L = 0L'/
> 
> Yes. LLVM did that for me.
> 
> > Problem #2: Int64.div n 2 -> idiv instruction.
> > 
> > A C compiler would optimize this to a right shift. Changing that to
> > 'Int64.shift_right n 1' speeds
> > up the code.
> 
> Yes. LLVM also did that for me. In fact, I have been bitten by
> ocamlopt not optimizing div and mod by a constant in real OCaml code
> before. This problem also turns up in the context of hash table
> implementations where you want to % by the length of the spine.

Do you really need to use Int64 for that though? Won't the 63-bit
version do?

> 
> > With these changes I get almost the same speed as the C code:
> > $ ocamlopt x.ml -o x && time ./x
> > 837799
> > real    0m0.664s
> > user    0m0.667s
> > sys     0m0.000s
> > 
> > $ gcc -O3 x.c && time ./a.out
> > 837799
> > real    0m0.635s
> > user    0m0.633s
> > sys     0m0.000s
> > 
> > Here's the OCaml code:
> > let rec collatzLen(c, n) : int =
> >     if n = 1L then c else
> >       collatzLen (c+1, if Int64.logand n 1L = 0L then
> > Int64.shift_right n 1 else Int64.add (Int64.mul 3L n) 1L);;
> > 
> >   let rec loop(i, (nlen, n)) =
> >     if i = 1L then n else
> >       let ilen = collatzLen(1, i) in
> >       let nlen, n = if ilen > nlen then ilen, i else nlen, n in
> >       loop (Int64.sub i 1L, (nlen, n));;
> > 
> >   let _ =
> >       let s = loop (1000000L, (1,1000000L)) in
> >       print_int (Int64.to_int s);;
> 
> I am unable to reproduce your results. Here, the time falls from 24s
> to 19.5s (using ocamlopt 3.12.0 on Intel x86) which is still 26×
> slower than HLVM.

Do you still have 'idiv' in the compiled code? See my attached
assembly, and compare it with yours please.
I was doing the test on 64-bit, with ocamlopt 3.11.2 and 3.12.0.

FWIW the original code took 2.8 seconds here, so only 4x slower (this
is an AMD Phenom II x6 1090T CPU). It probably depends how fast/slow
the 'idiv' is on your CPU.

--Edwin

[-- Attachment #2: x.s --]
[-- Type: application/octet-stream, Size: 4971 bytes --]

	.section        .rodata.cst8,"a",@progbits
	.align	16
caml_negf_mask:	.quad   0x8000000000000000, 0
	.align	16
caml_absf_mask:	.quad   0x7FFFFFFFFFFFFFFF, 0xFFFFFFFFFFFFFFFF
	.data
	.globl	camlX__data_begin
camlX__data_begin:
	.text
	.globl	camlX__code_begin
camlX__code_begin:
	.data
	.quad	2048
	.globl	camlX
camlX:
	.space	16
	.data
	.quad	3319
camlX__2:
	.quad	caml_tuplify2
	.quad	-3
	.quad	camlX__loop_1033
	.data
	.quad	3319
camlX__3:
	.quad	caml_tuplify2
	.quad	-3
	.quad	camlX__collatzLen_1030
	.data
	.quad	2048
camlX__1:
	.quad	.L100004
	.quad	.L100005
	.quad	2048
.L100005:
	.quad	3
	.quad	.L100006
	.quad	2303
.L100006:
	.quad	caml_int64_ops
	.quad	1000000
	.quad	2303
.L100004:
	.quad	caml_int64_ops
	.quad	1000000
	.text
	.align	16
	.globl	camlX__collatzLen_1030
camlX__collatzLen_1030:
	subq	$8, %rsp
.L103:
	movq	%rax, %rdi
	movq	$1, %rsi
	movq	8(%rbx), %rax
	cmpq	%rsi, %rax
	jne	.L102
	movq	%rdi, %rax
	addq	$8, %rsp
	ret
	.align	4
.L102:
	xorq	%rdx, %rdx
	movq	$1, %rsi
	movq	8(%rbx), %rax
	andq	%rsi, %rax
	cmpq	%rdx, %rax
	jne	.L101
.L104:	subq	$24, %r15
	movq	caml_young_limit@GOTPCREL(%rip), %rax
	cmpq	(%rax), %r15
	jb	.L105
	leaq	8(%r15), %rcx
	movq	$2303, -8(%rcx)
	movq	caml_int64_ops@GOTPCREL(%rip), %rax
	movq	%rax, (%rcx)
	movq	8(%rbx), %rax
	sarq	$1, %rax
	movq	%rax, 8(%rcx)
	jmp	.L100
	.align	4
.L101:
.L107:	subq	$24, %r15
	movq	caml_young_limit@GOTPCREL(%rip), %rax
	cmpq	(%rax), %r15
	jb	.L108
	leaq	8(%r15), %rcx
	movq	$2303, -8(%rcx)
	movq	caml_int64_ops@GOTPCREL(%rip), %rax
	movq	%rax, (%rcx)
	movq	$1, %rdx
	movq	8(%rbx), %rsi
	movq	$3, %rax
	imulq	%rsi, %rax
	addq	%rdx, %rax
	movq	%rax, 8(%rcx)
.L100:
	movq	%rdi, %rax
	addq	$2, %rax
	movq	%rcx, %rbx
	jmp	.L103
.L108:	call	caml_call_gc@PLT
.L109:	jmp	.L107
.L105:	call	caml_call_gc@PLT
.L106:	jmp	.L104
	.type	camlX__collatzLen_1030,@function
	.size	camlX__collatzLen_1030,.-camlX__collatzLen_1030
	.text
	.align	16
	.globl	camlX__loop_1033
camlX__loop_1033:
	subq	$24, %rsp
.L113:
	movq	%rax, %rdx
	movq	8(%rbx), %rax
	movq	$1, %rsi
	movq	8(%rdx), %rdi
	cmpq	%rsi, %rdi
	jne	.L112
	addq	$24, %rsp
	ret
	.align	4
.L112:
	movq	%rax, 8(%rsp)
	movq	%rdx, 16(%rsp)
	movq	(%rbx), %rax
	movq	%rax, 0(%rsp)
	movq	$3, %rax
	movq	%rdx, %rbx
	call	camlX__collatzLen_1030@PLT
.L114:
	movq	%rax, %rsi
	movq	0(%rsp), %rbx
	cmpq	%rbx, %rsi
	jle	.L111
.L115:	subq	$24, %r15
	movq	caml_young_limit@GOTPCREL(%rip), %rax
	cmpq	(%rax), %r15
	jb	.L116
	leaq	8(%r15), %rdi
	movq	$2048, -8(%rdi)
	movq	%rsi, (%rdi)
	movq	16(%rsp), %rax
	movq	%rax, 8(%rdi)
	jmp	.L110
	.align	4
.L111:
.L118:	subq	$24, %r15
	movq	caml_young_limit@GOTPCREL(%rip), %rax
	cmpq	(%rax), %r15
	jb	.L119
	leaq	8(%r15), %rdi
	movq	$2048, -8(%rdi)
	movq	%rbx, (%rdi)
	movq	8(%rsp), %rax
	movq	%rax, 8(%rdi)
.L110:
.L121:	subq	$48, %r15
	movq	caml_young_limit@GOTPCREL(%rip), %rax
	cmpq	(%rax), %r15
	jb	.L122
	leaq	8(%r15), %rbx
	movq	$2048, -8(%rbx)
	movq	(%rdi), %rax
	movq	%rax, (%rbx)
	movq	8(%rdi), %rax
	movq	%rax, 8(%rbx)
	leaq	24(%rbx), %rax
	movq	$2303, -8(%rax)
	movq	caml_int64_ops@GOTPCREL(%rip), %rdi
	movq	%rdi, (%rax)
	movq	$1, %rsi
	movq	16(%rsp), %rdi
	movq	8(%rdi), %rdi
	subq	%rsi, %rdi
	movq	%rdi, 8(%rax)
	jmp	.L113
.L122:	call	caml_call_gc@PLT
.L123:	jmp	.L121
.L119:	call	caml_call_gc@PLT
.L120:	jmp	.L118
.L116:	call	caml_call_gc@PLT
.L117:	jmp	.L115
	.type	camlX__loop_1033,@function
	.size	camlX__loop_1033,.-camlX__loop_1033
	.text
	.align	16
	.globl	camlX__entry
camlX__entry:
	subq	$8, %rsp
.L124:
	movq	camlX__3@GOTPCREL(%rip), %rbx
	movq	camlX@GOTPCREL(%rip), %rax
	movq	%rbx, (%rax)
	movq	camlX__2@GOTPCREL(%rip), %rbx
	movq	camlX@GOTPCREL(%rip), %rax
	movq	%rbx, 8(%rax)
	movq	camlX@GOTPCREL(%rip), %rax
	movq	8(%rax), %rbx
	movq	camlX__1@GOTPCREL(%rip), %rax
	movq	(%rbx), %rdi
	call	*%rdi
.L125:
	movq	8(%rax), %rax
	salq	$1, %rax
	orq	$1, %rax
	call	camlPervasives__string_of_int_1130@PLT
.L126:
	movq	%rax, %rbx
	movq	camlPervasives@GOTPCREL(%rip), %rax
	movq	184(%rax), %rax
	call	camlPervasives__output_string_1191@PLT
.L127:
	movq	$1, %rax
	addq	$8, %rsp
	ret
	.type	camlX__entry,@function
	.size	camlX__entry,.-camlX__entry
	.data
	.text
	.globl	camlX__code_end
camlX__code_end:
	.data
	.globl	camlX__data_end
camlX__data_end:
	.long	0
	.globl	camlX__frametable
camlX__frametable:
	.quad	9
	.quad	.L127
	.word	17
	.word	0
	.align	8
	.long	(.L200000 - .) + 0xe0000000
	.long	0x168120
	.quad	.L126
	.word	17
	.word	0
	.align	8
	.long	(.L200000 - .) + 0xe0000000
	.long	0x168270
	.quad	.L125
	.word	16
	.word	0
	.align	8
	.quad	.L123
	.word	32
	.word	2
	.word	16
	.word	5
	.align	8
	.quad	.L120
	.word	32
	.word	3
	.word	3
	.word	8
	.word	16
	.align	8
	.quad	.L117
	.word	32
	.word	2
	.word	16
	.word	7
	.align	8
	.quad	.L114
	.word	32
	.word	3
	.word	0
	.word	8
	.word	16
	.align	8
	.quad	.L109
	.word	16
	.word	2
	.word	3
	.word	5
	.align	8
	.quad	.L106
	.word	16
	.word	2
	.word	3
	.word	5
	.align	8
.L200000:
	.asciz	"pervasives.ml"
	.align	8
	.section .note.GNU-stack,"",%progbits

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Value types (Was: [Caml-list] ocamlopt LLVM support)
  2010-12-12 17:26     ` Török Edwin
@ 2010-12-12 18:01       ` Jon Harrop
  2010-12-12 18:22         ` Török Edwin
  0 siblings, 1 reply; 21+ messages in thread
From: Jon Harrop @ 2010-12-12 18:01 UTC (permalink / raw)
  To: 'Török Edwin'; +Cc: caml-list

Török Edwin wrote:
> Do you really need to use Int64 for that though? Won't the 63-bit
> version do?

I'm running 32-bit.

> > I am unable to reproduce your results. Here, the time falls from 24s
> > to 19.5s (using ocamlopt 3.12.0 on Intel x86) which is still 26×
> > slower than HLVM.

Sorry, I'm actually using an Opteron x86 (logged in from an Intel x86!).

> Do you still have 'idiv' in the compiled code? See my attached
> assembly, and compare it with yours please.
> I was doing the test on 64-bit, with ocamlopt 3.11.2 and 3.12.0.

I get what appear to be calls to C code:

camlCollatz__collatzLen_1030:
        subl    $8, %esp
.L103:
        movl    %eax, 4(%esp)
        movl    %ebx, 0(%esp)
        pushl   $camlCollatz__10
        pushl   %ebx
        movl    $caml_equal, %eax
        call    caml_c_call
.L104:
        addl    $8, %esp
        cmpl    $1, %eax
        je      .L102
        movl    4(%esp), %eax
        addl    $8, %esp
        ret
        .align  16
.L102:
        pushl   $camlCollatz__8
        movl    4(%esp), %eax
        pushl   %eax
        movl    $caml_int64_and, %eax
        call    caml_c_call
.L105:
        addl    $8, %esp
        pushl   $camlCollatz__9
        pushl   %eax
        movl    $caml_equal, %eax
        call    caml_c_call
.L106:
        addl    $8, %esp
        cmpl    $1, %eax
        je      .L101
        pushl   $3
        movl    4(%esp), %eax
        pushl   %eax
        movl    $caml_int64_shift_right, %eax
        call    caml_c_call
.L107:
        addl    $8, %esp
        movl    %eax, %ebx
        jmp     .L100
        .align  16
.L101:
        movl    0(%esp), %eax
        pushl   %eax
        pushl   $camlCollatz__6
        movl    $caml_int64_mul, %eax
        call    caml_c_call
.L108:
        addl    $8, %esp
        pushl   $camlCollatz__7
        pushl   %eax
        movl    $caml_int64_add, %eax
        call    caml_c_call
.L109:
        addl    $8, %esp
        movl    %eax, %ebx
.L100:
        movl    4(%esp), %eax
        addl    $2, %eax
        jmp     .L103

> FWIW the original code took 2.8 seconds here, so only 4x slower (this
> is an AMD Phenom II x6 1090T CPU). It probably depends how fast/slow
> the 'idiv' is on your CPU.

The performance of idiv is irrelevant here. The bottleneck may be those C calls but I don't understand why they are being generated.

Cheers,
Jon.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Value types (Was: [Caml-list] ocamlopt LLVM support)
  2010-12-12 18:01       ` Jon Harrop
@ 2010-12-12 18:22         ` Török Edwin
  0 siblings, 0 replies; 21+ messages in thread
From: Török Edwin @ 2010-12-12 18:22 UTC (permalink / raw)
  To: Jon Harrop; +Cc: caml-list

On Sun, 12 Dec 2010 18:01:13 -0000
"Jon Harrop" <jon@ffconsultancy.com> wrote:

> Török Edwin wrote:
> > Do you really need to use Int64 for that though? Won't the 63-bit
> > version do?
> 
> I'm running 32-bit.

That explains it, in a 32-bit chroot my modified version is still slow.
I thought that 32-bit systems are not that relevant anymore (except for
Windows, but then people start moving to 64-bit there also).

> 
> > > I am unable to reproduce your results. Here, the time falls from
> > > 24s to 19.5s (using ocamlopt 3.12.0 on Intel x86) which is still
> > > 26× slower than HLVM.
> 
> Sorry, I'm actually using an Opteron x86 (logged in from an Intel
> x86!).
> 
> > Do you still have 'idiv' in the compiled code? See my attached
> > assembly, and compare it with yours please.
> > I was doing the test on 64-bit, with ocamlopt 3.11.2 and 3.12.0.
> 
> I get what appear to be calls to C code:
> 
> camlCollatz__collatzLen_1030:
>         subl    $8, %esp
> .L103:
>         movl    %eax, 4(%esp)
>         movl    %ebx, 0(%esp)
>         pushl   $camlCollatz__10
>         pushl   %ebx
>         movl    $caml_equal, %eax
>         call    caml_c_call

Yes, that is quite bad. I don't know how OCaml's code generator works,
but it looks like it calls the C implementation if the CPU doesn't
support the operation directly. And since this is 32-bit you need all
the extra pushes and movs to do actually call something.
If only it could inline those calls, then it could optimize away most
of the overhead (LLVM would help here again).

> 
> > FWIW the original code took 2.8 seconds here, so only 4x slower
> > (this is an AMD Phenom II x6 1090T CPU). It probably depends how
> > fast/slow the 'idiv' is on your CPU.
> 
> The performance of idiv is irrelevant here. The bottleneck may be
> those C calls but I don't understand why they are being generated.

I think for the same reason gcc has __udivdi3 in libgcc: there is no
direct way of executing a 64-bit divide on a 32-bit machine, and it
saves code space to do it in a function.
However that doesn't make much sense for mul and add, which don't need
that many instructions to implement on 32-bit.

> 
> Cheers,
> Jon.
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Value types (Was: [Caml-list] ocamlopt LLVM support)
  2010-12-12 15:55 ` Török Edwin
  2010-12-12 17:14   ` Jon Harrop
@ 2010-12-12 19:09   ` Benedikt Meurer
  2010-12-12 19:20     ` John Carr
                       ` (3 more replies)
  2010-12-14  9:54   ` Value types Goswin von Brederlow
  2 siblings, 4 replies; 21+ messages in thread
From: Benedikt Meurer @ 2010-12-12 19:09 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 5708 bytes --]


On Dec 12, 2010, at 16:55 , Török Edwin wrote:

> On Sun, 12 Dec 2010 14:54:14 -0000
> "Jon Harrop" <jon@ffconsultancy.com> wrote:
> 
>> The Haskell guys got their best performance improvement moving to
>> LLVM from the hailstone benchmark so it is interesting to examine
>> this case as well. I just added support for 64-bit ints to HLVM to
>> implement that benchmark and my code is:
>> 
>> Here’s the equivalent OCaml code:
>> 
>>  let rec collatzLen(c, n) : int =
>>    if n = 1L then c else
>>      collatzLen (c+1, if Int64.rem n 2L = 0L then Int64.div n 2L else
> 
> OK, but boxing itself has nothing to do with the performance degration
> here. It is the lack of compiler optimizations on the Int64 type. This
> could be solved by implementing compiler optimizations for it (or
> manually optimizing some integer arithmetic that is slow).
> 
> Lets run the code under a profiler, or look at the assembly (I used
> 'perf record' and 'perf report'). 2 'idiv' instructions turn up as top
> offenders in the profile.
> 
> Problem #1: Int64.rem n 2 -> another idiv instruction
> 
> A C compiler would optimize this to an 'and' instruction.
> Change that to 'Int64.logand n 1L = 0L'/
> 
> Problem #2: Int64.div n 2 -> idiv instruction. 
> 
> A C compiler would optimize this to a right shift. Changing that to 'Int64.shift_right n 1' speeds
> up the code.

This is easy to fix in ocamlopt (see attached patch ocamlopt-natint.patch), by applying the same optimizations already used for constant int's to constant natint's (Int64 is Nativeint on 64bit). Note however, that "mod 2" is not really "and 1", neither is "div 2" equivalent to "lsr 1"; that would be the case for unsigned arithmetic (doesn't matter in your example, tho).

I don't see the point of optimizing for x86-32 (neither would I spend my time optimizing anything for VAX these days), but it would be possible to add appropriate cases for Int64 on x86 as well (regalloc would be most difficult here, since that requires support for register pairs to perform 64bit arithmetic).

>> 1. Unboxing can give huge performance improvements on serial code,
> 
> s/Unboxing/arithmetic optimizations/
> Please find an example where the performance benefit is due to
> unboxing, and not due to arithmetic optimizations performed on the
> unboxed code.

The boxing involved is relevant, but boxing in general is not the issue. In this special case, the "let nlen, n = if..." code requires heap allocation, because of the way the pattern is compiled. This could be fixed by moving the condition out of the code and using two if's to select n/nlen separately (doesn't speed up that much). Fixing the pattern compiler to handle these cases might be interesting for general benefit.

I already mentioned this multiple times, but here we go again: Unboxing optimizations may indeed prove useful if applied wisely (cmmgen.ml is of special interest here, the unboxing optimizations are more or less special cases; that could be extended to include interesting cases like moving boxing out of if-then-else in return position, etc).

But (here comes the special "Harrop note") this has absolutely nothing to do with LLVM (and of course really, really, really nothing to do with HLVM). Using a different data representation for the heap requires a nearly complete rewrite of the OCaml system (you would probably need to start at the Typing level); if one wants to do this, enjoy and come back with code. But even then, data representation issues will have to be considered long before it comes to actual code generation (if you are serious, you'll have to think about the runtime first prior to talking about code generation for a non-existing runtime), so even then it has nothing to do with LLVM (or C-- or C or whatever IR you can think of).

Combining alloc's across if-then-else constructs further reduces code size in your example (and probably other programs as well), see attached file ocamlopt-comballoc-ifthenelse.patch. It's quick&dirty, but it should illustrate the idea.

>> let alone parallel code. The optimized HLVM is running 32× faster
>> than the OCaml here.
>> 
>> 2. LLVM makes it easy to JIT fast code from OCaml. HLVM is using it
>> to beat GCC-compiled C code here.
>> 
> 
> One advantage of using LLVM is that it would notice arithmetic
> optimizations like this and perform it itself (even if you use the
> boxed representation).

In case of x86-32, it won't, simply because LLVM will be presented with the calls to caml_int32_* functions. You'd need to change the Cmm code instead (changing the low-level stuff is straight-forward as demonstrated). For 64bit targets, see attached patch.

This doesn't mean that LLVM wouldn't be useful (in fact, I've just started an LLVM backend for ocamlopt). But it is important to note that LLVM is not the solution to everything. As the name implies, it's "low level", it does a few "higher level" optimizations for C, but these are special cases (and somewhat ugly if you take the time to read the code). It won't make a faster OCaml magically, just like it didn't make a faster Haskell by itself.

I could go on by quoting common "Harrop jokes" like "you need types in the low-level IR", etc. trying to tell him that this is simply wrong; but after reading through the Caml/LISP mailing list archives (thanks for the pointers by several readers), I'm pretty much convinced that Jon simply decided to end his war against LISP just to start a new one against ocamlopt.

If anyone is serious about ocamlopt with LLVM, feel free to contact me (tho, my time is limited atm).

> Best regards,
> --Edwin

greets,
Benedikt


[-- Attachment #2: ocamlopt-comballoc-ifthenelse.patch --]
[-- Type: application/octet-stream, Size: 6109 bytes --]

diff --git a/asmcomp/comballoc.ml b/asmcomp/comballoc.ml
index 5a862b1..82c01f9 100644
--- a/asmcomp/comballoc.ml
+++ b/asmcomp/comballoc.ml
@@ -27,64 +27,78 @@ let allocated_size = function
     No_alloc -> 0
   | Pending_alloc(reg, ofs) -> ofs
 
+let instr_cons_alloc sz a r n =
+  if sz != 0
+  then instr_cons (Iop(Ialloc sz)) a r n
+  else n
+
 let rec combine i allocstate =
   match i.desc with
     Iend | Ireturn | Iexit _ | Iraise ->
-      (i, allocated_size allocstate)
+      (i, allocated_size allocstate, true)
   | Iop(Ialloc sz) ->
       begin match allocstate with
         No_alloc ->
-          let (newnext, newsz) =
+          let (newnext, newsz, _) =
             combine i.next (Pending_alloc(i.res.(0), sz)) in
-          (instr_cons (Iop(Ialloc newsz)) i.arg i.res newnext, 0)
+          (instr_cons_alloc newsz i.arg i.res newnext, 0, false)
       | Pending_alloc(reg, ofs) ->
           if ofs + sz < Config.max_young_wosize then begin
-            let (newnext, newsz) =
+            let (newnext, newsz, safe) =
               combine i.next (Pending_alloc(reg, ofs + sz)) in
-            (instr_cons (Iop(Iintop_imm(Iadd, ofs))) [| reg |] i.res newnext,
-             newsz)
+            if sz != 0 && ofs != 0 then
+              (instr_cons (Iop(Iintop_imm(Iadd, ofs))) [|reg|] i.res newnext, newsz, safe)
+            else if sz != 0 then
+              (instr_cons (Iop Imove) [|reg|] i.res newnext, newsz, safe)
+            else
+              (newnext, newsz, safe)
           end else begin
-            let (newnext, newsz) =
+            let (newnext, newsz, _) =
               combine i.next (Pending_alloc(i.res.(0), sz)) in
-            (instr_cons (Iop(Ialloc newsz)) i.arg i.res newnext, ofs)
+            (instr_cons_alloc newsz i.arg i.res newnext, ofs, false)
           end
       end
   | Iop(Icall_ind | Icall_imm _ | Iextcall _ |
         Itailcall_ind | Itailcall_imm _) ->
       let newnext = combine_restart i.next in
-      (instr_cons_debug i.desc i.arg i.res i.dbg newnext,
-       allocated_size allocstate)
+      (instr_cons_debug i.desc i.arg i.res i.dbg newnext, allocated_size allocstate, false)
   | Iop op ->
-      let (newnext, sz) = combine i.next allocstate in
-      (instr_cons_debug i.desc i.arg i.res i.dbg newnext, sz)
+      let (newnext, sz, safe) = combine i.next allocstate in
+      (instr_cons_debug i.desc i.arg i.res i.dbg newnext, sz, safe)
   | Iifthenelse(test, ifso, ifnot) ->
-      let newifso = combine_restart ifso in
-      let newifnot = combine_restart ifnot in
-      let newnext = combine_restart i.next in
-      (instr_cons (Iifthenelse(test, newifso, newifnot)) i.arg i.res newnext,
-       allocated_size allocstate)
+      begin match allocstate, combine ifso allocstate, combine ifnot allocstate with
+        Pending_alloc(reg, ofs), (newifso, szifso, true), (newifnot, szifnot, true) when szifso = szifnot ->
+          let (newnext, sznext, safe) = combine i.next (Pending_alloc(reg, ofs + szifso)) in
+          (instr_cons (Iifthenelse(test, newifso, newifnot)) i.arg i.res newnext,
+           sznext,
+           safe)
+      | _, _, _ ->
+          let newifso = combine_restart ifso in
+          let newifnot = combine_restart ifnot in
+          let newnext = combine_restart i.next in
+          (instr_cons (Iifthenelse(test, newifso, newifnot)) i.arg i.res newnext,
+           allocated_size allocstate, false)
+      end
   | Iswitch(table, cases) ->
       let newcases = Array.map combine_restart cases in
       let newnext = combine_restart i.next in
-      (instr_cons (Iswitch(table, newcases)) i.arg i.res newnext,
-       allocated_size allocstate)
+      (instr_cons (Iswitch(table, newcases)) i.arg i.res newnext, allocated_size allocstate, false)
   | Iloop(body) ->
       let newbody = combine_restart body in
-      (instr_cons (Iloop(newbody)) i.arg i.res i.next,
-       allocated_size allocstate)
+      (instr_cons (Iloop(newbody)) i.arg i.res i.next, allocated_size allocstate, false)
   | Icatch(io, body, handler) ->
-      let (newbody, sz) = combine body allocstate in
+      let (newbody, sz, _) = combine body allocstate in
       let newhandler = combine_restart handler in
       let newnext = combine_restart i.next in
-      (instr_cons (Icatch(io, newbody, newhandler)) i.arg i.res newnext, sz)
+      (instr_cons (Icatch(io, newbody, newhandler)) i.arg i.res newnext, sz, false)
   | Itrywith(body, handler) ->
-      let (newbody, sz) = combine body allocstate in
+      let (newbody, sz, _) = combine body allocstate in
       let newhandler = combine_restart handler in
       let newnext = combine_restart i.next in
-      (instr_cons (Itrywith(newbody, newhandler)) i.arg i.res newnext, sz)
+      (instr_cons (Itrywith(newbody, newhandler)) i.arg i.res newnext, sz, false)
 
 and combine_restart i =
-  let (newi, _) = combine i No_alloc in newi
+  let (newi, _, _) = combine i No_alloc in newi
 
 let fundecl f =
   {f with fun_body = combine_restart f.fun_body}
diff --git a/asmcomp/selectgen.ml b/asmcomp/selectgen.ml
index 7daa239..4221f95 100644
--- a/asmcomp/selectgen.ml
+++ b/asmcomp/selectgen.ml
@@ -562,6 +562,9 @@ method emit_expr env exp =
           let (rif, sif) = self#emit_sequence env eif in
           let (relse, selse) = self#emit_sequence env eelse in
           let r = join rif sif relse selse in
+          (* Dummy Ialloc 0 for comballoc.ml *)
+          let ra = self#regs_for typ_addr in
+          self#insert (Iop(Ialloc 0)) [||] ra;
           self#insert (Iifthenelse(cond, sif#extract, selse#extract))
                       rarg [||];
           r
@@ -790,6 +793,9 @@ method emit_tail env exp =
       begin match self#emit_expr env earg with
         None -> ()
       | Some rarg ->
+          (* Dummy Ialloc 0 for comballoc.ml *)
+          let ra = self#regs_for typ_addr in
+          self#insert (Iop(Ialloc 0)) [||] ra;
           self#insert (Iifthenelse(cond, self#emit_tail_sequence env eif,
                                          self#emit_tail_sequence env eelse))
                       rarg [||]

[-- Attachment #3: ocamlopt-natint.patch --]
[-- Type: application/octet-stream, Size: 7218 bytes --]

diff --git a/asmcomp/amd64/selection.ml b/asmcomp/amd64/selection.ml
index 4921e51..bce8104 100644
--- a/asmcomp/amd64/selection.ml
+++ b/asmcomp/amd64/selection.ml
@@ -168,6 +168,9 @@ method! select_operation op args =
         [arg1; Cconst_int n] when self#is_immediate n
                                && n = 1 lsl (Misc.log2 n) ->
           (Iintop_imm(Idiv, n), [arg1])
+        | [arg1; Cconst_natint n] when self#is_immediate_natint n
+                               && n = Nativeint.shift_left 1n (Misc.log2 (Nativeint.to_int n)) ->
+          (Iintop_imm(Idiv, Nativeint.to_int n), [arg1])
       | _ -> (Iintop Idiv, args)
       end
   | Cmodi ->
@@ -175,6 +178,9 @@ method! select_operation op args =
         [arg1; Cconst_int n] when self#is_immediate n
                                && n = 1 lsl (Misc.log2 n) ->
           (Iintop_imm(Imod, n), [arg1])
+        | [arg1; Cconst_natint n] when self#is_immediate_natint n
+                               && n = Nativeint.shift_left 1n (Misc.log2 (Nativeint.to_int n)) ->
+          (Iintop_imm(Imod, Nativeint.to_int n), [arg1])
       | _ -> (Iintop Imod, args)
       end
   (* Recognize float arithmetic with memory. *)
diff --git a/asmcomp/selectgen.ml b/asmcomp/selectgen.ml
index 50f949a..7daa239 100644
--- a/asmcomp/selectgen.ml
+++ b/asmcomp/selectgen.ml
@@ -201,6 +201,8 @@ method is_simple_expr = function
 
 method virtual is_immediate : int -> bool
 
+method virtual is_immediate_natint : nativeint -> bool
+
 (* Selection of addressing modes *)
 
 method virtual select_addressing :
@@ -238,11 +240,21 @@ method select_operation op args =
       if n = 1 lsl l
       then (Iintop_imm(Ilsl, l), [arg1])
       else self#select_arith_comm Imul args
+  | (Cmuli, [arg1; Cconst_natint n]) ->
+      let l = Misc.log2 (Nativeint.to_int n) in
+      if n = Nativeint.shift_left 1n l
+      then (Iintop_imm(Ilsl, l), [arg1])
+      else self#select_arith_comm Imul args
   | (Cmuli, [Cconst_int n; arg1]) ->
       let l = Misc.log2 n in
       if n = 1 lsl l
       then (Iintop_imm(Ilsl, l), [arg1])
       else self#select_arith_comm Imul args
+  | (Cmuli, [Cconst_natint n; arg1]) ->
+      let l = Misc.log2 (Nativeint.to_int n) in
+      if n = Nativeint.shift_left 1n l
+      then (Iintop_imm(Ilsl, l), [arg1])
+      else self#select_arith_comm Imul args
   | (Cmuli, _) -> self#select_arith_comm Imul args
   | (Cdivi, _) -> self#select_arith Idiv args
   | (Cmodi, _) -> self#select_arith_comm Imod args
@@ -270,38 +282,60 @@ method select_operation op args =
 method private select_arith_comm op = function
     [arg; Cconst_int n] when self#is_immediate n ->
       (Iintop_imm(op, n), [arg])
+  | [arg; Cconst_natint n] when self#is_immediate_natint n ->
+      (Iintop_imm(op, Nativeint.to_int n), [arg])
   | [arg; Cconst_pointer n] when self#is_immediate n ->
       (Iintop_imm(op, n), [arg])
+  | [arg; Cconst_natpointer n] when self#is_immediate_natint n ->
+      (Iintop_imm(op, Nativeint.to_int n), [arg])
   | [Cconst_int n; arg] when self#is_immediate n ->
       (Iintop_imm(op, n), [arg])
+  | [Cconst_natint n; arg] when self#is_immediate_natint n ->
+      (Iintop_imm(op, Nativeint.to_int n), [arg])
   | [Cconst_pointer n; arg] when self#is_immediate n ->
       (Iintop_imm(op, n), [arg])
+  | [Cconst_natpointer n; arg] when self#is_immediate_natint n ->
+      (Iintop_imm(op, Nativeint.to_int n), [arg])
   | args ->
       (Iintop op, args)
 
 method private select_arith op = function
     [arg; Cconst_int n] when self#is_immediate n ->
       (Iintop_imm(op, n), [arg])
+  | [arg; Cconst_natint n] when self#is_immediate_natint n ->
+      (Iintop_imm(op, Nativeint.to_int n), [arg])
   | [arg; Cconst_pointer n] when self#is_immediate n ->
       (Iintop_imm(op, n), [arg])
+  | [arg; Cconst_natpointer n] when self#is_immediate_natint n ->
+      (Iintop_imm(op, Nativeint.to_int n), [arg])
   | args ->
       (Iintop op, args)
 
 method private select_shift op = function
     [arg; Cconst_int n] when n >= 0 && n < Arch.size_int * 8 ->
       (Iintop_imm(op, n), [arg])
+  | [arg; Cconst_natint n] when n >= 0n && n < Nativeint.of_int (Arch.size_int * 8) ->
+      (Iintop_imm(op, Nativeint.to_int n), [arg])
   | args ->
       (Iintop op, args)
 
 method private select_arith_comp cmp = function
     [arg; Cconst_int n] when self#is_immediate n ->
       (Iintop_imm(Icomp cmp, n), [arg])
+  | [arg; Cconst_natint n] when self#is_immediate_natint n ->
+      (Iintop_imm(Icomp cmp, Nativeint.to_int n), [arg])
   | [arg; Cconst_pointer n] when self#is_immediate n ->
       (Iintop_imm(Icomp cmp, n), [arg])
+  | [arg; Cconst_natpointer n] when self#is_immediate_natint n ->
+      (Iintop_imm(Icomp cmp, Nativeint.to_int n), [arg])
   | [Cconst_int n; arg] when self#is_immediate n ->
       (Iintop_imm(Icomp(swap_intcomp cmp), n), [arg])
+  | [Cconst_natint n; arg] when self#is_immediate_natint n ->
+      (Iintop_imm(Icomp(swap_intcomp cmp), Nativeint.to_int n), [arg])
   | [Cconst_pointer n; arg] when self#is_immediate n ->
       (Iintop_imm(Icomp(swap_intcomp cmp), n), [arg])
+  | [Cconst_natpointer n; arg] when self#is_immediate_natint n ->
+      (Iintop_imm(Icomp(swap_intcomp cmp), Nativeint.to_int n), [arg])
   | args ->
       (Iintop(Icomp cmp), args)
 
@@ -310,12 +344,20 @@ method private select_arith_comp cmp = function
 method select_condition = function
     Cop(Ccmpi cmp, [arg1; Cconst_int n]) when self#is_immediate n ->
       (Iinttest_imm(Isigned cmp, n), arg1)
+  | Cop(Ccmpi cmp, [arg1; Cconst_natint n]) when self#is_immediate_natint n ->
+      (Iinttest_imm(Isigned cmp, Nativeint.to_int n), arg1)
   | Cop(Ccmpi cmp, [Cconst_int n; arg2]) when self#is_immediate n ->
       (Iinttest_imm(Isigned(swap_comparison cmp), n), arg2)
+  | Cop(Ccmpi cmp, [Cconst_natint n; arg2]) when self#is_immediate_natint n ->
+      (Iinttest_imm(Isigned(swap_comparison cmp), Nativeint.to_int n), arg2)
   | Cop(Ccmpi cmp, [arg1; Cconst_pointer n]) when self#is_immediate n ->
       (Iinttest_imm(Isigned cmp, n), arg1)
+  | Cop(Ccmpi cmp, [arg1; Cconst_natpointer n]) when self#is_immediate_natint n ->
+      (Iinttest_imm(Isigned cmp, Nativeint.to_int n), arg1)
   | Cop(Ccmpi cmp, [Cconst_pointer n; arg2]) when self#is_immediate n ->
       (Iinttest_imm(Isigned(swap_comparison cmp), n), arg2)
+  | Cop(Ccmpi cmp, [Cconst_natpointer n; arg2]) when self#is_immediate_natint n ->
+      (Iinttest_imm(Isigned(swap_comparison cmp), Nativeint.to_int n), arg2)
   | Cop(Ccmpi cmp, args) ->
       (Iinttest(Isigned cmp), Ctuple args)
   | Cop(Ccmpa cmp, [arg1; Cconst_pointer n]) when self#is_immediate n ->
diff --git a/asmcomp/selectgen.mli b/asmcomp/selectgen.mli
index 7c30f9f..8966ad1 100644
--- a/asmcomp/selectgen.mli
+++ b/asmcomp/selectgen.mli
@@ -23,6 +23,7 @@ class virtual selector_generic : object
   (* The following methods must or can be overridden by the processor
      description *)
   method virtual is_immediate : int -> bool
+  method virtual is_immediate_natint : nativeint -> bool
     (* Must be defined to indicate whether a constant is a suitable
        immediate operand to arithmetic instructions *)
   method virtual select_addressing :

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Value types (Was: [Caml-list] ocamlopt LLVM support)
  2010-12-12 19:09   ` Benedikt Meurer
@ 2010-12-12 19:20     ` John Carr
  2010-12-14  9:43       ` Value types Goswin von Brederlow
  2010-12-12 19:55     ` Value types (Was: [Caml-list] ocamlopt LLVM support) Török Edwin
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: John Carr @ 2010-12-12 19:20 UTC (permalink / raw)
  To: Benedikt Meurer; +Cc: caml-list

> I don't see the point of optimizing for x86-32

I'm using 32 bit ocaml because my program uses too much memory in 64
bit mode.  If there were an ocaml that used 32 bit words in 64 bit
mode, I would use that instead.

Early 32 to 64 bit transitions offered 32 bit pointers with 64 bit
registers, called TSO on Alpha and n32 on MIPS.  AMD and Intel did not.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Value types (Was: [Caml-list] ocamlopt LLVM support)
  2010-12-12 14:54 Value types (Was: [Caml-list] ocamlopt LLVM support) Jon Harrop
  2010-12-12 15:55 ` Török Edwin
@ 2010-12-12 19:53 ` Brian Hurt
  2010-12-12 20:39   ` Jon Harrop
  1 sibling, 1 reply; 21+ messages in thread
From: Brian Hurt @ 2010-12-12 19:53 UTC (permalink / raw)
  To: Jon Harrop; +Cc: caml-list

I'm going to regret this.  I know I'm going to regret this.

On Sun, 12 Dec 2010, Jon Harrop wrote:

> Here?s the equivalent OCaml code:
>
>  let rec collatzLen(c, n) : int =
>    if n = 1L then c else
>      collatzLen (c+1, if Int64.rem n 2L = 0L then Int64.div n 2L else
> Int64.add (Int64.mul 3L n) 1L);;
>
>  let rec loop(i, (nlen, n)) =
>    if i = 1L then n else
>      let ilen = collatzLen(1, i) in
>      let nlen, n = if ilen > nlen then ilen, i else nlen, n in
>      loop (Int64.sub i 1L, (nlen, n));;

Congratulations, Jon, you win today's Captain Obvious award.  Using 
Int64's, which are forced to be boxed, really slows things down.  Also, 
uncurrying all your arguments also slows things down.  Running your 
original code on my 64-bit laptop, it took 6.35s to run the 1M example. 
The following alternate code only took 0.82s, for a speed up of almost 
7.75x.  Scaling your timings by a similar amount gives Ocaml a running 
speed of 3.14s in your set up, or competitive with F#.

My code:
   let rec collatzLen c n =
     if n = 1 then c else
       collatzLen (c+1)
         (if (n land 1) == 0 then (n lsr 1) else ((n * 3) + 1))
   ;;

   let rec loop i nlen n =
     if i = 1 then n else
       let ilen = collatzLen 1 i in
       if (ilen > nlen) then
         loop (i - 1) ilen i
       else
         loop (i - 1) nlen n
   ;;

   loop 1000000 0 0

Here is where you insert a lecture about how Ocaml int's being on 63 (or 
31) bits aren't "real" ints, and that therefor this isn't a valid 
comparison at all.

Brian

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Value types (Was: [Caml-list] ocamlopt LLVM support)
  2010-12-12 19:09   ` Benedikt Meurer
  2010-12-12 19:20     ` John Carr
@ 2010-12-12 19:55     ` Török Edwin
  2010-12-12 22:05       ` Jon Harrop
  2010-12-12 21:50     ` Jon Harrop
  2010-12-13  8:43     ` Alain Frisch
  3 siblings, 1 reply; 21+ messages in thread
From: Török Edwin @ 2010-12-12 19:55 UTC (permalink / raw)
  To: Benedikt Meurer; +Cc: caml-list

On Sun, 12 Dec 2010 20:09:00 +0100
Benedikt Meurer <benedikt.meurer@googlemail.com> wrote:

> 
> On Dec 12, 2010, at 16:55 , Török Edwin wrote:
> 
> > [...]
> > Problem #2: Int64.div n 2 -> idiv instruction. 
> > 
> > A C compiler would optimize this to a right shift. Changing that to
> > 'Int64.shift_right n 1' speeds up the code.
> 
> This is easy to fix in ocamlopt (see attached patch
> ocamlopt-natint.patch), by applying the same optimizations already
> used for constant int's to constant natint's (Int64 is Nativeint on
> 64bit).

Nice patch. I definitely agree that what can be fixed in ocamlopt's
high-level opts should be fixed there, rather than hope that LLVM will
just do everything.

> [...]
> > 
> > One advantage of using LLVM is that it would notice arithmetic
> > optimizations like this and perform it itself (even if you use the
> > boxed representation).
> 
> In case of x86-32, it won't, simply because LLVM will be presented
> with the calls to caml_int32_* functions. 

I was thinking that the runtime could be compiled to .bc, and then they
would get inlined and optimized away at link time. That is overkill for
something as simple as integer arithmetic, but could be useful for more
complicated runtime functions (or C bindings).

> You'd need to change the
> Cmm code instead (changing the low-level stuff is straight-forward as
> demonstrated).

I agree that not emitting those C calls in the first place is better.

>  For 64bit targets, see attached patch.
> 
> This doesn't mean that LLVM wouldn't be useful (in fact, I've just
> started an LLVM backend for ocamlopt).

Great! If you notice some more optimizations missed by ocamlopt while
working on it, could you write up a list of those?

I think I suggested earlier in this thread that LLVM could be used in
two ways: write a backend with it, or port some of the LLVM
optimizations to ocamlopt.
Maybe it'd be good to write some documentation on ocamlopt's internals,
and how one can add more optimizations there. Something simple like
what is the IR format (like LLVM's langref), how you perform the
optimizations (what is the equivalent of LLVM's passes), and what
helper modules can you use (what is the equivalent of LLVM's analysis)
would suffice for a start. Does something like this exist already?

> But it is important to note
> that LLVM is not the solution to everything. As the name implies,
> it's "low level", it does a few "higher level" optimizations for C,
> but these are special cases (and somewhat ugly if you take the time
> to read the code). It won't make a faster OCaml magically, just like
> it didn't make a faster Haskell by itself.
> 
> I could go on by quoting common "Harrop jokes" like "you need types
> in the low-level IR", etc. trying to tell him that this is simply
> wrong; but after reading through the Caml/LISP mailing list archives
> (thanks for the pointers by several readers), I'm pretty much
> convinced that Jon simply decided to end his war against LISP just to
> start a new one against ocamlopt.
> 
> If anyone is serious about ocamlopt with LLVM, feel free to contact
> me (tho, my time is limited atm).

I wouldn't have much time to dedicate to this, so I can't say I'm
serious about it. 
AFAICT LLVM's OCaml bindings are only good for generating LLVM IR from
OCaml, not for actually performing transformations on it (there is no
binding to retrieve the type of a value for example). I'll probably be
looking into fixing that in the near future, and this may indirectly
help your LLVM backend (if you intend to write OCaml specific
transformations on the LLVM IR).

Best regards,
--Edwin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Value types (Was: [Caml-list] ocamlopt LLVM support)
  2010-12-12 19:53 ` Value types (Was: [Caml-list] ocamlopt LLVM support) Brian Hurt
@ 2010-12-12 20:39   ` Jon Harrop
  0 siblings, 0 replies; 21+ messages in thread
From: Jon Harrop @ 2010-12-12 20:39 UTC (permalink / raw)
  To: 'Brian Hurt'; +Cc: caml-list

Brian Hurt wrote:
> On Sun, 12 Dec 2010, Jon Harrop wrote:
> >  let rec collatzLen(c, n) : int =
> >    if n = 1L then c else
> >      collatzLen (c+1, if Int64.rem n 2L = 0L then Int64.div n 2L else
> > Int64.add (Int64.mul 3L n) 1L);;
> >
> >  let rec loop(i, (nlen, n)) =
> >    if i = 1L then n else
> >      let ilen = collatzLen(1, i) in
> >      let nlen, n = if ilen > nlen then ilen, i else nlen, n in
> >      loop (Int64.sub i 1L, (nlen, n));;
> 
> Congratulations, Jon, you win today's Captain Obvious award.  Using
> Int64's, which are forced to be boxed, really slows things down.

Apparently boxing isn't the issue here, as I had assumed. On 32-bit, OCaml
compiles each arithmetic operation on the int64s to a C function call.

> Also, uncurrying all your arguments also slows things down.

I see <3% performance improvement from currying everything.

> Running your
> original code on my 64-bit laptop, it took 6.35s to run the 1M example.
> The following alternate code only took 0.82s, for a speed up of almost
> 7.75x.

According to Edwin, you should be able to get C-like performance by running
the OCaml in 64-bit and replacing the div and mod operations with shifts and
logical ANDs.

> Scaling your timings by a similar amount gives Ocaml a running
> speed of 3.14s in your set up, or competitive with F#.

I'd be wary of scaling timings by measurements made across different
architectures. OCaml seems to be doing completely different things on x86
and x64 here.

Cheers,
Jon.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Value types (Was: [Caml-list] ocamlopt LLVM support)
  2010-12-12 19:09   ` Benedikt Meurer
  2010-12-12 19:20     ` John Carr
  2010-12-12 19:55     ` Value types (Was: [Caml-list] ocamlopt LLVM support) Török Edwin
@ 2010-12-12 21:50     ` Jon Harrop
  2010-12-13  8:43     ` Alain Frisch
  3 siblings, 0 replies; 21+ messages in thread
From: Jon Harrop @ 2010-12-12 21:50 UTC (permalink / raw)
  To: 'Benedikt Meurer', caml-list

Benedict wrote:
> > A C compiler would optimize this to a right shift. Changing that to
> > 'Int64.shift_right n 1' speeds up the code.
>
> This is easy to fix in ocamlopt (see attached patch ocamlopt-
> natint.patch), by applying the same optimizations already used for
> constant int's to constant natint's (Int64 is Nativeint on 64bit). Note
> however, that "mod 2" is not really "and 1", neither is "div 2"
> equivalent to "lsr 1"; that would be the case for unsigned arithmetic
> (doesn't matter in your example, tho).

That's great. Does it optimize div and mod by any constant integer as C
compilers do?

> >> 1. Unboxing can give huge performance improvements on serial code,
> >
> > s/Unboxing/arithmetic optimizations/
> > Please find an example where the performance benefit is due to
> > unboxing, and not due to arithmetic optimizations performed on the
> > unboxed code.
> 
> The boxing involved is relevant, but boxing in general is not the
> issue. In this special case, the "let nlen, n = if..." code requires
> heap allocation, because of the way the pattern is compiled. This could
> be fixed by moving the condition out of the code and using two if's to
> select n/nlen separately (doesn't speed up that much). Fixing the
> pattern compiler to handle these cases might be interesting for general
> benefit.
> 
> I already mentioned this multiple times, but here we go again: Unboxing
> optimizations may indeed prove useful if applied wisely (cmmgen.ml is
> of special interest here, the unboxing optimizations are more or less
> special cases; that could be extended to include interesting cases like
> moving boxing out of if-then-else in return position, etc).
> 
> But (here comes the special "Harrop note") this has absolutely nothing
> to do with LLVM (and of course really, really, really nothing to do
> with HLVM). Using a different data representation for the heap requires
> a nearly complete rewrite of the OCaml system (you would probably need
> to start at the Typing level); if one wants to do this, enjoy and come
> back with code. But even then, data representation issues will have to
> be considered long before it comes to actual code generation (if you
> are serious, you'll have to think about the runtime first prior to
> talking about code generation for a non-existing runtime), so even then
> it has nothing to do with LLVM (or C-- or C or whatever IR you can
> think of).

OCaml programmers can benefit from more appropriate data representations by
using LLVM as a library to generate code from OCaml. HLVM is an example of
this that anyone can play with.

> Combining alloc's across if-then-else constructs further reduces code
> size in your example (and probably other programs as well), see
> attached file ocamlopt-comballoc-ifthenelse.patch. It's quick&dirty,
> but it should illustrate the idea.

I think that is an even more valuable improvement to ocamlopt than the int
optimization.

> This doesn't mean that LLVM wouldn't be useful (in fact, I've just
> started an LLVM backend for ocamlopt). But it is important to note that
> LLVM is not the solution to everything. As the name implies, it's "low
> level", it does a few "higher level" optimizations for C, but these are
> special cases (and somewhat ugly if you take the time to read the
> code). It won't make a faster OCaml magically, just like it didn't make
> a faster Haskell by itself.

Absolutely.

> I could go on by quoting common "Harrop jokes" like "you need types in
> the low-level IR", etc. trying to tell him that this is simply wrong;
> but after reading through the Caml/LISP mailing list archives (thanks
> for the pointers by several readers), I'm pretty much convinced that
> Jon simply decided to end his war against LISP just to start a new one
> against ocamlopt.

Suggesting that OCaml programmers use LLVM as a library because you can get
huge performance gains is hardly a "war against ocamlopt".

Cheers,
Jon.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Value types (Was: [Caml-list] ocamlopt LLVM support)
  2010-12-12 19:55     ` Value types (Was: [Caml-list] ocamlopt LLVM support) Török Edwin
@ 2010-12-12 22:05       ` Jon Harrop
  2010-12-12 22:27         ` Török Edwin
  0 siblings, 1 reply; 21+ messages in thread
From: Jon Harrop @ 2010-12-12 22:05 UTC (permalink / raw)
  To: 'Török Edwin'; +Cc: caml-list

Edwin wrote:
> AFAICT LLVM's OCaml bindings are only good for generating LLVM IR from
> OCaml, not for actually performing transformations on it (there is no
> binding to retrieve the type of a value for example). I'll probably be
> looking into fixing that in the near future, and this may indirectly
> help your LLVM backend (if you intend to write OCaml specific
> transformations on the LLVM IR).

That's a lot of work. Wouldn't it be preferable to do the passes on the OCaml side and focus on generating high quality LLVM IR?

Cheers,
Jon.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Value types (Was: [Caml-list] ocamlopt LLVM support)
  2010-12-12 22:05       ` Jon Harrop
@ 2010-12-12 22:27         ` Török Edwin
  2010-12-12 23:41           ` Jon Harrop
  0 siblings, 1 reply; 21+ messages in thread
From: Török Edwin @ 2010-12-12 22:27 UTC (permalink / raw)
  To: Jon Harrop; +Cc: caml-list

On Sun, 12 Dec 2010 22:05:34 -0000
Jon Harrop <jonathandeanharrop@googlemail.com> wrote:

> Edwin wrote:
> > AFAICT LLVM's OCaml bindings are only good for generating LLVM IR
> > from OCaml, not for actually performing transformations on it
> > (there is no binding to retrieve the type of a value for example).
> > I'll probably be looking into fixing that in the near future, and
> > this may indirectly help your LLVM backend (if you intend to write
> > OCaml specific transformations on the LLVM IR).
> 
> That's a lot of work. Wouldn't it be preferable to do the passes on
> the OCaml side and focus on generating high quality LLVM IR?

Yes, that is probably a better approach (generating the optimized IR in
the first place).

Best regards,
--Edwin


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Value types (Was: [Caml-list] ocamlopt LLVM support)
  2010-12-12 22:27         ` Török Edwin
@ 2010-12-12 23:41           ` Jon Harrop
  2010-12-13  2:13             ` Eray Ozkural
  0 siblings, 1 reply; 21+ messages in thread
From: Jon Harrop @ 2010-12-12 23:41 UTC (permalink / raw)
  To: 'Török Edwin'; +Cc: caml-list

Edwin wrote:
> On Sun, 12 Dec 2010 22:05:34 -0000
> Jon Harrop <jonathandeanharrop@googlemail.com> wrote:
> > Edwin wrote:
> > > AFAICT LLVM's OCaml bindings are only good for generating LLVM IR
> > > from OCaml, not for actually performing transformations on it
> > > (there is no binding to retrieve the type of a value for example).
> > > I'll probably be looking into fixing that in the near future, and
> > > this may indirectly help your LLVM backend (if you intend to write
> > > OCaml specific transformations on the LLVM IR).
> >
> > That's a lot of work. Wouldn't it be preferable to do the passes on
> > the OCaml side and focus on generating high quality LLVM IR?
> 
> Yes, that is probably a better approach (generating the optimized IR in
> the first place).

FWIW, just the basic existing bindings were still buggy last I looked. For
example, HLVM had to use a workaround to add a dummy argument to a function
with no arguments because the bindings didn't handle that case correctly.
Ironing the bugs out of the existing bindings would be more useful than
making them even bigger, IMHO.

Going back to Richard's idea, writing an OCaml library that uses LLVM to JIT
interface code on-the-fly as a replacement for writing separate C stubs
would be incredibly useful and you could even use it to interface to LLVM
itself more easily!

Cheers,
Jon.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Value types (Was: [Caml-list] ocamlopt LLVM support)
  2010-12-12 23:41           ` Jon Harrop
@ 2010-12-13  2:13             ` Eray Ozkural
  0 siblings, 0 replies; 21+ messages in thread
From: Eray Ozkural @ 2010-12-13  2:13 UTC (permalink / raw)
  To: Jon Harrop; +Cc: Török Edwin, caml-list

[-- Attachment #1: Type: text/plain, Size: 347 bytes --]

It's better to focus on low-level optimizations within LLVM IR I think,
high-level optimizations would better be done beforehand. It's not a bad
marriage though. :)

-- 
Eray Ozkural, PhD candidate.  Comp. Sci. Dept., Bilkent University, Ankara
http://groups.yahoo.com/group/ai-philosophy
http://myspace.com/arizanesil http://myspace.com/malfunct

[-- Attachment #2: Type: text/html, Size: 539 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Value types (Was: [Caml-list] ocamlopt LLVM support)
  2010-12-12 19:09   ` Benedikt Meurer
                       ` (2 preceding siblings ...)
  2010-12-12 21:50     ` Jon Harrop
@ 2010-12-13  8:43     ` Alain Frisch
  2010-12-15 10:29       ` Benedikt Meurer
  3 siblings, 1 reply; 21+ messages in thread
From: Alain Frisch @ 2010-12-13  8:43 UTC (permalink / raw)
  To: Benedikt Meurer; +Cc: caml-list

On 12/12/2010 08:09 PM, Benedikt Meurer wrote:
> The boxing involved is relevant, but boxing in general is not the
> issue. In this special case, the "let nlen, n = if..." code requires
> heap allocation, because of the way the pattern is compiled. This could
> be fixed by moving the condition out of the code and using two if's to
> select n/nlen separately (doesn't speed up that much). Fixing the
> pattern compiler to handle these cases might be interesting for general
> benefit.

Instead of duplicating the conditional, you could also push the 
assignments to bound variables down the expression. For instance:

let (x, y) = if b then (u, v) else (v, u) in ...

can be replaced, conceptually, by:

let x = <dummy> in
let y = <dummy> in
if b then (x <- u; y <- v) else (x <- v; y <- u);
...

and similarly when the bound expression is a pattern matching.


I've played with this a few months ago and could observe important 
speedups (27%, 20%) on two micro-benchmarks.

The diff is really small:

http://caml.inria.fr/cgi-bin/viewcvs.cgi/ocaml/branches/inplace_let/bytecomp/matching.ml?rev=10475&sortby=date&r2=10475&r1=10474


-- Alain



Micro benchmark 1:

let () =
   for k = 1 to 10000 do
     for i = 1 to 100000 do
       let (x, y) =
         if i mod 2 = 0 then (1, i * 2)
         else (2, i * 3)
       in
       r := !r * x + y
     done
   done



Micro benchmark 2:

let f x y z =
   let a, b =
     match z with
     | Some (u, v) -> u, v * 2
     | None -> 10, 20
   in
   a * x + b * y

let () =
   let r = ref 0 in
   for k = 1 to 2000 do
     for i = 1 to 100000 do
       r := !r + f k i (Some (k, i));
       r := !r + f k i None
     done
   done


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Value types
  2010-12-12 19:20     ` John Carr
@ 2010-12-14  9:43       ` Goswin von Brederlow
  0 siblings, 0 replies; 21+ messages in thread
From: Goswin von Brederlow @ 2010-12-14  9:43 UTC (permalink / raw)
  To: John Carr; +Cc: Benedikt Meurer, caml-list

John Carr <jfc@MIT.EDU> writes:

>> I don't see the point of optimizing for x86-32
>
> I'm using 32 bit ocaml because my program uses too much memory in 64
> bit mode.  If there were an ocaml that used 32 bit words in 64 bit
> mode, I would use that instead.
>
> Early 32 to 64 bit transitions offered 32 bit pointers with 64 bit
> registers, called TSO on Alpha and n32 on MIPS.  AMD and Intel did not.

There is a patch for gcc for this for amd64.

It could be nice to implement this for ocaml too. It would mostly
improve the Int64 module. But the extra registers in 64bit mode should
help performance overall too.

MfG
        Goswin


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Value types
  2010-12-12 15:55 ` Török Edwin
  2010-12-12 17:14   ` Jon Harrop
  2010-12-12 19:09   ` Benedikt Meurer
@ 2010-12-14  9:54   ` Goswin von Brederlow
  2 siblings, 0 replies; 21+ messages in thread
From: Goswin von Brederlow @ 2010-12-14  9:54 UTC (permalink / raw)
  To: Edwin; +Cc: Jon Harrop, caml-list

TÃ¶rÃ¶k Edwin <edwintorok@gmail.com> writes:

> On Sun, 12 Dec 2010 14:54:14 -0000
> "Jon Harrop" <jon@ffconsultancy.com> wrote:
>
>> The Haskell guys got their best performance improvement moving to
>> LLVM from the hailstone benchmark so it is interesting to examine
>> this case as well. I just added support for 64-bit ints to HLVM to
>> implement that benchmark and my code is:
>> 
>> Hereâ€™s the equivalent OCaml code:
>> 
>>   let rec collatzLen(c, n) : int =
>>     if n = 1L then c else
>>       collatzLen (c+1, if Int64.rem n 2L = 0L then Int64.div n 2L else
>
> OK, but boxing itself has nothing to do with the performance degration
> here. It is the lack of compiler optimizations on the Int64 type. This
> could be solved by implementing compiler optimizations for it (or
> manually optimizing some integer arithmetic that is slow).
>
> Lets run the code under a profiler, or look at the assembly (I used
> 'perf record' and 'perf report'). 2 'idiv' instructions turn up as top
> offenders in the profile.
>
> Problem #1: Int64.rem n 2 -> another idiv instruction
>
> A C compiler would optimize this to an 'and' instruction.
> Change that to 'Int64.logand n 1L = 0L'/

But C would have an uint64_t there (if you are smart). Otherwise that
isn't equivalent:

void foo(int64_t x) {
  a = x % 2;
}

0000000000000000 <foo>:
   0:   48 89 f8                mov    %rdi,%rax
   3:   48 c1 e8 3f             shr    $0x3f,%rax
   7:   48 01 c7                add    %rax,%rdi
   a:   83 e7 01                and    $0x1,%edi
   d:   48 29 c7                sub    %rax,%rdi
  10:   48 89 3d 00 00 00 00    mov    %rdi,0x0(%rip)        # 17 <foo+0x17>
  17:   c3                      retq   

> Problem #2: Int64.div n 2 -> idiv instruction. 
>
> A C compiler would optimize this to a right shift. Changing that to 'Int64.shift_right n 1' speeds
> up the code.

Same thing again:

void foo(int64_t x) {
  a = x / 2;
}

0000000000000000 <foo>:
   0:   48 89 f8                mov    %rdi,%rax
   3:   48 c1 e8 3f             shr    $0x3f,%rax
   7:   48 8d 3c 38             lea    (%rax,%rdi,1),%rdi
   b:   48 d1 ff                sar    %rdi
   e:   48 89 3d 00 00 00 00    mov    %rdi,0x0(%rip)        # 15 <foo+0x15>
  15:   c3                      retq   

In both cases gcc avoids the expensive idiv instruction but it needs to
handle the sign of the number with some tricks. Still faster than idiv
though.


An UInt64 module would be nice here.

MfG
        Goswin


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Value types (Was: [Caml-list] ocamlopt LLVM support)
  2010-12-13  8:43     ` Alain Frisch
@ 2010-12-15 10:29       ` Benedikt Meurer
  2010-12-15 13:15         ` Jon Harrop
  0 siblings, 1 reply; 21+ messages in thread
From: Benedikt Meurer @ 2010-12-15 10:29 UTC (permalink / raw)
  To: caml-list


On Dec 13, 2010, at 09:43 , Alain Frisch wrote:

> On 12/12/2010 08:09 PM, Benedikt Meurer wrote:
>> The boxing involved is relevant, but boxing in general is not the
>> issue. In this special case, the "let nlen, n = if..." code requires
>> heap allocation, because of the way the pattern is compiled. This could
>> be fixed by moving the condition out of the code and using two if's to
>> select n/nlen separately (doesn't speed up that much). Fixing the
>> pattern compiler to handle these cases might be interesting for general
>> benefit.
> 
> Instead of duplicating the conditional, you could also push the assignments to bound variables down the expression. For instance:
> 
> let (x, y) = if b then (u, v) else (v, u) in ...
> 
> can be replaced, conceptually, by:
> 
> let x = <dummy> in
> let y = <dummy> in
> if b then (x <- u; y <- v) else (x <- v; y <- u);
> ...
> 
> and similarly when the bound expression is a pattern matching.
> 
> 
> I've played with this a few months ago and could observe important speedups (27%, 20%) on two micro-benchmarks.
> 
> The diff is really small:
> 
> http://caml.inria.fr/cgi-bin/viewcvs.cgi/ocaml/branches/inplace_let/bytecomp/matching.ml?rev=10475&sortby=date&r2=10475&r1=10474

Nice. But it would be even better to avoid the dummy, in your example

  let x = u in
  let y = v in
  if b then x <- v; y <- u

This does not only avoid the dummy, but would also allow lowering to "cmovcc" instructions in the backend selector (atleast for x86-32/64). 

> -- Alain

Benedikt

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Value types (Was: [Caml-list] ocamlopt LLVM support)
  2010-12-15 10:29       ` Benedikt Meurer
@ 2010-12-15 13:15         ` Jon Harrop
  0 siblings, 0 replies; 21+ messages in thread
From: Jon Harrop @ 2010-12-15 13:15 UTC (permalink / raw)
  To: caml-list

Benedikt wrote:
> Nice. But it would be even better to avoid the dummy, in your example
> 
>   let x = u in
>   let y = v in
>   if b then x <- v; y <- u
> 
> This does not only avoid the dummy, but would also allow lowering to
> "cmovcc" instructions in the backend selector (atleast for x86-32/64).

FWIW, when you're using LLVM (as a library!) to generate equivalent code
then you can just represent the tuples as structs (value types) and LLVM
will take care of all of this for you. Again, it generates really efficient
code with minimal effort.

The latest example front-end for HLVM represents tuples as structs so it
gets this for free. However, polymorphic recursion is *much* harder to
implement with that design choice. Also, there is the question of when boxed
vs unboxed tuples are more efficient.

Cheers,
Jon.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2010-12-15 13:15 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-12-12 14:54 Value types (Was: [Caml-list] ocamlopt LLVM support) Jon Harrop
2010-12-12 15:55 ` Török Edwin
2010-12-12 17:14   ` Jon Harrop
2010-12-12 17:26     ` Török Edwin
2010-12-12 18:01       ` Jon Harrop
2010-12-12 18:22         ` Török Edwin
2010-12-12 19:09   ` Benedikt Meurer
2010-12-12 19:20     ` John Carr
2010-12-14  9:43       ` Value types Goswin von Brederlow
2010-12-12 19:55     ` Value types (Was: [Caml-list] ocamlopt LLVM support) Török Edwin
2010-12-12 22:05       ` Jon Harrop
2010-12-12 22:27         ` Török Edwin
2010-12-12 23:41           ` Jon Harrop
2010-12-13  2:13             ` Eray Ozkural
2010-12-12 21:50     ` Jon Harrop
2010-12-13  8:43     ` Alain Frisch
2010-12-15 10:29       ` Benedikt Meurer
2010-12-15 13:15         ` Jon Harrop
2010-12-14  9:54   ` Value types Goswin von Brederlow
2010-12-12 19:53 ` Value types (Was: [Caml-list] ocamlopt LLVM support) Brian Hurt
2010-12-12 20:39   ` Jon Harrop

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).