From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82]) by yquem.inria.fr (Postfix) with ESMTP id AB378BC57 for ; Sun, 12 Dec 2010 20:09:06 +0100 (CET) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ArcAAGOsBE3RVaEtimdsb2JhbACkAQgVAQEBCgkMBw8GJaYdjAgBBYxzAQSFSg X-IronPort-AV: E=Sophos;i="4.59,333,1288566000"; d="scan'208";a="92064972" Received: from mail-fx0-f45.google.com ([209.85.161.45]) by mail1-smtp-roc.national.inria.fr with ESMTP; 12 Dec 2010 20:09:06 +0100 Received: by fxm12 with SMTP id 12so5588611fxm.4 for ; Sun, 12 Dec 2010 11:09:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=domainkey-signature:received:received:from:mime-version :content-type:subject:date:in-reply-to:to:references:message-id :x-mailer; bh=x+JRSiUQv/k9VI4YA36nRcHCXYTB4Gl3hvNXHeYa2zw=; b=I62cxxb2dgYqrJe5sM0arxksSQEgTvDemw/rZDBK+i8IdphEcgB/1hODdHkwva29Wx yGyz+Tp23Lrhl5q3v7S5H/ejFkuR/dL+aLNDoWXwfPLU27y8van2TGvdRC4BtOBYROOh C8AICx38ZpMYlwXBYluvKZlZ9qUyUFr11KcqA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=gamma; h=from:mime-version:content-type:subject:date:in-reply-to:to :references:message-id:x-mailer; b=tL/e7z6WFYKhS1bHygo+9JqMHwS6f0987FzrY/IDRxURWDaP+H/M3CdX2JDxN0pSGR OX/glxP/9KUAY1YcZUQ0x9fWCtEs/SDYKRK1Hz5aha2tnRzpj66HchmcIw0aoQ8aHGbV J3hdW+0GzNHwwrBGhLspQCH8w9/Exr7Ru/6KU= Received: by 10.223.125.68 with SMTP id x4mr3411510far.65.1292180943706; Sun, 12 Dec 2010 11:09:03 -0800 (PST) Received: from bespin.kosmos.all (ip-95-223-170-32.unitymediagroup.de [95.223.170.32]) by mx.google.com with ESMTPS id n3sm1461696faa.5.2010.12.12.11.09.01 (version=SSLv3 cipher=RC4-MD5); Sun, 12 Dec 2010 11:09:02 -0800 (PST) From: Benedikt Meurer Mime-Version: 1.0 (Apple Message framework v1081) Content-Type: multipart/mixed; boundary=Apple-Mail-247-341128255 Subject: Re: Value types (Was: [Caml-list] ocamlopt LLVM support) Date: Sun, 12 Dec 2010 20:09:00 +0100 In-Reply-To: <20101212175524.73a8e285@deb0> To: caml-list@yquem.inria.fr References: <036001cb9a0c$725acef0$57106cd0$@com> <20101212175524.73a8e285@deb0> Message-Id: <9264BEE6-DBAE-4523-93AC-4560615D2AC5@googlemail.com> X-Mailer: Apple Mail (2.1081) X-Spam: no; 0.00; ocamlopt:01 haskell:01 ocaml:01 compiler:01 compiler:01 'int:01 'int:01 ocamlopt:01 int's:01 nativeint:01 unboxing:01 unboxing:01 unboxed:01 cmmgen:01 ocaml:01 X-Attachments: type="application/octet-stream" name="ocamlopt-comballoc-ifthenelse.patch" name="ocamlopt-comballoc-ifthenelse.patch" type="application/octet-stream" name="ocamlopt-natint.patch" name="ocamlopt-natint.patch" --Apple-Mail-247-341128255 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=windows-1252 On Dec 12, 2010, at 16:55 , T=F6r=F6k Edwin wrote: > On Sun, 12 Dec 2010 14:54:14 -0000 > "Jon Harrop" wrote: >=20 >> The Haskell guys got their best performance improvement moving to >> LLVM from the hailstone benchmark so it is interesting to examine >> this case as well. I just added support for 64-bit ints to HLVM to >> implement that benchmark and my code is: >>=20 >> Here=92s the equivalent OCaml code: >>=20 >> let rec collatzLen(c, n) : int =3D >> if n =3D 1L then c else >> collatzLen (c+1, if Int64.rem n 2L =3D 0L then Int64.div n 2L = else >=20 > OK, but boxing itself has nothing to do with the performance degration > here. It is the lack of compiler optimizations on the Int64 type. This > could be solved by implementing compiler optimizations for it (or > manually optimizing some integer arithmetic that is slow). >=20 > Lets run the code under a profiler, or look at the assembly (I used > 'perf record' and 'perf report'). 2 'idiv' instructions turn up as top > offenders in the profile. >=20 > Problem #1: Int64.rem n 2 -> another idiv instruction >=20 > A C compiler would optimize this to an 'and' instruction. > Change that to 'Int64.logand n 1L =3D 0L'/ >=20 > Problem #2: Int64.div n 2 -> idiv instruction.=20 >=20 > A C compiler would optimize this to a right shift. Changing that to = 'Int64.shift_right n 1' speeds > up the code. This is easy to fix in ocamlopt (see attached patch = ocamlopt-natint.patch), by applying the same optimizations already used = for constant int's to constant natint's (Int64 is Nativeint on 64bit). = Note however, that "mod 2" is not really "and 1", neither is "div 2" = equivalent to "lsr 1"; that would be the case for unsigned arithmetic = (doesn't matter in your example, tho). I don't see the point of optimizing for x86-32 (neither would I spend my = time optimizing anything for VAX these days), but it would be possible = to add appropriate cases for Int64 on x86 as well (regalloc would be = most difficult here, since that requires support for register pairs to = perform 64bit arithmetic). >> 1. Unboxing can give huge performance improvements on serial code, >=20 > s/Unboxing/arithmetic optimizations/ > Please find an example where the performance benefit is due to > unboxing, and not due to arithmetic optimizations performed on the > unboxed code. The boxing involved is relevant, but boxing in general is not the issue. = In this special case, the "let nlen, n =3D if..." code requires heap = allocation, because of the way the pattern is compiled. This could be = fixed by moving the condition out of the code and using two if's to = select n/nlen separately (doesn't speed up that much). Fixing the = pattern compiler to handle these cases might be interesting for general = benefit. I already mentioned this multiple times, but here we go again: Unboxing = optimizations may indeed prove useful if applied wisely (cmmgen.ml is of = special interest here, the unboxing optimizations are more or less = special cases; that could be extended to include interesting cases like = moving boxing out of if-then-else in return position, etc). But (here comes the special "Harrop note") this has absolutely nothing = to do with LLVM (and of course really, really, really nothing to do with = HLVM). Using a different data representation for the heap requires a = nearly complete rewrite of the OCaml system (you would probably need to = start at the Typing level); if one wants to do this, enjoy and come back = with code. But even then, data representation issues will have to be = considered long before it comes to actual code generation (if you are = serious, you'll have to think about the runtime first prior to talking = about code generation for a non-existing runtime), so even then it has = nothing to do with LLVM (or C-- or C or whatever IR you can think of). Combining alloc's across if-then-else constructs further reduces code = size in your example (and probably other programs as well), see attached = file ocamlopt-comballoc-ifthenelse.patch. It's quick&dirty, but it = should illustrate the idea. >> let alone parallel code. The optimized HLVM is running 32=D7 faster >> than the OCaml here. >>=20 >> 2. LLVM makes it easy to JIT fast code from OCaml. HLVM is using it >> to beat GCC-compiled C code here. >>=20 >=20 > One advantage of using LLVM is that it would notice arithmetic > optimizations like this and perform it itself (even if you use the > boxed representation). In case of x86-32, it won't, simply because LLVM will be presented with = the calls to caml_int32_* functions. You'd need to change the Cmm code = instead (changing the low-level stuff is straight-forward as = demonstrated). For 64bit targets, see attached patch. This doesn't mean that LLVM wouldn't be useful (in fact, I've just = started an LLVM backend for ocamlopt). But it is important to note that = LLVM is not the solution to everything. As the name implies, it's "low = level", it does a few "higher level" optimizations for C, but these are = special cases (and somewhat ugly if you take the time to read the code). = It won't make a faster OCaml magically, just like it didn't make a = faster Haskell by itself. I could go on by quoting common "Harrop jokes" like "you need types in = the low-level IR", etc. trying to tell him that this is simply wrong; = but after reading through the Caml/LISP mailing list archives (thanks = for the pointers by several readers), I'm pretty much convinced that Jon = simply decided to end his war against LISP just to start a new one = against ocamlopt. If anyone is serious about ocamlopt with LLVM, feel free to contact me = (tho, my time is limited atm). > Best regards, > --Edwin greets, Benedikt --Apple-Mail-247-341128255 Content-Disposition: attachment; filename=ocamlopt-comballoc-ifthenelse.patch Content-Type: application/octet-stream; name="ocamlopt-comballoc-ifthenelse.patch" Content-Transfer-Encoding: 7bit diff --git a/asmcomp/comballoc.ml b/asmcomp/comballoc.ml index 5a862b1..82c01f9 100644 --- a/asmcomp/comballoc.ml +++ b/asmcomp/comballoc.ml @@ -27,64 +27,78 @@ let allocated_size = function No_alloc -> 0 | Pending_alloc(reg, ofs) -> ofs +let instr_cons_alloc sz a r n = + if sz != 0 + then instr_cons (Iop(Ialloc sz)) a r n + else n + let rec combine i allocstate = match i.desc with Iend | Ireturn | Iexit _ | Iraise -> - (i, allocated_size allocstate) + (i, allocated_size allocstate, true) | Iop(Ialloc sz) -> begin match allocstate with No_alloc -> - let (newnext, newsz) = + let (newnext, newsz, _) = combine i.next (Pending_alloc(i.res.(0), sz)) in - (instr_cons (Iop(Ialloc newsz)) i.arg i.res newnext, 0) + (instr_cons_alloc newsz i.arg i.res newnext, 0, false) | Pending_alloc(reg, ofs) -> if ofs + sz < Config.max_young_wosize then begin - let (newnext, newsz) = + let (newnext, newsz, safe) = combine i.next (Pending_alloc(reg, ofs + sz)) in - (instr_cons (Iop(Iintop_imm(Iadd, ofs))) [| reg |] i.res newnext, - newsz) + if sz != 0 && ofs != 0 then + (instr_cons (Iop(Iintop_imm(Iadd, ofs))) [|reg|] i.res newnext, newsz, safe) + else if sz != 0 then + (instr_cons (Iop Imove) [|reg|] i.res newnext, newsz, safe) + else + (newnext, newsz, safe) end else begin - let (newnext, newsz) = + let (newnext, newsz, _) = combine i.next (Pending_alloc(i.res.(0), sz)) in - (instr_cons (Iop(Ialloc newsz)) i.arg i.res newnext, ofs) + (instr_cons_alloc newsz i.arg i.res newnext, ofs, false) end end | Iop(Icall_ind | Icall_imm _ | Iextcall _ | Itailcall_ind | Itailcall_imm _) -> let newnext = combine_restart i.next in - (instr_cons_debug i.desc i.arg i.res i.dbg newnext, - allocated_size allocstate) + (instr_cons_debug i.desc i.arg i.res i.dbg newnext, allocated_size allocstate, false) | Iop op -> - let (newnext, sz) = combine i.next allocstate in - (instr_cons_debug i.desc i.arg i.res i.dbg newnext, sz) + let (newnext, sz, safe) = combine i.next allocstate in + (instr_cons_debug i.desc i.arg i.res i.dbg newnext, sz, safe) | Iifthenelse(test, ifso, ifnot) -> - let newifso = combine_restart ifso in - let newifnot = combine_restart ifnot in - let newnext = combine_restart i.next in - (instr_cons (Iifthenelse(test, newifso, newifnot)) i.arg i.res newnext, - allocated_size allocstate) + begin match allocstate, combine ifso allocstate, combine ifnot allocstate with + Pending_alloc(reg, ofs), (newifso, szifso, true), (newifnot, szifnot, true) when szifso = szifnot -> + let (newnext, sznext, safe) = combine i.next (Pending_alloc(reg, ofs + szifso)) in + (instr_cons (Iifthenelse(test, newifso, newifnot)) i.arg i.res newnext, + sznext, + safe) + | _, _, _ -> + let newifso = combine_restart ifso in + let newifnot = combine_restart ifnot in + let newnext = combine_restart i.next in + (instr_cons (Iifthenelse(test, newifso, newifnot)) i.arg i.res newnext, + allocated_size allocstate, false) + end | Iswitch(table, cases) -> let newcases = Array.map combine_restart cases in let newnext = combine_restart i.next in - (instr_cons (Iswitch(table, newcases)) i.arg i.res newnext, - allocated_size allocstate) + (instr_cons (Iswitch(table, newcases)) i.arg i.res newnext, allocated_size allocstate, false) | Iloop(body) -> let newbody = combine_restart body in - (instr_cons (Iloop(newbody)) i.arg i.res i.next, - allocated_size allocstate) + (instr_cons (Iloop(newbody)) i.arg i.res i.next, allocated_size allocstate, false) | Icatch(io, body, handler) -> - let (newbody, sz) = combine body allocstate in + let (newbody, sz, _) = combine body allocstate in let newhandler = combine_restart handler in let newnext = combine_restart i.next in - (instr_cons (Icatch(io, newbody, newhandler)) i.arg i.res newnext, sz) + (instr_cons (Icatch(io, newbody, newhandler)) i.arg i.res newnext, sz, false) | Itrywith(body, handler) -> - let (newbody, sz) = combine body allocstate in + let (newbody, sz, _) = combine body allocstate in let newhandler = combine_restart handler in let newnext = combine_restart i.next in - (instr_cons (Itrywith(newbody, newhandler)) i.arg i.res newnext, sz) + (instr_cons (Itrywith(newbody, newhandler)) i.arg i.res newnext, sz, false) and combine_restart i = - let (newi, _) = combine i No_alloc in newi + let (newi, _, _) = combine i No_alloc in newi let fundecl f = {f with fun_body = combine_restart f.fun_body} diff --git a/asmcomp/selectgen.ml b/asmcomp/selectgen.ml index 7daa239..4221f95 100644 --- a/asmcomp/selectgen.ml +++ b/asmcomp/selectgen.ml @@ -562,6 +562,9 @@ method emit_expr env exp = let (rif, sif) = self#emit_sequence env eif in let (relse, selse) = self#emit_sequence env eelse in let r = join rif sif relse selse in + (* Dummy Ialloc 0 for comballoc.ml *) + let ra = self#regs_for typ_addr in + self#insert (Iop(Ialloc 0)) [||] ra; self#insert (Iifthenelse(cond, sif#extract, selse#extract)) rarg [||]; r @@ -790,6 +793,9 @@ method emit_tail env exp = begin match self#emit_expr env earg with None -> () | Some rarg -> + (* Dummy Ialloc 0 for comballoc.ml *) + let ra = self#regs_for typ_addr in + self#insert (Iop(Ialloc 0)) [||] ra; self#insert (Iifthenelse(cond, self#emit_tail_sequence env eif, self#emit_tail_sequence env eelse)) rarg [||] --Apple-Mail-247-341128255 Content-Disposition: attachment; filename=ocamlopt-natint.patch Content-Type: application/octet-stream; name="ocamlopt-natint.patch" Content-Transfer-Encoding: 7bit diff --git a/asmcomp/amd64/selection.ml b/asmcomp/amd64/selection.ml index 4921e51..bce8104 100644 --- a/asmcomp/amd64/selection.ml +++ b/asmcomp/amd64/selection.ml @@ -168,6 +168,9 @@ method! select_operation op args = [arg1; Cconst_int n] when self#is_immediate n && n = 1 lsl (Misc.log2 n) -> (Iintop_imm(Idiv, n), [arg1]) + | [arg1; Cconst_natint n] when self#is_immediate_natint n + && n = Nativeint.shift_left 1n (Misc.log2 (Nativeint.to_int n)) -> + (Iintop_imm(Idiv, Nativeint.to_int n), [arg1]) | _ -> (Iintop Idiv, args) end | Cmodi -> @@ -175,6 +178,9 @@ method! select_operation op args = [arg1; Cconst_int n] when self#is_immediate n && n = 1 lsl (Misc.log2 n) -> (Iintop_imm(Imod, n), [arg1]) + | [arg1; Cconst_natint n] when self#is_immediate_natint n + && n = Nativeint.shift_left 1n (Misc.log2 (Nativeint.to_int n)) -> + (Iintop_imm(Imod, Nativeint.to_int n), [arg1]) | _ -> (Iintop Imod, args) end (* Recognize float arithmetic with memory. *) diff --git a/asmcomp/selectgen.ml b/asmcomp/selectgen.ml index 50f949a..7daa239 100644 --- a/asmcomp/selectgen.ml +++ b/asmcomp/selectgen.ml @@ -201,6 +201,8 @@ method is_simple_expr = function method virtual is_immediate : int -> bool +method virtual is_immediate_natint : nativeint -> bool + (* Selection of addressing modes *) method virtual select_addressing : @@ -238,11 +240,21 @@ method select_operation op args = if n = 1 lsl l then (Iintop_imm(Ilsl, l), [arg1]) else self#select_arith_comm Imul args + | (Cmuli, [arg1; Cconst_natint n]) -> + let l = Misc.log2 (Nativeint.to_int n) in + if n = Nativeint.shift_left 1n l + then (Iintop_imm(Ilsl, l), [arg1]) + else self#select_arith_comm Imul args | (Cmuli, [Cconst_int n; arg1]) -> let l = Misc.log2 n in if n = 1 lsl l then (Iintop_imm(Ilsl, l), [arg1]) else self#select_arith_comm Imul args + | (Cmuli, [Cconst_natint n; arg1]) -> + let l = Misc.log2 (Nativeint.to_int n) in + if n = Nativeint.shift_left 1n l + then (Iintop_imm(Ilsl, l), [arg1]) + else self#select_arith_comm Imul args | (Cmuli, _) -> self#select_arith_comm Imul args | (Cdivi, _) -> self#select_arith Idiv args | (Cmodi, _) -> self#select_arith_comm Imod args @@ -270,38 +282,60 @@ method select_operation op args = method private select_arith_comm op = function [arg; Cconst_int n] when self#is_immediate n -> (Iintop_imm(op, n), [arg]) + | [arg; Cconst_natint n] when self#is_immediate_natint n -> + (Iintop_imm(op, Nativeint.to_int n), [arg]) | [arg; Cconst_pointer n] when self#is_immediate n -> (Iintop_imm(op, n), [arg]) + | [arg; Cconst_natpointer n] when self#is_immediate_natint n -> + (Iintop_imm(op, Nativeint.to_int n), [arg]) | [Cconst_int n; arg] when self#is_immediate n -> (Iintop_imm(op, n), [arg]) + | [Cconst_natint n; arg] when self#is_immediate_natint n -> + (Iintop_imm(op, Nativeint.to_int n), [arg]) | [Cconst_pointer n; arg] when self#is_immediate n -> (Iintop_imm(op, n), [arg]) + | [Cconst_natpointer n; arg] when self#is_immediate_natint n -> + (Iintop_imm(op, Nativeint.to_int n), [arg]) | args -> (Iintop op, args) method private select_arith op = function [arg; Cconst_int n] when self#is_immediate n -> (Iintop_imm(op, n), [arg]) + | [arg; Cconst_natint n] when self#is_immediate_natint n -> + (Iintop_imm(op, Nativeint.to_int n), [arg]) | [arg; Cconst_pointer n] when self#is_immediate n -> (Iintop_imm(op, n), [arg]) + | [arg; Cconst_natpointer n] when self#is_immediate_natint n -> + (Iintop_imm(op, Nativeint.to_int n), [arg]) | args -> (Iintop op, args) method private select_shift op = function [arg; Cconst_int n] when n >= 0 && n < Arch.size_int * 8 -> (Iintop_imm(op, n), [arg]) + | [arg; Cconst_natint n] when n >= 0n && n < Nativeint.of_int (Arch.size_int * 8) -> + (Iintop_imm(op, Nativeint.to_int n), [arg]) | args -> (Iintop op, args) method private select_arith_comp cmp = function [arg; Cconst_int n] when self#is_immediate n -> (Iintop_imm(Icomp cmp, n), [arg]) + | [arg; Cconst_natint n] when self#is_immediate_natint n -> + (Iintop_imm(Icomp cmp, Nativeint.to_int n), [arg]) | [arg; Cconst_pointer n] when self#is_immediate n -> (Iintop_imm(Icomp cmp, n), [arg]) + | [arg; Cconst_natpointer n] when self#is_immediate_natint n -> + (Iintop_imm(Icomp cmp, Nativeint.to_int n), [arg]) | [Cconst_int n; arg] when self#is_immediate n -> (Iintop_imm(Icomp(swap_intcomp cmp), n), [arg]) + | [Cconst_natint n; arg] when self#is_immediate_natint n -> + (Iintop_imm(Icomp(swap_intcomp cmp), Nativeint.to_int n), [arg]) | [Cconst_pointer n; arg] when self#is_immediate n -> (Iintop_imm(Icomp(swap_intcomp cmp), n), [arg]) + | [Cconst_natpointer n; arg] when self#is_immediate_natint n -> + (Iintop_imm(Icomp(swap_intcomp cmp), Nativeint.to_int n), [arg]) | args -> (Iintop(Icomp cmp), args) @@ -310,12 +344,20 @@ method private select_arith_comp cmp = function method select_condition = function Cop(Ccmpi cmp, [arg1; Cconst_int n]) when self#is_immediate n -> (Iinttest_imm(Isigned cmp, n), arg1) + | Cop(Ccmpi cmp, [arg1; Cconst_natint n]) when self#is_immediate_natint n -> + (Iinttest_imm(Isigned cmp, Nativeint.to_int n), arg1) | Cop(Ccmpi cmp, [Cconst_int n; arg2]) when self#is_immediate n -> (Iinttest_imm(Isigned(swap_comparison cmp), n), arg2) + | Cop(Ccmpi cmp, [Cconst_natint n; arg2]) when self#is_immediate_natint n -> + (Iinttest_imm(Isigned(swap_comparison cmp), Nativeint.to_int n), arg2) | Cop(Ccmpi cmp, [arg1; Cconst_pointer n]) when self#is_immediate n -> (Iinttest_imm(Isigned cmp, n), arg1) + | Cop(Ccmpi cmp, [arg1; Cconst_natpointer n]) when self#is_immediate_natint n -> + (Iinttest_imm(Isigned cmp, Nativeint.to_int n), arg1) | Cop(Ccmpi cmp, [Cconst_pointer n; arg2]) when self#is_immediate n -> (Iinttest_imm(Isigned(swap_comparison cmp), n), arg2) + | Cop(Ccmpi cmp, [Cconst_natpointer n; arg2]) when self#is_immediate_natint n -> + (Iinttest_imm(Isigned(swap_comparison cmp), Nativeint.to_int n), arg2) | Cop(Ccmpi cmp, args) -> (Iinttest(Isigned cmp), Ctuple args) | Cop(Ccmpa cmp, [arg1; Cconst_pointer n]) when self#is_immediate n -> diff --git a/asmcomp/selectgen.mli b/asmcomp/selectgen.mli index 7c30f9f..8966ad1 100644 --- a/asmcomp/selectgen.mli +++ b/asmcomp/selectgen.mli @@ -23,6 +23,7 @@ class virtual selector_generic : object (* The following methods must or can be overridden by the processor description *) method virtual is_immediate : int -> bool + method virtual is_immediate_natint : nativeint -> bool (* Must be defined to indicate whether a constant is a suitable immediate operand to arithmetic instructions *) method virtual select_addressing : --Apple-Mail-247-341128255--