From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by yquem.inria.fr (Postfix) with ESMTP id 5C7DCBBAF for ; Mon, 6 Dec 2010 23:39:16 +0100 (CET) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AoMAANL0/EzRVdeukGdsb2JhbACWSoxrCBUBAQEBCQkMBxEEHqUQjAIBBY4XAQSCFGmCTA X-IronPort-AV: E=Sophos;i="4.59,307,1288566000"; d="scan'208";a="69991969" Received: from mail-ey0-f174.google.com ([209.85.215.174]) by mail3-smtp-sop.national.inria.fr with ESMTP; 06 Dec 2010 23:39:15 +0100 Received: by eye27 with SMTP id 27so7391228eye.5 for ; Mon, 06 Dec 2010 14:39:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=domainkey-signature:received:received:from:to:references :in-reply-to:subject:date:organization:message-id:mime-version :content-type:content-transfer-encoding:x-mailer:thread-index :content-language; bh=XUyIueptK81j+n9YN6+KPRViin4LWeK7prszJDQjVXg=; b=MIzLm0UxPg63Qpa7JPMUWAvleFpYHUarr6ukpzo+iPKmlgyDANDA6ifHCHn7tm49GL Axe4un/RdfpBxYyOnU2d42APGT/YCfQmT2h4XphhQjBExbMih7Q15hyqdLj4Tszc/l2W 1t5E/CkdSBL8tphZXAwOEq/6om9VqSiHsIeIE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=gamma; h=from:to:references:in-reply-to:subject:date:organization:message-id :mime-version:content-type:content-transfer-encoding:x-mailer :thread-index:content-language; b=nC8ttp3p310rF9iF9oy2mVmsDW0uf3sceX4UlSWh+NAobLm1oR7CVhcNwG2zmodunf 2UJ5jxmBzzC9fzg0Os1MNBc6EdGYe6OhqN0IurwrETmLrfBAk6d8JU6ytrbAshns9fMH NBEV0I5wD0RBc5KL7RWsNzh/RuiCAvFERp3yo= Received: by 10.216.44.208 with SMTP id n58mr625250web.39.1291675154997; Mon, 06 Dec 2010 14:39:14 -0800 (PST) Received: from WinEight ([87.113.160.118]) by mx.google.com with ESMTPS id a2sm2624765wer.41.2010.12.06.14.39.13 (version=SSLv3 cipher=RC4-MD5); Mon, 06 Dec 2010 14:39:14 -0800 (PST) From: Jon Harrop To: "'Benedikt Meurer'" , References: <3DCEA910-1382-47E5-876B-059178F8F82E@googlemail.com> <20101130124803.7952fca1@deb0> <0a8b01cb90da$da5e6240$8f1b26c0$@com> <5E2DA3F1-7998-4F62-B617-7B6451D1001D@googlemail.com> <0b3b01cb9161$a81c8e10$f855aa30$@com> <0b9301cb91a3$8f42fd60$adc8f820$@com> <0cae01cb9325$a8ddc3d0$fa994b70$@com> <004901cb94b8$c03abf80$40b03e80$@com> In-Reply-To: Subject: RE: ocamlopt LLVM support (Was: [Caml-list] OCamlJIT2 vs. OCamlJIT) Date: Mon, 6 Dec 2010 22:38:56 -0000 Organization: Flying Frog Consultancy Message-ID: <010001cb9596$5f6fda30$1e4f8e90$@com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Office Outlook 12.0 Thread-Index: AcuUwnlP2nnWc5W/QU27SM6T+5IT0AAwE3Jg Content-Language: en-gb X-Spam: no; 0.00; ocamlopt:01 c--:01 ocamlopt:01 c-like:01 ocaml:01 ocaml:01 vectors:01 matrices:01 hash:01 unboxed:01 bigarrays:01 ffi:01 coq:01 ocaml's:01 locality:01 Benedikt wrote: > Also looking at the GHC work you mentioned, they also start at the Cmm > level (slightly different C--, but comparable to ocamlopt), with = mostly > the same amount of type information available. And as you said, they > did quite well in some cases. Yes. They did well in benchmarks like hailstone: collatzLen :: Int -> Word32 -> Int collatzLen c 1 =3D c collatzLen c n =3D collatzLen (c+1) $ if n `mod` 2 =3D=3D 0 then n `div` = 2 else 3*n+1 pmax x n =3D x `max` (collatzLen 1 n, n) main =3D print $ foldl pmax (1,1) [2..1000000] Because they unbox every int and tuple to recover C-like performance, = just as HLVM does thanks to value types and OCaml cannot and does not. So I = would not presume that OCaml+LLVM would be as effective as GHC+LLVM. > > And faster tuples, ints, chars, complex numbers, low-dimensional > > vectors/matrices, hash tables and so on. More types (e.g. int16 and > > float32). Even arbitrary-precision arithmetic can benefit by keeping > small > > numbers unboxed when possible. Bigarrays disappear. The FFI gets > simpler, > > easier to use and more powerful (e.g. you can generate AoS layouts > for > > OpenGL). The benefits are enormous in the general case but that is > beside > > the point here. Moreover, this would never be accepted either = because > it > > would degrade the performance of applications like Coq. If it is = done > at > > all, such work must be kept separate from the OCaml compilers. >=20 > ocamlopt already supports float32 and int16, tho they are not = exploited > to the upper layers But there is no data representation for them in the heap? Could OCaml's = GC handle a boxed float32 or a float32 array without support for that in = the lower layers? > (don't know why, but I also don't know why one would need them). The float32 type is used primarily to halve memory requirements and = improve locality but also to facilitate SSE. Vector arithmetic with LLVM can be = well over 4x faster than OCaml thanks to SSE, of course. > Keeping Int32/Int64 values unboxed would be possible > similar to what is done with doubles; maybe there is simply no need to > do (honestly, I've never needed Int32/Int64 so far, int/Natint is > usually what you want and these are optimized). Regular int types are useful for applications like pseudo-random number generators, image processing and signal analysis. > >> 3. The current garbage collector is mostly straight-forward because > of > >> the data representation. No need to carry around/load type > information, > > > > Note that conveying run-time type information also facilitates = useful > > features like generic printing. >=20 > Generic printing is currently implemented using the run-time type > information (the three bits mentioned below). I'm not sure what you mean by this. OCaml cannot express generic = printing, by which I mean the ability to print a value of an arbitrary type at run-time and have it pretty printed appropriately according to its type. For example, the F# code: printf "%A" [1;2;3] prints "[1;2;3]". Similarly in HLVM: # print(create(3, 1));; [|1; 1; 1|] OCaml's top level can pretty print only because it knows the type of the expression. Ordinary compiled user code has no such information because = the OCaml only retains partial type information in its data representation = so you cannot do this in OCaml. > >> you just need the following bits of information: Is it a block in > the > >> heap? Does it contain pointers? If so how many? All this = information > is > >> immediately available with the current data representation (also > >> improves cache locality of the GC loops). > > > > That information is also immediately available with HLVM's design. > Each > > value contains a pointer to its run-time type. Each run-time type > contains a > > pointer to a "visit" function for the GC that returns the references > in any > > value of that type. >=20 > This is indeed possible, yes, and also implemented by most Java VMs; > each objects ships a type-info pointer. OCaml avoids the type-info > pointer using clever header field encoding. A matter of taste = probably. Beyond taste, it facilitates useful features like generic printing and reflection. > >> So, it is really worth to spend years on a new data representation > >> (including fixing all C bindings, etc.)? Just to get better = floating > >> point performance? Integer performance will be almost the same, > >> avoiding the shift/tag bit just replaces an "addq r1, r2; subq $1, > r2" > >> with "addq r1, r2"; even doing this thousands of times will not > cause a > >> noticeable difference. > > > > On the contrary, there are many example of huge performance gains > other than > > floating point. The int-based random number generator from the > SciMark2 > > benchmark is 6x faster with LLVM than with OCaml. Generic hash = tables > are > > 17x faster in F# than Java because of value types. And so on. >=20 > I doubt that this is really related to the use of "value types", I = also > doubt that the integer performance improvements are related to the > missing "subq" (must have been an unusual CPU then). The "subq" is irrelevant, yes. OCaml's poor performance is due to unnecessary boxing. Value types solve that problem in the general case, = i.e. in the heap as well as on the stack. > The Java performance is probably related to the GC in most JVMs, Java's performance is also dire because of boxing. Specifically, the key-value pairs and often the keys and values themselves. Firstly, that = is due to type erasure instead of reified generics. Secondly, even if you = wrote a type-specialized version the key-value pair still cannot be unboxed without value types. Thirdly, these limitations of the VM lead to the = use of suboptimal concrete data structures. Specifically, open hashing with = chains instead of closed hashing. > which is not as > well optimized for high speed allocation and compaction as the GCs > found in most runtimes of functional programming languages. But this = is > just guessing, we'd need some facts to be sure what's the cause. I > suggest you present some proof-of-concept code in C or LLVM, simply > using malloc() for memory allocation, comparing a straight-forward > implementation to a "value type" implementation (with everything else > the same). If there's still a 17x improvement, I promise to rewrite = the > whole OCaml system during the next three months to support value = types. Again, this is most easily studied using HLVM. A similar program using = value types in HLVM that fills an array of unboxed pairs of 32-bit ints looks = like this and takes 0.39s to run: let rec fill((a, i0, i2) : (int * int) array * int * int) : unit =3D let di =3D i2 - i0 in if di=3D0 then () else if di=3D1 then a.[i0] <- (i0, i2) else ( let i1 =3D i0 + di/2 in ( fill(a, i0, i1); fill(a, i1, i2) ) );; let n =3D 10000000 in fill(create(n, (0, 0)), 0, n);; The following version using a data representation similar to that of = OCaml and Java by boxing the key-value pair that is allocated on the heap = using "malloc" and, consequently, this version takes 2.37s to run: type keyValue =3D KeyValue of int * int;; let rec fill((a, i0, i2) : keyValue array * int * int) : unit =3D let di =3D i2 - i0 in if di=3D0 then () else if di=3D1 then a.[i0] <- (KeyValue(i0, i2)) else ( let i1 =3D i0 + di/2 in ( fill(a, i0, i1); fill(a, i1, i2) ) );; let n =3D 10000000 in fill(create(n, KeyValue(0, 0)), 0, n);; So that is already 6=D7 slower and the problem gets worse when you box = the keys and values themselves within the boxed pair. You might try to blame the poor performance on malloc but OCaml is over = 12=D7 slower with the following similar operation (that I had to do because of = its 16Mb limit): Array.init 1000 (fun i -> Array.init 10000 (fun j -> let k =3D i*1000+j in Int32.of_int k, Int32.of_int k));; Generational garbage collection further exacerbates the problem by unnecessarily copying every allocated value from the young to old generation. Remembered sets also further exacerbate the problem by = having the write barrier record a pointer at every write. Hence the huge performance discrepancies between OCaml/Java and HLVM/.NET in this = context. Cheers, Jon.