From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by sympa.inria.fr (Postfix) with ESMTPS id 31A7F7F615 for ; Wed, 21 Dec 2016 10:08:14 +0100 (CET) Authentication-Results: mail2-smtp-roc.national.inria.fr; spf=None smtp.pra=christoph.hoeger@tu-berlin.de; spf=None smtp.mailfrom=christoph.hoeger@tu-berlin.de; spf=None smtp.helo=postmaster@mail.tu-berlin.de Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of christoph.hoeger@tu-berlin.de) identity=pra; client-ip=130.149.7.33; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="christoph.hoeger@tu-berlin.de"; x-sender="christoph.hoeger@tu-berlin.de"; x-conformance=sidf_compatible Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of christoph.hoeger@tu-berlin.de) identity=mailfrom; client-ip=130.149.7.33; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="christoph.hoeger@tu-berlin.de"; x-sender="christoph.hoeger@tu-berlin.de"; x-conformance=sidf_compatible Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of postmaster@mail.tu-berlin.de) identity=helo; client-ip=130.149.7.33; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="christoph.hoeger@tu-berlin.de"; x-sender="postmaster@mail.tu-berlin.de"; x-conformance=sidf_compatible IronPort-PHdr: =?us-ascii?q?9a23=3A5tpRlBFHTVVwqpPzeOv7k51GYnF86YWxBRYc798d?= =?us-ascii?q?s5kLTJ75osuwAkXT6L1XgUPTWs2DsrQf2rGQ4/irADdYqdbZ6TZZL8wKD0dEwe?= =?us-ascii?q?wt3CUeQ+e9QXXhK/DrayFoVO9jb3RCu0+BDE5OBczlbEfTqHDhpRQbGxH4KBYn?= =?us-ascii?q?br+tQt2ap42N2uuz45zeZRlTzHr4OOsqbUb+kQKEl9cfh8NNLbo21BDJo2dTM7?= =?us-ascii?q?BX22xAJF+eklD7/MjmuNZS9DhZvroL/tRGVrSyK7U/UbVdBj08NWckzMLuvBjH?= =?us-ascii?q?CwCI4y1PfH8Rl09jAxLE9w39RpfGkrX1u/A1jCKaJ8ztUbcsWXKi6KpkRQXAlD?= =?us-ascii?q?pCPTMj9GDRzMB92vEI6Cm9rgByltaHKLqeM+BzK+aEJYsX?= X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: =?us-ascii?q?A0ATAACQRVpYhyEHlYJUCRsBAQEDAQEBC?= =?us-ascii?q?QEBARYBAQEDAQEBCQEBAYMKAQEBAQF5gQaNUJZZh3aNHIIKJoV8AoFuPxQBAQE?= =?us-ascii?q?BAQEBAQEBARIBAQEKCwkJHTCCMwQBFQEEghcBBSMyFwoDEAsOCicDAgIhJREGA?= =?us-ascii?q?QwGAgEBBRKINgMXBAqoGIIoh0INg0MBAQEBBgEBAQEBFA+IMwiBTIEIgkiBUTO?= =?us-ascii?q?CQDiCXQWIVw6QT4ENNYNjgXyHYwOFZ4UGgyKGNIgWgVWEOIQPH4INg3WBaXEBh?= =?us-ascii?q?hoqRIFPAQEB?= X-IPAS-Result: =?us-ascii?q?A0ATAACQRVpYhyEHlYJUCRsBAQEDAQEBCQEBARYBAQEDAQE?= =?us-ascii?q?BCQEBAYMKAQEBAQF5gQaNUJZZh3aNHIIKJoV8AoFuPxQBAQEBAQEBAQEBARIBA?= =?us-ascii?q?QEKCwkJHTCCMwQBFQEEghcBBSMyFwoDEAsOCicDAgIhJREGAQwGAgEBBRKINgM?= =?us-ascii?q?XBAqoGIIoh0INg0MBAQEBBgEBAQEBFA+IMwiBTIEIgkiBUTOCQDiCXQWIVw6QT?= =?us-ascii?q?4ENNYNjgXyHYwOFZ4UGgyKGNIgWgVWEOIQPH4INg3WBaXEBhhoqRIFPAQEB?= X-IronPort-AV: E=Sophos;i="5.33,382,1477954800"; d="asc'?scan'208";a="250989086" Received: from mail.tu-berlin.de ([130.149.7.33]) by mail2-smtp-roc.national.inria.fr with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 21 Dec 2016 10:08:12 +0100 X-tubIT-Incoming-IP: 91.66.22.179 Received: from ip5b4216b3.dynamic.kabel-deutschland.de ([91.66.22.179] helo=[192.168.178.42]) by mail.tu-berlin.de (exim-4.84_2/mailfrontend-5) with esmtpa id 1cJcsc-0006i1-7j; Wed, 21 Dec 2016 10:08:11 +0100 To: Ivan Gotovchits , Yotam Barnoy References: <7bc766a2-d460-524b-35ca-89609a34b719@tu-berlin.de> <1482148297.4629.19.camel@gerd-stolpmann.de> Cc: Gerd Stolpmann , caml-list From: =?UTF-8?Q?Christoph_H=c3=b6ger?= Organization: =?UTF-8?Q?Technische_Universit=c3=a4t_Berlin?= Message-ID: Date: Wed, 21 Dec 2016 10:08:09 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.5.1 MIME-Version: 1.0 In-Reply-To: Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="X2d03cXlkdC6C4OsJ1VR6fwvtdqJkwsiB" X-PMX-Version: 6.0.0.2142326, Antispam-Engine: 2.7.2.2107409, Antispam-Data: 2016.12.21.90319 X-PMX-Spam: Gauge=IIIIIII, Probability=0%, Report='' Subject: Re: [Caml-list] Closing the performance gap to C This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --X2d03cXlkdC6C4OsJ1VR6fwvtdqJkwsiB Content-Type: multipart/mixed; boundary="GNA9j0uM0Ti1u8LVhgG6jOxkakTnLJTTw"; protected-headers="v1" From: =?UTF-8?Q?Christoph_H=c3=b6ger?= To: Ivan Gotovchits , Yotam Barnoy Cc: Gerd Stolpmann , caml-list Message-ID: Subject: Re: [Caml-list] Closing the performance gap to C References: <7bc766a2-d460-524b-35ca-89609a34b719@tu-berlin.de> <1482148297.4629.19.camel@gerd-stolpmann.de> In-Reply-To: --GNA9j0uM0Ti1u8LVhgG6jOxkakTnLJTTw Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Dear all, thanks a lot for your valuable input. It seems like there still is something to gain for OCaml, but it is not quite simple to achieve. From your comments I reckon the main culprit is the use of a tail-recursive function instead of a loop, which causes boxing (which causes spilling). I will repeat my experiment with a new setup asap. @Ivan: Your point about the optimization is well taken, but please consider that the runge-kutta integrator is a sophisticated, general numerical method. Usually, the user of such a method has neither the knowledge nor the time to manually adapt it, but rather implements a standard interface. Therefore, the unused arguments and algebraic/trigonometric identities have to be resolved by the compiler. Otherwise, we could as well compute the exact solution, anyways ;). Am 19.12.2016 um 17:59 schrieb Ivan Gotovchits: >> Flambda should be able to do the kind of optimizations Ivan mentions, > at least ideally. >=20 > I've used 4.03+flambda and it didn't perform the optimizations. I've > played a lot with the inline options and=20 > `inline` directive and even noticed a strange (on a first glance) thing, > that the performance is better, when=20 > I explicitly _disable_ of the `trk4_step` function (with `[@@inline never= ]`) >=20 > As a side note, not about compilers, but about mathematics.=20 > Since the three cosine functions, that are computed in the > loop are basically cos(t), cos(t+h) and cos(t + h/2) and `h` is a > constant, you can use the `cos(t+x) =3D cos(t)cos(x) - sin(t)sin(x)` > trigonometric > identity to factor out the constant factors (that are only depend on > `h`), so finally, you will need to compute one cosine and one sine. > This will boost the performance of the program, leaving even the SSE > corei7 specific version far behind. >=20 >=20 > On Mon, Dec 19, 2016 at 11:44 AM, Yotam Barnoy > wrote: >=20 > Christoph, you don't mention which version of OCaml you're compiling > with, and whether FLambda is used. Flambda should be able to do the > kind of optimizations Ivan mentions, at least ideally. If it's not > able to do them yet, it will be when the rewritten version is > deployed. >=20 > -Yotam >=20 >=20 > On Mon, Dec 19, 2016 at 10:48 AM, Ivan Gotovchits > wrote: > > Hi Christoph, > > > > The problem is that your function definitions, like `loop` and > `rk4_step`, > > have too many parameters > > and OCaml is not able to eliminate them and is actually not > trying. It was > > always a feature of OCaml > > that poorly written code will be compiled into a poorly written > program. > > OCaml never tries to fix > > programmer's errors. It will try to minimize the abstraction > overhead (often > > to the zero). But if the abstractions, > > on the first hand, were chosen incorrectly, then it won't fix the > code. > > > > In your particular example, the C compiler was quite aggressive > and was > > capable of eliminating unnecessary computations. > > I wouldn't, however, assume that the C compiler will be always > able to do > > this for you. > > > > Let's go through the code: > > > > > > let rec loop steps h n y t =3D > > if n < steps then loop steps h (n+1) (rk4_step y t h) (t +. h) el= se > > y > > > > Here variables `steps` and `h` are loop invariants, so you > shouldn't put it > > into the loop function parameter list. > > Well yes, a compiler can find out loop invariants and remove them > for you. > > But in our case, to remove them, > > it will need to change the type of the function. The compiler will > not go > > that far for us. It respects a programmer, and if a > > programmer decided to make his loop function to depend on `6` > arguments, > > then it means that the computation is actually depending > > on 6 arguments. So, it will allocate 6 registers to hold loop > variables, > > with all the consequences. > > > > > > Now, let's take a look at the `rk4_step` function: > > > > let rk4_step y t h =3D > > let k1 =3D h *. y' t y in > > let k2 =3D h *. y' (t +. 0.5*.h) (y +. 0.5*.k1) in > > let k3 =3D h *. y' (t +. 0.5*.h) (y +. 0.5*.k2) in > > let k4 =3D h *. y' (t +. h) (y +. k3) in > > y +. (k1+.k4)/.6.0 +. (k2+.k3)/.3.0 > > > > > > > > This function, is, in fact, a body of the loop, and everything > except t is > > loop invariant here. Moreover, > > function `y'` is defined as: > > > > let y' t y =3D cos t > > > > I.e., it doesn't really use it the second argument. Probably, a > compiler > > should inline the call, and eliminate > > lots of unecessary computations, and thus free a few registers, but, > > apparently, OCaml doesn't do this > > (even in 4.03+flambda). > > > > So we should do this manually: > > > > let rk4_step y t =3D > > let k1 =3D h *. y' t in > > let k2 =3D h *. y' (t +. 0.5*.h) in > > let k3 =3D h *. y' (t +. 0.5*.h) in > > let k4 =3D h *. y' (t +. h) (y +. k3) in > > y +. (k1+.k4)/.6.0 +. (k2+.k3)/.3.0 > > > > We can even see, that `k3` and `k2` are equal now, so we can > eliminate them: > > > > let rk4_step y t =3D > > let k1 =3D h *. y' t in > > let k2 =3D h *. y' (t +. 0.5*.h) in > > let k4 =3D h *. y' (t +. h) (y +. k3) in > > y +. (k1+.k4)/.6.0 +. k2 *. 1.5 > > > > > > Finally, we don't want to pass `y` into the `rk4_step` every time, > as we > > don't want to require an extra register for it. > > After all these manual optimizations, we have the following program: > > > > let h =3D 0.1 > > > > > > let exact t =3D sin t > > > > > > let rk4_step t =3D > > > > let k1 =3D h *. cos t in > > > > let k2 =3D h *. cos (t +. 0.5*.h) in > > > > let k4 =3D h *. cos (t +. h) in > > > > (k1+.k4)/.6.0 +. k2*.1.5 > > > > > > let compute steps =3D > > > > let rec loop n y t =3D > > > > if n < steps > > > > then loop (n+1) (y +. rk4_step t) (t +. h) > > > > else y in > > > > loop 1 1.0 0.0 > > > > > > let () =3D > > > > let y =3D compute 102 in > > > > let err =3D abs_float (y -. (exact ((float_of_int 102) *. h))) in > > > > let large =3D 50000000 in > > > > let y =3D compute large in > > > > Printf.printf "%b\n" > > > > (abs_float (y -. (exact (float_of_int large) *. h)) < 2. *. err) > > > > > > > > This program has the same performance as the C one... unless I > pass really > > aggressive optimization options > > to the C compiler, that will emit a platform specific code, e.g., > > > > gcc rk.c -lm -O3 -march=3Dcorei7-avx -o rksse > > > > > > These options basically double the performance of the C version, > leaving > > OCaml lagging behind. That is because, > > OCaml, obviously, cannot follow the latest developments of intel CP= U, > > especially in the field of SSE. > > > > The fun part is that when I've tried to compile the same file with > clang, > > the resulting program was even slower > > than the original non-optimized OCaml. But this is all micro > benchmarking of > > course, so don't jump to fast conclusions > > (although I like to think that OCaml is faster than Clang :)) > > > > > > As a final remark, my experience in HPC shows that in general you > should not > > really rely on compiler optimizations and hope > > that the compiler will do the magic for you. Even the GCC > compiler. It would > > be very easy to accidentally amend the above program > > in a such way, that the optimizations will no longer fire in. Of > course, > > writing in assembly is also not a choice. If you really need > > to optimize, then you should find out the performance bottleneck > and then > > optimize it manually until you get an expected performance. > > Alternatively, you can use plain old Fortran to get the reliable > > performance. And then call it from C or OCaml. > > > > > > Best wishes, > > Ivan > > > > > > > > On Mon, Dec 19, 2016 at 6:51 AM, Gerd Stolpmann > > > > wrote: > >> > >> Hi Christoph, > >> > >> the extra code looks very much like an allocation on the minor hea= p: > >> > >> sub $0x10,%r15 > >> lea 0x25c7b6(%rip),%rax > >> cmp (%rax),%r15 > >> jb 404a8a > >> lea 0x8(%r15),%rax > >> movq $0x4fd,-0x8(%rax) > >> > >> r15 points to the used area of the minor heap - by decrementing > it you > >> get an additional block of memory. It is compared against the > beginning > >> of the heap to check whether GC is needed. The constant 0x4fd is t= he > >> header of the new block (which must be always initialized). > >> > >> From the source code, it remains unclear for what this is used. > >> Obviously, the compiler runs out of registers, and moves some > values to > >> the minor heap (temporarily). When you call a C function like cos > it is > >> likely that this happens because the C calling conventions do not > >> preserve the FP registers (xmm*). This could be improved if the OC= aml > >> compiler tried alternate places for temporarily storing FP values: > >> > >> - int registers (which is perfectly possible on 64 bit platforms). > >> A number of int registers survive C calls. > >> - stack > >> > >> To my knowledge, the OCaml compiler never tries this (but this > could be > >> out of date). This is a fairly specific optimization that makes > mostly > >> sense for purely iterating or aggregating functions like yours > that do > >> not store FP values away. > >> > >> Gerd > >> > >> Am Samstag, den 17.12.2016, 14:02 +0100 schrieb Christoph H=C3=B6g= er: > >> > Ups. Forgot the actual examples. > >> > > >> > Am 17.12.2016 um 14:01 schrieb Christoph H=C3=B6ger: > >> > > > >> > > Dear all, > >> > > > >> > > find attached two simple runge-kutta iteration schemes. One is > >> > > written > >> > > in C, the other in OCaml. I compared the runtime of both and > gcc (- > >> > > O2) > >> > > produces an executable that is roughly 30% faster (to be more > >> > > precise: > >> > > 3.52s vs. 2.63s). That is in itself quite pleasing, I think. I= do > >> > > not > >> > > understand however, what causes this difference. Admittedly, t= he > >> > > generated assembly looks completely different, but both compil= ers > >> > > inline > >> > > all functions and generate one big loop. Ocaml generates a > lot more > >> > > scaffolding, but that is to be expected. > >> > > > >> > > There is however an interesting particularity: OCaml generates= 6 > >> > > calls > >> > > to cos, while gcc only needs 3 (and one direct jump). > Surprisingly, > >> > > there are also calls to cosh, acos and pretty much any other > >> > > trigonometric function (initialization of constants, maybe?) > >> > > > >> > > However, the true culprit seems to be an excess of instructions > >> > > between > >> > > the different calls to cos. This is what happens between the > first > >> > > two > >> > > calls to cos: > >> > > > >> > > gcc: > >> > > jmpq 400530 > >> > > nop > >> > > nopw %cs:0x0(%rax,%rax,1) > >> > > > >> > > sub $0x38,%rsp > >> > > movsd %xmm0,0x10(%rsp) > >> > > movapd %xmm1,%xmm0 > >> > > movsd %xmm2,0x18(%rsp) > >> > > movsd %xmm1,0x8(%rsp) > >> > > callq 400530 > >> > > > >> > > ocamlopt: > >> > > > >> > > callq 401a60 > >> > > mulsd (%r12),%xmm0 > >> > > movsd %xmm0,0x10(%rsp) > >> > > sub $0x10,%r15 > >> > > lea 0x25c7b6(%rip),%rax > >> > > cmp (%rax),%r15 > >> > > jb 404a8a > >> > > lea 0x8(%r15),%rax > >> > > movq $0x4fd,-0x8(%rax) > >> > > > >> > > movsd 0x32319(%rip),%xmm1 > >> > > > >> > > movapd %xmm1,%xmm2 > >> > > mulsd %xmm0,%xmm2 > >> > > addsd 0x0(%r13),%xmm2 > >> > > movsd %xmm2,(%rax) > >> > > movapd %xmm1,%xmm0 > >> > > mulsd (%r12),%xmm0 > >> > > addsd (%rbx),%xmm0 > >> > > callq 401a60 > >> > > > >> > > > >> > > Is this caused by some underlying difference in the > representation > >> > > of > >> > > numeric values (i.e. tagged ints) or is it reasonable to attack > >> > > this > >> > > issue as a hobby experiment? > >> > > > >> > > > >> > > thanks for any advice, > >> > > > >> > > Christoph > >> > > > >> > > >> -- > >> ------------------------------------------------------------ > >> Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de > > >> My OCaml site: http://www.camlcity.org > >> Contact details: http://www.camlcity.org/contact.html > > >> Company homepage: http://www.gerd-stolpmann.de > >> ------------------------------------------------------------ > >> > >> > > >=20 >=20 --=20 Christoph H=C3=B6ger Technische Universit=C3=A4t Berlin Fakult=C3=A4t IV - Elektrotechnik und Informatik =C3=9Cbersetzerbau und Programmiersprachen Sekr. TEL12-2, Ernst-Reuter-Platz 7, 10587 Berlin Tel.: +49 (30) 314-24890 E-Mail: christoph.hoeger@tu-berlin.de --GNA9j0uM0Ti1u8LVhgG6jOxkakTnLJTTw-- --X2d03cXlkdC6C4OsJ1VR6fwvtdqJkwsiB Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iEYEARECAAYFAlhaRnkACgkQhMBO4cVSGS9GcACfdvWuBSkT/VmYbqKOpOfFYSaX neUAnj9NzDaBy/P2xb5N1bLrNbBKt83f =hYIi -----END PGP SIGNATURE----- --X2d03cXlkdC6C4OsJ1VR6fwvtdqJkwsiB--