From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by sympa.inria.fr (Postfix) with ESMTPS id 0D13D7F615 for ; Mon, 19 Dec 2016 17:59:29 +0100 (CET) Authentication-Results: mail2-smtp-roc.national.inria.fr; spf=None smtp.pra=ivg@ieee.org; spf=Pass smtp.mailfrom=ivg@ieee.org; spf=None smtp.helo=postmaster@mail-lf0-f48.google.com Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of ivg@ieee.org) identity=pra; client-ip=209.85.215.48; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="ivg@ieee.org"; x-sender="ivg@ieee.org"; x-conformance=sidf_compatible Received-SPF: Pass (mail2-smtp-roc.national.inria.fr: domain of ivg@ieee.org designates 209.85.215.48 as permitted sender) identity=mailfrom; client-ip=209.85.215.48; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="ivg@ieee.org"; x-sender="ivg@ieee.org"; x-conformance=sidf_compatible; x-record-type="v=spf1" Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of postmaster@mail-lf0-f48.google.com) identity=helo; client-ip=209.85.215.48; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="ivg@ieee.org"; x-sender="postmaster@mail-lf0-f48.google.com"; x-conformance=sidf_compatible IronPort-PHdr: =?us-ascii?q?9a23=3A99qCYRxcRqLQ5YDXCy+O+j09IxM/srCxBDY+r6Qd?= =?us-ascii?q?0eoRIJqq85mqBkHD//Il1AaPBtSAra0bwLuK++C4ACpbvsbH6ChDOLV3FDY7yu?= =?us-ascii?q?wu1zQ6B8CEDUCpZNXLVAcdWPp4aVl+4nugOlJUEsutL3fbo3m18CJAUk6nbVk9?= =?us-ascii?q?Dq3PF4XTl8W60fyps92WOl0QxWn1XbQnHRKqpACZnMAMh4xzYvIgzQfAp3FBYe?= =?us-ascii?q?JR1EtnIFuSm1D34cLmr7B59CEFmuwo8YZvVrn9Ya84TKBDRGAnLW8d5cDmuF/E?= =?us-ascii?q?VwTZtShUaXkfjhcdW1uN1xr9RJqk93Ki7uc=3D?= X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: =?us-ascii?q?A0AeAAD3EFhYhjDXVdFUCRoBAQEBAgEBA?= =?us-ascii?q?QEIAQEBARUBAQEBAgEBAQEIAQEBAYJIRAEBAQEBeYEGB41IllmHdo0YggomhXw?= =?us-ascii?q?CgXsHPxQBAQEBAQEBAQEBARIBAQEICwsJHTCCMwQBFQEEghYBAQEDARoJHQEBE?= =?us-ascii?q?yQBBAsJAgsNKgICIQESAQUBHAYTCYhIAw8IDooYkQs/ixpogiiDDAEBBYQcDYN?= =?us-ascii?q?CAQEBAQEFAQEBAQEBARkIEoYkg1GBCIJIgVEzgniCXYhZDocciTCBDTWNQgODb?= =?us-ascii?q?5BNiA6BVIQ3gkkUHoEUH4FcKxMDgwwsIIIGIDQBhkwqRIFPAQEB?= X-IPAS-Result: =?us-ascii?q?A0AeAAD3EFhYhjDXVdFUCRoBAQEBAgEBAQEIAQEBARUBAQE?= =?us-ascii?q?BAgEBAQEIAQEBAYJIRAEBAQEBeYEGB41IllmHdo0YggomhXwCgXsHPxQBAQEBA?= =?us-ascii?q?QEBAQEBARIBAQEICwsJHTCCMwQBFQEEghYBAQEDARoJHQEBEyQBBAsJAgsNKgI?= =?us-ascii?q?CIQESAQUBHAYTCYhIAw8IDooYkQs/ixpogiiDDAEBBYQcDYNCAQEBAQEFAQEBA?= =?us-ascii?q?QEBARkIEoYkg1GBCIJIgVEzgniCXYhZDocciTCBDTWNQgODb5BNiA6BVIQ3gkk?= =?us-ascii?q?UHoEUH4FcKxMDgwwsIIIGIDQBhkwqRIFPAQEB?= X-IronPort-AV: E=Sophos;i="5.33,374,1477954800"; d="scan'208,217";a="250729791" Received: from mail-lf0-f48.google.com ([209.85.215.48]) by mail2-smtp-roc.national.inria.fr with ESMTP/TLS/AES128-GCM-SHA256; 19 Dec 2016 17:59:26 +0100 Received: by mail-lf0-f48.google.com with SMTP id b14so58238770lfg.2 for ; Mon, 19 Dec 2016 08:59:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ieee-org.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=z9kN7R8tqyeUQFoeHFrAW3J/H5ZGW2LYJDR3iFOexjw=; b=sAFp6lwBOwsxykMVGtBsl9Cn4h4EgDGTV6HdhD42D23++HOhae3VmYm1F1iRcS5CWX JCyHD8j6L8FHVUBQzxtvHgzb0KypZQzbvPkwX3RUoD8q5fPg5chpoXAiuVj/G+3rwZpq LrNpCABwHtqLhUIXP5H6MNZ0abNKZPuW+AWuUaIhakIa3WiIbwKx7RkUY8k6+aG4gLP1 teBjaGcMnHWi2rpTR7uvzR3B2FqE0zuVmrsL5gX7ys2NwL9Sk+Xkrw0vwNUEQ8+IddiE tfhw+0tkr8Q6FFNH35Cdq7aV8Y/jmAgnW53HJJhfzey4YmQLCBpRx1hia3P/XCwliq7L jxlw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=z9kN7R8tqyeUQFoeHFrAW3J/H5ZGW2LYJDR3iFOexjw=; b=Xl8lOoqZsLLkRzaqi7aJ2FkPytmz6Mn/I2qvxlX/9CMcz0avNhZL6bs1vwK8q8ATQu yBPqRZ6tTxO+FNBi9AQV6X7Vzi0gZ/dZe8o0drmMflq8uO+SVYK3JZ0TT5TCOotw1827 NAQiM57/n9vAwZskZckJTsZliCVn4+rJ3PQF/2SUw2KqoWWeEqrJeBwlhkpergWyUemW k6ufa5VuapWf4n9VU2gc2DP+1XfCCL+L+tVCdMrT0Nyf0k+Tq8kp1PygsID4Z7pVneGb 5CuxWKkkWKoM9WMjJov2hOTQ7s9Bugo9/ohKy/vMl11RbvC67ij/hLrycR+yWNioKCp2 SK0Q== X-Gm-Message-State: AKaTC03QfCy5v0sbrQMMVc6hLvdtenO1PI29N0LJsSnn0oLyIJmMBqU+o6D6keK1tVk2lJOVvspQFfALwPAR7nfC X-Received: by 10.25.16.105 with SMTP id f102mr4703318lfi.99.1482166765585; Mon, 19 Dec 2016 08:59:25 -0800 (PST) MIME-Version: 1.0 Received: by 10.114.93.99 with HTTP; Mon, 19 Dec 2016 08:59:24 -0800 (PST) In-Reply-To: References: <7bc766a2-d460-524b-35ca-89609a34b719@tu-berlin.de> <1482148297.4629.19.camel@gerd-stolpmann.de> From: Ivan Gotovchits Date: Mon, 19 Dec 2016 11:59:24 -0500 Message-ID: To: Yotam Barnoy Cc: Gerd Stolpmann , =?UTF-8?Q?Christoph_H=C3=B6ger?= , caml-list Content-Type: multipart/alternative; boundary=001a113f929c8fb301054405d7ef Subject: Re: [Caml-list] Closing the performance gap to C --001a113f929c8fb301054405d7ef Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable > Flambda should be able to do the kind of optimizations Ivan mentions, at least ideally. I've used 4.03+flambda and it didn't perform the optimizations. I've played a lot with the inline options and `inline` directive and even noticed a strange (on a first glance) thing, that the performance is better, when I explicitly _disable_ of the `trk4_step` function (with `[@@inline never]`) As a side note, not about compilers, but about mathematics. Since the three cosine functions, that are computed in the loop are basically cos(t), cos(t+h) and cos(t + h/2) and `h` is a constant, you can use the `cos(t+x) =3D cos(t)cos(x) - sin(t)sin(x)` trigonometric identity to factor out the constant factors (that are only depend on `h`), so finally, you will need to compute one cosine and one sine. This will boost the performance of the program, leaving even the SSE corei7 specific version far behind. On Mon, Dec 19, 2016 at 11:44 AM, Yotam Barnoy wrote: > Christoph, you don't mention which version of OCaml you're compiling > with, and whether FLambda is used. Flambda should be able to do the > kind of optimizations Ivan mentions, at least ideally. If it's not > able to do them yet, it will be when the rewritten version is > deployed. > > -Yotam > > > On Mon, Dec 19, 2016 at 10:48 AM, Ivan Gotovchits wrote: > > Hi Christoph, > > > > The problem is that your function definitions, like `loop` and > `rk4_step`, > > have too many parameters > > and OCaml is not able to eliminate them and is actually not trying. It > was > > always a feature of OCaml > > that poorly written code will be compiled into a poorly written program. > > OCaml never tries to fix > > programmer's errors. It will try to minimize the abstraction overhead > (often > > to the zero). But if the abstractions, > > on the first hand, were chosen incorrectly, then it won't fix the code. > > > > In your particular example, the C compiler was quite aggressive and was > > capable of eliminating unnecessary computations. > > I wouldn't, however, assume that the C compiler will be always able to = do > > this for you. > > > > Let's go through the code: > > > > > > let rec loop steps h n y t =3D > > if n < steps then loop steps h (n+1) (rk4_step y t h) (t +. h) else > > y > > > > Here variables `steps` and `h` are loop invariants, so you shouldn't put > it > > into the loop function parameter list. > > Well yes, a compiler can find out loop invariants and remove them for > you. > > But in our case, to remove them, > > it will need to change the type of the function. The compiler will not = go > > that far for us. It respects a programmer, and if a > > programmer decided to make his loop function to depend on `6` arguments, > > then it means that the computation is actually depending > > on 6 arguments. So, it will allocate 6 registers to hold loop variables, > > with all the consequences. > > > > > > Now, let's take a look at the `rk4_step` function: > > > > let rk4_step y t h =3D > > let k1 =3D h *. y' t y in > > let k2 =3D h *. y' (t +. 0.5*.h) (y +. 0.5*.k1) in > > let k3 =3D h *. y' (t +. 0.5*.h) (y +. 0.5*.k2) in > > let k4 =3D h *. y' (t +. h) (y +. k3) in > > y +. (k1+.k4)/.6.0 +. (k2+.k3)/.3.0 > > > > > > > > This function, is, in fact, a body of the loop, and everything except t > is > > loop invariant here. Moreover, > > function `y'` is defined as: > > > > let y' t y =3D cos t > > > > I.e., it doesn't really use it the second argument. Probably, a compiler > > should inline the call, and eliminate > > lots of unecessary computations, and thus free a few registers, but, > > apparently, OCaml doesn't do this > > (even in 4.03+flambda). > > > > So we should do this manually: > > > > let rk4_step y t =3D > > let k1 =3D h *. y' t in > > let k2 =3D h *. y' (t +. 0.5*.h) in > > let k3 =3D h *. y' (t +. 0.5*.h) in > > let k4 =3D h *. y' (t +. h) (y +. k3) in > > y +. (k1+.k4)/.6.0 +. (k2+.k3)/.3.0 > > > > We can even see, that `k3` and `k2` are equal now, so we can eliminate > them: > > > > let rk4_step y t =3D > > let k1 =3D h *. y' t in > > let k2 =3D h *. y' (t +. 0.5*.h) in > > let k4 =3D h *. y' (t +. h) (y +. k3) in > > y +. (k1+.k4)/.6.0 +. k2 *. 1.5 > > > > > > Finally, we don't want to pass `y` into the `rk4_step` every time, as we > > don't want to require an extra register for it. > > After all these manual optimizations, we have the following program: > > > > let h =3D 0.1 > > > > > > let exact t =3D sin t > > > > > > let rk4_step t =3D > > > > let k1 =3D h *. cos t in > > > > let k2 =3D h *. cos (t +. 0.5*.h) in > > > > let k4 =3D h *. cos (t +. h) in > > > > (k1+.k4)/.6.0 +. k2*.1.5 > > > > > > let compute steps =3D > > > > let rec loop n y t =3D > > > > if n < steps > > > > then loop (n+1) (y +. rk4_step t) (t +. h) > > > > else y in > > > > loop 1 1.0 0.0 > > > > > > let () =3D > > > > let y =3D compute 102 in > > > > let err =3D abs_float (y -. (exact ((float_of_int 102) *. h))) in > > > > let large =3D 50000000 in > > > > let y =3D compute large in > > > > Printf.printf "%b\n" > > > > (abs_float (y -. (exact (float_of_int large) *. h)) < 2. *. err) > > > > > > > > This program has the same performance as the C one... unless I pass > really > > aggressive optimization options > > to the C compiler, that will emit a platform specific code, e.g., > > > > gcc rk.c -lm -O3 -march=3Dcorei7-avx -o rksse > > > > > > These options basically double the performance of the C version, leaving > > OCaml lagging behind. That is because, > > OCaml, obviously, cannot follow the latest developments of intel CPU, > > especially in the field of SSE. > > > > The fun part is that when I've tried to compile the same file with clan= g, > > the resulting program was even slower > > than the original non-optimized OCaml. But this is all micro > benchmarking of > > course, so don't jump to fast conclusions > > (although I like to think that OCaml is faster than Clang :)) > > > > > > As a final remark, my experience in HPC shows that in general you should > not > > really rely on compiler optimizations and hope > > that the compiler will do the magic for you. Even the GCC compiler. It > would > > be very easy to accidentally amend the above program > > in a such way, that the optimizations will no longer fire in. Of course, > > writing in assembly is also not a choice. If you really need > > to optimize, then you should find out the performance bottleneck and th= en > > optimize it manually until you get an expected performance. > > Alternatively, you can use plain old Fortran to get the reliable > > performance. And then call it from C or OCaml. > > > > > > Best wishes, > > Ivan > > > > > > > > On Mon, Dec 19, 2016 at 6:51 AM, Gerd Stolpmann > > wrote: > >> > >> Hi Christoph, > >> > >> the extra code looks very much like an allocation on the minor heap: > >> > >> sub $0x10,%r15 > >> lea 0x25c7b6(%rip),%rax > >> cmp (%rax),%r15 > >> jb 404a8a > >> lea 0x8(%r15),%rax > >> movq $0x4fd,-0x8(%rax) > >> > >> r15 points to the used area of the minor heap - by decrementing it you > >> get an additional block of memory. It is compared against the beginning > >> of the heap to check whether GC is needed. The constant 0x4fd is the > >> header of the new block (which must be always initialized). > >> > >> From the source code, it remains unclear for what this is used. > >> Obviously, the compiler runs out of registers, and moves some values to > >> the minor heap (temporarily). When you call a C function like cos it is > >> likely that this happens because the C calling conventions do not > >> preserve the FP registers (xmm*). This could be improved if the OCaml > >> compiler tried alternate places for temporarily storing FP values: > >> > >> - int registers (which is perfectly possible on 64 bit platforms). > >> A number of int registers survive C calls. > >> - stack > >> > >> To my knowledge, the OCaml compiler never tries this (but this could be > >> out of date). This is a fairly specific optimization that makes mostly > >> sense for purely iterating or aggregating functions like yours that do > >> not store FP values away. > >> > >> Gerd > >> > >> Am Samstag, den 17.12.2016, 14:02 +0100 schrieb Christoph H=C3=B6ger: > >> > Ups. Forgot the actual examples. > >> > > >> > Am 17.12.2016 um 14:01 schrieb Christoph H=C3=B6ger: > >> > > > >> > > Dear all, > >> > > > >> > > find attached two simple runge-kutta iteration schemes. One is > >> > > written > >> > > in C, the other in OCaml. I compared the runtime of both and gcc (- > >> > > O2) > >> > > produces an executable that is roughly 30% faster (to be more > >> > > precise: > >> > > 3.52s vs. 2.63s). That is in itself quite pleasing, I think. I do > >> > > not > >> > > understand however, what causes this difference. Admittedly, the > >> > > generated assembly looks completely different, but both compilers > >> > > inline > >> > > all functions and generate one big loop. Ocaml generates a lot more > >> > > scaffolding, but that is to be expected. > >> > > > >> > > There is however an interesting particularity: OCaml generates 6 > >> > > calls > >> > > to cos, while gcc only needs 3 (and one direct jump). Surprisingly, > >> > > there are also calls to cosh, acos and pretty much any other > >> > > trigonometric function (initialization of constants, maybe?) > >> > > > >> > > However, the true culprit seems to be an excess of instructions > >> > > between > >> > > the different calls to cos. This is what happens between the first > >> > > two > >> > > calls to cos: > >> > > > >> > > gcc: > >> > > jmpq 400530 > >> > > nop > >> > > nopw %cs:0x0(%rax,%rax,1) > >> > > > >> > > sub $0x38,%rsp > >> > > movsd %xmm0,0x10(%rsp) > >> > > movapd %xmm1,%xmm0 > >> > > movsd %xmm2,0x18(%rsp) > >> > > movsd %xmm1,0x8(%rsp) > >> > > callq 400530 > >> > > > >> > > ocamlopt: > >> > > > >> > > callq 401a60 > >> > > mulsd (%r12),%xmm0 > >> > > movsd %xmm0,0x10(%rsp) > >> > > sub $0x10,%r15 > >> > > lea 0x25c7b6(%rip),%rax > >> > > cmp (%rax),%r15 > >> > > jb 404a8a > >> > > lea 0x8(%r15),%rax > >> > > movq $0x4fd,-0x8(%rax) > >> > > > >> > > movsd 0x32319(%rip),%xmm1 > >> > > > >> > > movapd %xmm1,%xmm2 > >> > > mulsd %xmm0,%xmm2 > >> > > addsd 0x0(%r13),%xmm2 > >> > > movsd %xmm2,(%rax) > >> > > movapd %xmm1,%xmm0 > >> > > mulsd (%r12),%xmm0 > >> > > addsd (%rbx),%xmm0 > >> > > callq 401a60 > >> > > > >> > > > >> > > Is this caused by some underlying difference in the representation > >> > > of > >> > > numeric values (i.e. tagged ints) or is it reasonable to attack > >> > > this > >> > > issue as a hobby experiment? > >> > > > >> > > > >> > > thanks for any advice, > >> > > > >> > > Christoph > >> > > > >> > > >> -- > >> ------------------------------------------------------------ > >> Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de > >> My OCaml site: http://www.camlcity.org > >> Contact details: http://www.camlcity.org/contact.html > >> Company homepage: http://www.gerd-stolpmann.de > >> ------------------------------------------------------------ > >> > >> > > > --001a113f929c8fb301054405d7ef Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
>=C2=A0=C2=A0Flambda should be able to do the=C2=A0<= /span>kind of optimizations Ivan mentions,= at least ideally.

I've used 4.03+flambda an= d it didn't perform the optimizations. I've played a lot with the i= nline options and=C2=A0
`inline` directive and even noticed a strange (= on a first glance) thing, that the performance is better, when=C2=A0
<= div>I explicitly _disable_ of the `trk4_step` function (with `[@@inline nev= er]`)

As a side note, not about compilers, but abo= ut mathematics.=C2=A0
Since the three cosine functions, that are = computed in the
loop are basically cos(t), cos(t+h) and cos(t + h= /2) and `h` is a constant, you can use the =C2=A0`cos(t+x) =3D cos(t)cos(x)= - sin(t)sin(x)` trigonometric
identity to factor out the constan= t factors (that are only depend on `h`), so finally, you will need to compu= te one cosine and one sine.
This will boost the performance of th= e program, leaving even the SSE corei7 specific version far behind.


On Mon, Dec 19, 2016 at 11:44 AM, Yotam Barnoy <yotambarnoy@gmail.co= m> wrote:
Christoph, you do= n't mention which version of OCaml you're compiling
with, and whether FLambda is used. Flambda should be able to do the
kind of optimizations Ivan mentions, at least ideally. If it's not
able to do them yet, it will be when the rewritten version is
deployed.

-Yotam


On Mon, Dec 19, 2016 at 10:48 AM, Ivan Gotovchits <ivg@ieee.org> wrote:
> Hi Christoph,
>
> The problem is that your function definitions, like `loop` and `rk4_st= ep`,
> have too many parameters
> and OCaml is not able to eliminate them and is actually not trying. It= was
> always a feature of OCaml
> that poorly written code will be compiled into a poorly written progra= m.
> OCaml never tries to fix
> programmer's errors. It will try to minimize the abstraction overh= ead (often
> to the zero). But if the abstractions,
> on the first hand, were chosen incorrectly, then it won't fix the = code.
>
> In your particular example, the C compiler was quite aggressive and wa= s
> capable of eliminating unnecessary computations.
> I wouldn't, however, assume that the C compiler will be always abl= e to do
> this for you.
>
> Let's go through the code:
>
>
> let rec loop steps h n y t =3D
>=C2=A0 =C2=A0if n < steps then loop steps h (n+1) (rk4_step y t h) (= t +. h) else
>=C2=A0 =C2=A0 =C2=A0y
>
> Here variables `steps` and `h` are loop invariants, so you shouldn'= ;t put it
> into the loop function parameter list.
> Well yes, a compiler can find out loop invariants and remove them for = you.
> But in our case, to remove them,
> it will need to change the type of the function. The compiler will not= go
> that far for us. It respects a programmer, and if a
> programmer decided to make his loop function to depend on `6` argument= s,
> then it means that the computation is actually depending
> on 6 arguments. So, it will allocate 6 registers to hold loop variable= s,
> with all the consequences.
>
>
> Now, let's take a look at the `rk4_step` function:
>
> let rk4_step y t h =3D
>=C2=A0 =C2=A0let k1 =3D h *. y' t y in
>=C2=A0 =C2=A0let k2 =3D h *. y' (t +. 0.5*.h) (y +. 0.5*.k1) in
>=C2=A0 =C2=A0let k3 =3D h *. y' (t +. 0.5*.h) (y +. 0.5*.k2) in
>=C2=A0 =C2=A0let k4 =3D h *. y' (t +. h) (y +. k3) in
>=C2=A0 =C2=A0y +. (k1+.k4)/.6.0 +. (k2+.k3)/.3.0
>
>
>
> This function, is, in fact, a body of the loop, and everything except = t is
> loop invariant here. Moreover,
> function=C2=A0 `y'` is defined as:
>
> let y' t y =3D cos t
>
> I.e., it doesn't really use it the second argument. Probably, a co= mpiler
> should inline the call, and eliminate
> lots of unecessary computations, and thus free a few registers, but, > apparently, OCaml doesn't do this
> (even in 4.03+flambda).
>
> So we should do this manually:
>
> let rk4_step y t =3D
>=C2=A0 =C2=A0let k1 =3D h *. y' t in
>=C2=A0 =C2=A0let k2 =3D h *. y' (t +. 0.5*.h)=C2=A0 in
>=C2=A0 =C2=A0let k3 =3D h *. y' (t +. 0.5*.h)=C2=A0 in
>=C2=A0 =C2=A0let k4 =3D h *. y' (t +. h) (y +. k3) in
>=C2=A0 =C2=A0y +. (k1+.k4)/.6.0 +. (k2+.k3)/.3.0
>
> We can even see, that `k3` and `k2` are equal now, so we can eliminate= them:
>
> let rk4_step y t =3D
>=C2=A0 =C2=A0let k1 =3D h *. y' t in
>=C2=A0 =C2=A0let k2 =3D h *. y' (t +. 0.5*.h)=C2=A0 in
>=C2=A0 =C2=A0let k4 =3D h *. y' (t +. h) (y +. k3) in
>=C2=A0 =C2=A0y +. (k1+.k4)/.6.0 +. k2 *. 1.5
>
>
> Finally, we don't want to pass `y` into the `rk4_step` every time,= as we
> don't want to require an extra register for it.
> After all these manual optimizations, we have the following program: >
> let h =3D 0.1
>
>
> let exact t =3D sin t
>
>
> let rk4_step t =3D
>
>=C2=A0 =C2=A0let k1 =3D h *. cos t in
>
>=C2=A0 =C2=A0let k2 =3D h *. cos (t +. 0.5*.h) in
>
>=C2=A0 =C2=A0let k4 =3D h *. cos (t +. h) in
>
>=C2=A0 =C2=A0(k1+.k4)/.6.0 +. k2*.1.5
>
>
> let compute steps =3D
>
>=C2=A0 =C2=A0let rec loop n y t =3D
>
>=C2=A0 =C2=A0 =C2=A0if n < steps
>
>=C2=A0 =C2=A0 =C2=A0then loop (n+1) (y +. rk4_step t) (t +. h)
>
>=C2=A0 =C2=A0 =C2=A0else y in
>
>=C2=A0 =C2=A0loop 1 1.0 0.0
>
>
> let () =3D
>
>=C2=A0 =C2=A0let y =3D compute 102 in
>
>=C2=A0 =C2=A0let err =3D abs_float (y -. (exact ((float_of_int 102) *. = h))) in
>
>=C2=A0 =C2=A0let large =3D 50000000 in
>
>=C2=A0 =C2=A0let y =3D compute large in
>
>=C2=A0 =C2=A0Printf.printf "%b\n"
>
>=C2=A0 =C2=A0 =C2=A0(abs_float (y -. (exact (float_of_int large) *. h))= < 2. *. err)
>
>
>
> This program has the same performance as the C one... unless I pass re= ally
> aggressive optimization options
> to the C compiler, that will emit a platform specific code, e.g.,
>
>=C2=A0 =C2=A0 =C2=A0gcc rk.c -lm -O3 -march=3Dcorei7-avx -o rksse
>
>
> These options basically double the performance of the C version, leavi= ng
> OCaml lagging behind. That is because,
> OCaml, obviously, cannot follow the latest developments of intel CPU,<= br> > especially in the field of SSE.
>
> The fun part is that when I've tried to compile the same file with= clang,
> the resulting program was even slower
> than the original non-optimized OCaml. But this is all micro benchmark= ing of
> course, so don't jump to fast conclusions
> (although I like to think that OCaml is faster than Clang :))
>
>
> As a final remark, my experience in HPC shows that in general you shou= ld not
> really rely on compiler optimizations and hope
> that the compiler will do the magic for you. Even the GCC compiler. It= would
> be very easy to accidentally amend the above program
> in a such way, that the optimizations will no longer fire in. Of cours= e,
> writing in assembly is also not a choice. If you really need
> to optimize, then you should find out the performance bottleneck and t= hen
> optimize it manually until you get an expected performance.
> Alternatively, you can use plain old Fortran to get the reliable
> performance. And then call it from C or OCaml.
>
>
> Best wishes,
> Ivan
>
>
>
> On Mon, Dec 19, 2016 at 6:51 AM, Gerd Stolpmann <info@gerd-stolpmann.de>
> wrote:
>>
>> Hi Christoph,
>>
>> the extra code looks very much like an allocation on the minor hea= p:
>>
>> sub=C2=A0 =C2=A0 $0x10,%r15
>> lea=C2=A0 =C2=A0 0x25c7b6(%rip),%rax
>> cmp=C2=A0 =C2=A0 (%rax),%r15
>> jb=C2=A0 =C2=A0 =C2=A0404a8a <dlerror@plt+0x2d0a>
>> lea=C2=A0 =C2=A0 0x8(%r15),%rax
>> movq=C2=A0 =C2=A0$0x4fd,-0x8(%rax)
>>
>> r15 points to the used area of the minor heap - by decrementing it= you
>> get an additional block of memory. It is compared against the begi= nning
>> of the heap to check whether GC is needed. The constant 0x4fd is t= he
>> header of the new block (which must be always initialized).
>>
>> From the source code, it remains unclear for what this is used.
>> Obviously, the compiler runs out of registers, and moves some valu= es to
>> the minor heap (temporarily). When you call a C function like cos = it is
>> likely that this happens because the C calling conventions do not<= br> >> preserve the FP registers (xmm*). This could be improved if the OC= aml
>> compiler tried alternate places for temporarily storing FP values:=
>>
>>=C2=A0 - int registers (which is perfectly possible on 64 bit platf= orms).
>>=C2=A0 =C2=A0 A number of int registers survive C calls.
>>=C2=A0 - stack
>>
>> To my knowledge, the OCaml compiler never tries this (but this cou= ld be
>> out of date). This is a fairly specific optimization that makes mo= stly
>> sense for purely iterating or aggregating functions like yours tha= t do
>> not store FP values away.
>>
>> Gerd
>>
>> Am Samstag, den 17.12.2016, 14:02 +0100 schrieb Christoph H=C3=B6g= er:
>> > Ups. Forgot the actual examples.
>> >
>> > Am 17.12.2016 um 14:01 schrieb Christoph H=C3=B6ger:
>> > >
>> > > Dear all,
>> > >
>> > > find attached two simple runge-kutta iteration schemes. = One is
>> > > written
>> > > in C, the other in OCaml. I compared the runtime of both= and gcc (-
>> > > O2)
>> > > produces an executable that is roughly 30% faster (to be= more
>> > > precise:
>> > > 3.52s vs. 2.63s). That is in itself quite pleasing, I th= ink. I do
>> > > not
>> > > understand however, what causes this difference. Admitte= dly, the
>> > > generated assembly looks completely different, but both = compilers
>> > > inline
>> > > all functions and generate one big loop. Ocaml generates= a lot more
>> > > scaffolding, but that is to be expected.
>> > >
>> > > There is however an interesting particularity: OCaml gen= erates 6
>> > > calls
>> > > to cos, while gcc only needs 3 (and one direct jump). Su= rprisingly,
>> > > there are also calls to cosh, acos and pretty much any o= ther
>> > > trigonometric function (initialization of constants, may= be?)
>> > >
>> > > However, the true culprit seems to be an excess of instr= uctions
>> > > between
>> > > the different calls to cos. This is what happens between= the first
>> > > two
>> > > calls to cos:
>> > >
>> > > gcc:
>> > > jmpq=C2=A0 =C2=A0400530 <cos@plt>
>> > > nop
>> > > nopw=C2=A0 =C2=A0%cs:0x0(%rax,%rax,1)
>> > >
>> > > sub=C2=A0 =C2=A0 $0x38,%rsp
>> > > movsd=C2=A0 %xmm0,0x10(%rsp)
>> > > movapd %xmm1,%xmm0
>> > > movsd=C2=A0 %xmm2,0x18(%rsp)
>> > > movsd=C2=A0 %xmm1,0x8(%rsp)
>> > > callq=C2=A0 400530 <cos@plt>
>> > >
>> > > ocamlopt:
>> > >
>> > > callq=C2=A0 401a60 <cos@plt>
>> > > mulsd=C2=A0 (%r12),%xmm0
>> > > movsd=C2=A0 %xmm0,0x10(%rsp)
>> > > sub=C2=A0 =C2=A0 $0x10,%r15
>> > > lea=C2=A0 =C2=A0 0x25c7b6(%rip),%rax
>> > > cmp=C2=A0 =C2=A0 (%rax),%r15
>> > > jb=C2=A0 =C2=A0 =C2=A0404a8a <dlerror@plt+0x2d0a><= br> >> > > lea=C2=A0 =C2=A0 0x8(%r15),%rax
>> > > movq=C2=A0 =C2=A0$0x4fd,-0x8(%rax)
>> > >
>> > > movsd=C2=A0 0x32319(%rip),%xmm1
>> > >
>> > > movapd %xmm1,%xmm2
>> > > mulsd=C2=A0 %xmm0,%xmm2
>> > > addsd=C2=A0 0x0(%r13),%xmm2
>> > > movsd=C2=A0 %xmm2,(%rax)
>> > > movapd %xmm1,%xmm0
>> > > mulsd=C2=A0 (%r12),%xmm0
>> > > addsd=C2=A0 (%rbx),%xmm0
>> > > callq=C2=A0 401a60 <cos@plt>
>> > >
>> > >
>> > > Is this caused by some underlying difference in the repr= esentation
>> > > of
>> > > numeric values (i.e. tagged ints) or is it reasonable to= attack
>> > > this
>> > > issue as a hobby experiment?
>> > >
>> > >
>> > > thanks for any advice,
>> > >
>> > > Christoph
>> > >
>> >
>> --
>> ------------------------------------------------------------<= br> >> Gerd Stolpmann, Darmstadt, Germany=C2=A0 =C2=A0 gerd@gerd-stolpmann.de
>> My OCaml site:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 http://www.camlcity= .org
>> Contact details:=C2=A0 =C2=A0 =C2=A0 =C2=A0 http://www.= camlcity.org/contact.html
>> Company homepage:=C2=A0 =C2=A0 =C2=A0 =C2=A0http://www.gerd-sto= lpmann.de
>> ------------------------------------------------------------<= br> >>
>>
>

--001a113f929c8fb301054405d7ef--