From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ivg@ieee.org>
X-Original-To: caml-list@sympa.inria.fr
Delivered-To: caml-list@sympa.inria.fr
Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104])
	by sympa.inria.fr (Postfix) with ESMTPS id CDCD27F615
	for <caml-list@sympa.inria.fr>; Mon, 19 Dec 2016 16:48:51 +0100 (CET)
Authentication-Results: mail3-smtp-sop.national.inria.fr; spf=None smtp.pra=ivg@ieee.org; spf=Pass smtp.mailfrom=ivg@ieee.org; spf=None smtp.helo=postmaster@mail-lf0-f54.google.com
Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender
  authenticity information available from domain of
  ivg@ieee.org) identity=pra; client-ip=209.85.215.54;
  receiver=mail3-smtp-sop.national.inria.fr;
  envelope-from="ivg@ieee.org"; x-sender="ivg@ieee.org";
  x-conformance=sidf_compatible
Received-SPF: Pass (mail3-smtp-sop.national.inria.fr: domain of
  ivg@ieee.org designates 209.85.215.54 as permitted sender)
  identity=mailfrom; client-ip=209.85.215.54;
  receiver=mail3-smtp-sop.national.inria.fr;
  envelope-from="ivg@ieee.org"; x-sender="ivg@ieee.org";
  x-conformance=sidf_compatible; x-record-type="v=spf1"
Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender
  authenticity information available from domain of
  postmaster@mail-lf0-f54.google.com) identity=helo;
  client-ip=209.85.215.54;
  receiver=mail3-smtp-sop.national.inria.fr;
  envelope-from="ivg@ieee.org";
  x-sender="postmaster@mail-lf0-f54.google.com";
  x-conformance=sidf_compatible
IronPort-PHdr: =?us-ascii?q?9a23=3ApEzbyhBxcLOAU97Dr6t6UyQJP3N1i/DPJgcQr6Af?=
 =?us-ascii?q?oPdwSPv8psbcNUDSrc9gkEXOFd2CrakV0KyJ7Ou/BiQp2tWoiDg6aptCVhsI24?=
 =?us-ascii?q?09vjcLJ4q7M3D9N+PgdCcgHc5PBxdP9nC/NlVJSo6lPwWB6nK94iQPFRrhKAF7?=
 =?us-ascii?q?Ovr6GpLIj8Swyuu+54Dfbx9GiTe5b75+Nhe7oAfeusQUg4ZpN7o8xAbOrnZUYe?=
 =?us-ascii?q?pd2HlmJUiUnxby58ew+IBs/iFNsP8/9MBOTLv3cb0gQbNXEDopPWY15Nb2tRbY?=
 =?us-ascii?q?VguA+mEcUmQNnRVWBQXO8Qz3UY3wsiv+sep9xTWaMMjrRr06RTiu86FmQwLuhS?=
 =?us-ascii?q?waNTA27XvXh9Ryg6JVoByvqR9xzZPKbo6JL/dxZL/RcMkASGZdQspcVSpMCZ68?=
 =?us-ascii?q?YYsVCOoBOP5VoYrjp1QUqxu1GAiiBOTzyj9PmH/5wa060+U9EQHdwQctGNMOsG?=
 =?us-ascii?q?rbrNjuNacdT/66w7fSwTXEdfNW1i7w5Y7VeR4vpvGMWKh/ccvXyUQ3GAPKkE+Q?=
 =?us-ascii?q?ppH8MzOOyuQNtGyb7/JlVe21jW4nrRt9rSWxycs0lobGnIcVylTY+iV43IY0Jc?=
 =?us-ascii?q?e0SElhYd6rFpZbqiKUN5NuT888X21lvDw2x74GtJKhYiQG1ZQqywTfZvGIdYWD?=
 =?us-ascii?q?/wjtW/yLIThigXJoYLK/iAi28Uin0uD8U9O70FdOriZcltnMuGwB2wXd6sWHRf?=
 =?us-ascii?q?Zx5Eih2TGI1wDc7uFLP1o4mrbcK54k2rIwl5wTvlrfHiLuhkn6kKubel8n9+Wo?=
 =?us-ascii?q?8ejrfKjqq5+GO4J0hQzyKqEulda+AeQ8PAgORW+b+eGk2b3i4035T65Hjvwona?=
 =?us-ascii?q?bHrp/WP8MbprS2AwNNyIYs9w6/Dyu60NQfhXQIMEhKeBeDj4TwP1HOIev4Deuk?=
 =?us-ascii?q?jlS3kDZrwujGMaf7DpXMKHjDirbhcqxn505S0gpghexYsr1ZEL1JG+70Xlf0vd?=
 =?us-ascii?q?reRks4NQKz6+nqEtk4zZ8ZXXqKC6mfdq/f5wym/OUqdsiSbYldlzHhLOYu5//y?=
 =?us-ascii?q?ljdtmEESVaik0JZRb2q3SKc1a36FaGbh149SWVwBuRAzGbTn?=
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: =?us-ascii?q?A0A+AACUAFhYhjbXVdFdGgEBAQECAQEBA?=
 =?us-ascii?q?QgBAQEBFQEBAQECAQEBAQgBAQEBgwwBAQEBAT86gQYHjUiWWZUOggomhXwCgXs?=
 =?us-ascii?q?HPxQBAQEBAQEBAQEBARIBAQEICwsJHTCCMxiCHgEEASMdAQETJAEECwkCCwMqC?=
 =?us-ascii?q?gICIhIBBQEcBhMJiFoIDolqkQs/ixpogiiDDAEBBYd6AQEBAQEFAQEBAQEBARk?=
 =?us-ascii?q?IEoYkhFmETIJ4gl2IWQ6HHIkwgUKNRYNvkE2IDoYLgkkUHoEUH4FcKxMDgwwsI?=
 =?us-ascii?q?IIGIDQBhkwqghMBAQE?=
X-IPAS-Result: =?us-ascii?q?A0A+AACUAFhYhjbXVdFdGgEBAQECAQEBAQgBAQEBFQEBAQE?=
 =?us-ascii?q?CAQEBAQgBAQEBgwwBAQEBAT86gQYHjUiWWZUOggomhXwCgXsHPxQBAQEBAQEBA?=
 =?us-ascii?q?QEBARIBAQEICwsJHTCCMxiCHgEEASMdAQETJAEECwkCCwMqCgICIhIBBQEcBhM?=
 =?us-ascii?q?JiFoIDolqkQs/ixpogiiDDAEBBYd6AQEBAQEFAQEBAQEBARkIEoYkhFmETIJ4g?=
 =?us-ascii?q?l2IWQ6HHIkwgUKNRYNvkE2IDoYLgkkUHoEUH4FcKxMDgwwsIIIGIDQBhkwqghM?=
 =?us-ascii?q?BAQE?=
X-IronPort-AV: E=Sophos;i="5.33,374,1477954800"; 
   d="scan'208,217";a="205122902"
Received: from mail-lf0-f54.google.com ([209.85.215.54])
  by mail3-smtp-sop.national.inria.fr with ESMTP/TLS/AES128-GCM-SHA256; 19 Dec 2016 16:48:50 +0100
Received: by mail-lf0-f54.google.com with SMTP id y21so56782991lfa.1
        for <caml-list@inria.fr>; Mon, 19 Dec 2016 07:48:50 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=ieee-org.20150623.gappssmtp.com; s=20150623;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :cc;
        bh=ghCr55JiHX6W8fNadIqXn7/yG7O1w2M0+CGW7PDwyY0=;
        b=m+33Bd65zSF/mPyVfjLCOljDP5rrb7DvLBHD68buMhwG+UgacJVBtPLyngec6fLMIz
         KvAQTgqzn30CKbq859ySba+X+PxkSlK0i9O+SKC2CbouUk3FYaRKqtxMUT8M6TLhc/Sq
         SotiYcKfV3soY9mJrWP7muU2m9t/EyQ1ZMMYuQdNaAgfTeg0ZG8epxNEfavp9i+PZ3KT
         g5qpCKy+32qA6rEgvoX8OcYVrvhXGKHXzJluUCfOGaGlvnla2aeHRiSd8z2rOokhLABT
         gmGgv2g/ixaPVqwd+hVUGKGpPSHKMsrJ9JEWj7jTE5rwGeNFKdJVCIUE43YDfpph1ilE
         vH5A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:in-reply-to:references:from:date
         :message-id:subject:to:cc;
        bh=ghCr55JiHX6W8fNadIqXn7/yG7O1w2M0+CGW7PDwyY0=;
        b=DaFD8olKxlLD8XXZPfVJRGK/wZ9Nmu1A7BGQoKxR1xByHQSDtWYYvv/JnkE6JAbEm3
         owB4SWNY1+X/rfUNvYx05akraq+PdjqRGQ+o77MKgflwO5tmDDXvBV6Rw9V6Vh/OePqc
         uk1l9VuQbtxcItGp3gAvkG/qCWLE+xAs/uphJDftRCfoLoUg8E1P68QPRyPuTkUN+Gav
         gFbkdS3cUV4gUvGw8RJ+9DzVO92Ub8zezVVMW71D4rMd8s+SNIrTXY5rtfAB/aCRnOzn
         S4pAy93npnKWc6dAX6smV7z3Gr/mRQepkWqw748BxodEmzj6WFupXBhVVZzreFCOdNrE
         AypA==
X-Gm-Message-State: AIkVDXImOFwINorvhxlEYY7T3djjqlupuD6EN/FnoLwZzLcbuQJHGtVCI9MPeuhYlcrjo7fFKPTWHydVxNQmnzFJ
X-Received: by 10.46.76.1 with SMTP id z1mr7587108lja.41.1482162528888; Mon,
 19 Dec 2016 07:48:48 -0800 (PST)
MIME-Version: 1.0
Received: by 10.114.93.99 with HTTP; Mon, 19 Dec 2016 07:48:48 -0800 (PST)
In-Reply-To: <1482148297.4629.19.camel@gerd-stolpmann.de>
References: <7bc766a2-d460-524b-35ca-89609a34b719@tu-berlin.de>
 <adf19464-c995-0e02-48e9-100f0efd26b6@tu-berlin.de> <1482148297.4629.19.camel@gerd-stolpmann.de>
From: Ivan Gotovchits <ivg@ieee.org>
Date: Mon, 19 Dec 2016 10:48:48 -0500
Message-ID: <CALdWJ+w0FFcLvL=Jv8bptZJJzYpdLP2XzZxyAo3fh7nzPV-2ig@mail.gmail.com>
To: Gerd Stolpmann <info@gerd-stolpmann.de>
Cc: =?UTF-8?Q?Christoph_H=C3=B6ger?= <christoph.hoeger@tu-berlin.de>, 
	caml-list <caml-list@inria.fr>
Content-Type: multipart/alternative; boundary=f403045ea6da08c55a054404db07
Subject: Re: [Caml-list] Closing the performance gap to C

--f403045ea6da08c55a054404db07
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi Christoph,

The problem is that your function definitions, like `loop` and `rk4_step`,
have too many parameters
and OCaml is not able to eliminate them and is actually not trying. It was
always a feature of OCaml
that poorly written code will be compiled into a poorly written program.
OCaml never tries to fix
programmer's errors. It will try to minimize the abstraction overhead
(often to the zero). But if the abstractions,
on the first hand, were chosen incorrectly, then it won't fix the code.

In your particular example, the C compiler was quite aggressive and was
capable of eliminating unnecessary computations.
I wouldn't, however, assume that the C compiler will be always able to do
this for you.

Let's go through the code:


let rec loop steps h n y t =3D
  if n < steps then loop steps h (n+1) (rk4_step y t h) (t +. h) else
    y

Here variables `steps` and `h` are loop invariants, so you shouldn't put it
into the loop function parameter list.
Well yes, a compiler can find out loop invariants and remove them for you.
But in our case, to remove them,
it will need to change the type of the function. The compiler will not go
that far for us. It respects a programmer, and if a
programmer decided to make his loop function to depend on `6` arguments,
then it means that the computation is actually depending
on 6 arguments. So, it will allocate 6 registers to hold loop variables,
with all the consequences.


Now, let's take a look at the `rk4_step` function:

let rk4_step y t h =3D
  let k1 =3D h *. y' t y in
  let k2 =3D h *. y' (t +. 0.5*.h) (y +. 0.5*.k1) in
  let k3 =3D h *. y' (t +. 0.5*.h) (y +. 0.5*.k2) in
  let k4 =3D h *. y' (t +. h) (y +. k3) in
  y +. (k1+.k4)/.6.0 +. (k2+.k3)/.3.0


This function, is, in fact, a body of the loop, and everything except t is
loop invariant here. Moreover,
function  `y'` is defined as:

let y' t y =3D cos t

I.e., it doesn't really use it the second argument. Probably, a compiler
should inline the call, and eliminate
lots of unecessary computations, and thus free a few registers, but,
apparently, OCaml doesn't do this
(even in 4.03+flambda).

So we should do this manually:

let rk4_step y t =3D
  let k1 =3D h *. y' t in
  let k2 =3D h *. y' (t +. 0.5*.h)  in
  let k3 =3D h *. y' (t +. 0.5*.h)  in
  let k4 =3D h *. y' (t +. h) (y +. k3) in
  y +. (k1+.k4)/.6.0 +. (k2+.k3)/.3.0

We can even see, that `k3` and `k2` are equal now, so we can eliminate them:

let rk4_step y t =3D
  let k1 =3D h *. y' t in
  let k2 =3D h *. y' (t +. 0.5*.h)  in
  let k4 =3D h *. y' (t +. h) (y +. k3) in
  y +. (k1+.k4)/.6.0 +. k2 *. 1.5


Finally, we don't want to pass `y` into the `rk4_step` every time, as we
don't want to require an extra register for it.
After all these manual optimizations, we have the following program:

let h =3D 0.1


let exact t =3D sin t


let rk4_step t =3D

  let k1 =3D h *. cos t in

  let k2 =3D h *. cos (t +. 0.5*.h) in

  let k4 =3D h *. cos (t +. h) in

  (k1+.k4)/.6.0 +. k2*.1.5


let compute steps =3D

  let rec loop n y t =3D

    if n < steps

    then loop (n+1) (y +. rk4_step t) (t +. h)

    else y in

  loop 1 1.0 0.0


let () =3D

  let y =3D compute 102 in

  let err =3D abs_float (y -. (exact ((float_of_int 102) *. h))) in

  let large =3D 50000000 in

  let y =3D compute large in

  Printf.printf "%b\n"

    (abs_float (y -. (exact (float_of_int large) *. h)) < 2. *. err)


This program has the same performance as the C one... unless I pass really
aggressive optimization options
to the C compiler, that will emit a platform specific code, e.g.,

    gcc rk.c -lm -O3 -march=3Dcorei7-avx -o rksse


These options basically double the performance of the C version,
leaving OCaml lagging behind. That is because,
OCaml, obviously, cannot follow the latest developments of intel CPU,
especially in the field of SSE.

The fun part is that when I've tried to compile the same file with clang,
the resulting program was even slower
than the original non-optimized OCaml. But this is all micro benchmarking
of course, so don't jump to fast conclusions
(although I like to think that OCaml is faster than Clang :))


As a final remark, my experience in HPC shows that in general you should
not really rely on compiler optimizations and hope
that the compiler will do the magic for you. Even the GCC compiler. It
would be very easy to accidentally amend the above program
in a such way, that the optimizations will no longer fire in. Of course,
writing in assembly is also not a choice. If you really need
to optimize, then you should find out the performance bottleneck and then
optimize it manually until you get an expected performance.
Alternatively, you can use plain old Fortran to get the reliable
performance. And then call it from C or OCaml.


Best wishes,
Ivan


On Mon, Dec 19, 2016 at 6:51 AM, Gerd Stolpmann <info@gerd-stolpmann.de>
wrote:

> Hi Christoph,
>
> the extra code looks very much like an allocation on the minor heap:
>
> sub    $0x10,%r15
> lea    0x25c7b6(%rip),%rax
> cmp    (%rax),%r15
> jb     404a8a <dlerror@plt+0x2d0a>
> lea    0x8(%r15),%rax
> movq   $0x4fd,-0x8(%rax)
>
> r15 points to the used area of the minor heap - by decrementing it you
> get an additional block of memory. It is compared against the beginning
> of the heap to check whether GC is needed. The constant 0x4fd is the
> header of the new block (which must be always initialized).
>
> From the source code, it remains unclear for what this is used.
> Obviously, the compiler runs out of registers, and moves some values to
> the minor heap (temporarily). When you call a C function like cos it is
> likely that this happens because the C calling conventions do not
> preserve the FP registers (xmm*). This could be improved if the OCaml
> compiler tried alternate places for temporarily storing FP values:
>
>  - int registers (which is perfectly possible on 64 bit platforms).
>    A number of int registers survive C calls.
>  - stack
>
> To my knowledge, the OCaml compiler never tries this (but this could be
> out of date). This is a fairly specific optimization that makes mostly
> sense for purely iterating or aggregating functions like yours that do
> not store FP values away.
>
> Gerd
>
> Am Samstag, den 17.12.2016, 14:02 +0100 schrieb Christoph H=C3=B6ger:
> > Ups. Forgot the actual examples.
> >
> > Am 17.12.2016 um 14:01 schrieb Christoph H=C3=B6ger:
> > >
> > > Dear all,
> > >
> > > find attached two simple runge-kutta iteration schemes. One is
> > > written
> > > in C, the other in OCaml. I compared the runtime of both and gcc (-
> > > O2)
> > > produces an executable that is roughly 30% faster (to be more
> > > precise:
> > > 3.52s vs. 2.63s). That is in itself quite pleasing, I think. I do
> > > not
> > > understand however, what causes this difference. Admittedly, the
> > > generated assembly looks completely different, but both compilers
> > > inline
> > > all functions and generate one big loop. Ocaml generates a lot more
> > > scaffolding, but that is to be expected.
> > >
> > > There is however an interesting particularity: OCaml generates 6
> > > calls
> > > to cos, while gcc only needs 3 (and one direct jump). Surprisingly,
> > > there are also calls to cosh, acos and pretty much any other
> > > trigonometric function (initialization of constants, maybe?)
> > >
> > > However, the true culprit seems to be an excess of instructions
> > > between
> > > the different calls to cos. This is what happens between the first
> > > two
> > > calls to cos:
> > >
> > > gcc:
> > > jmpq   400530 <cos@plt>
> > > nop
> > > nopw   %cs:0x0(%rax,%rax,1)
> > >
> > > sub    $0x38,%rsp
> > > movsd  %xmm0,0x10(%rsp)
> > > movapd %xmm1,%xmm0
> > > movsd  %xmm2,0x18(%rsp)
> > > movsd  %xmm1,0x8(%rsp)
> > > callq  400530 <cos@plt>
> > >
> > > ocamlopt:
> > >
> > > callq  401a60 <cos@plt>
> > > mulsd  (%r12),%xmm0
> > > movsd  %xmm0,0x10(%rsp)
> > > sub    $0x10,%r15
> > > lea    0x25c7b6(%rip),%rax
> > > cmp    (%rax),%r15
> > > jb     404a8a <dlerror@plt+0x2d0a>
> > > lea    0x8(%r15),%rax
> > > movq   $0x4fd,-0x8(%rax)
> > >
> > > movsd  0x32319(%rip),%xmm1
> > >
> > > movapd %xmm1,%xmm2
> > > mulsd  %xmm0,%xmm2
> > > addsd  0x0(%r13),%xmm2
> > > movsd  %xmm2,(%rax)
> > > movapd %xmm1,%xmm0
> > > mulsd  (%r12),%xmm0
> > > addsd  (%rbx),%xmm0
> > > callq  401a60 <cos@plt>
> > >
> > >
> > > Is this caused by some underlying difference in the representation
> > > of
> > > numeric values (i.e. tagged ints) or is it reasonable to attack
> > > this
> > > issue as a hobby experiment?
> > >
> > >
> > > thanks for any advice,
> > >
> > > Christoph
> > >
> >
> --
> ------------------------------------------------------------
> Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
> My OCaml site:          http://www.camlcity.org
> Contact details:        http://www.camlcity.org/contact.html
> Company homepage:       http://www.gerd-stolpmann.de
> ------------------------------------------------------------
>
>
>

--f403045ea6da08c55a054404db07
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Christoph,<div><br></div><div>The problem=C2=A0is that =
your function definitions, like `loop` and `rk4_step`, have too many parame=
ters</div><div>and OCaml is not able to eliminate them=C2=A0and is actually=
 not trying. It was always a feature of OCaml</div><div>that poorly written=
 code will be compiled into a poorly written program. OCaml never tries to =
fix</div><div>programmer&#39;s errors. It will try to minimize the abstract=
ion overhead (often to the zero). But if the abstractions,</div><div>on the=
 first hand, were chosen incorrectly, then it won&#39;t fix the code.=C2=A0=
</div><div><br></div><div>In your particular example, the C compiler was qu=
ite aggressive and was capable of eliminating unnecessary=C2=A0computations=
.=C2=A0</div><div>I wouldn&#39;t, however, assume that the C compiler will =
be always able to do this for you.=C2=A0</div><div><br></div><div>Let&#39;s=
 go through the code:</div><div><br></div><div><br></div><blockquote style=
=3D"margin:0px 0px 0px 40px;border:none;padding:0px"><div><div><font face=
=3D"monospace, monospace">let rec loop steps h n y t =3D</font></div></div>=
<div><div><font face=3D"monospace, monospace">=C2=A0 if n &lt; steps then l=
oop steps h (n+1) (rk4_step y t h) (t +. h) else</font></div></div><div><di=
v><font face=3D"monospace, monospace">=C2=A0 =C2=A0 y</font></div></div><di=
v><br></div></blockquote>Here variables `steps` and `h` are loop invariants=
, so you shouldn&#39;t put it into the loop function parameter list.=C2=A0<=
div>Well yes, a compiler can find out loop invariants=C2=A0and remove them =
for you. But in our case, to remove them,=C2=A0</div><div>it will need to c=
hange the type of the function. The=C2=A0compiler will not go that far for =
us. It respects a programmer, and if=C2=A0a=C2=A0</div>programmer decided t=
o make his loop function to depend on `6` arguments, then it means that the=
 computation=C2=A0is actually depending<div>on 6 arguments. So, it will all=
ocate 6 registers to hold loop variables, with all the consequences.</div><=
div><br></div><div><br></div><div>Now, let&#39;s take a look at the `rk4_st=
ep` function:</div><div><br></div><blockquote style=3D"margin:0px 0px 0px 4=
0px;border:none;padding:0px"><div><div><font face=3D"monospace, monospace">=
let rk4_step y t h =3D</font></div></div><div><div><font face=3D"monospace,=
 monospace">=C2=A0 let k1 =3D h *. y&#39; t y in</font></div></div><div><di=
v><font face=3D"monospace, monospace">=C2=A0 let k2 =3D h *. y&#39; (t +. 0=
.5*.h) (y +. 0.5*.k1) in</font></div></div><div><div><font face=3D"monospac=
e, monospace">=C2=A0 let k3 =3D h *. y&#39; (t +. 0.5*.h) (y +. 0.5*.k2) in=
</font></div></div><div><div><font face=3D"monospace, monospace">=C2=A0 let=
 k4 =3D h *. y&#39; (t +. h) (y +. k3) in</font></div></div><div><div><font=
 face=3D"monospace, monospace">=C2=A0 y +. (k1+.k4)/.6.0 +. (k2+.k3)/.3.0</=
font></div></div></blockquote><div><br></div><div><br></div><div>This funct=
ion, is, in fact, a body of the loop, and everything except t is loop invar=
iant here. Moreover,=C2=A0</div><div>function =C2=A0`y&#39;` is defined as:=
</div><div><br></div><blockquote style=3D"margin:0px 0px 0px 40px;border:no=
ne;padding:0px"><div><pre class=3D"gmail-m_-1547823585245164802gmail-aLF-aP=
X-K0-aPE gmail-m_-1547823585245164802gmail-aLF-aPX-aLK-ayr-auR">let y&#39; =
t y =3D cos t</pre></div></blockquote><span style=3D"white-space:pre-wrap">=
<font face=3D"arial, helvetica, sans-serif">I.e., it doesn&#39;t really use=
 it the second argument. Probably, a compiler should inline the call, and e=
liminate </font></span><div><span style=3D"white-space:pre-wrap"><font face=
=3D"arial, helvetica, sans-serif">lots of unecessary computations, and thus=
 free a few registers, </font></span><span style=3D"font-family:arial,helve=
tica,sans-serif;white-space:pre-wrap">but, apparently, OCaml doesn&#39;t do=
 this </span></div><div><span style=3D"font-family:arial,helvetica,sans-ser=
if;white-space:pre-wrap">(even in 4.03+flambda).</span></div><div><span sty=
le=3D"font-family:arial,helvetica,sans-serif;white-space:pre-wrap"><br></sp=
an></div><div><span style=3D"white-space:pre-wrap"><font face=3D"arial, hel=
vetica, sans-serif">So we should do this manually:</font></span></div><div>=
<span style=3D"white-space:pre-wrap"><font face=3D"arial, helvetica, sans-s=
erif"><br></font></span></div><div><div><div><blockquote style=3D"margin:0p=
x 0px 0px 40px;border:none;padding:0px"><div><div><font face=3D"monospace, =
monospace">let rk4_step y t =3D</font></div><div><font face=3D"monospace, m=
onospace">=C2=A0 let k1 =3D h *. y&#39; t in</font></div><div><font face=3D=
"monospace, monospace">=C2=A0 let k2 =3D h *. y&#39; (t +. 0.5*.h) =C2=A0in=
</font></div><div><font face=3D"monospace, monospace">=C2=A0 let k3 =3D h *=
. y&#39; (t +. 0.5*.h) =C2=A0in</font></div><div><font face=3D"monospace, m=
onospace">=C2=A0 let k4 =3D h *. y&#39; (t +. h) (y +. k3) in</font></div><=
div><font face=3D"monospace, monospace">=C2=A0 y +. (k1+.k4)/.6.0 +. (k2+.k=
3)/.3.0</font></div></div><div><font face=3D"monospace, monospace"><br></fo=
nt></div></blockquote><font face=3D"arial, helvetica, sans-serif">We can ev=
en see, that `k3` and `k2` are equal now, so we can eliminate them:</font><=
/div></div></div><div><font face=3D"arial, helvetica, sans-serif"><br></fon=
t></div><blockquote style=3D"margin:0px 0px 0px 40px;border:none;padding:0p=
x"><div><div><font face=3D"monospace, monospace">let rk4_step y t =3D</font=
></div></div><div><div><font face=3D"monospace, monospace">=C2=A0 let k1 =
=3D h *. y&#39; t in</font></div></div><div><div><font face=3D"monospace, m=
onospace">=C2=A0 let k2 =3D h *. y&#39; (t +. 0.5*.h) =C2=A0in</font></div>=
</div><div><div><font face=3D"monospace, monospace">=C2=A0 let k4 =3D h *. =
y&#39; (t +. h) (y +. k3) in</font></div></div><div><div><font face=3D"mono=
space, monospace">=C2=A0 y +. (k1+.k4)/.6.0 +. k2 *. 1.5</font></div></div>=
<div><br></div></blockquote><br><div>Finally, we don&#39;t want to pass `y`=
 into the `rk4_step` every time, as we don&#39;t want to require an extra r=
egister for it.</div><div>After all these manual optimizations, we have the=
 following program:</div><div><br></div><blockquote style=3D"margin:0px 0px=
 0px 40px;border:none;padding:0px"><blockquote style=3D"margin:0px 0px 0px =
40px;border:none;padding:0px"><font face=3D"monospace, monospace">let h =3D=
 0.1</font></blockquote><blockquote style=3D"margin:0px 0px 0px 40px;border=
:none;padding:0px"><font face=3D"monospace, monospace"><br></font></blockqu=
ote><blockquote style=3D"margin:0px 0px 0px 40px;border:none;padding:0px"><=
font face=3D"monospace, monospace">let exact t =3D sin t</font></blockquote=
><blockquote style=3D"margin:0px 0px 0px 40px;border:none;padding:0px"><fon=
t face=3D"monospace, monospace"><br></font></blockquote><blockquote style=
=3D"margin:0px 0px 0px 40px;border:none;padding:0px"><font face=3D"monospac=
e, monospace">let rk4_step t =3D</font></blockquote><blockquote style=3D"ma=
rgin:0px 0px 0px 40px;border:none;padding:0px"><font face=3D"monospace, mon=
ospace">=C2=A0 let k1 =3D h *. cos t in</font></blockquote><blockquote styl=
e=3D"margin:0px 0px 0px 40px;border:none;padding:0px"><font face=3D"monospa=
ce, monospace">=C2=A0 let k2 =3D h *. cos (t +. 0.5*.h) in</font></blockquo=
te><blockquote style=3D"margin:0px 0px 0px 40px;border:none;padding:0px"><f=
ont face=3D"monospace, monospace">=C2=A0 let k4 =3D h *. cos (t +. h) in</f=
ont></blockquote><blockquote style=3D"margin:0px 0px 0px 40px;border:none;p=
adding:0px"><font face=3D"monospace, monospace">=C2=A0 (k1+.k4)/.6.0 +. k2*=
.1.5</font></blockquote><blockquote style=3D"margin:0px 0px 0px 40px;border=
:none;padding:0px"><font face=3D"monospace, monospace"><br></font></blockqu=
ote><blockquote style=3D"margin:0px 0px 0px 40px;border:none;padding:0px"><=
font face=3D"monospace, monospace">let compute steps =3D</font></blockquote=
><blockquote style=3D"margin:0px 0px 0px 40px;border:none;padding:0px"><fon=
t face=3D"monospace, monospace">=C2=A0 let rec loop n y t =3D</font></block=
quote><blockquote style=3D"margin:0px 0px 0px 40px;border:none;padding:0px"=
><font face=3D"monospace, monospace">=C2=A0 =C2=A0 if n &lt; steps</font></=
blockquote><blockquote style=3D"margin:0px 0px 0px 40px;border:none;padding=
:0px"><font face=3D"monospace, monospace">=C2=A0 =C2=A0 then loop (n+1) (y =
+. rk4_step t) (t +. h)</font></blockquote><blockquote style=3D"margin:0px =
0px 0px 40px;border:none;padding:0px"><font face=3D"monospace, monospace">=
=C2=A0 =C2=A0 else y in</font></blockquote><blockquote style=3D"margin:0px =
0px 0px 40px;border:none;padding:0px"><font face=3D"monospace, monospace">=
=C2=A0 loop 1 1.0 0.0</font></blockquote><blockquote style=3D"margin:0px 0p=
x 0px 40px;border:none;padding:0px"><font face=3D"monospace, monospace"><br=
></font></blockquote><blockquote style=3D"margin:0px 0px 0px 40px;border:no=
ne;padding:0px"><font face=3D"monospace, monospace">let () =3D</font></bloc=
kquote><blockquote style=3D"margin:0px 0px 0px 40px;border:none;padding:0px=
"><font face=3D"monospace, monospace">=C2=A0 let y =3D compute 102 in</font=
></blockquote><blockquote style=3D"margin:0px 0px 0px 40px;border:none;padd=
ing:0px"><font face=3D"monospace, monospace">=C2=A0 let err =3D abs_float (=
y -. (exact ((float_of_int 102) *. h))) in</font></blockquote><blockquote s=
tyle=3D"margin:0px 0px 0px 40px;border:none;padding:0px"><font face=3D"mono=
space, monospace">=C2=A0 let large =3D 50000000 in</font></blockquote><bloc=
kquote style=3D"margin:0px 0px 0px 40px;border:none;padding:0px"><font face=
=3D"monospace, monospace">=C2=A0 let y =3D compute large in</font></blockqu=
ote><blockquote style=3D"margin:0px 0px 0px 40px;border:none;padding:0px"><=
font face=3D"monospace, monospace">=C2=A0 Printf.printf &quot;%b\n&quot;</f=
ont></blockquote><blockquote style=3D"margin:0px 0px 0px 40px;border:none;p=
adding:0px"><font face=3D"monospace, monospace">=C2=A0 =C2=A0 (abs_float (y=
 -. (exact (float_of_int large) *. h)) &lt; 2. *. err)</font></blockquote><=
div><br></div></blockquote><div><br></div><div>This program has the same pe=
rformance as the C one... unless I pass really aggressive optimization opti=
ons=C2=A0</div><div>to the C compiler, that will emit a platform specific c=
ode, e.g.,</div><div><br></div><div>=C2=A0 =C2=A0 gcc rk.c -lm -O3 -march=
=3Dcorei7-avx -o rksse<br></div><div><br></div><div><br></div><div>These op=
tions basically double=C2=A0the performance of the C version, leaving=C2=A0=
OCaml lagging behind. That is because,=C2=A0</div><div>OCaml, obviously, ca=
nnot follow the latest developments of intel CPU, especially in the field o=
f SSE.=C2=A0</div><div><br></div><div>The fun part=C2=A0is that when I&#39;=
ve tried to compile the same file with clang, the resulting program was eve=
n slower</div><div>than the original non-optimized OCaml. But this is all m=
icro benchmarking of course, so don&#39;t jump to fast conclusions=C2=A0</d=
iv><div>(although I like to think that OCaml is faster than Clang :))=C2=A0=
</div><div><br></div><div><br></div><div>As a final remark, my experience i=
n HPC shows that in general you should not really rely on compiler optimiza=
tions and hope</div>that the compiler will do the magic for you. Even the G=
CC compiler. It would be very easy to accidentally=C2=A0amend the above pro=
gram=C2=A0<div>in a such way, that the optimizations will no longer fire in=
. Of course, writing in assembly is also not a choice. If you really need<d=
iv>to optimize, then you should find out the performance bottleneck=C2=A0an=
d then optimize it manually until you get an expected=C2=A0performance.=C2=
=A0</div><div>Alternatively, you can use plain old Fortran to get the relia=
ble performance. And then call it from C or OCaml.=C2=A0</div><div><br></di=
v><div><br></div><div>Best wishes,</div><div>Ivan</div><div><br></div><div>=
<br></div></div></div><div class=3D"gmail_extra"><br><div class=3D"gmail_qu=
ote">On Mon, Dec 19, 2016 at 6:51 AM, Gerd Stolpmann <span dir=3D"ltr">&lt;=
<a href=3D"mailto:info@gerd-stolpmann.de" target=3D"_blank">info@gerd-stolp=
mann.de</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D=
"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Christop=
h,<br>
<br>
the extra code looks very much like an allocation on the minor heap:<br>
<span class=3D""><br>
sub=C2=A0=C2=A0=C2=A0=C2=A0$0x10,%r15<br>
lea=C2=A0=C2=A0=C2=A0=C2=A00x25c7b6(%rip),%rax<br>
cmp=C2=A0=C2=A0=C2=A0=C2=A0(%rax),%r15<br>
jb=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0404a8a &lt;dlerror@plt+0x2d0a&gt;<br>
lea=C2=A0=C2=A0=C2=A0=C2=A00x8(%r15),%rax<br>
movq=C2=A0=C2=A0=C2=A0$0x4fd,-0x8(%rax)<br>
<br>
</span>r15 points to the used area of the minor heap - by decrementing it y=
ou<br>
get an additional block of memory. It is compared against the beginning<br>
of the heap to check whether GC is needed. The constant 0x4fd is the<br>
header of the new block (which must be always initialized).<br>
<br>
=46rom the source code, it remains unclear for what this is used.<br>
Obviously, the compiler runs out of registers, and moves some values to<br>
the minor heap (temporarily). When you call a C function like cos it is<br>
likely that this happens because the C calling conventions do not<br>
preserve the FP registers (xmm*). This could be improved if the OCaml<br>
compiler tried alternate places for temporarily storing FP values:<br>
<br>
=C2=A0- int registers (which is perfectly possible on 64 bit platforms).<br>
=C2=A0 =C2=A0A number of int registers survive C calls.<br>
=C2=A0- stack<br>
<br>
To my knowledge, the OCaml compiler never tries this (but this could be<br>
out of date). This is a fairly specific optimization that makes mostly<br>
sense for purely iterating or aggregating functions like yours that do<br>
not store FP values away.<br>
<br>
Gerd<br>
<div><div class=3D"h5"><br>
Am Samstag, den 17.12.2016, 14:02 +0100 schrieb Christoph H=C3=B6ger:<br>
&gt; Ups. Forgot the actual examples.<br>
&gt;<br>
&gt; Am 17.12.2016 um 14:01 schrieb Christoph H=C3=B6ger:<br>
&gt; &gt;<br>
&gt; &gt; Dear all,<br>
&gt; &gt;<br>
&gt; &gt; find attached two simple runge-kutta iteration schemes. One is<br>
&gt; &gt; written<br>
&gt; &gt; in C, the other in OCaml. I compared the runtime of both and gcc =
(-<br>
&gt; &gt; O2)<br>
&gt; &gt; produces an executable that is roughly 30% faster (to be more<br>
&gt; &gt; precise:<br>
&gt; &gt; 3.52s vs. 2.63s). That is in itself quite pleasing, I think. I do=
<br>
&gt; &gt; not<br>
&gt; &gt; understand however, what causes this difference. Admittedly, the<=
br>
&gt; &gt; generated assembly looks completely different, but both compilers=
<br>
&gt; &gt; inline<br>
&gt; &gt; all functions and generate one big loop. Ocaml generates a lot mo=
re<br>
&gt; &gt; scaffolding, but that is to be expected.<br>
&gt; &gt;<br>
&gt; &gt; There is however an interesting particularity: OCaml generates 6<=
br>
&gt; &gt; calls<br>
&gt; &gt; to cos, while gcc only needs 3 (and one direct jump). Surprisingl=
y,<br>
&gt; &gt; there are also calls to cosh, acos and pretty much any other<br>
&gt; &gt; trigonometric function (initialization of constants, maybe?)<br>
&gt; &gt;<br>
&gt; &gt; However, the true culprit seems to be an excess of instructions<b=
r>
&gt; &gt; between<br>
&gt; &gt; the different calls to cos. This is what happens between the firs=
t<br>
&gt; &gt; two<br>
&gt; &gt; calls to cos:<br>
&gt; &gt;<br>
&gt; &gt; gcc:<br>
&gt; &gt; jmpq=C2=A0=C2=A0=C2=A0400530 &lt;cos@plt&gt;<br>
&gt; &gt; nop<br>
&gt; &gt; nopw=C2=A0=C2=A0=C2=A0%cs:0x0(%rax,%rax,1)<br>
&gt; &gt;<br>
&gt; &gt; sub=C2=A0=C2=A0=C2=A0=C2=A0$0x38,%rsp<br>
&gt; &gt; movsd=C2=A0=C2=A0%xmm0,0x10(%rsp)<br>
&gt; &gt; movapd %xmm1,%xmm0<br>
&gt; &gt; movsd=C2=A0=C2=A0%xmm2,0x18(%rsp)<br>
&gt; &gt; movsd=C2=A0=C2=A0%xmm1,0x8(%rsp)<br>
&gt; &gt; callq=C2=A0=C2=A0400530 &lt;cos@plt&gt;<br>
&gt; &gt;<br>
&gt; &gt; ocamlopt:<br>
&gt; &gt;<br>
&gt; &gt; callq=C2=A0=C2=A0401a60 &lt;cos@plt&gt;<br>
&gt; &gt; mulsd=C2=A0=C2=A0(%r12),%xmm0<br>
&gt; &gt; movsd=C2=A0=C2=A0%xmm0,0x10(%rsp)<br>
&gt; &gt; sub=C2=A0=C2=A0=C2=A0=C2=A0$0x10,%r15<br>
&gt; &gt; lea=C2=A0=C2=A0=C2=A0=C2=A00x25c7b6(%rip),%rax<br>
&gt; &gt; cmp=C2=A0=C2=A0=C2=A0=C2=A0(%rax),%r15<br>
&gt; &gt; jb=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0404a8a &lt;dlerror@plt+0x2d0a&gt;=
<br>
&gt; &gt; lea=C2=A0=C2=A0=C2=A0=C2=A00x8(%r15),%rax<br>
&gt; &gt; movq=C2=A0=C2=A0=C2=A0$0x4fd,-0x8(%rax)<br>
&gt; &gt;<br>
&gt; &gt; movsd=C2=A0=C2=A00x32319(%rip),%xmm1<br>
&gt; &gt;<br>
&gt; &gt; movapd %xmm1,%xmm2<br>
&gt; &gt; mulsd=C2=A0=C2=A0%xmm0,%xmm2<br>
&gt; &gt; addsd=C2=A0=C2=A00x0(%r13),%xmm2<br>
&gt; &gt; movsd=C2=A0=C2=A0%xmm2,(%rax)<br>
&gt; &gt; movapd %xmm1,%xmm0<br>
&gt; &gt; mulsd=C2=A0=C2=A0(%r12),%xmm0<br>
&gt; &gt; addsd=C2=A0=C2=A0(%rbx),%xmm0<br>
&gt; &gt; callq=C2=A0=C2=A0401a60 &lt;cos@plt&gt;<br>
&gt; &gt;<br>
&gt; &gt;<br>
&gt; &gt; Is this caused by some underlying difference in the representatio=
n<br>
&gt; &gt; of<br>
&gt; &gt; numeric values (i.e. tagged ints) or is it reasonable to attack<b=
r>
&gt; &gt; this<br>
&gt; &gt; issue as a hobby experiment?<br>
&gt; &gt;<br>
&gt; &gt;<br>
&gt; &gt; thanks for any advice,<br>
&gt; &gt;<br>
&gt; &gt; Christoph<br>
&gt; &gt;<br>
&gt;<br>
--<br>
</div></div>------------------------------<wbr>----------------------------=
--<br>
Gerd Stolpmann, Darmstadt, Germany=C2=A0 =C2=A0 <a href=3D"mailto:gerd@gerd=
-stolpmann.de">gerd@gerd-stolpmann.de</a><br>
My OCaml site:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 <a href=3D"http://www.caml=
city.org" rel=3D"noreferrer" target=3D"_blank">http://www.camlcity.org</a><=
br>
Contact details:=C2=A0 =C2=A0 =C2=A0 =C2=A0 <a href=3D"http://www.camlcity.=
org/contact.html" rel=3D"noreferrer" target=3D"_blank">http://www.camlcity.=
org/<wbr>contact.html</a><br>
Company homepage:=C2=A0 =C2=A0 =C2=A0 =C2=A0<a href=3D"http://www.gerd-stol=
pmann.de" rel=3D"noreferrer" target=3D"_blank">http://www.gerd-stolpmann.de=
</a><br>
------------------------------<wbr>------------------------------<br>
<br>
<br>
</blockquote></div><br></div>

--f403045ea6da08c55a054404db07--