From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <info@gerd-stolpmann.de>
X-Original-To: caml-list@sympa.inria.fr
Delivered-To: caml-list@sympa.inria.fr
Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82])
	by sympa.inria.fr (Postfix) with ESMTPS id 8FE307ED26
	for <caml-list@sympa.inria.fr>; Thu,  7 Jun 2012 22:09:59 +0200 (CEST)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AhwDAGEJ0U/U4367k2dsb2JhbABFo1KQXiIBAQEBCQkLCRQDJIIYAQEFJ0cLMw0IVQkSBhMJCYdrAw8Hr0EDiVgUiwmCW4MlA40miQmRSg
X-IronPort-AV: E=Sophos;i="4.75,732,1330902000"; 
   d="scan'208";a="161827110"
Received: from moutng.kundenserver.de ([212.227.126.187])
  by mail1-smtp-roc.national.inria.fr with ESMTP; 07 Jun 2012 22:09:59 +0200
Received: from office1.lan.sumadev.de (dslb-094-219-220-051.pools.arcor-ip.net [94.219.220.51])
	by mrelayeu.kundenserver.de (node=mreu2) with ESMTP (Nemesis)
	id 0MF8TX-1Sjdr40pQZ-00G0hT; Thu, 07 Jun 2012 22:09:57 +0200
Received: from samsung (dslb-178-004-232-062.pools.arcor-ip.net [178.4.232.62])
	by office1.lan.sumadev.de (Postfix) with ESMTPSA id 10BF1C00D1;
	Thu,  7 Jun 2012 22:09:55 +0200 (CEST)
Date: Thu, 07 Jun 2012 22:09:54 +0200
From: Gerd Stolpmann <info@gerd-stolpmann.de>
To: Gabriel Scherer <gabriel.scherer@gmail.com>
Cc: Daniel =?iso-8859-1?q?B=FCnzli?= <daniel.buenzli@erratique.ch>,
	caml-list <caml-list@inria.fr>
In-Reply-To: <CAPFanBHNR7zhMmmZ4qeiyhuF9nyQisDqBdqLQX258LGx-SogpQ@mail.gmail.com>
	(from gabriel.scherer@gmail.com on Thu Jun  7 21:07:17 2012)
X-Mailer: Balsa 2.4.11
Message-Id: <1339099794.4950.8@samsung>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-15; DelSp=Yes; Format=Flowed
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
X-Provags-ID: V02:K0:HctNdP/A/OtDiyVwbsZeVkX8PWdaOFUKHdT6oXIJVfx
 7/3bLUW4AMEYoAdtDiAROhFpn/onF4AGdSpwTDylWkrYR0V14v
 071vCSnahR93eo792e5OFFXApBmbbsrz+2hRVyZxWPzWP+ZxZT
 EdH880Z5Poryh+f2InhKMHfcZ1z8LYmhLCyeZ9QyHXZBfMUx+E
 g8YbfwppJdxhhY/RcTRLjHuEpN7i2xTgHWfXfd/VlvQt6wB/lM
 V+KLUYSZRoWTililgcx/41L0qbCQSWfy2HzvEoEkY6U/TeaLFa
 rEuf07HNs6n2emLzrPwL33OmsPPd+ukBg1VRpmW1J1/k1EOfhT
 N7TQISngzQ3Nt/2IxAnw=
Subject: AW: [Caml-list] Local functions with arguments or closures ?

I think it depends very much would you do inside your loop. In=20=20
Gabriel's example, there is only an addition, so you will probably see=20=
=20
a difference in runtime. But that's not what you do in practice here.=20=20
Imagine the loop iterates over a data structure, and let's assume it is=20=
=20
big enough that it is not completely cached. Then, the few extra=20=20
register loads will make no difference, they are just hidden by the=20=20
speculative execution done by modern CPUs (i.e. the loads are done=20=20
while the CPU waits for the data from RAM).

So yes, passing the arguments explicitly is probably only worth for=20=20
extremely simple cases that do not access data massively, but otherwise=20=
=20
use the style you like.

Gerd

Am 07.06.2012 21:07:17 schrieb(en) Gabriel Scherer:
> I do not know these parts of the compiler well, but here is how
> I investigated your question.
>=20
> TL;DR: for toy tail-recursive code that all fits in register, the
> first one might be slightly faster. It's crazy to take this into
> account to write code, except in a hot loop, premature optimization
> blah blah blah.
>=20
> 1. I had to come up with a compilable piece of code that would be
> reasonably representative of what you're trying to do. It actually was
> the hardest part. Here is my attempt:
>=20
>   let f1 x =3D
>     let v =3D !x in
>     let rec loop1 x v =3D function
>       | 0 -> ()
>       | n ->
>         x :=3D !x + v;
>         loop1 x v (n - 1);
>     in
>     loop1 x v 10
>=20
>   let f2 x =3D
>     let v =3D !x in
>     let rec loop2 =3D function
>       | 0 -> ()
>       | n ->
>         x :=3D !x + v;
>         loop2 (n - 1);
>     in
>     loop2 10
>=20
> 2. Using the "print the intermediate representation" features of the
> native compiler, I looked at my favorite intermediate
> representations: -dlambda and -dcmm. In -dlambda, there is no
> difference (but the passing of more arguments), variables are accessed
> by names like in the source code (-dlambda output is
> very readable). -dcmm shows a difference:
>=20
>   (function camlOptim__loop1_1033 (x/1034: addr v/1035: addr n/1036:=20=
=20
> addr)
>    (if (!=3D n/1036 1)
>      (seq (store x/1034 (+ (+ (load x/1034) v/1035) -1))
>        (app "camlOptim__loop1_1033" x/1034 v/1035 (+ n/1036 -2) addr))
>      1a))
>=20
>   (function camlOptim__loop2_1040 (n/1041: addr env/1049: addr)
>    (if (!=3D n/1041 1)
>      (seq
>        (store (load (+a env/1049 8))
>          (+ (+ (load (load (+a env/1049 8))) (load (+a env/1049 12)))=20=
=20
> -1))
>        (app "camlOptim__loop2_1040" (+ n/1041 -2) env/1049 addr))
>      1a))
>=20
> The first function uses the argument directly, while the second one
> loads them from an "env" argument corresponding, I suppose, to closure
> conversion.
>=20
> 3. I finally checked that lower-level code preserved this difference
> (it could have optimized it away). Instead of reading the assembly
> output of -S, I read -dlinear which has remnants of variable names and
> is therefore more readable.
>=20
>   *** Linearized code
>   camlOptim__loop1_1033:
>     if n/10[%ecx] =3D=3Ds 1 goto L100
>     A/12[%edx] :=3D [x/8[%eax]]
>     I/13[%edx] :=3D A/12[%edx] + v/9[%ebx] + -1
>     [x/8[%eax]] :=3D I/13[%edx]
>     I/14[%ecx] :=3D I/14[%ecx] + -2
>     tailcall "camlOptim__loop1_1033" R/0[%eax] R/1[%ebx] R/2[%ecx]
>     L100:
>     A/11[%eax] :=3D 1
>     return R/0[%eax]
>=20
>   *** Linearized code
>   camlOptim__loop2_1040:
>     if n/8[%eax] =3D=3Ds 1 goto L102
>     A/11[%esi] :=3D [env/9[%ebx] + 8]
>     A/12[%edx] :=3D [env/9[%ebx] + 12]
>     A/13[%ecx] :=3D [env/9[%ebx] + 8]
>     A/14[%ecx] :=3D [A/13[%ecx]]
>     I/15[%ecx] :=3D A/14[%ecx] + A/12[%edx] + -1
>     [A/11[%esi]] :=3D I/15[%ecx]
>     I/16[%eax] :=3D I/16[%eax] + -2
>     tailcall "camlOptim__loop2_1040" R/0[%eax] R/1[%ebx]
>     L102:
>     A/10[%eax] :=3D 1
>     return R/0[%eax]
>=20
> The first code "looks" more efficient, because it does less loads.
>=20
> 4. Reflexion phase. Is the difference noticeable? Your question seems
> to indicate that it is note. Here are a few reasons why that could be
> the case
> :
> - magic magic, talking about performance without benchmarking is silly
> - the processor is good at caching/fetching/whatever and reduces the
>   cost of those loads drastically
> - for non-toy code, the performance different becomes neglectible
>   (after all, it's just a few loads)
> - for a non-tail-recursive function, there is an argument passing cost
>   that is comparable to the cost of the loads
>=20
> The last hypothesis is easy to check (if you call reading low-level
> code without running it "checking"): just swap the order of the
> recursive call and the asignment:
>=20
>   let f1 x =3D
>     let v =3D !x in
>     let rec loop1 x v =3D function
>       | 0 -> ()
>       | n ->
>         loop1 x v (n - 1);
>         x :=3D !x + v;
>     in
>     loop1 x v 10
>=20
>   let f2 x =3D
>     let v =3D !x in
>     let rec loop2 =3D function
>       | 0 -> ()
>       | n ->
>         loop2 (n - 1);
>         x :=3D !x + v;
>     in
>     loop2 10
>=20
> As was to be expected, the non-tail-call messes with the fine register
> allocation we had, and the first version gets two spills (one for "x"
> and one for "v"), while the second one only spills the environment
> (basically a record of "x" and "v"):
>=20
>   camlOptim__loop1_1033:
>     if n/10[%ecx] =3D=3Ds 1 goto L100
>     spilled-v/16[s1] :=3D v/9[%ebx] (spill)
>     spilled-x/17[s0] :=3D x/8[%eax] (spill)
>     I/12[%ecx] :=3D I/12[%ecx] + -2
>     {spilled-v/16[s1]* spilled-x/17[s0]*}
>     call "camlOptim__loop1_1033" R/0[%eax] R/1[%ebx] R/2[%ecx]
>     x/18[%ebx] :=3D spilled-x/17[s0] (reload)
>     A/13[%ecx] :=3D [x/18[%ebx]]
>     v/19[%eax] :=3D spilled-v/16[s1] (reload)
>     I/14[%eax] :=3D A/13[%ecx] + v/19[%eax] + -1
>     [x/18[%ebx]] :=3D I/14[%eax]
>     A/15[%eax] :=3D 1
>     reload retaddr
>     return R/0[%eax]
>     L100:
>     A/11[%eax] :=3D 1
>     reload retaddr
>     return R/0[%eax]
>=20
>   camlOptim__loop2_1040:
>     if n/8[%eax] =3D=3Ds 1 goto L103
>     spilled-env/18[s0] :=3D env/9[%ebx] (spill)
>     I/11[%eax] :=3D I/11[%eax] + -2
>     {spilled-env/18[s0]*}
>     call "camlOptim__loop2_1040" R/0[%eax] R/1[%ebx]
>     env/19[%eax] :=3D spilled-env/18[s0] (reload)
>     A/12[%ecx] :=3D [env/19[%eax] + 8]
>     A/13[%ebx] :=3D [env/19[%eax] + 12]
>     A/14[%eax] :=3D [env/19[%eax] + 8]
>     A/15[%eax] :=3D [A/14[%eax]]
>     I/16[%eax] :=3D A/15[%eax] + A/13[%ebx] + -1
>     [A/12[%ecx]] :=3D I/16[%eax]
>     A/17[%eax] :=3D 1
>     reload retaddr
>     return R/0[%eax]
>     L103:
>     A/10[%eax] :=3D 1
>     reload retaddr
>     return R/0[%eax]
>=20
> You get two spills and two reloads in the first version, vs. one
> spill, one releoad, and two loads from the env in the second
> version. It is not the case anymore that the first one is "clearly"
> better.
>=20
> On Thu, Jun 7, 2012 at 7:34 PM, Daniel B=FCnzli
> <daniel.buenzli@erratique.ch> wrote:
> >
> > Hello,
> >
> > In the past I remember having indirectly benchmarked two different=20=
=20
> styles for writing local functions :
> >
> > =A0 =A0let f x =3D
> > =A0 =A0 =A0let v =3D ... in
> > =A0 =A0 =A0let rec loop x v =3D ... in
> > =A0 =A0 =A0loop x v
> >
> > or
> >
> > =A0 =A0let f x =3D
> > =A0 =A0 =A0 =A0let v =3D ... in
> > =A0 =A0 =A0 =A0let rec loop () =3D ... (* uses v and x *) in
> > =A0 =A0 =A0 =A0loop ()
> >
> > without being able to reach a conclusion. Is there any particular=20=20
> style that is definitively faster ? Does it maybe depend on the=20=20
> number of arguments and/or on whether loop is recursive or not ?
> >
> > This question keeps coming back in my mind when I write local=20=20
> functions... I'm sure someone with some knowledge of the compiler's=20=20
> internals can provide a more reasonable answer than benchmarks.
> >
> > Best,
> >
> > Daniel
> >
> >
> >
> >
> >
> > --
> > Caml-list mailing list. =A0Subscription management and archives:
> > https://sympa-roc.inria.fr/wws/info/caml-list
> > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> > Bug reports: http://caml.inria.fr/bin/caml-bugs
> >
>=20
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa-roc.inria.fr/wws/info/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>=20
>=20
>=20


--=20
------------------------------------------------------------
Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
Creator of GODI and camlcity.org.
Contact details:        http://www.camlcity.org/contact.html
Company homepage:       http://www.gerd-stolpmann.de
------------------------------------------------------------=