From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82]) by sympa.inria.fr (Postfix) with ESMTPS id 8FE307ED26 for ; Thu, 7 Jun 2012 22:09:59 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AhwDAGEJ0U/U4367k2dsb2JhbABFo1KQXiIBAQEBCQkLCRQDJIIYAQEFJ0cLMw0IVQkSBhMJCYdrAw8Hr0EDiVgUiwmCW4MlA40miQmRSg X-IronPort-AV: E=Sophos;i="4.75,732,1330902000"; d="scan'208";a="161827110" Received: from moutng.kundenserver.de ([212.227.126.187]) by mail1-smtp-roc.national.inria.fr with ESMTP; 07 Jun 2012 22:09:59 +0200 Received: from office1.lan.sumadev.de (dslb-094-219-220-051.pools.arcor-ip.net [94.219.220.51]) by mrelayeu.kundenserver.de (node=mreu2) with ESMTP (Nemesis) id 0MF8TX-1Sjdr40pQZ-00G0hT; Thu, 07 Jun 2012 22:09:57 +0200 Received: from samsung (dslb-178-004-232-062.pools.arcor-ip.net [178.4.232.62]) by office1.lan.sumadev.de (Postfix) with ESMTPSA id 10BF1C00D1; Thu, 7 Jun 2012 22:09:55 +0200 (CEST) Date: Thu, 07 Jun 2012 22:09:54 +0200 From: Gerd Stolpmann To: Gabriel Scherer Cc: Daniel =?iso-8859-1?q?B=FCnzli?= , caml-list In-Reply-To: (from gabriel.scherer@gmail.com on Thu Jun 7 21:07:17 2012) X-Mailer: Balsa 2.4.11 Message-Id: <1339099794.4950.8@samsung> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15; DelSp=Yes; Format=Flowed Content-Disposition: inline Content-Transfer-Encoding: quoted-printable X-Provags-ID: V02:K0:HctNdP/A/OtDiyVwbsZeVkX8PWdaOFUKHdT6oXIJVfx 7/3bLUW4AMEYoAdtDiAROhFpn/onF4AGdSpwTDylWkrYR0V14v 071vCSnahR93eo792e5OFFXApBmbbsrz+2hRVyZxWPzWP+ZxZT EdH880Z5Poryh+f2InhKMHfcZ1z8LYmhLCyeZ9QyHXZBfMUx+E g8YbfwppJdxhhY/RcTRLjHuEpN7i2xTgHWfXfd/VlvQt6wB/lM V+KLUYSZRoWTililgcx/41L0qbCQSWfy2HzvEoEkY6U/TeaLFa rEuf07HNs6n2emLzrPwL33OmsPPd+ukBg1VRpmW1J1/k1EOfhT N7TQISngzQ3Nt/2IxAnw= Subject: AW: [Caml-list] Local functions with arguments or closures ? I think it depends very much would you do inside your loop. In=20=20 Gabriel's example, there is only an addition, so you will probably see=20= =20 a difference in runtime. But that's not what you do in practice here.=20=20 Imagine the loop iterates over a data structure, and let's assume it is=20= =20 big enough that it is not completely cached. Then, the few extra=20=20 register loads will make no difference, they are just hidden by the=20=20 speculative execution done by modern CPUs (i.e. the loads are done=20=20 while the CPU waits for the data from RAM). So yes, passing the arguments explicitly is probably only worth for=20=20 extremely simple cases that do not access data massively, but otherwise=20= =20 use the style you like. Gerd Am 07.06.2012 21:07:17 schrieb(en) Gabriel Scherer: > I do not know these parts of the compiler well, but here is how > I investigated your question. >=20 > TL;DR: for toy tail-recursive code that all fits in register, the > first one might be slightly faster. It's crazy to take this into > account to write code, except in a hot loop, premature optimization > blah blah blah. >=20 > 1. I had to come up with a compilable piece of code that would be > reasonably representative of what you're trying to do. It actually was > the hardest part. Here is my attempt: >=20 > let f1 x =3D > let v =3D !x in > let rec loop1 x v =3D function > | 0 -> () > | n -> > x :=3D !x + v; > loop1 x v (n - 1); > in > loop1 x v 10 >=20 > let f2 x =3D > let v =3D !x in > let rec loop2 =3D function > | 0 -> () > | n -> > x :=3D !x + v; > loop2 (n - 1); > in > loop2 10 >=20 > 2. Using the "print the intermediate representation" features of the > native compiler, I looked at my favorite intermediate > representations: -dlambda and -dcmm. In -dlambda, there is no > difference (but the passing of more arguments), variables are accessed > by names like in the source code (-dlambda output is > very readable). -dcmm shows a difference: >=20 > (function camlOptim__loop1_1033 (x/1034: addr v/1035: addr n/1036:=20= =20 > addr) > (if (!=3D n/1036 1) > (seq (store x/1034 (+ (+ (load x/1034) v/1035) -1)) > (app "camlOptim__loop1_1033" x/1034 v/1035 (+ n/1036 -2) addr)) > 1a)) >=20 > (function camlOptim__loop2_1040 (n/1041: addr env/1049: addr) > (if (!=3D n/1041 1) > (seq > (store (load (+a env/1049 8)) > (+ (+ (load (load (+a env/1049 8))) (load (+a env/1049 12)))=20= =20 > -1)) > (app "camlOptim__loop2_1040" (+ n/1041 -2) env/1049 addr)) > 1a)) >=20 > The first function uses the argument directly, while the second one > loads them from an "env" argument corresponding, I suppose, to closure > conversion. >=20 > 3. I finally checked that lower-level code preserved this difference > (it could have optimized it away). Instead of reading the assembly > output of -S, I read -dlinear which has remnants of variable names and > is therefore more readable. >=20 > *** Linearized code > camlOptim__loop1_1033: > if n/10[%ecx] =3D=3Ds 1 goto L100 > A/12[%edx] :=3D [x/8[%eax]] > I/13[%edx] :=3D A/12[%edx] + v/9[%ebx] + -1 > [x/8[%eax]] :=3D I/13[%edx] > I/14[%ecx] :=3D I/14[%ecx] + -2 > tailcall "camlOptim__loop1_1033" R/0[%eax] R/1[%ebx] R/2[%ecx] > L100: > A/11[%eax] :=3D 1 > return R/0[%eax] >=20 > *** Linearized code > camlOptim__loop2_1040: > if n/8[%eax] =3D=3Ds 1 goto L102 > A/11[%esi] :=3D [env/9[%ebx] + 8] > A/12[%edx] :=3D [env/9[%ebx] + 12] > A/13[%ecx] :=3D [env/9[%ebx] + 8] > A/14[%ecx] :=3D [A/13[%ecx]] > I/15[%ecx] :=3D A/14[%ecx] + A/12[%edx] + -1 > [A/11[%esi]] :=3D I/15[%ecx] > I/16[%eax] :=3D I/16[%eax] + -2 > tailcall "camlOptim__loop2_1040" R/0[%eax] R/1[%ebx] > L102: > A/10[%eax] :=3D 1 > return R/0[%eax] >=20 > The first code "looks" more efficient, because it does less loads. >=20 > 4. Reflexion phase. Is the difference noticeable? Your question seems > to indicate that it is note. Here are a few reasons why that could be > the case > : > - magic magic, talking about performance without benchmarking is silly > - the processor is good at caching/fetching/whatever and reduces the > cost of those loads drastically > - for non-toy code, the performance different becomes neglectible > (after all, it's just a few loads) > - for a non-tail-recursive function, there is an argument passing cost > that is comparable to the cost of the loads >=20 > The last hypothesis is easy to check (if you call reading low-level > code without running it "checking"): just swap the order of the > recursive call and the asignment: >=20 > let f1 x =3D > let v =3D !x in > let rec loop1 x v =3D function > | 0 -> () > | n -> > loop1 x v (n - 1); > x :=3D !x + v; > in > loop1 x v 10 >=20 > let f2 x =3D > let v =3D !x in > let rec loop2 =3D function > | 0 -> () > | n -> > loop2 (n - 1); > x :=3D !x + v; > in > loop2 10 >=20 > As was to be expected, the non-tail-call messes with the fine register > allocation we had, and the first version gets two spills (one for "x" > and one for "v"), while the second one only spills the environment > (basically a record of "x" and "v"): >=20 > camlOptim__loop1_1033: > if n/10[%ecx] =3D=3Ds 1 goto L100 > spilled-v/16[s1] :=3D v/9[%ebx] (spill) > spilled-x/17[s0] :=3D x/8[%eax] (spill) > I/12[%ecx] :=3D I/12[%ecx] + -2 > {spilled-v/16[s1]* spilled-x/17[s0]*} > call "camlOptim__loop1_1033" R/0[%eax] R/1[%ebx] R/2[%ecx] > x/18[%ebx] :=3D spilled-x/17[s0] (reload) > A/13[%ecx] :=3D [x/18[%ebx]] > v/19[%eax] :=3D spilled-v/16[s1] (reload) > I/14[%eax] :=3D A/13[%ecx] + v/19[%eax] + -1 > [x/18[%ebx]] :=3D I/14[%eax] > A/15[%eax] :=3D 1 > reload retaddr > return R/0[%eax] > L100: > A/11[%eax] :=3D 1 > reload retaddr > return R/0[%eax] >=20 > camlOptim__loop2_1040: > if n/8[%eax] =3D=3Ds 1 goto L103 > spilled-env/18[s0] :=3D env/9[%ebx] (spill) > I/11[%eax] :=3D I/11[%eax] + -2 > {spilled-env/18[s0]*} > call "camlOptim__loop2_1040" R/0[%eax] R/1[%ebx] > env/19[%eax] :=3D spilled-env/18[s0] (reload) > A/12[%ecx] :=3D [env/19[%eax] + 8] > A/13[%ebx] :=3D [env/19[%eax] + 12] > A/14[%eax] :=3D [env/19[%eax] + 8] > A/15[%eax] :=3D [A/14[%eax]] > I/16[%eax] :=3D A/15[%eax] + A/13[%ebx] + -1 > [A/12[%ecx]] :=3D I/16[%eax] > A/17[%eax] :=3D 1 > reload retaddr > return R/0[%eax] > L103: > A/10[%eax] :=3D 1 > reload retaddr > return R/0[%eax] >=20 > You get two spills and two reloads in the first version, vs. one > spill, one releoad, and two loads from the env in the second > version. It is not the case anymore that the first one is "clearly" > better. >=20 > On Thu, Jun 7, 2012 at 7:34 PM, Daniel B=FCnzli > wrote: > > > > Hello, > > > > In the past I remember having indirectly benchmarked two different=20= =20 > styles for writing local functions : > > > > =A0 =A0let f x =3D > > =A0 =A0 =A0let v =3D ... in > > =A0 =A0 =A0let rec loop x v =3D ... in > > =A0 =A0 =A0loop x v > > > > or > > > > =A0 =A0let f x =3D > > =A0 =A0 =A0 =A0let v =3D ... in > > =A0 =A0 =A0 =A0let rec loop () =3D ... (* uses v and x *) in > > =A0 =A0 =A0 =A0loop () > > > > without being able to reach a conclusion. Is there any particular=20=20 > style that is definitively faster ? Does it maybe depend on the=20=20 > number of arguments and/or on whether loop is recursive or not ? > > > > This question keeps coming back in my mind when I write local=20=20 > functions... I'm sure someone with some knowledge of the compiler's=20=20 > internals can provide a more reasonable answer than benchmarks. > > > > Best, > > > > Daniel > > > > > > > > > > > > -- > > Caml-list mailing list. =A0Subscription management and archives: > > https://sympa-roc.inria.fr/wws/info/caml-list > > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > > Bug reports: http://caml.inria.fr/bin/caml-bugs > > >=20 > -- > Caml-list mailing list. Subscription management and archives: > https://sympa-roc.inria.fr/wws/info/caml-list > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs >=20 >=20 >=20 --=20 ------------------------------------------------------------ Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de Creator of GODI and camlcity.org. Contact details: http://www.camlcity.org/contact.html Company homepage: http://www.gerd-stolpmann.de ------------------------------------------------------------=