From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gabriel.scherer@gmail.com>
X-Original-To: caml-list@sympa.inria.fr
Delivered-To: caml-list@sympa.inria.fr
Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105])
	by sympa.inria.fr (Postfix) with ESMTPS id 5DC137ED26
	for <caml-list@sympa.inria.fr>; Thu,  7 Jun 2012 21:07:59 +0200 (CEST)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AkcBAJL70E/RVaC2kGdsb2JhbABFo02QVggiAQEBAQkJDQcUBCOCGAEBAQQSAhMZARsSCwEDDAYFCw0NCBkiAREBBQEKEgYTEhCHWgEDCwuaZwkDjCKCcIR+ChknAwpXiHEBBQyLEYJngyUDlR2BEo0LPoQB
X-IronPort-AV: E=Sophos;i="4.75,732,1330902000"; 
   d="scan'208";a="146953251"
Received: from mail-gh0-f182.google.com ([209.85.160.182])
  by mail4-smtp-sop.national.inria.fr with ESMTP/TLS/RC4-MD5; 07 Jun 2012 21:07:58 +0200
Received: by ghbz22 with SMTP id z22so1069879ghb.27
        for <caml-list@inria.fr>; Thu, 07 Jun 2012 12:07:57 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20120113;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :cc:content-type:content-transfer-encoding;
        bh=EwzYEsVTVTdVrq//veEiVQmuhRb9j9svG0/pC320a80=;
        b=BI51SaIZrOkRwmhC8weCF3QWQgaruKruHeBXhjI7GSxubaWYqnGEPZimxgKCR1n4vi
         iWL2IeEq1rJ1WaXSPxv09VNldgtUQ5q7rum9UNOhikSsvCDU0i32Uc0CWNOMfaoN+97H
         VJHGC/mUHWyWU/u+uEiRzB8YL0LorRS0EA5IbK4uiP1NeooCo+pMkpj1JYOjas8VTA07
         KtYw6lE2WyeL2oHp4WdtHpstB5gBYeyVGU0ZFDncGd08AkG205j2MaCmE5iAopmdjgYa
         9isLD5ryjn2OOe5oZAmZpHc/dxxYGY9cpcjsdmPUQYMKz+EFVQW/gy2+JjCV0ajbOceL
         Ezhg==
Received: by 10.50.217.137 with SMTP id oy9mr1948420igc.56.1339096077247; Thu,
 07 Jun 2012 12:07:57 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.50.94.39 with HTTP; Thu, 7 Jun 2012 12:07:17 -0700 (PDT)
In-Reply-To: <CE0EBC4C252E4B178BD00227099BA4D4@erratique.ch>
References: <CE0EBC4C252E4B178BD00227099BA4D4@erratique.ch>
From: Gabriel Scherer <gabriel.scherer@gmail.com>
Date: Thu, 7 Jun 2012 21:07:17 +0200
Message-ID: <CAPFanBHNR7zhMmmZ4qeiyhuF9nyQisDqBdqLQX258LGx-SogpQ@mail.gmail.com>
To: =?ISO-8859-1?Q?Daniel_B=FCnzli?= <daniel.buenzli@erratique.ch>
Cc: caml-list <caml-list@inria.fr>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Caml-list] Local functions with arguments or closures ?

I do not know these parts of the compiler well, but here is how
I investigated your question.

TL;DR: for toy tail-recursive code that all fits in register, the
first one might be slightly faster. It's crazy to take this into
account to write code, except in a hot loop, premature optimization
blah blah blah.

1. I had to come up with a compilable piece of code that would be
reasonably representative of what you're trying to do. It actually was
the hardest part. Here is my attempt:

  let f1 x =3D
    let v =3D !x in
    let rec loop1 x v =3D function
      | 0 -> ()
      | n ->
        x :=3D !x + v;
        loop1 x v (n - 1);
    in
    loop1 x v 10

  let f2 x =3D
    let v =3D !x in
    let rec loop2 =3D function
      | 0 -> ()
      | n ->
        x :=3D !x + v;
        loop2 (n - 1);
    in
    loop2 10

2. Using the "print the intermediate representation" features of the
native compiler, I looked at my favorite intermediate
representations: -dlambda and -dcmm. In -dlambda, there is no
difference (but the passing of more arguments), variables are accessed
by names like in the source code (-dlambda output is
very readable). -dcmm shows a difference:

  (function camlOptim__loop1_1033 (x/1034: addr v/1035: addr n/1036: addr)
   (if (!=3D n/1036 1)
     (seq (store x/1034 (+ (+ (load x/1034) v/1035) -1))
       (app "camlOptim__loop1_1033" x/1034 v/1035 (+ n/1036 -2) addr))
     1a))

  (function camlOptim__loop2_1040 (n/1041: addr env/1049: addr)
   (if (!=3D n/1041 1)
     (seq
       (store (load (+a env/1049 8))
         (+ (+ (load (load (+a env/1049 8))) (load (+a env/1049 12))) -1))
       (app "camlOptim__loop2_1040" (+ n/1041 -2) env/1049 addr))
     1a))

The first function uses the argument directly, while the second one
loads them from an "env" argument corresponding, I suppose, to closure
conversion.

3. I finally checked that lower-level code preserved this difference
(it could have optimized it away). Instead of reading the assembly
output of -S, I read -dlinear which has remnants of variable names and
is therefore more readable.

  *** Linearized code
  camlOptim__loop1_1033:
    if n/10[%ecx] =3D=3Ds 1 goto L100
    A/12[%edx] :=3D [x/8[%eax]]
    I/13[%edx] :=3D A/12[%edx] + v/9[%ebx] + -1
    [x/8[%eax]] :=3D I/13[%edx]
    I/14[%ecx] :=3D I/14[%ecx] + -2
    tailcall "camlOptim__loop1_1033" R/0[%eax] R/1[%ebx] R/2[%ecx]
    L100:
    A/11[%eax] :=3D 1
    return R/0[%eax]

  *** Linearized code
  camlOptim__loop2_1040:
    if n/8[%eax] =3D=3Ds 1 goto L102
    A/11[%esi] :=3D [env/9[%ebx] + 8]
    A/12[%edx] :=3D [env/9[%ebx] + 12]
    A/13[%ecx] :=3D [env/9[%ebx] + 8]
    A/14[%ecx] :=3D [A/13[%ecx]]
    I/15[%ecx] :=3D A/14[%ecx] + A/12[%edx] + -1
    [A/11[%esi]] :=3D I/15[%ecx]
    I/16[%eax] :=3D I/16[%eax] + -2
    tailcall "camlOptim__loop2_1040" R/0[%eax] R/1[%ebx]
    L102:
    A/10[%eax] :=3D 1
    return R/0[%eax]

The first code "looks" more efficient, because it does less loads.

4. Reflexion phase. Is the difference noticeable? Your question seems
to indicate that it is note. Here are a few reasons why that could be
the case
:
- magic magic, talking about performance without benchmarking is silly
- the processor is good at caching/fetching/whatever and reduces the
  cost of those loads drastically
- for non-toy code, the performance different becomes neglectible
  (after all, it's just a few loads)
- for a non-tail-recursive function, there is an argument passing cost
  that is comparable to the cost of the loads

The last hypothesis is easy to check (if you call reading low-level
code without running it "checking"): just swap the order of the
recursive call and the asignment:

  let f1 x =3D
    let v =3D !x in
    let rec loop1 x v =3D function
      | 0 -> ()
      | n ->
        loop1 x v (n - 1);
        x :=3D !x + v;
    in
    loop1 x v 10

  let f2 x =3D
    let v =3D !x in
    let rec loop2 =3D function
      | 0 -> ()
      | n ->
        loop2 (n - 1);
        x :=3D !x + v;
    in
    loop2 10

As was to be expected, the non-tail-call messes with the fine register
allocation we had, and the first version gets two spills (one for "x"
and one for "v"), while the second one only spills the environment
(basically a record of "x" and "v"):

  camlOptim__loop1_1033:
    if n/10[%ecx] =3D=3Ds 1 goto L100
    spilled-v/16[s1] :=3D v/9[%ebx] (spill)
    spilled-x/17[s0] :=3D x/8[%eax] (spill)
    I/12[%ecx] :=3D I/12[%ecx] + -2
    {spilled-v/16[s1]* spilled-x/17[s0]*}
    call "camlOptim__loop1_1033" R/0[%eax] R/1[%ebx] R/2[%ecx]
    x/18[%ebx] :=3D spilled-x/17[s0] (reload)
    A/13[%ecx] :=3D [x/18[%ebx]]
    v/19[%eax] :=3D spilled-v/16[s1] (reload)
    I/14[%eax] :=3D A/13[%ecx] + v/19[%eax] + -1
    [x/18[%ebx]] :=3D I/14[%eax]
    A/15[%eax] :=3D 1
    reload retaddr
    return R/0[%eax]
    L100:
    A/11[%eax] :=3D 1
    reload retaddr
    return R/0[%eax]

  camlOptim__loop2_1040:
    if n/8[%eax] =3D=3Ds 1 goto L103
    spilled-env/18[s0] :=3D env/9[%ebx] (spill)
    I/11[%eax] :=3D I/11[%eax] + -2
    {spilled-env/18[s0]*}
    call "camlOptim__loop2_1040" R/0[%eax] R/1[%ebx]
    env/19[%eax] :=3D spilled-env/18[s0] (reload)
    A/12[%ecx] :=3D [env/19[%eax] + 8]
    A/13[%ebx] :=3D [env/19[%eax] + 12]
    A/14[%eax] :=3D [env/19[%eax] + 8]
    A/15[%eax] :=3D [A/14[%eax]]
    I/16[%eax] :=3D A/15[%eax] + A/13[%ebx] + -1
    [A/12[%ecx]] :=3D I/16[%eax]
    A/17[%eax] :=3D 1
    reload retaddr
    return R/0[%eax]
    L103:
    A/10[%eax] :=3D 1
    reload retaddr
    return R/0[%eax]

You get two spills and two reloads in the first version, vs. one
spill, one releoad, and two loads from the env in the second
version. It is not the case anymore that the first one is "clearly"
better.

On Thu, Jun 7, 2012 at 7:34 PM, Daniel B=FCnzli
<daniel.buenzli@erratique.ch> wrote:
>
> Hello,
>
> In the past I remember having indirectly benchmarked two different styles=
 for writing local functions :
>
> =A0 =A0let f x =3D
> =A0 =A0 =A0let v =3D ... in
> =A0 =A0 =A0let rec loop x v =3D ... in
> =A0 =A0 =A0loop x v
>
> or
>
> =A0 =A0let f x =3D
> =A0 =A0 =A0 =A0let v =3D ... in
> =A0 =A0 =A0 =A0let rec loop () =3D ... (* uses v and x *) in
> =A0 =A0 =A0 =A0loop ()
>
> without being able to reach a conclusion. Is there any particular style t=
hat is definitively faster ? Does it maybe depend on the number of argument=
s and/or on whether loop is recursive or not ?
>
> This question keeps coming back in my mind when I write local functions..=
. I'm sure someone with some knowledge of the compiler's internals can prov=
ide a more reasonable answer than benchmarks.
>
> Best,
>
> Daniel
>
>
>
>
>
> --
> Caml-list mailing list. =A0Subscription management and archives:
> https://sympa-roc.inria.fr/wws/info/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>