From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <yminsky@janestreet.com>
X-Original-To: caml-list@sympa.inria.fr
Delivered-To: caml-list@sympa.inria.fr
Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104])
	by sympa.inria.fr (Postfix) with ESMTPS id 912B67FA5F
	for <caml-list@sympa.inria.fr>; Thu, 19 Jan 2017 23:30:46 +0100 (CET)
Authentication-Results: mail3-smtp-sop.national.inria.fr; spf=None smtp.pra=yminsky@janestreet.com; spf=Pass smtp.mailfrom=yminsky@janestreet.com; spf=None smtp.helo=postmaster@mxout3.mail.janestreet.com
Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender
  authenticity information available from domain of
  yminsky@janestreet.com) identity=pra;
  client-ip=38.105.200.229;
  receiver=mail3-smtp-sop.national.inria.fr;
  envelope-from="yminsky@janestreet.com";
  x-sender="yminsky@janestreet.com";
  x-conformance=sidf_compatible
Received-SPF: Pass (mail3-smtp-sop.national.inria.fr: domain of
  yminsky@janestreet.com designates 38.105.200.229 as permitted
  sender) identity=mailfrom; client-ip=38.105.200.229;
  receiver=mail3-smtp-sop.national.inria.fr;
  envelope-from="yminsky@janestreet.com";
  x-sender="yminsky@janestreet.com";
  x-conformance=sidf_compatible; x-record-type="v=spf1"
Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender
  authenticity information available from domain of
  postmaster@mxout3.mail.janestreet.com) identity=helo;
  client-ip=38.105.200.229;
  receiver=mail3-smtp-sop.national.inria.fr;
  envelope-from="yminsky@janestreet.com";
  x-sender="postmaster@mxout3.mail.janestreet.com";
  x-conformance=sidf_compatible
IronPort-PHdr: =?us-ascii?q?9a23=3Ako5v1RGiizdx3SNmoIBm2J1GYnF86YWxBRYc798d?=
 =?us-ascii?q?s5kLTJ7yp8iwAkXT6L1XgUPTWs2DsrQf2raQ7/qrATdIyK3CmUhKSIZLWR4BhJ?=
 =?us-ascii?q?detC0bK+nBN3fGKuX3ZTcxBsVIWQwt1Xi6NU9IBJS2PAWK8TW94jEIBxrwKxd+?=
 =?us-ascii?q?KPjrFY7OlcS30P2594HObwlSijewZbx/IA+5oAnPucUanYVvIbstxxXUpXdFZ/?=
 =?us-ascii?q?5Yzn5yK1KJmBb86Maw/Jp9/ClVpvks6c1OX7jkcqohVbBXAygoPG4z5M3wqBnM?=
 =?us-ascii?q?VhCP6WcGUmUXiRVHHQ7I5wznU5jrsyv6su192DSGPcDzULs5Vyiu47ttRRT1ky?=
 =?us-ascii?q?oMKSI3/3/LhcxxlKJboQyupxpjw47PfYqZMONycr7Bcd8GQGZMWMFeWTFcAoOn?=
 =?us-ascii?q?d4sAEfYOPfpWoYn6olsBtxq+BQ+xD+/rxTJFgnr60Ksn2OojDA7GxhQtENAAsH?=
 =?us-ascii?q?rUotv7N7ocX/6pw6fH1jjDc+pW1C3h5ITUbhwso/eBVq9wf8rLzkkvEhvIgVeK?=
 =?us-ascii?q?poz/ODOV0PkGvW+a7+pmTuKviG4moBx2rzmvw8csi4/JhpkWylHE7ih5wpw6Jd?=
 =?us-ascii?q?umR05gfd6kCoVfuD+GN4dsXswiRGRotT88x7Ybt5C7ey0Kx44mxx7Zc/GHco6I?=
 =?us-ascii?q?4gjiVOmLOzt4imhldKq/hhmo8Uigzer8WtOo31ZNqypIlMTHuHMV1xHL5MWKSe?=
 =?us-ascii?q?Fx8lq91TuPzQzf9P1ILVwumabFNZIsxqY8moQPvUnHBCP7m0X7gLWIekk59OWk?=
 =?us-ascii?q?8ebqbqvgq5SBLYF7kBv+Pb4rmsGnAeQ3LAwOX2+D9OS527zj+lD5QKlEg/Esl6?=
 =?us-ascii?q?nWqpHaJcABqq67GQBV1Jgs6w2jDze8ztsXg2UHIEhZdxKAiojlI1DOIPbmAvej?=
 =?us-ascii?q?m1mhnjRmy+rbMrH9ApjBNGbPnKv9cbpn9UJQ1g4+wcha551OC7EBJPzzWlX2tN?=
 =?us-ascii?q?zdFhI4Mwm0w+fhCNVm1YMfWXmCAq2DP6PUr1CI/f4vI/OSa4ALpDbxMeQq5/nr?=
 =?us-ascii?q?jXMhg18SYbGp3YcLaHC/BvlpP1+WYX/ogtsYFWcKvxE+TPDxhV2ZUT9TYm6yUL?=
 =?us-ascii?q?gm6jE6DoKmF4bDSZq3jLyPxifoVqFRM0VPEFPEMX75e4iCE6MJYTiRLc9ogzAJ?=
 =?us-ascii?q?U5CwQo8m0lelswqsmJR9Ke+ByyQCspSr8dlz/O7C3UUj8D1yFMeM+2OESWxvgn?=
 =?us-ascii?q?kFSiNw16d69x8ugmyf2LR11qQLXedY4OlEB0JnbJM=3D?=
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: =?us-ascii?q?A0DcAQCOPYFYfeXIaSZEGhkBAQEBAQEBA?=
 =?us-ascii?q?QEBAQcBAQEBARQBAQEBAQEBAQEBAQcBAQEBAYMUAQEBAQF/gQkHg0qKepETgja?=
 =?us-ascii?q?FTo0oggwqgkKDNgKBegdBEgEBAQEBAQEBAQEBEgEBCRYKTYIzG4IbAQEBAwEjB?=
 =?us-ascii?q?BkBASwGBQEECwsLBwYCAgkdAgIiEgEFAQoEDgYTEg2ISQMQCAMLLaMkP4saaIF?=
 =?us-ascii?q?rOoMIAQEFhC8Dgw0BAQEBAQEBAwEBAQEBAQEBARcIEnmFQINlgQmCUTuBMA2DB?=
 =?us-ascii?q?oJehy0Mh3R/imU4hmGDGINtg3+Bd1KEPYloihuHDxQegRQmB4FBEggVFToUBYQ?=
 =?us-ascii?q?TDAMcgX5VAYZ4gjsBAQE?=
X-IPAS-Result: =?us-ascii?q?A0DcAQCOPYFYfeXIaSZEGhkBAQEBAQEBAQEBAQcBAQEBARQ?=
 =?us-ascii?q?BAQEBAQEBAQEBAQcBAQEBAYMUAQEBAQF/gQkHg0qKepETgjaFTo0oggwqgkKDN?=
 =?us-ascii?q?gKBegdBEgEBAQEBAQEBAQEBEgEBCRYKTYIzG4IbAQEBAwEjBBkBASwGBQEECws?=
 =?us-ascii?q?LBwYCAgkdAgIiEgEFAQoEDgYTEg2ISQMQCAMLLaMkP4saaIFrOoMIAQEFhC8Dg?=
 =?us-ascii?q?w0BAQEBAQEBAwEBAQEBAQEBARcIEnmFQINlgQmCUTuBMA2DBoJehy0Mh3R/imU?=
 =?us-ascii?q?4hmGDGINtg3+Bd1KEPYloihuHDxQegRQmB4FBEggVFToUBYQTDAMcgX5VAYZ4g?=
 =?us-ascii?q?jsBAQE?=
X-IronPort-AV: E=Sophos;i="5.33,256,1477954800"; 
   d="scan'208";a="210031154"
Received: from mxout3.mail.janestreet.com ([38.105.200.229])
  by mail3-smtp-sop.national.inria.fr with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Jan 2017 23:30:43 +0100
Received: from tot-qpr-mailcore1.delacy.com ([172.27.56.68] helo=tot-qpr-mailcore1)
	by mxout3.mail.janestreet.com with esmtps (TLSv1.2:DHE-RSA-AES256-GCM-SHA384:256)
	(Exim 4.84_2)
	(envelope-from <yminsky@janestreet.com>)
	id 1cULE9-0000hR-AH
	for caml-list@inria.fr; Thu, 19 Jan 2017 17:30:41 -0500
X-JS-Flow: external
X-JS-Scanner-attachment: (ok) No attachments
Received: by tot-qpr-mailcore1 with JS-mailcore (0.1)
	(envelope-from <yminsky@janestreet.com>)
	id BYgT4R-AAAAxU-IX; 2017-01-19 17:30:41.269414-05:00
Received: from mail-vk0-f72.google.com ([209.85.213.72])
	by mxgoog1.mail.janestreet.com with esmtps (TLSv1.2:AES128-GCM-SHA256:128)
	(Exim 4.84_2)
	(envelope-from <yminsky@janestreet.com>)
	id 1cULE9-0004ne-6P
	for caml-list@inria.fr; Thu, 19 Jan 2017 17:30:41 -0500
Received: by mail-vk0-f72.google.com with SMTP id x75so33302497vke.5
        for <caml-list@inria.fr>; Thu, 19 Jan 2017 14:30:41 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=janestreet.com; s=google;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :cc:content-transfer-encoding;
        bh=LIol/FRMhpztatxGenyuiKaSGPCFARnTVa22k3WS8K0=;
        b=qLYzmJZGw3iknAFnqvazEhhKhjoabvibvRCi7Rhi7owC5zVPZtZhttPEO4KgFZ7Quu
         5LGaogLofjA2xlOHGF6E4Ju7zFQjQsZXIzzZfn6OkLaQNudtm1KhNskVqmmPwPz+rNvg
         aV5b3RyT9LsLcEDCc5WnrAW+NHHVa7a/Hs3K8=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:in-reply-to:references:from:date
         :message-id:subject:to:cc:content-transfer-encoding;
        bh=LIol/FRMhpztatxGenyuiKaSGPCFARnTVa22k3WS8K0=;
        b=NAqxB6FQ/MhcX7bGaGFKnBpRMxV5BTN0/WXCsDDljTQ/6IOaDi/AULi7cUH5ySFfQP
         F7gaz0AcnDTHlN2bLn+YceiM94nGapFMq4geZR4gYEVXFn5e2YdW5gDN9h5OIdoHE/g1
         RfKM8tLr82rerEw1eQfX0O+L1vMoLXwUIf/r58KBPa2THpyY7v+bIDmAjLVFyb9MYb3J
         KgRAPcAcKV4FldSfqAXE/QOzGuFauVucd0LkFGwbBePYeqZhczLJGq2K/VA88+GNgV2+
         fIUquXFCQtttFRrvcaF22ME/xKXq2hU8wxToZRATWfxeGk045AXwS5iibSrMOuuAtS82
         /Ipg==
X-Gm-Message-State: AIkVDXI1dga4UeVzywZrbB6w6/zkFlUHSPnsqhxvVPF506uWHg7hN7BQ2Y6ay4+bj7Wk0jmXu1VXybXcplY/Y1KD8r3HSi/Y/7vHliczXq7IY494ImA08HkBvdBaUcnydhU4m3aTEzlbUC0wdPRd
X-Received: by 10.176.8.19 with SMTP id a19mr5331825uaf.30.1484865040389;
        Thu, 19 Jan 2017 14:30:40 -0800 (PST)
X-Received: by 10.176.8.19 with SMTP id a19mr5331815uaf.30.1484865039859; Thu,
 19 Jan 2017 14:30:39 -0800 (PST)
MIME-Version: 1.0
Received: by 10.31.229.6 with HTTP; Thu, 19 Jan 2017 14:30:19 -0800 (PST)
In-Reply-To: <2EA73F0B-9C8C-443F-9F05-F0F856ACF2C5@yahoo.com>
References: <2B595CCC-1121-4C8C-8F5F-A235D3AB19BB@yahoo.com>
 <CACLX4jQjjGTOrooQcsu1kcGW9HOMwnNFw80B7i7-gcXi1FTCBQ@mail.gmail.com> <2EA73F0B-9C8C-443F-9F05-F0F856ACF2C5@yahoo.com>
From:Yaron Minsky <yminsky@janestreet.com>
Date: Thu, 19 Jan 2017 17:30:19 -0500
Message-ID: <CACLX4jTwywp4RuyUPnViS9LQrQ=jjdjwmsHZqgWUoGkOVKj2DA@mail.gmail.com>
To:Mark Hayden <markghayden@yahoo.com>
Cc:"caml-list@inria.fr" <caml-list@inria.fr>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-JS-Processed-by: mailcore
Subject: Re: [Caml-list] Ocaml optimizer pitfalls & work-arounds

On Thu, Jan 19, 2017 at 12:59 PM, Mark Hayden <markghayden@yahoo.com> wrote:
> I agree that removal of the current treatment of float array would
> eliminate the biggest issues and that would avoid some of the bigger
> issues.  However, our issue is not just with how float arrays are
> handled.  Comparisons for integer/char/bool types should always be
> specialized after inlining.

I think eliminating polymorphic compare and adding modular implicits
would take care of this issue. In that case, you could write:

   a > 3

and modular implicits could be used to pick the specialized comparison
function for this type. This would fix both the performance problem
and the semantic problems of polymorphic compare.

Modular implicits is a big change, but unlike changes to where type
information is available in the compiler, it's one that is actually in
progress and seems like it's really going to land.

But I agree: for right now, OCaml developers who care about
performance need to be careful about exactly the issues you highlight.

For what it's worth, Base, which is our not-quite-ready-for-prime-time
alternative to the OCaml stdlib, hides polymorphic comparison
functions by default.

> As things stand (and apparently this is unlikely to change), Ocaml
> programmers who want best performance need to learn to developer
> their software to carefully tailor use of polymorphism, abstraction,
> min/max, etc, at least for parts where performance is important.
> This is too bad... the type checker infers the types but (according
> to X Leroy) the type information is no longer available by the time
> inlining occurs.
>
> Below is an example taking the maximum (at least 0) of an array of
> an array of integers.  The natural way to implement this is:
>
>  let array_sum arr =3D
>     Array.fold_left max 0 arr
>  ;;
>
> The inner loop compiles to the code below (again, with
> Array.fold_left redefined to be annotated with =E2=80=9C[@inline]=E2=80=
=9D).
> Because of how the Ocaml compiler is architected, the array
> operations and comparison operations will not be specialized, even
> though the code is inlined and the array type has been inferred to
> be [int array].  Removing the special =E2=80=9Cfloat array=E2=80=9D treat=
ment from
> Ocaml would at least eliminate the "dead code" for checking and
> boxing for floats, but the call to [_caml_greaterequal] would still
> be used instead of a single comparison instruction.
>
> L258:
>         movq    (%rsp), %rbx
>         .loc    1       38      14
>         movzbq  -8(%rbx), %rax
>         cmpq    $254, %rax
>         je      L262
>         .loc    1       38      14
>         movq    -4(%rbx,%rdx,4), %rsi
>         movq    %rsi, 24(%rsp)
>         jmp     L261
>         .align  2
> L262:
>         .loc    1       38      14
> L264:
>         subq    $16, %r15
>         movq    _caml_young_limit@GOTPCREL(%rip), %rax
>         cmpq    (%rax), %r15
>         jb      L265
>         leaq    8(%r15), %rsi
>         movq    $1277, -8(%rsi)
>         .loc    1       38      14
>         movsd   -4(%rbx,%rdx,4), %xmm0
>         movsd   %xmm0, (%rsi)
>         movq    %rsi, 24(%rsp)
> L261:
>         movq    %rdi, 32(%rsp)
>         .loc    5       65      17
>         movq    _caml_greaterequal@GOTPCREL(%rip), %rax
>         call    _caml_c_call
> L256:
>         movq    _caml_young_ptr@GOTPCREL(%rip), %r11
>         movq    (%r11), %r15
>         cmpq    $1, %rax
>         je      L260
>         movq    32(%rsp), %rdi
>         jmp     L259
>         .align  2
> L260:
>         movq    24(%rsp), %rdi
> L259:
>         movq    8(%rsp), %rdx
>         movq    %rdx, %rax
>         addq    $2, %rdx
>         movq    %rdx, 8(%rsp)
>         movq    16(%rsp), %rbx
>         cmpq    %rbx, %rax
>         jne     L258
>
> If you write it like this:
>
>  let [@inline] array_fold_left_i f (x:int) (a:int array) =3D
>    let r =3D ref x in
>    for i =3D 0 to Array.length a - 1 do
>      r :=3D f !r (Array.unsafe_get a i)
>    done;
>    !r
>  ;;
>
>  let [@inline] int_max (v0:int) v1 =3D
>    if v0 > v1 then v0 else v1
>  ;;
>
>  let array_int_max a =3D
>    array_fold_left_i int_max 0 a
>  ;;
>
> Then the inner loop will compile as follows, which is about 5-15x faster =
than the code above and I think most developers would be pleased with.
>
> L284:
>         .loc    1       54      14
>         movq    -4(%rbx,%rsi,4), %rdx
>         cmpq    %rdx, %rdi
>         jle     L286
>         jmp     L285
>         .align  2
> L286:
>         movq    %rdx, %rdi
> L285:
>         movq    %rsi, %rdx
>         addq    $2, %rsi
>         cmpq    %rax, %rdx
>         jne     L284
>
>
>
>
>
>> On Jan 19, 2017, at 5:41 AM, Yaron Minsky <yminsky@janestreet.com> wrote:
>>
>> It seems like your primary issues are around lack of specialization
>> around two features:
>>
>> - unboxing in float arrays
>> - optimization of ad-hoc operations (e.g., polymorphic compare)
>>
>> My view on this is that it's best not to rely on float array
>> specialization at all, and I think the best improvement we can make to
>> OCaml is to remove the ad-hoc specialization of float arrays, and
>> instead add a separate, specialized (and unboxed) type for arrays of
>> floats, similar to the Bytes.t type which is effectively a specialized
>> byte array.
>>
>> As you've observed, specialization is brittle, and it's best not to
>> rely on it too much. Beyond that, the existence of float arrays
>> complicate the runtime quite a bit, and make other bugs more likely.
>>
>> There isn't yet consensus that specialization of float array should be
>> removed, but I'm still hopeful that we'll get there. It's probably
>> Jane Street's highest priority ask for the compiler.
>>
>> We also avoid use of polymorphic compare and other ad-hoc operations,
>> preferring to use type-specialized comparators everywhere. This is
>> better for semantic as well as performance reasons, since we've seen a
>> lot of subtle bugs from polymorphic compare doing the wrong thing on
>> specific types.
>>
>> It's hard to deny that using type-specialized comparators is more
>> verbose than polymorphic compare, but hopefully modular implicits will
>> make this problem go away, and we can get the best of both worlds. And
>> we hope that Flambda will be up to the job of inlining away the
>> overhead of the more indirect calling conventions imposed by modular
>> implicits.
>>
>> I think that with the above changes, we can probably get pretty far
>> towards the goal of being able to write OCaml code that is both highly
>> performant and pretty.  Being able to delay specialization until later
>> in the compilation pipeline would help more, but I believe we can do
>> pretty well without it.
>>
>> y
>>
>>
>> On Thu, Jan 19, 2017 at 1:51 AM, Mark Hayden <markghayden@yahoo.com> wro=
te:
>>> We recently upgraded our Ocaml toolchain from 4.02.3 to Ocaml 4.04.0.
>>> We were looking forward to a performance boost from the optimization
>>> improvements, especially from flambda.  While we generally were able
>>> to achieve significant performance improvement, we were somewhat
>>> surprised by the effort required to avoid certain pitfalls in Ocaml.
>>>
>>> This note describes some issues we ran into.  We filed several
>>> reports on Ocaml Mantis regarding our findings.  However it appears
>>> the underlying issues we ran into are unlikely to change:
>>>
>>> Your three reports (0007440, 0007441, 0007442) are manifestations
>>> of the same fact: the OCaml compiler performs type-based
>>> optimizations first, then erases types, then performs all the other
>>> optimizations. This is very unlikely to change in the near future,
>>> as it would require a total rewrite of the compiler.
>>>
>>> [X Leroy, https://caml.inria.fr/mantis/view.php?id=3D7440]
>>>
>>> I encourage readers to review the problem reports we submitted, which
>>> include more concrete examples.  I'm posting this note in case there
>>> are others running into similar performance issues with their Ocaml
>>> software and who might find it helpful in working around those
>>> issues.  I'm not aware of them being documented elsewhere and there
>>> appears to be little prospect of the issues being addressed in the
>>> compiler in the forseeable future.  Please chime in if any of this is
>>> inaccurate or there is something I missed.
>>>
>>> As an initial example, consider the following Ocaml code to find the
>>> maximum floating point value in an array (that is at least 0.0):
>>>
>>> [Array.fold_left max 0.0 arr]
>>>
>>> Now compile this with the latest compiler and maximum optimization.
>>> Because of how the Ocaml optimization works, this will run about
>>> 10-15x slower (and allocate 2-3 words per array element) than a more
>>> carefully written version that uses specialized operations and avoids
>>> allocation.  See below for one way to achieve this (while still using
>>> a functional-programming style).
>>>
>>> (* Same as Array.fold_left, but with type casting.
>>>  *)
>>> let [@inline] array_fold_leftf f (x:float) (a:float array) =3D
>>>   let r =3D ref x in
>>>   for i =3D 0 to Array.length a - 1 do
>>>     r :=3D f !r (Array.unsafe_get a i)
>>>   done;
>>>   !r
>>> ;;
>>>
>>> let [@inline] float_max (v0:float) v1 =3D
>>>   if v0 > v1 then v0 else v1
>>> ;;
>>>
>>> let array_float_max a =3D
>>>   array_fold_leftf float_max 0.0 a
>>> ;;
>>>
>>> The assembly for the "inner loop" for the two examples are below.
>>> They were compiled with Ocaml 4.05.dev+flambda, "-O3
>>> -unbox-closures", MacOS 12.2, AMD64.
>>>
>>> Unoptimized example.  Note test/branch for array tag.  Allocation for
>>> boxing (we did not include the calls to trigger a minor gc).  There
>>> is a call to Ocaml runtime for polymorphic greater-equal.  This is
>>> probably not what one would expect from an optimizing/inline compiler
>>> for a simple case such as this.  Note that to create this we used our
>>> own definition of Array.fold_left which had an "[@inline]"
>>> annotation.
>>>
>>> L215:
>>>         movq    (%rsp), %rbx
>>>         .loc    1       38      14
>>>         movzbq  -8(%rbx), %rax
>>>         cmpq    $254, %rax
>>>         je      L219
>>>         .loc    1       38      14
>>>         movq    -4(%rbx,%rdx,4), %rsi
>>>         movq    %rsi, 24(%rsp)
>>>         jmp     L218
>>>         .align  2
>>> L219:
>>>         .loc    1       38      14
>>> L221:
>>>         subq    $16, %r15
>>>         movq    _caml_young_limit@GOTPCREL(%rip), %rax
>>>         cmpq    (%rax), %r15
>>>         jb      L222
>>>         leaq    8(%r15), %rsi
>>>         movq    $1277, -8(%rsi)
>>>         .loc    1       38      14
>>>         movsd   -4(%rbx,%rdx,4), %xmm0
>>>         movsd   %xmm0, (%rsi)
>>>         movq    %rsi, 24(%rsp)
>>> L218:
>>>         movq    %rdi, 32(%rsp)
>>>         .file   5       "pervasives.ml"
>>>         .loc    5       65      17
>>>         movq    _caml_greaterequal@GOTPCREL(%rip), %rax
>>>         call    _caml_c_call
>>> L213:
>>>         movq    _caml_young_ptr@GOTPCREL(%rip), %r11
>>>         movq    (%r11), %r15
>>>         cmpq    $1, %rax
>>>         je      L217
>>>         movq    32(%rsp), %rdi
>>>         jmp     L216
>>>         .align  2
>>> L217:
>>>         movq    24(%rsp), %rdi
>>> L216:
>>>         movq    8(%rsp), %rdx
>>>         movq    %rdx, %rax
>>>         addq    $2, %rdx
>>>         movq    %rdx, 8(%rsp)
>>>         movq    16(%rsp), %rbx
>>>         cmpq    %rbx, %rax
>>>         jne     L215
>>>
>>>
>>> The assembly for the more carefully writting case is below.  No
>>> allocation.  No call to external C code.  No test/branch for array
>>> tag.  This matches what I think most people would like to see.  It is
>>> compact enough that (maybe) it would benefit from unrolling:
>>>
>>> l225:
>>>         .loc    1       46      14
>>>         movsd   -4(%rax,%rdi,4), %xmm1
>>>         comisd  %xmm1, %xmm0
>>>         jbe     l227
>>>         jmp     l226
>>>         .align  2
>>> l227:
>>>         movapd  %xmm1, %xmm0
>>> l226:
>>>         movq    %rdi, %rsi
>>>         addq    $2, %rdi
>>>         cmpq    %rbx, %rsi
>>>         jne     l225
>>>
>>>
>>> The two main learnings we found were:
>>>
>>> * Polymorphic primitives ([Array.get], [compare], [>=3D], [min]) are
>>> only specialized if they appear in a context where the types can be
>>> determined at their exact call site, otherwise a polymorphic
>>> version is used.  If the use of the primitive is later inlined in a
>>> context where the type is no longer polymorphic, the function will
>>> _not_ be specialized by the compiler.
>>>
>>> * Use of abstract data types prevents specialization.  In particular,
>>> defining an abstract data type in a module ("type t ;;") will
>>> prevent specialization (even after inlining) for any polymorphic
>>> primitives (eg, "caml_equal") used with that type.  For instance,
>>> if the underlying type for [t] is actually [int], other modules
>>> will still use polymorphic equality instead of a single machine
>>> instruction.  You can prevent this behavior with the "private"
>>> keyword in order to export the type information, "type t =3D private
>>> int".  Alternatively, the module can include its own specialized
>>> operations and other modules can be careful to use them.
>>>
>>> It bears emphasizing that the issues described in this note apply
>>> even when all of the code is "fully inlined" and uses highest level
>>> of optimization.  Specialization in the Ocaml compiler occurs in a
>>> stage prior to inlining.  If it hasn=E2=80=99t happened before inlining=
, it
>>> won=E2=80=99t happen afterwards.
>>>
>>> What kind of effect does lack of specialization have on performance?
>>> Calling the "caml_compare" Ocaml C runtime function to compare
>>> integers can be 10-20x times slower than using a single integer
>>> comparison machine instruction.  Ditto for floating point values.
>>> The unspecialized [Array.get], [Array.set], and (on 32-bit)
>>> [Array.length] have to check the tag on the array to determine if the
>>> array uses the unboxed floating-point represntation (I wish Ocaml
>>> didn't use this!).  For instance, the polymorphic [Array.get] checks
>>> the tag on the array and (for floating point arrays) reads the value
>>> and boxes the floating point value (ie, allocate 2-3 words on the
>>> heap).  Note that when iterating over an array, the check on the tag
>>> will be included in _each_ loop iteration.
>>>
>>> Other impacts of using non-specialized functions:
>>>
>>> * Use of polymorphic primitives means floating point values have to
>>> be boxed, requiring heap allocation.  Through use of specialized
>>> specialized primitives, in many cases floats can remain unboxed.
>>>
>>> * All the extra native code from using polymorphic primitives
>>> (checking array tags, conditionally boxing floats, calling out to
>>> Ocaml C runtime) can have follow-on effects for further inlining.
>>> In other words, when native code can be kept compact, then more
>>> code can be inlined and/or loops unrolled and this can in turn
>>> allow further optimization.
>>>
>>> Some suggestions others may find helpful:
>>>
>>> * Consider using the "private" keyword for any abstract types in your
>>> modules.  We added over 50 of these to our code base.  It is an
>>> ugly but effective work-around.
>>>
>>> * The min/max functions in standard library Pervasives are
>>> particularly problematic.  They are polymorphic so their comparison
>>> will never be specialized.  It can be helpful to define specialized
>>> functions such as:
>>>
>>>   let [@inline] float_max (v0:float) (v1:float) =3D
>>>     if v0 > v1 then v0 else v1
>>>   ;;
>>>
>>>   let [@inline] int_max (v0:int) (v1:int) =3D
>>>     if v0 > v1 then v0 else v1
>>>   ;;
>>>
>>> These will be compiled to use native machine code and unboxed
>>> values.
>>>
>>> * Any use of polymorphism can negatively affect performance.  Be
>>> careful about inadvertently introducing polymorphism into your
>>> program, such as this helper function:
>>>
>>>   let [@inline] getter v ofs =3D Array.get v ofs ;;
>>>
>>> This will result in unspecialized version of [Array.get] being
>>> inlined at all call-sites.  Note that if your .mli file defines the
>>> function as non-polymorphic that will still _not_ affect how
>>> [getter] is compiled.:
>>>
>>>   type getter : t -> int -> int ;;  (* does not affect optimization *)
>>>
>>> You must cast the type in the implementation in order for [Array.get]
>>> to be specialized:
>>>
>>>   let [@inline] getter (v:int array) ofs =3D Array.get v ofs ;;
>>>
>>> * All the iterators in the Array module (eg, [Array.iter]) in the
>>> standard library are polymorphic, so will use unspecialized
>>> accessors and be affected by the issues described here.  Using the
>>> following to sum and array of floats may seem elegant:
>>>
>>>   Array.fold_left (+) 0.0 arr
>>>
>>> However, the resulting code is much slower (and allocates 2 floats
>>> per array entry, ie 4-6 words) than a "specialized" version.  Note
>>> that even if [Array.fold_left] and surrounding code were "fully
>>> inlined," it is still a polymorphic function so the above
>>> performance penalty for checking the array tag and boxing the float
>>> is present.  See also the earlier example.
>>>
>>> * It can be helpful to review the compiled assembly code (using "-S"
>>> option for ocamlopt) and look for tell-tale signs of lack of
>>> specialization, such as calls to [_caml_greaterequal] or allocation
>>> [caml_call_gc] in cases where they are not expected.  You can refer
>>> to the assembly code for the original implementation, know that
>>> that code will not be specialized when inlined.
>>>
>>> As I said, by examining our hot spots and following the suggestions
>>> above, we found the resulting native code could be comparable to what
>>> we would expect from C.  It is unfortunate these issues were
>>> (apparently) designed into the Ocaml compiler architecture, because
>>> otherwise it would have seemed this would be a natural area of
>>> improvement for the compiler.  I would have thought a staticly typed
>>> language such as Ocaml would (through its type checker) be
>>> well-suited for the simple types of function specialization described
>>> in this note.
>>>
>>>
>>> --
>>> Caml-list mailing list.  Subscription management and archives:
>>> https://sympa.inria.fr/sympa/arc/caml-list
>>> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
>>> Bug reports: http://caml.inria.fr/bin/caml-bugs
>