From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pierre.chambart@ocamlpro.com>
X-Original-To: caml-list@sympa.inria.fr
Delivered-To: caml-list@sympa.inria.fr
Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104])
	by sympa.inria.fr (Postfix) with ESMTPS id C70417FA5F
	for <caml-list@sympa.inria.fr>; Sat, 21 Jan 2017 15:39:26 +0100 (CET)
Authentication-Results: mail3-smtp-sop.national.inria.fr; spf=None smtp.pra=pierre.chambart@ocamlpro.com; spf=None smtp.mailfrom=pierre.chambart@ocamlpro.com; spf=Pass smtp.helo=postmaster@redisdead.crans.org
Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender
  authenticity information available from domain of
  pierre.chambart@ocamlpro.com) identity=pra;
  client-ip=138.231.136.39;
  receiver=mail3-smtp-sop.national.inria.fr;
  envelope-from="pierre.chambart@ocamlpro.com";
  x-sender="pierre.chambart@ocamlpro.com";
  x-conformance=sidf_compatible
Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender
  authenticity information available from domain of
  pierre.chambart@ocamlpro.com) identity=mailfrom;
  client-ip=138.231.136.39;
  receiver=mail3-smtp-sop.national.inria.fr;
  envelope-from="pierre.chambart@ocamlpro.com";
  x-sender="pierre.chambart@ocamlpro.com";
  x-conformance=sidf_compatible
Received-SPF: Pass (mail3-smtp-sop.national.inria.fr: domain of
  postmaster@redisdead.crans.org designates 138.231.136.39 as
  permitted sender) identity=helo; client-ip=138.231.136.39;
  receiver=mail3-smtp-sop.national.inria.fr;
  envelope-from="pierre.chambart@ocamlpro.com";
  x-sender="postmaster@redisdead.crans.org";
  x-conformance=sidf_compatible; x-record-type="v=spf1"
IronPort-PHdr: =?us-ascii?q?9a23=3AxofYhRfA0uo9uZH3a1ptaXjllGMj4u6mDksu8pMi?=
 =?us-ascii?q?zoh2WeGdxcW6ZB7h7PlgxGXEQZ/co6odzbGH7+a8BCdevt6oizMrSNR0TRgLiM?=
 =?us-ascii?q?EbzUQLIfWuLgnFFsPsdDEwB89YVVVorDmROElRH9viNRWJ+iXhpTEdFQ/iOgVr?=
 =?us-ascii?q?O+/7BpDdj9it1+C15pbffxhEiCCzbL52Ixi6twrcu8kZjYZjKKs61wfErGZPd+?=
 =?us-ascii?q?lK321jOEidnwz75se+/Z5j9zpftvc8/MNeUqv0Yro1Q6VAADspL2466svrtQLe?=
 =?us-ascii?q?TQSU/XsTTn8WkhtTDAfb6hzxQ4r8vTH7tup53ymaINH2QLUpUjms86tnVBnlgz?=
 =?us-ascii?q?ocOjUn7G/YlNB/jKNDoBKguRN/xZLUYJqIP/Z6Z6/RYM8WSXZEUstXSidPAJ6z?=
 =?us-ascii?q?b5EXAuQcI+hYoYnzp1wNoxWwCwajC+HgxSNHiHLtwa06yv4sHR3a0AEuHd8Dtm?=
 =?us-ascii?q?nfotXvNKcVVOC41LPGzTTdYPNMwzfy9pXDfw4hof6WWLJ/atDeyVMzFwjYiViQ?=
 =?us-ascii?q?sY3lMC2P1uQXrWeX9fZvVeK1hG4isA5+uCKvxts3h4nOmo0a0FXE9SFhwIYvIt?=
 =?us-ascii?q?20UlJ0YdmhEJZJsSyRKoV4QsQnQ25yuSY6zKULuZ+9fCgQyZQo3QTTa/Kdc4WO?=
 =?us-ascii?q?/xntV/6RLC9liH9re7+znQi+/Ea9xuHmS8W53ldHojBLn9TIrnwBygDf58ydRv?=
 =?us-ascii?q?dg+kqtxSyD2x7O5uxLO0w4iKjWJp0nz7UtjJQcq17DETXzmEjujK+ZaEEk+u+w?=
 =?us-ascii?q?5uT9fLrpu56cN5RphQ7gKKshh9azDvgiMgcUWWib4v6w1Lr5/U32WLlKj/s2nb?=
 =?us-ascii?q?fFsJ3COMgWqK20DxVL3ost9hqzFSqq3doZkHUdIl9IfAqLj43zNFHPJPD4A+2/?=
 =?us-ascii?q?g1OpkDpz3/DGP6PuAo/XLnfdirvuYbZ95FRZyAUo199f/Y5UB6oGIP3vQEDxsd?=
 =?us-ascii?q?jYDgcjMwyz2eroFNJ91oYGVWKVHqCZKL/SsUOP5u83P+aDfooVuDLkJ/gh5v7u?=
 =?us-ascii?q?lmM5lEQGfaip2JsXcGq3Eu5nI0Wfe3rsg80OHX0EvgokH6TWjwinWCRQL1a7Tq?=
 =?us-ascii?q?4x4HlvDYu8DojDT5GhjbqpzS67GZoQbWdDXBTEGn7tc8CAWuwQQCOUOM5o1DIe?=
 =?us-ascii?q?Bpa7TIp07RCjrhX3g5FnJ+/e8zcE/cb71dVv/eCVnhE78DZ9Fdi11GqESmVshG?=
 =?us-ascii?q?ROTDgzivMs6Xdhw0uOhPAry8dTEsZesrYQCl83?=
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: =?us-ascii?q?A0CIAQBacoNYlyeI54pEGhsBAQEDAQEBC?=
 =?us-ascii?q?QEBARYBAQEDAQEBCQEBAYMSAQEBAQF/Aydfg1OvLIQcKoJCgzYCghVCFQEBAQE?=
 =?us-ascii?q?BAQEBAQEBEgEBAQEBCBYITYIzGwGCGwYjBE0VCxIIAiYCAkkOBgEMCAEBG4htB?=
 =?us-ascii?q?AEJLa19gWs6ikQBCgEBASSBC4VAggWCa4MMgRkXDYMGgl4Fm0uBfIRmgxmDHoZ?=
 =?us-ascii?q?IGDqEPYMqhj6SdjWBTB1PhDkRC4FicgGGVwIkghcBAQE?=
X-IPAS-Result: =?us-ascii?q?A0CIAQBacoNYlyeI54pEGhsBAQEDAQEBCQEBARYBAQEDAQE?=
 =?us-ascii?q?BCQEBAYMSAQEBAQF/Aydfg1OvLIQcKoJCgzYCghVCFQEBAQEBAQEBAQEBEgEBA?=
 =?us-ascii?q?QEBCBYITYIzGwGCGwYjBE0VCxIIAiYCAkkOBgEMCAEBG4htBAEJLa19gWs6ikQ?=
 =?us-ascii?q?BCgEBASSBC4VAggWCa4MMgRkXDYMGgl4Fm0uBfIRmgxmDHoZIGDqEPYMqhj6Sd?=
 =?us-ascii?q?jWBTB1PhDkRC4FicgGGVwIkghcBAQE?=
X-IronPort-AV: E=Sophos;i="5.33,264,1477954800"; 
   d="scan'208";a="210272447"
Received: from redisdead.crans.org ([138.231.136.39])
  by mail3-smtp-sop.national.inria.fr with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 21 Jan 2017 15:39:14 +0100
Received: from [192.168.1.164] (178-214-190-109.dsl.ovh.fr [109.190.214.178])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by redisdead.crans.org (Postfix) with ESMTPSA id 078744AA;
	Sat, 21 Jan 2017 15:39:09 +0100 (CET)
To: Mark Hayden <markghayden@yahoo.com>, caml-list@inria.fr
References: <2B595CCC-1121-4C8C-8F5F-A235D3AB19BB@yahoo.com>
From: Pierre Chambart <pierre.chambart@ocamlpro.com>
Organization: OcamlPro
Message-ID: <7289adbb-10e7-ac50-d4fa-d0bf86e31dac@ocamlpro.com>
Date: Sat, 21 Jan 2017 15:39:03 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
 Thunderbird/45.6.0
MIME-Version: 1.0
In-Reply-To: <2B595CCC-1121-4C8C-8F5F-A235D3AB19BB@yahoo.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Validation-by: pierre.chambart@ocamlpro.com
Subject: Re: [Caml-list] <DKIM> Ocaml optimizer pitfalls & work-arounds

I'm coming a bit late to the discussion to tell you that this is not
completely hopeless.
Xavier Leroy was right by telling you that the (mid/back)-end of the
compiler is
untyped and that this hinder this kind of optimization. But it is not
completely
impossible. I've recently being working on some more aggressive unboxing
which
requires propagating some more type information. I took the opportunity to
specialize a few more primitive in flambda. It appears that not so much
more type
annotations are required to be propagated through the intermediate
representations
to be able to improve something like:

  [Array.fold_left max 0.0 arr]

When translating from the typedtree to Lambda, the application arguments
can be
annotated with the kind of values they carry. Here the important
information is that
[arr] is a float array. When fold left is inlined that information can
be propagated to
its uses. I have some tests showing good results on this.

Notice that this is still quite limited. Type propagation on later
representations can
only follow values and flow forward. This is mainly due to GADTs. For
instance

   type 'a t =3D Float : float t | Int : int t
   type ('a, 'b) eq =3D Eq : ('a, 'a) eq

   let f (type x) (t : x t) (v : x) (a : x array) (cast : x t -> (x,
float) eq) =3D
     a.(0) <- v;
     let Eq =3D cast t in
     v +. 3.14

       (function t/1019 v/1020 a/1021 cast/1022
         (seq (array.set a/1021 0 v/1020)
           (let (match/1028 =3D (apply cast/1022 t/1019)) (+. v/1020 3.14))=
)))

Here, for instance, the type of the argument v cannot be deduced from
the (+.)
operation, and specializing the array assignment using it would lead to
segfaults.
--=20
Pierre Chambart

Le 19/01/2017 =C3=A0 07:51, Mark Hayden a =C3=A9crit :
> We recently upgraded our Ocaml toolchain from 4.02.3 to Ocaml 4.04.0.
> We were looking forward to a performance boost from the optimization
> improvements, especially from flambda.  While we generally were able
> to achieve significant performance improvement, we were somewhat
> surprised by the effort required to avoid certain pitfalls in Ocaml.
>
> This note describes some issues we ran into.  We filed several
> reports on Ocaml Mantis regarding our findings.  However it appears
> the underlying issues we ran into are unlikely to change:
>
>   Your three reports (0007440, 0007441, 0007442) are manifestations
>   of the same fact: the OCaml compiler performs type-based
>   optimizations first, then erases types, then performs all the other
>   optimizations. This is very unlikely to change in the near future,
>   as it would require a total rewrite of the compiler.
>
>   [X Leroy, https://caml.inria.fr/mantis/view.php?id=3D7440]
>
> I encourage readers to review the problem reports we submitted, which
> include more concrete examples.  I'm posting this note in case there
> are others running into similar performance issues with their Ocaml
> software and who might find it helpful in working around those
> issues.  I'm not aware of them being documented elsewhere and there
> appears to be little prospect of the issues being addressed in the
> compiler in the forseeable future.  Please chime in if any of this is
> inaccurate or there is something I missed.
>
> As an initial example, consider the following Ocaml code to find the
> maximum floating point value in an array (that is at least 0.0):
>
>   [Array.fold_left max 0.0 arr]
>
> Now compile this with the latest compiler and maximum optimization.
> Because of how the Ocaml optimization works, this will run about
> 10-15x slower (and allocate 2-3 words per array element) than a more
> carefully written version that uses specialized operations and avoids
> allocation.  See below for one way to achieve this (while still using
> a functional-programming style).
>
>   (* Same as Array.fold_left, but with type casting.
>    *)
>   let [@inline] array_fold_leftf f (x:float) (a:float array) =3D
>     let r =3D ref x in
>     for i =3D 0 to Array.length a - 1 do
>       r :=3D f !r (Array.unsafe_get a i)
>     done;
>     !r
>   ;;
>
>   let [@inline] float_max (v0:float) v1 =3D
>     if v0 > v1 then v0 else v1
>   ;;
>
>   let array_float_max a =3D
>     array_fold_leftf float_max 0.0 a
>   ;;
>
> The assembly for the "inner loop" for the two examples are below.
> They were compiled with Ocaml 4.05.dev+flambda, "-O3
> -unbox-closures", MacOS 12.2, AMD64.
>
> Unoptimized example.  Note test/branch for array tag.  Allocation for
> boxing (we did not include the calls to trigger a minor gc).  There
> is a call to Ocaml runtime for polymorphic greater-equal.  This is
> probably not what one would expect from an optimizing/inline compiler
> for a simple case such as this.  Note that to create this we used our
> own definition of Array.fold_left which had an "[@inline]"
> annotation.
>
>   L215:
>           movq    (%rsp), %rbx
>           .loc    1       38      14
>           movzbq  -8(%rbx), %rax
>           cmpq    $254, %rax
>           je      L219
>           .loc    1       38      14
>           movq    -4(%rbx,%rdx,4), %rsi
>           movq    %rsi, 24(%rsp)
>           jmp     L218
>           .align  2
>   L219:
>           .loc    1       38      14
>   L221:
>           subq    $16, %r15
>           movq    _caml_young_limit@GOTPCREL(%rip), %rax
>           cmpq    (%rax), %r15
>           jb      L222
>           leaq    8(%r15), %rsi
>           movq    $1277, -8(%rsi)
>           .loc    1       38      14
>           movsd   -4(%rbx,%rdx,4), %xmm0
>           movsd   %xmm0, (%rsi)
>           movq    %rsi, 24(%rsp)
>   L218:
>           movq    %rdi, 32(%rsp)
>           .file   5       "pervasives.ml"
>           .loc    5       65      17
>           movq    _caml_greaterequal@GOTPCREL(%rip), %rax
>           call    _caml_c_call
>   L213:
>           movq    _caml_young_ptr@GOTPCREL(%rip), %r11
>           movq    (%r11), %r15
>           cmpq    $1, %rax
>           je      L217
>           movq    32(%rsp), %rdi
>           jmp     L216
>           .align  2
>   L217:
>           movq    24(%rsp), %rdi
>   L216:
>           movq    8(%rsp), %rdx
>           movq    %rdx, %rax
>           addq    $2, %rdx
>           movq    %rdx, 8(%rsp)
>           movq    16(%rsp), %rbx
>           cmpq    %rbx, %rax
>           jne     L215
>
>
> The assembly for the more carefully writting case is below.  No
> allocation.  No call to external C code.  No test/branch for array
> tag.  This matches what I think most people would like to see.  It is
> compact enough that (maybe) it would benefit from unrolling:
>
>   l225:
>           .loc    1       46      14
>           movsd   -4(%rax,%rdi,4), %xmm1
>           comisd  %xmm1, %xmm0
>           jbe     l227
>           jmp     l226
>           .align  2
>   l227:
>           movapd  %xmm1, %xmm0
>   l226:
>           movq    %rdi, %rsi
>           addq    $2, %rdi
>           cmpq    %rbx, %rsi
>           jne     l225
>
>
> The two main learnings we found were:
>
> * Polymorphic primitives ([Array.get], [compare], [>=3D], [min]) are
>   only specialized if they appear in a context where the types can be
>   determined at their exact call site, otherwise a polymorphic
>   version is used.  If the use of the primitive is later inlined in a
>   context where the type is no longer polymorphic, the function will
>   _not_ be specialized by the compiler.
>
> * Use of abstract data types prevents specialization.  In particular,
>   defining an abstract data type in a module ("type t ;;") will
>   prevent specialization (even after inlining) for any polymorphic
>   primitives (eg, "caml_equal") used with that type.  For instance,
>   if the underlying type for [t] is actually [int], other modules
>   will still use polymorphic equality instead of a single machine
>   instruction.  You can prevent this behavior with the "private"
>   keyword in order to export the type information, "type t =3D private
>   int".  Alternatively, the module can include its own specialized
>   operations and other modules can be careful to use them.
>
> It bears emphasizing that the issues described in this note apply
> even when all of the code is "fully inlined" and uses highest level
> of optimization.  Specialization in the Ocaml compiler occurs in a
> stage prior to inlining.  If it hasn=E2=80=99t happened before inlining, =
it
> won=E2=80=99t happen afterwards.
>
> What kind of effect does lack of specialization have on performance?
> Calling the "caml_compare" Ocaml C runtime function to compare
> integers can be 10-20x times slower than using a single integer
> comparison machine instruction.  Ditto for floating point values.
> The unspecialized [Array.get], [Array.set], and (on 32-bit)
> [Array.length] have to check the tag on the array to determine if the
> array uses the unboxed floating-point represntation (I wish Ocaml
> didn't use this!).  For instance, the polymorphic [Array.get] checks
> the tag on the array and (for floating point arrays) reads the value
> and boxes the floating point value (ie, allocate 2-3 words on the
> heap).  Note that when iterating over an array, the check on the tag
> will be included in _each_ loop iteration.
>
> Other impacts of using non-specialized functions:
>
> * Use of polymorphic primitives means floating point values have to
>   be boxed, requiring heap allocation.  Through use of specialized
>   specialized primitives, in many cases floats can remain unboxed.
>
> * All the extra native code from using polymorphic primitives
>   (checking array tags, conditionally boxing floats, calling out to
>   Ocaml C runtime) can have follow-on effects for further inlining.
>   In other words, when native code can be kept compact, then more
>   code can be inlined and/or loops unrolled and this can in turn
>   allow further optimization.
>
> Some suggestions others may find helpful:
>
> * Consider using the "private" keyword for any abstract types in your
>   modules.  We added over 50 of these to our code base.  It is an
>   ugly but effective work-around.
>
> * The min/max functions in standard library Pervasives are
>   particularly problematic.  They are polymorphic so their comparison
>   will never be specialized.  It can be helpful to define specialized
>   functions such as:
>
>     let [@inline] float_max (v0:float) (v1:float) =3D
>       if v0 > v1 then v0 else v1
>     ;;
>
>     let [@inline] int_max (v0:int) (v1:int) =3D
>       if v0 > v1 then v0 else v1
>     ;;
>
>   These will be compiled to use native machine code and unboxed
>   values.
>
> * Any use of polymorphism can negatively affect performance.  Be
>   careful about inadvertently introducing polymorphism into your
>   program, such as this helper function:
>
>     let [@inline] getter v ofs =3D Array.get v ofs ;;
>
>   This will result in unspecialized version of [Array.get] being
>   inlined at all call-sites.  Note that if your .mli file defines the
>   function as non-polymorphic that will still _not_ affect how
>   [getter] is compiled.:
>
>     type getter : t -> int -> int ;;  (* does not affect optimization *)
>
>   You must cast the type in the implementation in order for [Array.get]
>   to be specialized:
>
>     let [@inline] getter (v:int array) ofs =3D Array.get v ofs ;;
>
> * All the iterators in the Array module (eg, [Array.iter]) in the
>   standard library are polymorphic, so will use unspecialized
>   accessors and be affected by the issues described here.  Using the
>   following to sum and array of floats may seem elegant:
>
>     Array.fold_left (+) 0.0 arr
>
>   However, the resulting code is much slower (and allocates 2 floats
>   per array entry, ie 4-6 words) than a "specialized" version.  Note
>   that even if [Array.fold_left] and surrounding code were "fully
>   inlined," it is still a polymorphic function so the above
>   performance penalty for checking the array tag and boxing the float
>   is present.  See also the earlier example.
>
> * It can be helpful to review the compiled assembly code (using "-S"
>   option for ocamlopt) and look for tell-tale signs of lack of
>   specialization, such as calls to [_caml_greaterequal] or allocation
>   [caml_call_gc] in cases where they are not expected.  You can refer
>   to the assembly code for the original implementation, know that
>   that code will not be specialized when inlined.
>
> As I said, by examining our hot spots and following the suggestions
> above, we found the resulting native code could be comparable to what
> we would expect from C.  It is unfortunate these issues were
> (apparently) designed into the Ocaml compiler architecture, because
> otherwise it would have seemed this would be a natural area of
> improvement for the compiler.  I would have thought a staticly typed
> language such as Ocaml would (through its type checker) be
> well-suited for the simple types of function specialization described
> in this note.
>
>