caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] Ocaml optimizer pitfalls & work-arounds
@ 2017-01-19  6:51 Mark Hayden
  2017-01-19 11:20 ` Nils Becker
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Mark Hayden @ 2017-01-19  6:51 UTC (permalink / raw)
  To: caml-list

We recently upgraded our Ocaml toolchain from 4.02.3 to Ocaml 4.04.0.
We were looking forward to a performance boost from the optimization
improvements, especially from flambda.  While we generally were able
to achieve significant performance improvement, we were somewhat
surprised by the effort required to avoid certain pitfalls in Ocaml.

This note describes some issues we ran into.  We filed several
reports on Ocaml Mantis regarding our findings.  However it appears
the underlying issues we ran into are unlikely to change:

  Your three reports (0007440, 0007441, 0007442) are manifestations
  of the same fact: the OCaml compiler performs type-based
  optimizations first, then erases types, then performs all the other
  optimizations. This is very unlikely to change in the near future,
  as it would require a total rewrite of the compiler.

  [X Leroy, https://caml.inria.fr/mantis/view.php?id=7440]

I encourage readers to review the problem reports we submitted, which
include more concrete examples.  I'm posting this note in case there
are others running into similar performance issues with their Ocaml
software and who might find it helpful in working around those
issues.  I'm not aware of them being documented elsewhere and there
appears to be little prospect of the issues being addressed in the
compiler in the forseeable future.  Please chime in if any of this is
inaccurate or there is something I missed.

As an initial example, consider the following Ocaml code to find the
maximum floating point value in an array (that is at least 0.0):

  [Array.fold_left max 0.0 arr]

Now compile this with the latest compiler and maximum optimization.
Because of how the Ocaml optimization works, this will run about
10-15x slower (and allocate 2-3 words per array element) than a more
carefully written version that uses specialized operations and avoids
allocation.  See below for one way to achieve this (while still using
a functional-programming style).

  (* Same as Array.fold_left, but with type casting.
   *)
  let [@inline] array_fold_leftf f (x:float) (a:float array) =
    let r = ref x in
    for i = 0 to Array.length a - 1 do
      r := f !r (Array.unsafe_get a i)
    done;
    !r
  ;;

  let [@inline] float_max (v0:float) v1 =
    if v0 > v1 then v0 else v1
  ;;

  let array_float_max a =
    array_fold_leftf float_max 0.0 a
  ;;

The assembly for the "inner loop" for the two examples are below.
They were compiled with Ocaml 4.05.dev+flambda, "-O3
-unbox-closures", MacOS 12.2, AMD64.

Unoptimized example.  Note test/branch for array tag.  Allocation for
boxing (we did not include the calls to trigger a minor gc).  There
is a call to Ocaml runtime for polymorphic greater-equal.  This is
probably not what one would expect from an optimizing/inline compiler
for a simple case such as this.  Note that to create this we used our
own definition of Array.fold_left which had an "[@inline]"
annotation.

  L215:
          movq    (%rsp), %rbx
          .loc    1       38      14
          movzbq  -8(%rbx), %rax
          cmpq    $254, %rax
          je      L219
          .loc    1       38      14
          movq    -4(%rbx,%rdx,4), %rsi
          movq    %rsi, 24(%rsp)
          jmp     L218
          .align  2
  L219:
          .loc    1       38      14
  L221:
          subq    $16, %r15
          movq    _caml_young_limit@GOTPCREL(%rip), %rax
          cmpq    (%rax), %r15
          jb      L222
          leaq    8(%r15), %rsi
          movq    $1277, -8(%rsi)
          .loc    1       38      14
          movsd   -4(%rbx,%rdx,4), %xmm0
          movsd   %xmm0, (%rsi)
          movq    %rsi, 24(%rsp)
  L218:
          movq    %rdi, 32(%rsp)
          .file   5       "pervasives.ml"
          .loc    5       65      17
          movq    _caml_greaterequal@GOTPCREL(%rip), %rax
          call    _caml_c_call
  L213:
          movq    _caml_young_ptr@GOTPCREL(%rip), %r11
          movq    (%r11), %r15
          cmpq    $1, %rax
          je      L217
          movq    32(%rsp), %rdi
          jmp     L216
          .align  2
  L217:
          movq    24(%rsp), %rdi
  L216:
          movq    8(%rsp), %rdx
          movq    %rdx, %rax
          addq    $2, %rdx
          movq    %rdx, 8(%rsp)
          movq    16(%rsp), %rbx
          cmpq    %rbx, %rax
          jne     L215


The assembly for the more carefully writting case is below.  No
allocation.  No call to external C code.  No test/branch for array
tag.  This matches what I think most people would like to see.  It is
compact enough that (maybe) it would benefit from unrolling:

  l225:
          .loc    1       46      14
          movsd   -4(%rax,%rdi,4), %xmm1
          comisd  %xmm1, %xmm0
          jbe     l227
          jmp     l226
          .align  2
  l227:
          movapd  %xmm1, %xmm0
  l226:
          movq    %rdi, %rsi
          addq    $2, %rdi
          cmpq    %rbx, %rsi
          jne     l225


The two main learnings we found were:

* Polymorphic primitives ([Array.get], [compare], [>=], [min]) are
  only specialized if they appear in a context where the types can be
  determined at their exact call site, otherwise a polymorphic
  version is used.  If the use of the primitive is later inlined in a
  context where the type is no longer polymorphic, the function will
  _not_ be specialized by the compiler.

* Use of abstract data types prevents specialization.  In particular,
  defining an abstract data type in a module ("type t ;;") will
  prevent specialization (even after inlining) for any polymorphic
  primitives (eg, "caml_equal") used with that type.  For instance,
  if the underlying type for [t] is actually [int], other modules
  will still use polymorphic equality instead of a single machine
  instruction.  You can prevent this behavior with the "private"
  keyword in order to export the type information, "type t = private
  int".  Alternatively, the module can include its own specialized
  operations and other modules can be careful to use them.

It bears emphasizing that the issues described in this note apply
even when all of the code is "fully inlined" and uses highest level
of optimization.  Specialization in the Ocaml compiler occurs in a
stage prior to inlining.  If it hasn’t happened before inlining, it
won’t happen afterwards.

What kind of effect does lack of specialization have on performance?
Calling the "caml_compare" Ocaml C runtime function to compare
integers can be 10-20x times slower than using a single integer
comparison machine instruction.  Ditto for floating point values.
The unspecialized [Array.get], [Array.set], and (on 32-bit)
[Array.length] have to check the tag on the array to determine if the
array uses the unboxed floating-point represntation (I wish Ocaml
didn't use this!).  For instance, the polymorphic [Array.get] checks
the tag on the array and (for floating point arrays) reads the value
and boxes the floating point value (ie, allocate 2-3 words on the
heap).  Note that when iterating over an array, the check on the tag
will be included in _each_ loop iteration.

Other impacts of using non-specialized functions:

* Use of polymorphic primitives means floating point values have to
  be boxed, requiring heap allocation.  Through use of specialized
  specialized primitives, in many cases floats can remain unboxed.

* All the extra native code from using polymorphic primitives
  (checking array tags, conditionally boxing floats, calling out to
  Ocaml C runtime) can have follow-on effects for further inlining.
  In other words, when native code can be kept compact, then more
  code can be inlined and/or loops unrolled and this can in turn
  allow further optimization.

Some suggestions others may find helpful:

* Consider using the "private" keyword for any abstract types in your
  modules.  We added over 50 of these to our code base.  It is an
  ugly but effective work-around.

* The min/max functions in standard library Pervasives are
  particularly problematic.  They are polymorphic so their comparison
  will never be specialized.  It can be helpful to define specialized
  functions such as:

    let [@inline] float_max (v0:float) (v1:float) =
      if v0 > v1 then v0 else v1
    ;;

    let [@inline] int_max (v0:int) (v1:int) =
      if v0 > v1 then v0 else v1
    ;;

  These will be compiled to use native machine code and unboxed
  values.

* Any use of polymorphism can negatively affect performance.  Be
  careful about inadvertently introducing polymorphism into your
  program, such as this helper function:

    let [@inline] getter v ofs = Array.get v ofs ;;

  This will result in unspecialized version of [Array.get] being
  inlined at all call-sites.  Note that if your .mli file defines the
  function as non-polymorphic that will still _not_ affect how
  [getter] is compiled.:

    type getter : t -> int -> int ;;  (* does not affect optimization *)

  You must cast the type in the implementation in order for [Array.get]
  to be specialized:

    let [@inline] getter (v:int array) ofs = Array.get v ofs ;;

* All the iterators in the Array module (eg, [Array.iter]) in the
  standard library are polymorphic, so will use unspecialized
  accessors and be affected by the issues described here.  Using the
  following to sum and array of floats may seem elegant:

    Array.fold_left (+) 0.0 arr

  However, the resulting code is much slower (and allocates 2 floats
  per array entry, ie 4-6 words) than a "specialized" version.  Note
  that even if [Array.fold_left] and surrounding code were "fully
  inlined," it is still a polymorphic function so the above
  performance penalty for checking the array tag and boxing the float
  is present.  See also the earlier example.

* It can be helpful to review the compiled assembly code (using "-S"
  option for ocamlopt) and look for tell-tale signs of lack of
  specialization, such as calls to [_caml_greaterequal] or allocation
  [caml_call_gc] in cases where they are not expected.  You can refer
  to the assembly code for the original implementation, know that
  that code will not be specialized when inlined.

As I said, by examining our hot spots and following the suggestions
above, we found the resulting native code could be comparable to what
we would expect from C.  It is unfortunate these issues were
(apparently) designed into the Ocaml compiler architecture, because
otherwise it would have seemed this would be a natural area of
improvement for the compiler.  I would have thought a staticly typed
language such as Ocaml would (through its type checker) be
well-suited for the simple types of function specialization described
in this note.


^ permalink raw reply	[flat|nested] 15+ messages in thread
* Re: [Caml-list] Ocaml optimizer pitfalls & work-arounds
@ 2017-01-19 14:32 Hongbo Zhang (BLOOMBERG/ 731 LEX)
  0 siblings, 0 replies; 15+ messages in thread
From: Hongbo Zhang (BLOOMBERG/ 731 LEX) @ 2017-01-19 14:32 UTC (permalink / raw)
  To: markghayden; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 11619 bytes --]

Indeed, I really wish we have a warning when polymorphic comparison can not be type specialized so that people can turn that warning into error, or we have an attribute in types to allow polymorphic compare for a small amount of selective types.
There are some other awkward situations: recently when we polyfill Int64 in BuckleScript, we have to 
think about the data layout to match the semantics of polymorphic comparison. so that a polymorphic compare of `1L < 2L` still make sense.
`min`, `max` are worse, since they are never type specialized in current compiler.
From: markghayden@yahoo.com At: 01/19/17 01:51:43
To: caml-list@inria.fr
Subject: Re:[Caml-list] Ocaml optimizer pitfalls & work-arounds

We recently upgraded our Ocaml toolchain from 4.02.3 to Ocaml 4.04.0.
We were looking forward to a performance boost from the optimization
improvements, especially from flambda.  While we generally were able
to achieve significant performance improvement, we were somewhat
surprised by the effort required to avoid certain pitfalls in Ocaml.

This note describes some issues we ran into.  We filed several
reports on Ocaml Mantis regarding our findings.  However it appears
the underlying issues we ran into are unlikely to change:

  Your three reports (0007440, 0007441, 0007442) are manifestations
  of the same fact: the OCaml compiler performs type-based
  optimizations first, then erases types, then performs all the other
  optimizations. This is very unlikely to change in the near future,
  as it would require a total rewrite of the compiler.

  [X Leroy, https://caml.inria.fr/mantis/view.php?id=7440]

I encourage readers to review the problem reports we submitted, which
include more concrete examples.  I'm posting this note in case there
are others running into similar performance issues with their Ocaml
software and who might find it helpful in working around those
issues.  I'm not aware of them being documented elsewhere and there
appears to be little prospect of the issues being addressed in the
compiler in the forseeable future.  Please chime in if any of this is
inaccurate or there is something I missed.

As an initial example, consider the following Ocaml code to find the
maximum floating point value in an array (that is at least 0.0):

  [Array.fold_left max 0.0 arr]

Now compile this with the latest compiler and maximum optimization.
Because of how the Ocaml optimization works, this will run about
10-15x slower (and allocate 2-3 words per array element) than a more
carefully written version that uses specialized operations and avoids
allocation.  See below for one way to achieve this (while still using
a functional-programming style).

  (* Same as Array.fold_left, but with type casting.
   *)
  let [@inline] array_fold_leftf f (x:float) (a:float array) =
    let r = ref x in
    for i = 0 to Array.length a - 1 do
      r := f !r (Array.unsafe_get a i)
    done;
    !r
  ;;

  let [@inline] float_max (v0:float) v1 =
    if v0 > v1 then v0 else v1
  ;;

  let array_float_max a =
    array_fold_leftf float_max 0.0 a
  ;;

The assembly for the "inner loop" for the two examples are below.
They were compiled with Ocaml 4.05.dev+flambda, "-O3
-unbox-closures", MacOS 12.2, AMD64.

Unoptimized example.  Note test/branch for array tag.  Allocation for
boxing (we did not include the calls to trigger a minor gc).  There
is a call to Ocaml runtime for polymorphic greater-equal.  This is
probably not what one would expect from an optimizing/inline compiler
for a simple case such as this.  Note that to create this we used our
own definition of Array.fold_left which had an "[@inline]"
annotation.

  L215:
          movq    (%rsp), %rbx
          .loc    1       38      14
          movzbq  -8(%rbx), %rax
          cmpq    $254, %rax
          je      L219
          .loc    1       38      14
          movq    -4(%rbx,%rdx,4), %rsi
          movq    %rsi, 24(%rsp)
          jmp     L218
          .align  2
  L219:
          .loc    1       38      14
  L221:
          subq    $16, %r15
          movq    _caml_young_limit@GOTPCREL(%rip), %rax
          cmpq    (%rax), %r15
          jb      L222
          leaq    8(%r15), %rsi
          movq    $1277, -8(%rsi)
          .loc    1       38      14
          movsd   -4(%rbx,%rdx,4), %xmm0
          movsd   %xmm0, (%rsi)
          movq    %rsi, 24(%rsp)
  L218:
          movq    %rdi, 32(%rsp)
          .file   5       "pervasives.ml"
          .loc    5       65      17
          movq    _caml_greaterequal@GOTPCREL(%rip), %rax
          call    _caml_c_call
  L213:
          movq    _caml_young_ptr@GOTPCREL(%rip), %r11
          movq    (%r11), %r15
          cmpq    $1, %rax
          je      L217
          movq    32(%rsp), %rdi
          jmp     L216
          .align  2
  L217:
          movq    24(%rsp), %rdi
  L216:
          movq    8(%rsp), %rdx
          movq    %rdx, %rax
          addq    $2, %rdx
          movq    %rdx, 8(%rsp)
          movq    16(%rsp), %rbx
          cmpq    %rbx, %rax
          jne     L215


The assembly for the more carefully writting case is below.  No
allocation.  No call to external C code.  No test/branch for array
tag.  This matches what I think most people would like to see.  It is
compact enough that (maybe) it would benefit from unrolling:

  l225:
          .loc    1       46      14
          movsd   -4(%rax,%rdi,4), %xmm1
          comisd  %xmm1, %xmm0
          jbe     l227
          jmp     l226
          .align  2
  l227:
          movapd  %xmm1, %xmm0
  l226:
          movq    %rdi, %rsi
          addq    $2, %rdi
          cmpq    %rbx, %rsi
          jne     l225


The two main learnings we found were:

* Polymorphic primitives ([Array.get], [compare], [>=], [min]) are
  only specialized if they appear in a context where the types can be
  determined at their exact call site, otherwise a polymorphic
  version is used.  If the use of the primitive is later inlined in a
  context where the type is no longer polymorphic, the function will
  _not_ be specialized by the compiler.

* Use of abstract data types prevents specialization.  In particular,
  defining an abstract data type in a module ("type t ;;") will
  prevent specialization (even after inlining) for any polymorphic
  primitives (eg, "caml_equal") used with that type.  For instance,
  if the underlying type for [t] is actually [int], other modules
  will still use polymorphic equality instead of a single machine
  instruction.  You can prevent this behavior with the "private"
  keyword in order to export the type information, "type t = private
  int".  Alternatively, the module can include its own specialized
  operations and other modules can be careful to use them.

It bears emphasizing that the issues described in this note apply
even when all of the code is "fully inlined" and uses highest level
of optimization.  Specialization in the Ocaml compiler occurs in a
stage prior to inlining.  If it hasn’t happened before inlining, it
won’t happen afterwards.

What kind of effect does lack of specialization have on performance?
Calling the "caml_compare" Ocaml C runtime function to compare
integers can be 10-20x times slower than using a single integer
comparison machine instruction.  Ditto for floating point values.
The unspecialized [Array.get], [Array.set], and (on 32-bit)
[Array.length] have to check the tag on the array to determine if the
array uses the unboxed floating-point represntation (I wish Ocaml
didn't use this!).  For instance, the polymorphic [Array.get] checks
the tag on the array and (for floating point arrays) reads the value
and boxes the floating point value (ie, allocate 2-3 words on the
heap).  Note that when iterating over an array, the check on the tag
will be included in _each_ loop iteration.

Other impacts of using non-specialized functions:

* Use of polymorphic primitives means floating point values have to
  be boxed, requiring heap allocation.  Through use of specialized
  specialized primitives, in many cases floats can remain unboxed.

* All the extra native code from using polymorphic primitives
  (checking array tags, conditionally boxing floats, calling out to
  Ocaml C runtime) can have follow-on effects for further inlining.
  In other words, when native code can be kept compact, then more
  code can be inlined and/or loops unrolled and this can in turn
  allow further optimization.

Some suggestions others may find helpful:

* Consider using the "private" keyword for any abstract types in your
  modules.  We added over 50 of these to our code base.  It is an
  ugly but effective work-around.

* The min/max functions in standard library Pervasives are
  particularly problematic.  They are polymorphic so their comparison
  will never be specialized.  It can be helpful to define specialized
  functions such as:

    let [@inline] float_max (v0:float) (v1:float) =
      if v0 > v1 then v0 else v1
    ;;

    let [@inline] int_max (v0:int) (v1:int) =
      if v0 > v1 then v0 else v1
    ;;

  These will be compiled to use native machine code and unboxed
  values.

* Any use of polymorphism can negatively affect performance.  Be
  careful about inadvertently introducing polymorphism into your
  program, such as this helper function:

    let [@inline] getter v ofs = Array.get v ofs ;;

  This will result in unspecialized version of [Array.get] being
  inlined at all call-sites.  Note that if your .mli file defines the
  function as non-polymorphic that will still _not_ affect how
  [getter] is compiled.:

    type getter : t -> int -> int ;;  (* does not affect optimization *)

  You must cast the type in the implementation in order for [Array.get]
  to be specialized:

    let [@inline] getter (v:int array) ofs = Array.get v ofs ;;

* All the iterators in the Array module (eg, [Array.iter]) in the
  standard library are polymorphic, so will use unspecialized
  accessors and be affected by the issues described here.  Using the
  following to sum and array of floats may seem elegant:

    Array.fold_left (+) 0.0 arr

  However, the resulting code is much slower (and allocates 2 floats
  per array entry, ie 4-6 words) than a "specialized" version.  Note
  that even if [Array.fold_left] and surrounding code were "fully
  inlined," it is still a polymorphic function so the above
  performance penalty for checking the array tag and boxing the float
  is present.  See also the earlier example.

* It can be helpful to review the compiled assembly code (using "-S"
  option for ocamlopt) and look for tell-tale signs of lack of
  specialization, such as calls to [_caml_greaterequal] or allocation
  [caml_call_gc] in cases where they are not expected.  You can refer
  to the assembly code for the original implementation, know that
  that code will not be specialized when inlined.

As I said, by examining our hot spots and following the suggestions
above, we found the resulting native code could be comparable to what
we would expect from C.  It is unfortunate these issues were
(apparently) designed into the Ocaml compiler architecture, because
otherwise it would have seemed this would be a natural area of
improvement for the compiler.  I would have thought a staticly typed
language such as Ocaml would (through its type checker) be
well-suited for the simple types of function specialization described
in this note.


-- 
Caml-list mailing list.  Subscription management and archives:
https://sympa.inria.fr/sympa/arc/caml-list
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


[-- Attachment #2: Type: text/html, Size: 13895 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-01-23 16:34 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-19  6:51 [Caml-list] Ocaml optimizer pitfalls & work-arounds Mark Hayden
2017-01-19 11:20 ` Nils Becker
2017-01-19 11:39   ` Gabriel Scherer
2017-01-19 13:26     ` Frédéric Bour
2017-01-19 14:35   ` Alain Frisch
2017-01-19 15:35     ` Ivan Gotovchits
2017-01-19 17:02       ` Hezekiah M. Carty
2017-01-19 15:41     ` Gerd Stolpmann
2017-01-19 13:41 ` Yaron Minsky
2017-01-19 17:59   ` Mark Hayden
2017-01-19 22:30     ` Yaron Minsky
2017-01-22 20:06       ` Berke Durak
2017-01-23 16:33         ` David McClain
2017-01-21 14:39 ` [Caml-list] <DKIM> " Pierre Chambart
2017-01-19 14:32 [Caml-list] " Hongbo Zhang (BLOOMBERG/ 731 LEX)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).