From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by sympa.inria.fr (Postfix) with ESMTPS id A94057FA5F for ; Thu, 19 Jan 2017 07:51:28 +0100 (CET) Authentication-Results: mail3-smtp-sop.national.inria.fr; spf=None smtp.pra=markghayden@yahoo.com; spf=Pass smtp.mailfrom=markghayden@yahoo.com; spf=None smtp.helo=postmaster@nm3-vm1.bullet.mail.ne1.yahoo.com Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of markghayden@yahoo.com) identity=pra; client-ip=98.138.91.53; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="markghayden@yahoo.com"; x-sender="markghayden@yahoo.com"; x-conformance=sidf_compatible Received-SPF: Pass (mail3-smtp-sop.national.inria.fr: domain of markghayden@yahoo.com designates 98.138.91.53 as permitted sender) identity=mailfrom; client-ip=98.138.91.53; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="markghayden@yahoo.com"; x-sender="markghayden@yahoo.com"; x-conformance=sidf_compatible; x-record-type="v=spf1" Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of postmaster@nm3-vm1.bullet.mail.ne1.yahoo.com) identity=helo; client-ip=98.138.91.53; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="markghayden@yahoo.com"; x-sender="postmaster@nm3-vm1.bullet.mail.ne1.yahoo.com"; x-conformance=sidf_compatible IronPort-PHdr: =?us-ascii?q?9a23=3AwRw74RT4/1iPptNKChPM24mcRNpsv+yvbD5Q0YIu?= =?us-ascii?q?jvd0So/mwa69bByN2/xhgRfzUJnB7Loc0qyN4vymBTRLuM7Z+Fk5M7V0Hycfjs?= =?us-ascii?q?sXmwFySOWkMmbcaMDQUiohAc5ZX0Vk9XzoeWJcGcL5ekGA6ibqtW1aFRrwLxd6?= =?us-ascii?q?KfroEYDOkcu3y/qy+5rOaAlUmTaxe71/IRG5oAnLucQanYRuJrstxhfVv3BFZ/?= =?us-ascii?q?lYyWR0KFyJgh3y/N2w/Jlt8yRRv/Iu6ctNWrjkcqo7ULJVEi0oP3g668P3uxbD?= =?us-ascii?q?SxCP5mYHXWUNjhVIGQnF4wrkUZr3ryD3q/By2CiePc3xULA0RTGv5LplRRP0lC?= =?us-ascii?q?sKMSMy/WfKgcJyka1bugqsqRxjzIDbb46bKflwcK3Dc90dXmdBQt9RVyldDoO8?= =?us-ascii?q?c4cCDewMNvtYoYnnoFsOqAOzCw62C+P1yT9Dm3340rc60us8Dw7G2hErEtULsH?= =?us-ascii?q?vOttX1N6gSUeCvw6jI0DrMcfVW1Cz96YfSchAhpvaMUahsfsrWzEkiDgXIhUie?= =?us-ascii?q?p4ziOjOazOUNs26D4uphU+KvkW8npBtrrjih3MchjJTCiIENyl3c8Sh0w5w5Kc?= =?us-ascii?q?C2RUN4e9KpFIZcuzuaOoZ4Ws8vR2JltDwnxrAIupO3ZisHxZA9yxLCafGKfI6F?= =?us-ascii?q?6Q/5WumLOzd3nndldaq/hxms9UigzfXxVs+x0FtEtyZFjNzMum0X2xPI98iHTv?= =?us-ascii?q?998Vm92TqV0gDc8OBEIUQumardNZEt36Q8l5oJvkTDGS/2n1/6g7ORdkUh4uSo?= =?us-ascii?q?6uLnbav6ppKEM4J5iRvyPrkgl8G8G+g1NhUCU3Kb9OmyzLHj+Ff2QLROjv04iK?= =?us-ascii?q?nZt5XaKNwepqGjGQ9V0Ykj6xalADamzdsXg38HIUlFeR2dj4jpPFbOLOrkAve4?= =?us-ascii?q?hlSgiC1ryOzePr39HpXNKWDOn6v7crZ4705Q0Q4zzdFE55JIEbwBO/LyWkrptN?= =?us-ascii?q?PCFBM5Mgq0w/zmCNpnzI8eV3iPUeelN/bZuFqMo+YuOPWkZYkPuT+7JeJ2yeTp?= =?us-ascii?q?iCockEUeNYmgzJcabjjsG/18IEqZaGvgj9EpAG4KuQ14R+vv3g7RGQVPbmq/Cv?= =?us-ascii?q?pvrgowD5irWN/O?= X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: =?us-ascii?q?A0AWBQD/YIBYhjVbimJEGhwBAQQBAQoBA?= =?us-ascii?q?RcBAQQBAQoBAYICgRIBAQEBAX+BCRKDP5trglWGPow4ggwqgkKFQ0AXAQEBAQE?= =?us-ascii?q?BAQEBAQESAQEBCAsLCh0wgjMZgkcEGQEBMgYsCAImAkcERohIAQMYDi2uR2iBa?= =?us-ascii?q?xgFAQEbgwgBAQWDXgEjJwODBgEBAQcBAQEBAQEBAQEBARUIFQJ0h0UIhW2BMIJ?= =?us-ascii?q?ZDC4tgjGQLIsdgUuFFoMYh2wfgVhRhD2DKiCGHo5JhCchAoFLEh1fAYIaggEPE?= =?us-ascii?q?QuCAVIBhyiCOwEBAQ?= X-IPAS-Result: =?us-ascii?q?A0AWBQD/YIBYhjVbimJEGhwBAQQBAQoBARcBAQQBAQoBAYI?= =?us-ascii?q?CgRIBAQEBAX+BCRKDP5trglWGPow4ggwqgkKFQ0AXAQEBAQEBAQEBAQESAQEBC?= =?us-ascii?q?AsLCh0wgjMZgkcEGQEBMgYsCAImAkcERohIAQMYDi2uR2iBaxgFAQEbgwgBAQW?= =?us-ascii?q?DXgEjJwODBgEBAQcBAQEBAQEBAQEBARUIFQJ0h0UIhW2BMIJZDC4tgjGQLIsdg?= =?us-ascii?q?UuFFoMYh2wfgVhRhD2DKiCGHo5JhCchAoFLEh1fAYIaggEPEQuCAVIBhyiCOwE?= =?us-ascii?q?BAQ?= X-IronPort-AV: E=Sophos;i="5.33,252,1477954800"; d="scan'208";a="209874258" Received: from nm3-vm1.bullet.mail.ne1.yahoo.com ([98.138.91.53]) by mail3-smtp-sop.national.inria.fr with ESMTP/TLS/AES128-GCM-SHA256; 19 Jan 2017 07:51:27 +0100 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1484808685; bh=kzlk/HM+O86Lnby8XO1k6tMUzf5Toih5dW1mBQCW4xQ=; h=From:Subject:Date:To:From:Subject; b=Gf+dd77xK0s2ljLdRf3918PGtIa7ttGzy33xgOtwFOGRYbFZSrxjrTWNlBohglDyE1gcaAqN7a8tRKqURLlOIGIO2VAeTRQR1M473EG++whFOwEq19D+ZlBPRutELFjaEdojsJdcn3rZb/Fwv0egMYFd7xvh+qkjMjkymLh3AnvWWcLxSBGlq6MxICSM87NgGdDSGNK78OdfNO0jQKxIoGQneoRnO1s+Fhf7V4v6vIZl2aGPO5r6GvSreZPPs5NHt9VGwChpF17x1+KKHnVlDOTr8ZtD4sXpTzJ/0y5yu5pOlJHOBjvKPHH5joAh4/2E9dZqGleLJ3+R8LgAltUHyA== Received: from [98.138.101.131] by nm3.bullet.mail.ne1.yahoo.com with NNFMP; 19 Jan 2017 06:51:25 -0000 Received: from [98.138.89.174] by tm19.bullet.mail.ne1.yahoo.com with NNFMP; 19 Jan 2017 06:51:25 -0000 Received: from [127.0.0.1] by omp1030.mail.ne1.yahoo.com with NNFMP; 19 Jan 2017 06:51:25 -0000 X-Yahoo-Newman-Id: 516219.47060.bm@omp1030.mail.ne1.yahoo.com Received: (qmail 26335 invoked from network); 19 Jan 2017 06:51:25 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1484808685; bh=kzlk/HM+O86Lnby8XO1k6tMUzf5Toih5dW1mBQCW4xQ=; h=From:Content-Type:Content-Transfer-Encoding:Mime-Version:Subject:Message-Id:Date:To; b=o0wDMvTHY8oswDgb+i7vn0kq5mB55qSgCkxn1WOzgxq+K0wuaDi2uxj38SKN+ABada9wb4m+YKI4COdP7XYZE2VSL2I+bBdI7a7JRHPaqaZpL0KopgePYZdTL+6Vr/p+Wa8rFlZDflPRixlp+zvDEsfMNInjwNX8rwV11MM0wK0= X-Yahoo-Newman-Property: ymail-3 X-YMail-OSG: s6xOwZcVM1nNgRAx4AYdP0fekWyUkpjztfIGHXE_fKMaXYy 5aR.ptX45S0ux7s_CxYnzX00psiAT9wTRb3JLRKF2N4Gbx6CrhmwlQzN7azF a62VcUyzxhsfyaweB9bf4rKGbsob9Exir1vDdhrVU6l896mn0a6ntSPLCqPK hgA3IzGp_VGWFdbZkWbyXjnK.dsk5uvVdpbYLrfhOh44LKXR0X04eUlApZcH ypc.cY_y3rmx26Ee5E03RqJlKYYy0kmUKcp_43Ga_3YDA9TdfqzH9LxEWUrh ItdazkWo1CZ0psFQz8hs04a7mrcAMJaiitEHPF0.eyD.eMTReYMs6AFUqxU6 6cQblNNpO54L8jKfaChsfauapidMvGT98M_UYND9S3rY83gfY9kTgpWWI1xo No.QOQ4gEAF6E1O0lp_iDLrsg3NVkeqi8mU7UeogXebd5B6qnae.HwMJhojc YRGw6aRKcX.9Ze0cMT97oTeldIvSX_2rtyx4ubdjYEwEi1Xfj887zv3Bx9SM SFt4ZE3_gjwiXyJZ1JWTLMyLjGEWriBno4bWNwItlQ5riqvUYNNABS_EMXOA 8LTs0LkobABnQ1w-- X-Yahoo-SMTP: 1rhWOp.swBCIsRPqec.N67IhngFT7HDF From: Mark Hayden Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Mac OS X Mail 10.2 \(3259\)) Message-Id: <2B595CCC-1121-4C8C-8F5F-A235D3AB19BB@yahoo.com> Date: Wed, 18 Jan 2017 22:51:24 -0800 To: caml-list@inria.fr X-Mailer: Apple Mail (2.3259) Subject: [Caml-list] Ocaml optimizer pitfalls & work-arounds We recently upgraded our Ocaml toolchain from 4.02.3 to Ocaml 4.04.0. We were looking forward to a performance boost from the optimization improvements, especially from flambda. While we generally were able to achieve significant performance improvement, we were somewhat surprised by the effort required to avoid certain pitfalls in Ocaml. This note describes some issues we ran into. We filed several reports on Ocaml Mantis regarding our findings. However it appears the underlying issues we ran into are unlikely to change: Your three reports (0007440, 0007441, 0007442) are manifestations of the same fact: the OCaml compiler performs type-based optimizations first, then erases types, then performs all the other optimizations. This is very unlikely to change in the near future, as it would require a total rewrite of the compiler. [X Leroy, https://caml.inria.fr/mantis/view.php?id=3D7440] I encourage readers to review the problem reports we submitted, which include more concrete examples. I'm posting this note in case there are others running into similar performance issues with their Ocaml software and who might find it helpful in working around those issues. I'm not aware of them being documented elsewhere and there appears to be little prospect of the issues being addressed in the compiler in the forseeable future. Please chime in if any of this is inaccurate or there is something I missed. As an initial example, consider the following Ocaml code to find the maximum floating point value in an array (that is at least 0.0): [Array.fold_left max 0.0 arr] Now compile this with the latest compiler and maximum optimization. Because of how the Ocaml optimization works, this will run about 10-15x slower (and allocate 2-3 words per array element) than a more carefully written version that uses specialized operations and avoids allocation. See below for one way to achieve this (while still using a functional-programming style). (* Same as Array.fold_left, but with type casting. *) let [@inline] array_fold_leftf f (x:float) (a:float array) =3D let r =3D ref x in for i =3D 0 to Array.length a - 1 do r :=3D f !r (Array.unsafe_get a i) done; !r ;; let [@inline] float_max (v0:float) v1 =3D if v0 > v1 then v0 else v1 ;; let array_float_max a =3D array_fold_leftf float_max 0.0 a ;; The assembly for the "inner loop" for the two examples are below. They were compiled with Ocaml 4.05.dev+flambda, "-O3 -unbox-closures", MacOS 12.2, AMD64. Unoptimized example. Note test/branch for array tag. Allocation for boxing (we did not include the calls to trigger a minor gc). There is a call to Ocaml runtime for polymorphic greater-equal. This is probably not what one would expect from an optimizing/inline compiler for a simple case such as this. Note that to create this we used our own definition of Array.fold_left which had an "[@inline]" annotation. L215: movq (%rsp), %rbx .loc 1 38 14 movzbq -8(%rbx), %rax cmpq $254, %rax je L219 .loc 1 38 14 movq -4(%rbx,%rdx,4), %rsi movq %rsi, 24(%rsp) jmp L218 .align 2 L219: .loc 1 38 14 L221: subq $16, %r15 movq _caml_young_limit@GOTPCREL(%rip), %rax cmpq (%rax), %r15 jb L222 leaq 8(%r15), %rsi movq $1277, -8(%rsi) .loc 1 38 14 movsd -4(%rbx,%rdx,4), %xmm0 movsd %xmm0, (%rsi) movq %rsi, 24(%rsp) L218: movq %rdi, 32(%rsp) .file 5 "pervasives.ml" .loc 5 65 17 movq _caml_greaterequal@GOTPCREL(%rip), %rax call _caml_c_call L213: movq _caml_young_ptr@GOTPCREL(%rip), %r11 movq (%r11), %r15 cmpq $1, %rax je L217 movq 32(%rsp), %rdi jmp L216 .align 2 L217: movq 24(%rsp), %rdi L216: movq 8(%rsp), %rdx movq %rdx, %rax addq $2, %rdx movq %rdx, 8(%rsp) movq 16(%rsp), %rbx cmpq %rbx, %rax jne L215 The assembly for the more carefully writting case is below. No allocation. No call to external C code. No test/branch for array tag. This matches what I think most people would like to see. It is compact enough that (maybe) it would benefit from unrolling: l225: .loc 1 46 14 movsd -4(%rax,%rdi,4), %xmm1 comisd %xmm1, %xmm0 jbe l227 jmp l226 .align 2 l227: movapd %xmm1, %xmm0 l226: movq %rdi, %rsi addq $2, %rdi cmpq %rbx, %rsi jne l225 The two main learnings we found were: * Polymorphic primitives ([Array.get], [compare], [>=3D], [min]) are only specialized if they appear in a context where the types can be determined at their exact call site, otherwise a polymorphic version is used. If the use of the primitive is later inlined in a context where the type is no longer polymorphic, the function will _not_ be specialized by the compiler. * Use of abstract data types prevents specialization. In particular, defining an abstract data type in a module ("type t ;;") will prevent specialization (even after inlining) for any polymorphic primitives (eg, "caml_equal") used with that type. For instance, if the underlying type for [t] is actually [int], other modules will still use polymorphic equality instead of a single machine instruction. You can prevent this behavior with the "private" keyword in order to export the type information, "type t =3D private int". Alternatively, the module can include its own specialized operations and other modules can be careful to use them. It bears emphasizing that the issues described in this note apply even when all of the code is "fully inlined" and uses highest level of optimization. Specialization in the Ocaml compiler occurs in a stage prior to inlining. If it hasn=E2=80=99t happened before inlining, it won=E2=80=99t happen afterwards. What kind of effect does lack of specialization have on performance? Calling the "caml_compare" Ocaml C runtime function to compare integers can be 10-20x times slower than using a single integer comparison machine instruction. Ditto for floating point values. The unspecialized [Array.get], [Array.set], and (on 32-bit) [Array.length] have to check the tag on the array to determine if the array uses the unboxed floating-point represntation (I wish Ocaml didn't use this!). For instance, the polymorphic [Array.get] checks the tag on the array and (for floating point arrays) reads the value and boxes the floating point value (ie, allocate 2-3 words on the heap). Note that when iterating over an array, the check on the tag will be included in _each_ loop iteration. Other impacts of using non-specialized functions: * Use of polymorphic primitives means floating point values have to be boxed, requiring heap allocation. Through use of specialized specialized primitives, in many cases floats can remain unboxed. * All the extra native code from using polymorphic primitives (checking array tags, conditionally boxing floats, calling out to Ocaml C runtime) can have follow-on effects for further inlining. In other words, when native code can be kept compact, then more code can be inlined and/or loops unrolled and this can in turn allow further optimization. Some suggestions others may find helpful: * Consider using the "private" keyword for any abstract types in your modules. We added over 50 of these to our code base. It is an ugly but effective work-around. * The min/max functions in standard library Pervasives are particularly problematic. They are polymorphic so their comparison will never be specialized. It can be helpful to define specialized functions such as: let [@inline] float_max (v0:float) (v1:float) =3D if v0 > v1 then v0 else v1 ;; let [@inline] int_max (v0:int) (v1:int) =3D if v0 > v1 then v0 else v1 ;; These will be compiled to use native machine code and unboxed values. * Any use of polymorphism can negatively affect performance. Be careful about inadvertently introducing polymorphism into your program, such as this helper function: let [@inline] getter v ofs =3D Array.get v ofs ;; This will result in unspecialized version of [Array.get] being inlined at all call-sites. Note that if your .mli file defines the function as non-polymorphic that will still _not_ affect how [getter] is compiled.: type getter : t -> int -> int ;; (* does not affect optimization *) You must cast the type in the implementation in order for [Array.get] to be specialized: let [@inline] getter (v:int array) ofs =3D Array.get v ofs ;; * All the iterators in the Array module (eg, [Array.iter]) in the standard library are polymorphic, so will use unspecialized accessors and be affected by the issues described here. Using the following to sum and array of floats may seem elegant: Array.fold_left (+) 0.0 arr However, the resulting code is much slower (and allocates 2 floats per array entry, ie 4-6 words) than a "specialized" version. Note that even if [Array.fold_left] and surrounding code were "fully inlined," it is still a polymorphic function so the above performance penalty for checking the array tag and boxing the float is present. See also the earlier example. * It can be helpful to review the compiled assembly code (using "-S" option for ocamlopt) and look for tell-tale signs of lack of specialization, such as calls to [_caml_greaterequal] or allocation [caml_call_gc] in cases where they are not expected. You can refer to the assembly code for the original implementation, know that that code will not be specialized when inlined. As I said, by examining our hot spots and following the suggestions above, we found the resulting native code could be comparable to what we would expect from C. It is unfortunate these issues were (apparently) designed into the Ocaml compiler architecture, because otherwise it would have seemed this would be a natural area of improvement for the compiler. I would have thought a staticly typed language such as Ocaml would (through its type checker) be well-suited for the simple types of function specialization described in this note.