From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by sympa.inria.fr (Postfix) with ESMTPS id E30427FA5F for ; Thu, 19 Jan 2017 22:39:54 +0100 (CET) Authentication-Results: mail3-smtp-sop.national.inria.fr; spf=None smtp.pra=markghayden@yahoo.com; spf=Pass smtp.mailfrom=markghayden@yahoo.com; spf=None smtp.helo=postmaster@nm20-vm0.bullet.mail.ne1.yahoo.com Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of markghayden@yahoo.com) identity=pra; client-ip=98.138.91.45; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="markghayden@yahoo.com"; x-sender="markghayden@yahoo.com"; x-conformance=sidf_compatible Received-SPF: Pass (mail3-smtp-sop.national.inria.fr: domain of markghayden@yahoo.com designates 98.138.91.45 as permitted sender) identity=mailfrom; client-ip=98.138.91.45; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="markghayden@yahoo.com"; x-sender="markghayden@yahoo.com"; x-conformance=sidf_compatible; x-record-type="v=spf1" Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of postmaster@nm20-vm0.bullet.mail.ne1.yahoo.com) identity=helo; client-ip=98.138.91.45; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="markghayden@yahoo.com"; x-sender="postmaster@nm20-vm0.bullet.mail.ne1.yahoo.com"; x-conformance=sidf_compatible IronPort-PHdr: =?us-ascii?q?9a23=3Agz66GhIgJi/WiHjg69mcpTZWNBhigK39O0sv0rFi?= =?us-ascii?q?tYgfK/nxwZ3uMQTl6Ol3ixeRBMOAuq4C17Od6vy8EUU7or+5+EgYd5JNUxJXwe?= =?us-ascii?q?43pCcHRPC/NEvgMfTxZDY7FskRHHVs/nW8LFQHUJ2mPw6arXK99yMdFQviPgRp?= =?us-ascii?q?OOv1BpTSj8Oq3Oyu5pHfeQtFiT6ybL9oIxi6sArdutQZjIZtN6081gbHrnxUdu?= =?us-ascii?q?pM2GhmP0iTnxHy5sex+J5s7SFdsO8/+sBDTKv3Yb02QaRXAzo6PW814tbrtQTY?= =?us-ascii?q?QguU+nQcSGQWnQFWDAXD8Rr3Q43+sir+tup6xSmaIcj7Rq06VDi+86tmTgLjhT?= =?us-ascii?q?wZPDAl7m7Yls1wjLpaoB2/oRx/35XUa5yROPZnY6/RYc8WSW9HU8lWSiJBH5i8?= =?us-ascii?q?b5MRAOUdIeZWoY79p14Uohu/AwmnGefjxzBMi3Pz26AxzuYvHhzc3AE4A90Bv2?= =?us-ascii?q?naotX3O6kcXu67z6fIwyvEYf5NxTf98Y3Ifgwhof2QX799d9fax0k1FwPCi1Wd?= =?us-ascii?q?sYvrMCmP1uQOrmOV7fBvVOKyhGE5rQF6vz+ixsI2hYnThYIVxVDE+j95wYkoO9?= =?us-ascii?q?K4TlV2YN6+H5tQsCGaMJF6Td8lQ2FtoSs3zKANt5C8fCgP0psnxhjfZuSGc4iO?= =?us-ascii?q?+BLjVfyeLS12hHJ/YL6+hwy98Uinyu37TMW7zFFKri9Dn9LRtX4NzwTe58yER/?= =?us-ascii?q?dn40us1zWC2xrX5+1ZO0w5mqrWJ4Ylz7MzjJYfrErOEyzslEnokqObd18o9vWp?= =?us-ascii?q?5unjZLjtu4WSOJVuig7kN6Qjgsy/Dvo8MggJR2Wb/+G82KP/8UHgXrVKi+E6nr?= =?us-ascii?q?PCv5DHIcQborC2AxNP3oYm8Rm/DjOm3M4enXYZMV5JYhKGgJLpO1HJJ/D0F+uw?= =?us-ascii?q?g1OpkDtzxvDGOKPuAonVI3XHk7rtZ6tx5kBfxQYpyd1T+ohYB74BLf7rX0/+rt?= =?us-ascii?q?3YDhs3MwyuxObnDc1w1oYEVmKVAa+ZP6PSvkWI5+0yPeaMYpQYuTbnJPgl4P7u?= =?us-ascii?q?imU1lkMafamsxZcXcmy3Hux6I0WFZnrhmssOHn0Pvgo6VeDqjFyCUSVPZ3upRK?= =?us-ascii?q?I95jQ7CJq8AovZR4CthqaB3CahEZFMaGBGEAPELXC9RoyeXPFETSOUOcxw2mgV?= =?us-ascii?q?ULmnUIww/ROnsgLh16BqI/aS8Sod48HNzt9wssjajxJ61TVlA8mbmzWIRnt3kW?= =?us-ascii?q?MBVjM72ohuqEx6zRGI1q0u0K8QLsBa+/4cClRyDpXb1eEvTo2rVw=3D=3D?= X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: =?us-ascii?q?A0CiAQBYMYFYhi1bimJEGhkBAQEBAQEBA?= =?us-ascii?q?QEBAQcBAQEBARQBAQEBAQEBAQEBAQcBAQEBAYICgRIBAQEBAX+BCRKDP5wMgja?= =?us-ascii?q?SdoIMKoJCgzYCggFAEwEBAQEBAQEBAQEBEgEBAQgLCwodMIIzG4IbAQEBAwEjB?= =?us-ascii?q?BkBASwGBQEECwsOBAYCAgkdAgJFBA4GExINiEgBAxAIDi2uZmiBaxgFAQEbgwg?= =?us-ascii?q?BAQWDYgEjJwODDQEBAQEBAQEBAQEBAQEBAQEBAQEBARUIFQJ0h0WBYIEJglE7g?= =?us-ascii?q?TANgkw6LYIxhy0MiHOKZTiGYYMYg22Dfx+BWFKEPYMqIIYeihuELoQnIQGBTBI?= =?us-ascii?q?dTxABghqCAQwDEQuCAVIBhniCOwEBAQ?= X-IPAS-Result: =?us-ascii?q?A0CiAQBYMYFYhi1bimJEGhkBAQEBAQEBAQEBAQcBAQEBARQ?= =?us-ascii?q?BAQEBAQEBAQEBAQcBAQEBAYICgRIBAQEBAX+BCRKDP5wMgjaSdoIMKoJCgzYCg?= =?us-ascii?q?gFAEwEBAQEBAQEBAQEBEgEBAQgLCwodMIIzG4IbAQEBAwEjBBkBASwGBQEECws?= =?us-ascii?q?OBAYCAgkdAgJFBA4GExINiEgBAxAIDi2uZmiBaxgFAQEbgwgBAQWDYgEjJwODD?= =?us-ascii?q?QEBAQEBAQEBAQEBAQEBAQEBAQEBARUIFQJ0h0WBYIEJglE7gTANgkw6LYIxhy0?= =?us-ascii?q?MiHOKZTiGYYMYg22Dfx+BWFKEPYMqIIYeihuELoQnIQGBTBIdTxABghqCAQwDE?= =?us-ascii?q?QuCAVIBhniCOwEBAQ?= X-IronPort-AV: E=Sophos;i="5.33,255,1477954800"; d="scan'208";a="210026684" Received: from nm20-vm0.bullet.mail.ne1.yahoo.com ([98.138.91.45]) by mail3-smtp-sop.national.inria.fr with ESMTP/TLS/AES128-GCM-SHA256; 19 Jan 2017 22:39:52 +0100 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1484861991; bh=7L3rFcHPhOeMq5t1lGlDnMUSSJhVtPxB7v++Z36qp6k=; h=Subject:From:In-Reply-To:Date:Cc:References:To:From:Subject; b=O9yNsTvRnPz1XFoVtCFoLcknAOEXgpZc1nmgf/61pllMAioZwMN3iLL/ZdfuWNAP1bJiwBtbNx2+cuNHuC2RVbXIxmJxr8+yslKfYknySbfMgRfJjmvzHsJ3GVZwZPPu/CIeWVY8j+yGzLE8fAiud/YoDwfmqm/9TwFdZzQpMDcnJdIbw0cs5Uf1iO7umhCme/ZAhPdqTNEXHxxYVJm4ItcsyYaoABex7ZQMQgDwKV41m6skIYc1PYXggUBCsYFnxBXWTiqg1BIG1zPBJLbANrhtksHq9vesWcsWVTh2UM7TFb5EaL04VQfgPxjncZtY+LSgfW1JWwLqq2vAfV1QTQ== Received: from [98.138.226.177] by nm20.bullet.mail.ne1.yahoo.com with NNFMP; 19 Jan 2017 21:39:51 -0000 Received: from [98.138.226.169] by tm12.bullet.mail.ne1.yahoo.com with NNFMP; 19 Jan 2017 21:39:51 -0000 Received: from [127.0.0.1] by omp1070.mail.ne1.yahoo.com with NNFMP; 19 Jan 2017 21:39:51 -0000 X-Yahoo-Newman-Id: 134390.31549.bm@omp1070.mail.ne1.yahoo.com Received: (qmail 78266 invoked from network); 19 Jan 2017 18:00:11 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1484848810; bh=7L3rFcHPhOeMq5t1lGlDnMUSSJhVtPxB7v++Z36qp6k=; h=Content-Type:Mime-Version:Subject:From:In-Reply-To:Date:Cc:Content-Transfer-Encoding:Message-Id:References:To; b=A5fAjqEwkf0aJ4XsTo55Hg+Yeo+5r4I8elmDk+rWOOzJ4L6F5cJo9PVGbKE5kz2PCQ1i8+aqZMCgeoJSVF54XT9xJ9zf8yjPP9mYj/3RmnZHVXIhvZufcuumv1QXARVz0mZ0MlOI71nB1GJg5z9v4UBLUcm+C8j2bt9PSjOq1jI= X-Yahoo-Newman-Property: ymail-3 X-YMail-OSG: YGc5sAsVM1lFyAeAtVTTKgwGsDkkM07U5bjOveeKbdKgloS VsejW7Zp7kS0A81l1MbLFF66n9EcUNSHtT.GXz.yvP3fRp6ru_7qfHehEuEt yzR0_pJBfWhitoS8oNlg2WFGEJYxOOzgAtF66Az1U4P_5L0Z7RwmR3NDlH_K lHnNpX4B9XjuJIrNs_V4uzejGelISYFowYW3aZ9UILsFZZBCuw6jEfr6wuwW PSZAFhWDpo__dZ5A9zlkSEUe8oUqR8iYw1R5L74n1uuP24PucGw4E3_WNY0p DfBp8aSgJE9Fv54H0Qb5RBQU8tlIDKIiX79g1JNDKP4PrzcBxewnOV0oaFFt XdC_ntsTokuIgpBkdqIvsazf4ags2m7jyMqdHLioCZfJ.4PeaUIsDoKfDOo3 QhJsg2jGOoKg7Sq5GI.tHrdrnPjV7cWad1iGlUmUOOSVa2HJgnt_B1EdEdWi kIH9xtXDFSihWm.P6bYMajE6nQ6BY2OEe0hEwKLdlUgXJu7UH3YP1DJVi2YU ZmEWTIyPLWxx0deMdOarpETP6O.j3ZYiyZVa4LMzcMDeRIIUGRiBfq3bazEI 0O7b3 X-Yahoo-SMTP: 1rhWOp.swBCIsRPqec.N67IhngFT7HDF Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.2 \(3259\)) From: Mark Hayden In-Reply-To: Date: Thu, 19 Jan 2017 09:59:55 -0800 Cc: "caml-list@inria.fr" Content-Transfer-Encoding: quoted-printable Message-Id: <2EA73F0B-9C8C-443F-9F05-F0F856ACF2C5@yahoo.com> References: <2B595CCC-1121-4C8C-8F5F-A235D3AB19BB@yahoo.com> To: Yaron Minsky X-Mailer: Apple Mail (2.3259) Subject: Re: [Caml-list] Ocaml optimizer pitfalls & work-arounds I agree that removal of the current treatment of float array would eliminat= e the biggest issues and that would avoid some of the bigger issues. Howev= er, our issue is not just with how float arrays are handled. Comparisons f= or integer/char/bool types should always be specialized after inlining. As things stand (and apparently this is unlikely to change), Ocaml programm= ers who want best performance need to learn to developer their software to = carefully tailor use of polymorphism, abstraction, min/max, etc, at least f= or parts where performance is important. This is too bad... the type check= er infers the types but (according to X Leroy) the type information is no l= onger available by the time inlining occurs. Below is an example taking the maximum (at least 0) of an array of an array= of integers. The natural way to implement this is: let array_sum arr =3D Array.fold_left max 0 arr ;; The inner loop compiles to the code below (again, with Array.fold_left rede= fined to be annotated with =E2=80=9C[@inline]=E2=80=9D). Because of how th= e Ocaml compiler is architected, the array operations and comparison operat= ions will not be specialized, even though the code is inlined and the array= type has been inferred to be [int array]. Removing the special =E2=80=9Cf= loat array=E2=80=9D treatment from Ocaml would at least eliminate the "dead= code" for checking and boxing for floats, but the call to [_caml_greatereq= ual] would still be used instead of a single comparison instruction. L258: movq (%rsp), %rbx .loc 1 38 14 movzbq -8(%rbx), %rax cmpq $254, %rax je L262 .loc 1 38 14 movq -4(%rbx,%rdx,4), %rsi movq %rsi, 24(%rsp) jmp L261 .align 2 L262: .loc 1 38 14 L264: subq $16, %r15 movq _caml_young_limit@GOTPCREL(%rip), %rax cmpq (%rax), %r15 jb L265 leaq 8(%r15), %rsi movq $1277, -8(%rsi) .loc 1 38 14 movsd -4(%rbx,%rdx,4), %xmm0 movsd %xmm0, (%rsi) movq %rsi, 24(%rsp) L261: movq %rdi, 32(%rsp) .loc 5 65 17 movq _caml_greaterequal@GOTPCREL(%rip), %rax call _caml_c_call L256: movq _caml_young_ptr@GOTPCREL(%rip), %r11 movq (%r11), %r15 cmpq $1, %rax je L260 movq 32(%rsp), %rdi jmp L259 .align 2 L260: movq 24(%rsp), %rdi L259: movq 8(%rsp), %rdx movq %rdx, %rax addq $2, %rdx movq %rdx, 8(%rsp) movq 16(%rsp), %rbx cmpq %rbx, %rax jne L258 If you write it like this: let [@inline] array_fold_left_i f (x:int) (a:int array) =3D let r =3D ref x in for i =3D 0 to Array.length a - 1 do r :=3D f !r (Array.unsafe_get a i) done; !r ;; let [@inline] int_max (v0:int) v1 =3D if v0 > v1 then v0 else v1 ;; let array_int_max a =3D array_fold_left_i int_max 0 a ;; Then the inner loop will compile as follows, which is about 5-15x faster th= an the code above and I think most developers would be pleased with. L284: .loc 1 54 14 movq -4(%rbx,%rsi,4), %rdx cmpq %rdx, %rdi jle L286 jmp L285 .align 2 L286: movq %rdx, %rdi L285: movq %rsi, %rdx addq $2, %rsi cmpq %rax, %rdx jne L284 > On Jan 19, 2017, at 5:41 AM, Yaron Minsky wrote: >=20 > It seems like your primary issues are around lack of specialization > around two features: >=20 > - unboxing in float arrays > - optimization of ad-hoc operations (e.g., polymorphic compare) >=20 > My view on this is that it's best not to rely on float array > specialization at all, and I think the best improvement we can make to > OCaml is to remove the ad-hoc specialization of float arrays, and > instead add a separate, specialized (and unboxed) type for arrays of > floats, similar to the Bytes.t type which is effectively a specialized > byte array. >=20 > As you've observed, specialization is brittle, and it's best not to > rely on it too much. Beyond that, the existence of float arrays > complicate the runtime quite a bit, and make other bugs more likely. >=20 > There isn't yet consensus that specialization of float array should be > removed, but I'm still hopeful that we'll get there. It's probably > Jane Street's highest priority ask for the compiler. >=20 > We also avoid use of polymorphic compare and other ad-hoc operations, > preferring to use type-specialized comparators everywhere. This is > better for semantic as well as performance reasons, since we've seen a > lot of subtle bugs from polymorphic compare doing the wrong thing on > specific types. >=20 > It's hard to deny that using type-specialized comparators is more > verbose than polymorphic compare, but hopefully modular implicits will > make this problem go away, and we can get the best of both worlds. And > we hope that Flambda will be up to the job of inlining away the > overhead of the more indirect calling conventions imposed by modular > implicits. >=20 > I think that with the above changes, we can probably get pretty far > towards the goal of being able to write OCaml code that is both highly > performant and pretty. Being able to delay specialization until later > in the compilation pipeline would help more, but I believe we can do > pretty well without it. >=20 > y >=20 >=20 > On Thu, Jan 19, 2017 at 1:51 AM, Mark Hayden wrot= e: >> We recently upgraded our Ocaml toolchain from 4.02.3 to Ocaml 4.04.0. >> We were looking forward to a performance boost from the optimization >> improvements, especially from flambda. While we generally were able >> to achieve significant performance improvement, we were somewhat >> surprised by the effort required to avoid certain pitfalls in Ocaml. >>=20 >> This note describes some issues we ran into. We filed several >> reports on Ocaml Mantis regarding our findings. However it appears >> the underlying issues we ran into are unlikely to change: >>=20 >> Your three reports (0007440, 0007441, 0007442) are manifestations >> of the same fact: the OCaml compiler performs type-based >> optimizations first, then erases types, then performs all the other >> optimizations. This is very unlikely to change in the near future, >> as it would require a total rewrite of the compiler. >>=20 >> [X Leroy, https://caml.inria.fr/mantis/view.php?id=3D7440] >>=20 >> I encourage readers to review the problem reports we submitted, which >> include more concrete examples. I'm posting this note in case there >> are others running into similar performance issues with their Ocaml >> software and who might find it helpful in working around those >> issues. I'm not aware of them being documented elsewhere and there >> appears to be little prospect of the issues being addressed in the >> compiler in the forseeable future. Please chime in if any of this is >> inaccurate or there is something I missed. >>=20 >> As an initial example, consider the following Ocaml code to find the >> maximum floating point value in an array (that is at least 0.0): >>=20 >> [Array.fold_left max 0.0 arr] >>=20 >> Now compile this with the latest compiler and maximum optimization. >> Because of how the Ocaml optimization works, this will run about >> 10-15x slower (and allocate 2-3 words per array element) than a more >> carefully written version that uses specialized operations and avoids >> allocation. See below for one way to achieve this (while still using >> a functional-programming style). >>=20 >> (* Same as Array.fold_left, but with type casting. >> *) >> let [@inline] array_fold_leftf f (x:float) (a:float array) =3D >> let r =3D ref x in >> for i =3D 0 to Array.length a - 1 do >> r :=3D f !r (Array.unsafe_get a i) >> done; >> !r >> ;; >>=20 >> let [@inline] float_max (v0:float) v1 =3D >> if v0 > v1 then v0 else v1 >> ;; >>=20 >> let array_float_max a =3D >> array_fold_leftf float_max 0.0 a >> ;; >>=20 >> The assembly for the "inner loop" for the two examples are below. >> They were compiled with Ocaml 4.05.dev+flambda, "-O3 >> -unbox-closures", MacOS 12.2, AMD64. >>=20 >> Unoptimized example. Note test/branch for array tag. Allocation for >> boxing (we did not include the calls to trigger a minor gc). There >> is a call to Ocaml runtime for polymorphic greater-equal. This is >> probably not what one would expect from an optimizing/inline compiler >> for a simple case such as this. Note that to create this we used our >> own definition of Array.fold_left which had an "[@inline]" >> annotation. >>=20 >> L215: >> movq (%rsp), %rbx >> .loc 1 38 14 >> movzbq -8(%rbx), %rax >> cmpq $254, %rax >> je L219 >> .loc 1 38 14 >> movq -4(%rbx,%rdx,4), %rsi >> movq %rsi, 24(%rsp) >> jmp L218 >> .align 2 >> L219: >> .loc 1 38 14 >> L221: >> subq $16, %r15 >> movq _caml_young_limit@GOTPCREL(%rip), %rax >> cmpq (%rax), %r15 >> jb L222 >> leaq 8(%r15), %rsi >> movq $1277, -8(%rsi) >> .loc 1 38 14 >> movsd -4(%rbx,%rdx,4), %xmm0 >> movsd %xmm0, (%rsi) >> movq %rsi, 24(%rsp) >> L218: >> movq %rdi, 32(%rsp) >> .file 5 "pervasives.ml" >> .loc 5 65 17 >> movq _caml_greaterequal@GOTPCREL(%rip), %rax >> call _caml_c_call >> L213: >> movq _caml_young_ptr@GOTPCREL(%rip), %r11 >> movq (%r11), %r15 >> cmpq $1, %rax >> je L217 >> movq 32(%rsp), %rdi >> jmp L216 >> .align 2 >> L217: >> movq 24(%rsp), %rdi >> L216: >> movq 8(%rsp), %rdx >> movq %rdx, %rax >> addq $2, %rdx >> movq %rdx, 8(%rsp) >> movq 16(%rsp), %rbx >> cmpq %rbx, %rax >> jne L215 >>=20 >>=20 >> The assembly for the more carefully writting case is below. No >> allocation. No call to external C code. No test/branch for array >> tag. This matches what I think most people would like to see. It is >> compact enough that (maybe) it would benefit from unrolling: >>=20 >> l225: >> .loc 1 46 14 >> movsd -4(%rax,%rdi,4), %xmm1 >> comisd %xmm1, %xmm0 >> jbe l227 >> jmp l226 >> .align 2 >> l227: >> movapd %xmm1, %xmm0 >> l226: >> movq %rdi, %rsi >> addq $2, %rdi >> cmpq %rbx, %rsi >> jne l225 >>=20 >>=20 >> The two main learnings we found were: >>=20 >> * Polymorphic primitives ([Array.get], [compare], [>=3D], [min]) are >> only specialized if they appear in a context where the types can be >> determined at their exact call site, otherwise a polymorphic >> version is used. If the use of the primitive is later inlined in a >> context where the type is no longer polymorphic, the function will >> _not_ be specialized by the compiler. >>=20 >> * Use of abstract data types prevents specialization. In particular, >> defining an abstract data type in a module ("type t ;;") will >> prevent specialization (even after inlining) for any polymorphic >> primitives (eg, "caml_equal") used with that type. For instance, >> if the underlying type for [t] is actually [int], other modules >> will still use polymorphic equality instead of a single machine >> instruction. You can prevent this behavior with the "private" >> keyword in order to export the type information, "type t =3D private >> int". Alternatively, the module can include its own specialized >> operations and other modules can be careful to use them. >>=20 >> It bears emphasizing that the issues described in this note apply >> even when all of the code is "fully inlined" and uses highest level >> of optimization. Specialization in the Ocaml compiler occurs in a >> stage prior to inlining. If it hasn=E2=80=99t happened before inlining,= it >> won=E2=80=99t happen afterwards. >>=20 >> What kind of effect does lack of specialization have on performance? >> Calling the "caml_compare" Ocaml C runtime function to compare >> integers can be 10-20x times slower than using a single integer >> comparison machine instruction. Ditto for floating point values. >> The unspecialized [Array.get], [Array.set], and (on 32-bit) >> [Array.length] have to check the tag on the array to determine if the >> array uses the unboxed floating-point represntation (I wish Ocaml >> didn't use this!). For instance, the polymorphic [Array.get] checks >> the tag on the array and (for floating point arrays) reads the value >> and boxes the floating point value (ie, allocate 2-3 words on the >> heap). Note that when iterating over an array, the check on the tag >> will be included in _each_ loop iteration. >>=20 >> Other impacts of using non-specialized functions: >>=20 >> * Use of polymorphic primitives means floating point values have to >> be boxed, requiring heap allocation. Through use of specialized >> specialized primitives, in many cases floats can remain unboxed. >>=20 >> * All the extra native code from using polymorphic primitives >> (checking array tags, conditionally boxing floats, calling out to >> Ocaml C runtime) can have follow-on effects for further inlining. >> In other words, when native code can be kept compact, then more >> code can be inlined and/or loops unrolled and this can in turn >> allow further optimization. >>=20 >> Some suggestions others may find helpful: >>=20 >> * Consider using the "private" keyword for any abstract types in your >> modules. We added over 50 of these to our code base. It is an >> ugly but effective work-around. >>=20 >> * The min/max functions in standard library Pervasives are >> particularly problematic. They are polymorphic so their comparison >> will never be specialized. It can be helpful to define specialized >> functions such as: >>=20 >> let [@inline] float_max (v0:float) (v1:float) =3D >> if v0 > v1 then v0 else v1 >> ;; >>=20 >> let [@inline] int_max (v0:int) (v1:int) =3D >> if v0 > v1 then v0 else v1 >> ;; >>=20 >> These will be compiled to use native machine code and unboxed >> values. >>=20 >> * Any use of polymorphism can negatively affect performance. Be >> careful about inadvertently introducing polymorphism into your >> program, such as this helper function: >>=20 >> let [@inline] getter v ofs =3D Array.get v ofs ;; >>=20 >> This will result in unspecialized version of [Array.get] being >> inlined at all call-sites. Note that if your .mli file defines the >> function as non-polymorphic that will still _not_ affect how >> [getter] is compiled.: >>=20 >> type getter : t -> int -> int ;; (* does not affect optimization *) >>=20 >> You must cast the type in the implementation in order for [Array.get] >> to be specialized: >>=20 >> let [@inline] getter (v:int array) ofs =3D Array.get v ofs ;; >>=20 >> * All the iterators in the Array module (eg, [Array.iter]) in the >> standard library are polymorphic, so will use unspecialized >> accessors and be affected by the issues described here. Using the >> following to sum and array of floats may seem elegant: >>=20 >> Array.fold_left (+) 0.0 arr >>=20 >> However, the resulting code is much slower (and allocates 2 floats >> per array entry, ie 4-6 words) than a "specialized" version. Note >> that even if [Array.fold_left] and surrounding code were "fully >> inlined," it is still a polymorphic function so the above >> performance penalty for checking the array tag and boxing the float >> is present. See also the earlier example. >>=20 >> * It can be helpful to review the compiled assembly code (using "-S" >> option for ocamlopt) and look for tell-tale signs of lack of >> specialization, such as calls to [_caml_greaterequal] or allocation >> [caml_call_gc] in cases where they are not expected. You can refer >> to the assembly code for the original implementation, know that >> that code will not be specialized when inlined. >>=20 >> As I said, by examining our hot spots and following the suggestions >> above, we found the resulting native code could be comparable to what >> we would expect from C. It is unfortunate these issues were >> (apparently) designed into the Ocaml compiler architecture, because >> otherwise it would have seemed this would be a natural area of >> improvement for the compiler. I would have thought a staticly typed >> language such as Ocaml would (through its type checker) be >> well-suited for the simple types of function specialization described >> in this note. >>=20 >>=20 >> -- >> Caml-list mailing list. Subscription management and archives: >> https://sympa.inria.fr/sympa/arc/caml-list >> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners >> Bug reports: http://caml.inria.fr/bin/caml-bugs