From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <markghayden@yahoo.com>
X-Original-To: caml-list@sympa.inria.fr
Delivered-To: caml-list@sympa.inria.fr
Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104])
	by sympa.inria.fr (Postfix) with ESMTPS id E30427FA5F
	for <caml-list@sympa.inria.fr>; Thu, 19 Jan 2017 22:39:54 +0100 (CET)
Authentication-Results: mail3-smtp-sop.national.inria.fr; spf=None smtp.pra=markghayden@yahoo.com; spf=Pass smtp.mailfrom=markghayden@yahoo.com; spf=None smtp.helo=postmaster@nm20-vm0.bullet.mail.ne1.yahoo.com
Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender
  authenticity information available from domain of
  markghayden@yahoo.com) identity=pra; client-ip=98.138.91.45;
  receiver=mail3-smtp-sop.national.inria.fr;
  envelope-from="markghayden@yahoo.com";
  x-sender="markghayden@yahoo.com";
  x-conformance=sidf_compatible
Received-SPF: Pass (mail3-smtp-sop.national.inria.fr: domain of
  markghayden@yahoo.com designates 98.138.91.45 as permitted
  sender) identity=mailfrom; client-ip=98.138.91.45;
  receiver=mail3-smtp-sop.national.inria.fr;
  envelope-from="markghayden@yahoo.com";
  x-sender="markghayden@yahoo.com";
  x-conformance=sidf_compatible; x-record-type="v=spf1"
Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender
  authenticity information available from domain of
  postmaster@nm20-vm0.bullet.mail.ne1.yahoo.com) identity=helo;
  client-ip=98.138.91.45;
  receiver=mail3-smtp-sop.national.inria.fr;
  envelope-from="markghayden@yahoo.com";
  x-sender="postmaster@nm20-vm0.bullet.mail.ne1.yahoo.com";
  x-conformance=sidf_compatible
IronPort-PHdr: =?us-ascii?q?9a23=3Agz66GhIgJi/WiHjg69mcpTZWNBhigK39O0sv0rFi?=
 =?us-ascii?q?tYgfK/nxwZ3uMQTl6Ol3ixeRBMOAuq4C17Od6vy8EUU7or+5+EgYd5JNUxJXwe?=
 =?us-ascii?q?43pCcHRPC/NEvgMfTxZDY7FskRHHVs/nW8LFQHUJ2mPw6arXK99yMdFQviPgRp?=
 =?us-ascii?q?OOv1BpTSj8Oq3Oyu5pHfeQtFiT6ybL9oIxi6sArdutQZjIZtN6081gbHrnxUdu?=
 =?us-ascii?q?pM2GhmP0iTnxHy5sex+J5s7SFdsO8/+sBDTKv3Yb02QaRXAzo6PW814tbrtQTY?=
 =?us-ascii?q?QguU+nQcSGQWnQFWDAXD8Rr3Q43+sir+tup6xSmaIcj7Rq06VDi+86tmTgLjhT?=
 =?us-ascii?q?wZPDAl7m7Yls1wjLpaoB2/oRx/35XUa5yROPZnY6/RYc8WSW9HU8lWSiJBH5i8?=
 =?us-ascii?q?b5MRAOUdIeZWoY79p14Uohu/AwmnGefjxzBMi3Pz26AxzuYvHhzc3AE4A90Bv2?=
 =?us-ascii?q?naotX3O6kcXu67z6fIwyvEYf5NxTf98Y3Ifgwhof2QX799d9fax0k1FwPCi1Wd?=
 =?us-ascii?q?sYvrMCmP1uQOrmOV7fBvVOKyhGE5rQF6vz+ixsI2hYnThYIVxVDE+j95wYkoO9?=
 =?us-ascii?q?K4TlV2YN6+H5tQsCGaMJF6Td8lQ2FtoSs3zKANt5C8fCgP0psnxhjfZuSGc4iO?=
 =?us-ascii?q?+BLjVfyeLS12hHJ/YL6+hwy98Uinyu37TMW7zFFKri9Dn9LRtX4NzwTe58yER/?=
 =?us-ascii?q?dn40us1zWC2xrX5+1ZO0w5mqrWJ4Ylz7MzjJYfrErOEyzslEnokqObd18o9vWp?=
 =?us-ascii?q?5unjZLjtu4WSOJVuig7kN6Qjgsy/Dvo8MggJR2Wb/+G82KP/8UHgXrVKi+E6nr?=
 =?us-ascii?q?PCv5DHIcQborC2AxNP3oYm8Rm/DjOm3M4enXYZMV5JYhKGgJLpO1HJJ/D0F+uw?=
 =?us-ascii?q?g1OpkDtzxvDGOKPuAonVI3XHk7rtZ6tx5kBfxQYpyd1T+ohYB74BLf7rX0/+rt?=
 =?us-ascii?q?3YDhs3MwyuxObnDc1w1oYEVmKVAa+ZP6PSvkWI5+0yPeaMYpQYuTbnJPgl4P7u?=
 =?us-ascii?q?imU1lkMafamsxZcXcmy3Hux6I0WFZnrhmssOHn0Pvgo6VeDqjFyCUSVPZ3upRK?=
 =?us-ascii?q?I95jQ7CJq8AovZR4CthqaB3CahEZFMaGBGEAPELXC9RoyeXPFETSOUOcxw2mgV?=
 =?us-ascii?q?ULmnUIww/ROnsgLh16BqI/aS8Sod48HNzt9wssjajxJ61TVlA8mbmzWIRnt3kW?=
 =?us-ascii?q?MBVjM72ohuqEx6zRGI1q0u0K8QLsBa+/4cClRyDpXb1eEvTo2rVw=3D=3D?=
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: =?us-ascii?q?A0CiAQBYMYFYhi1bimJEGhkBAQEBAQEBA?=
 =?us-ascii?q?QEBAQcBAQEBARQBAQEBAQEBAQEBAQcBAQEBAYICgRIBAQEBAX+BCRKDP5wMgja?=
 =?us-ascii?q?SdoIMKoJCgzYCggFAEwEBAQEBAQEBAQEBEgEBAQgLCwodMIIzG4IbAQEBAwEjB?=
 =?us-ascii?q?BkBASwGBQEECwsOBAYCAgkdAgJFBA4GExINiEgBAxAIDi2uZmiBaxgFAQEbgwg?=
 =?us-ascii?q?BAQWDYgEjJwODDQEBAQEBAQEBAQEBAQEBAQEBAQEBARUIFQJ0h0WBYIEJglE7g?=
 =?us-ascii?q?TANgkw6LYIxhy0MiHOKZTiGYYMYg22Dfx+BWFKEPYMqIIYeihuELoQnIQGBTBI?=
 =?us-ascii?q?dTxABghqCAQwDEQuCAVIBhniCOwEBAQ?=
X-IPAS-Result: =?us-ascii?q?A0CiAQBYMYFYhi1bimJEGhkBAQEBAQEBAQEBAQcBAQEBARQ?=
 =?us-ascii?q?BAQEBAQEBAQEBAQcBAQEBAYICgRIBAQEBAX+BCRKDP5wMgjaSdoIMKoJCgzYCg?=
 =?us-ascii?q?gFAEwEBAQEBAQEBAQEBEgEBAQgLCwodMIIzG4IbAQEBAwEjBBkBASwGBQEECws?=
 =?us-ascii?q?OBAYCAgkdAgJFBA4GExINiEgBAxAIDi2uZmiBaxgFAQEbgwgBAQWDYgEjJwODD?=
 =?us-ascii?q?QEBAQEBAQEBAQEBAQEBAQEBAQEBARUIFQJ0h0WBYIEJglE7gTANgkw6LYIxhy0?=
 =?us-ascii?q?MiHOKZTiGYYMYg22Dfx+BWFKEPYMqIIYeihuELoQnIQGBTBIdTxABghqCAQwDE?=
 =?us-ascii?q?QuCAVIBhniCOwEBAQ?=
X-IronPort-AV: E=Sophos;i="5.33,255,1477954800"; 
   d="scan'208";a="210026684"
Received: from nm20-vm0.bullet.mail.ne1.yahoo.com ([98.138.91.45])
  by mail3-smtp-sop.national.inria.fr with ESMTP/TLS/AES128-GCM-SHA256; 19 Jan 2017 22:39:52 +0100
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1484861991; bh=7L3rFcHPhOeMq5t1lGlDnMUSSJhVtPxB7v++Z36qp6k=; h=Subject:From:In-Reply-To:Date:Cc:References:To:From:Subject; b=O9yNsTvRnPz1XFoVtCFoLcknAOEXgpZc1nmgf/61pllMAioZwMN3iLL/ZdfuWNAP1bJiwBtbNx2+cuNHuC2RVbXIxmJxr8+yslKfYknySbfMgRfJjmvzHsJ3GVZwZPPu/CIeWVY8j+yGzLE8fAiud/YoDwfmqm/9TwFdZzQpMDcnJdIbw0cs5Uf1iO7umhCme/ZAhPdqTNEXHxxYVJm4ItcsyYaoABex7ZQMQgDwKV41m6skIYc1PYXggUBCsYFnxBXWTiqg1BIG1zPBJLbANrhtksHq9vesWcsWVTh2UM7TFb5EaL04VQfgPxjncZtY+LSgfW1JWwLqq2vAfV1QTQ==
Received: from [98.138.226.177] by nm20.bullet.mail.ne1.yahoo.com with NNFMP; 19 Jan 2017 21:39:51 -0000
Received: from [98.138.226.169] by tm12.bullet.mail.ne1.yahoo.com with NNFMP; 19 Jan 2017 21:39:51 -0000
Received: from [127.0.0.1] by omp1070.mail.ne1.yahoo.com with NNFMP; 19 Jan 2017 21:39:51 -0000
X-Yahoo-Newman-Id: 134390.31549.bm@omp1070.mail.ne1.yahoo.com
Received: (qmail 78266 invoked from network); 19 Jan 2017 18:00:11 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1484848810; bh=7L3rFcHPhOeMq5t1lGlDnMUSSJhVtPxB7v++Z36qp6k=; h=Content-Type:Mime-Version:Subject:From:In-Reply-To:Date:Cc:Content-Transfer-Encoding:Message-Id:References:To; b=A5fAjqEwkf0aJ4XsTo55Hg+Yeo+5r4I8elmDk+rWOOzJ4L6F5cJo9PVGbKE5kz2PCQ1i8+aqZMCgeoJSVF54XT9xJ9zf8yjPP9mYj/3RmnZHVXIhvZufcuumv1QXARVz0mZ0MlOI71nB1GJg5z9v4UBLUcm+C8j2bt9PSjOq1jI=
X-Yahoo-Newman-Property: ymail-3
X-YMail-OSG: YGc5sAsVM1lFyAeAtVTTKgwGsDkkM07U5bjOveeKbdKgloS
 VsejW7Zp7kS0A81l1MbLFF66n9EcUNSHtT.GXz.yvP3fRp6ru_7qfHehEuEt
 yzR0_pJBfWhitoS8oNlg2WFGEJYxOOzgAtF66Az1U4P_5L0Z7RwmR3NDlH_K
 lHnNpX4B9XjuJIrNs_V4uzejGelISYFowYW3aZ9UILsFZZBCuw6jEfr6wuwW
 PSZAFhWDpo__dZ5A9zlkSEUe8oUqR8iYw1R5L74n1uuP24PucGw4E3_WNY0p
 DfBp8aSgJE9Fv54H0Qb5RBQU8tlIDKIiX79g1JNDKP4PrzcBxewnOV0oaFFt
 XdC_ntsTokuIgpBkdqIvsazf4ags2m7jyMqdHLioCZfJ.4PeaUIsDoKfDOo3
 QhJsg2jGOoKg7Sq5GI.tHrdrnPjV7cWad1iGlUmUOOSVa2HJgnt_B1EdEdWi
 kIH9xtXDFSihWm.P6bYMajE6nQ6BY2OEe0hEwKLdlUgXJu7UH3YP1DJVi2YU
 ZmEWTIyPLWxx0deMdOarpETP6O.j3ZYiyZVa4LMzcMDeRIIUGRiBfq3bazEI
 0O7b3
X-Yahoo-SMTP: 1rhWOp.swBCIsRPqec.N67IhngFT7HDF
Content-Type: text/plain; charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 10.2 \(3259\))
From: Mark Hayden <markghayden@yahoo.com>
In-Reply-To: <CACLX4jQjjGTOrooQcsu1kcGW9HOMwnNFw80B7i7-gcXi1FTCBQ@mail.gmail.com>
Date: Thu, 19 Jan 2017 09:59:55 -0800
Cc: "caml-list@inria.fr" <caml-list@inria.fr>
Content-Transfer-Encoding: quoted-printable
Message-Id: <2EA73F0B-9C8C-443F-9F05-F0F856ACF2C5@yahoo.com>
References: <2B595CCC-1121-4C8C-8F5F-A235D3AB19BB@yahoo.com>
 <CACLX4jQjjGTOrooQcsu1kcGW9HOMwnNFw80B7i7-gcXi1FTCBQ@mail.gmail.com>
To: Yaron Minsky <yminsky@janestreet.com>
X-Mailer: Apple Mail (2.3259)
Subject: Re: [Caml-list] Ocaml optimizer pitfalls & work-arounds

I agree that removal of the current treatment of float array would eliminat=
e the biggest issues and that would avoid some of the bigger issues.  Howev=
er, our issue is not just with how float arrays are handled.  Comparisons f=
or integer/char/bool types should always be specialized after inlining.

As things stand (and apparently this is unlikely to change), Ocaml programm=
ers who want best performance need to learn to developer their software to =
carefully tailor use of polymorphism, abstraction, min/max, etc, at least f=
or parts where performance is important.  This is too bad... the type check=
er infers the types but (according to X Leroy) the type information is no l=
onger available by the time inlining occurs.

Below is an example taking the maximum (at least 0) of an array of an array=
 of integers.  The natural way to implement this is:

 let array_sum arr =3D
    Array.fold_left max 0 arr
 ;;

The inner loop compiles to the code below (again, with Array.fold_left rede=
fined to be annotated with =E2=80=9C[@inline]=E2=80=9D).  Because of how th=
e Ocaml compiler is architected, the array operations and comparison operat=
ions will not be specialized, even though the code is inlined and the array=
 type has been inferred to be [int array].  Removing the special =E2=80=9Cf=
loat array=E2=80=9D treatment from Ocaml would at least eliminate the "dead=
 code" for checking and boxing for floats, but the call to [_caml_greatereq=
ual] would still be used instead of a single comparison instruction.

L258:
	movq	(%rsp), %rbx
	.loc	1	38	14
	movzbq	-8(%rbx), %rax
	cmpq	$254, %rax
	je	L262
	.loc	1	38	14
	movq	-4(%rbx,%rdx,4), %rsi
	movq	%rsi, 24(%rsp)
	jmp	L261
	.align	2
L262:
	.loc	1	38	14
L264:
	subq	$16, %r15
	movq	_caml_young_limit@GOTPCREL(%rip), %rax
	cmpq	(%rax), %r15
	jb	L265
	leaq	8(%r15), %rsi
	movq	$1277, -8(%rsi)
	.loc	1	38	14
	movsd	-4(%rbx,%rdx,4), %xmm0
	movsd	%xmm0, (%rsi)
	movq	%rsi, 24(%rsp)
L261:
	movq	%rdi, 32(%rsp)
	.loc	5	65	17
	movq	_caml_greaterequal@GOTPCREL(%rip), %rax
	call	_caml_c_call
L256:
	movq	_caml_young_ptr@GOTPCREL(%rip), %r11
	movq	(%r11), %r15
	cmpq	$1, %rax
	je	L260
	movq	32(%rsp), %rdi
	jmp	L259
	.align	2
L260:
	movq	24(%rsp), %rdi
L259:
	movq	8(%rsp), %rdx
	movq	%rdx, %rax
	addq	$2, %rdx
	movq	%rdx, 8(%rsp)
	movq	16(%rsp), %rbx
	cmpq	%rbx, %rax
	jne	L258

If you write it like this:

 let [@inline] array_fold_left_i f (x:int) (a:int array) =3D
   let r =3D ref x in
   for i =3D 0 to Array.length a - 1 do
     r :=3D f !r (Array.unsafe_get a i)
   done;
   !r
 ;;

 let [@inline] int_max (v0:int) v1 =3D
   if v0 > v1 then v0 else v1
 ;;

 let array_int_max a =3D
   array_fold_left_i int_max 0 a
 ;;

Then the inner loop will compile as follows, which is about 5-15x faster th=
an the code above and I think most developers would be pleased with.

L284:
	.loc	1	54	14
	movq	-4(%rbx,%rsi,4), %rdx
	cmpq	%rdx, %rdi
	jle	L286
	jmp	L285
	.align	2
L286:
	movq	%rdx, %rdi
L285:
	movq	%rsi, %rdx
	addq	$2, %rsi
	cmpq	%rax, %rdx
	jne	L284


> On Jan 19, 2017, at 5:41 AM, Yaron Minsky <yminsky@janestreet.com> wrote:
>=20
> It seems like your primary issues are around lack of specialization
> around two features:
>=20
> - unboxing in float arrays
> - optimization of ad-hoc operations (e.g., polymorphic compare)
>=20
> My view on this is that it's best not to rely on float array
> specialization at all, and I think the best improvement we can make to
> OCaml is to remove the ad-hoc specialization of float arrays, and
> instead add a separate, specialized (and unboxed) type for arrays of
> floats, similar to the Bytes.t type which is effectively a specialized
> byte array.
>=20
> As you've observed, specialization is brittle, and it's best not to
> rely on it too much. Beyond that, the existence of float arrays
> complicate the runtime quite a bit, and make other bugs more likely.
>=20
> There isn't yet consensus that specialization of float array should be
> removed, but I'm still hopeful that we'll get there. It's probably
> Jane Street's highest priority ask for the compiler.
>=20
> We also avoid use of polymorphic compare and other ad-hoc operations,
> preferring to use type-specialized comparators everywhere. This is
> better for semantic as well as performance reasons, since we've seen a
> lot of subtle bugs from polymorphic compare doing the wrong thing on
> specific types.
>=20
> It's hard to deny that using type-specialized comparators is more
> verbose than polymorphic compare, but hopefully modular implicits will
> make this problem go away, and we can get the best of both worlds. And
> we hope that Flambda will be up to the job of inlining away the
> overhead of the more indirect calling conventions imposed by modular
> implicits.
>=20
> I think that with the above changes, we can probably get pretty far
> towards the goal of being able to write OCaml code that is both highly
> performant and pretty.  Being able to delay specialization until later
> in the compilation pipeline would help more, but I believe we can do
> pretty well without it.
>=20
> y
>=20
>=20
> On Thu, Jan 19, 2017 at 1:51 AM, Mark Hayden <markghayden@yahoo.com> wrot=
e:
>> We recently upgraded our Ocaml toolchain from 4.02.3 to Ocaml 4.04.0.
>> We were looking forward to a performance boost from the optimization
>> improvements, especially from flambda.  While we generally were able
>> to achieve significant performance improvement, we were somewhat
>> surprised by the effort required to avoid certain pitfalls in Ocaml.
>>=20
>> This note describes some issues we ran into.  We filed several
>> reports on Ocaml Mantis regarding our findings.  However it appears
>> the underlying issues we ran into are unlikely to change:
>>=20
>> Your three reports (0007440, 0007441, 0007442) are manifestations
>> of the same fact: the OCaml compiler performs type-based
>> optimizations first, then erases types, then performs all the other
>> optimizations. This is very unlikely to change in the near future,
>> as it would require a total rewrite of the compiler.
>>=20
>> [X Leroy, https://caml.inria.fr/mantis/view.php?id=3D7440]
>>=20
>> I encourage readers to review the problem reports we submitted, which
>> include more concrete examples.  I'm posting this note in case there
>> are others running into similar performance issues with their Ocaml
>> software and who might find it helpful in working around those
>> issues.  I'm not aware of them being documented elsewhere and there
>> appears to be little prospect of the issues being addressed in the
>> compiler in the forseeable future.  Please chime in if any of this is
>> inaccurate or there is something I missed.
>>=20
>> As an initial example, consider the following Ocaml code to find the
>> maximum floating point value in an array (that is at least 0.0):
>>=20
>> [Array.fold_left max 0.0 arr]
>>=20
>> Now compile this with the latest compiler and maximum optimization.
>> Because of how the Ocaml optimization works, this will run about
>> 10-15x slower (and allocate 2-3 words per array element) than a more
>> carefully written version that uses specialized operations and avoids
>> allocation.  See below for one way to achieve this (while still using
>> a functional-programming style).
>>=20
>> (* Same as Array.fold_left, but with type casting.
>>  *)
>> let [@inline] array_fold_leftf f (x:float) (a:float array) =3D
>>   let r =3D ref x in
>>   for i =3D 0 to Array.length a - 1 do
>>     r :=3D f !r (Array.unsafe_get a i)
>>   done;
>>   !r
>> ;;
>>=20
>> let [@inline] float_max (v0:float) v1 =3D
>>   if v0 > v1 then v0 else v1
>> ;;
>>=20
>> let array_float_max a =3D
>>   array_fold_leftf float_max 0.0 a
>> ;;
>>=20
>> The assembly for the "inner loop" for the two examples are below.
>> They were compiled with Ocaml 4.05.dev+flambda, "-O3
>> -unbox-closures", MacOS 12.2, AMD64.
>>=20
>> Unoptimized example.  Note test/branch for array tag.  Allocation for
>> boxing (we did not include the calls to trigger a minor gc).  There
>> is a call to Ocaml runtime for polymorphic greater-equal.  This is
>> probably not what one would expect from an optimizing/inline compiler
>> for a simple case such as this.  Note that to create this we used our
>> own definition of Array.fold_left which had an "[@inline]"
>> annotation.
>>=20
>> L215:
>>         movq    (%rsp), %rbx
>>         .loc    1       38      14
>>         movzbq  -8(%rbx), %rax
>>         cmpq    $254, %rax
>>         je      L219
>>         .loc    1       38      14
>>         movq    -4(%rbx,%rdx,4), %rsi
>>         movq    %rsi, 24(%rsp)
>>         jmp     L218
>>         .align  2
>> L219:
>>         .loc    1       38      14
>> L221:
>>         subq    $16, %r15
>>         movq    _caml_young_limit@GOTPCREL(%rip), %rax
>>         cmpq    (%rax), %r15
>>         jb      L222
>>         leaq    8(%r15), %rsi
>>         movq    $1277, -8(%rsi)
>>         .loc    1       38      14
>>         movsd   -4(%rbx,%rdx,4), %xmm0
>>         movsd   %xmm0, (%rsi)
>>         movq    %rsi, 24(%rsp)
>> L218:
>>         movq    %rdi, 32(%rsp)
>>         .file   5       "pervasives.ml"
>>         .loc    5       65      17
>>         movq    _caml_greaterequal@GOTPCREL(%rip), %rax
>>         call    _caml_c_call
>> L213:
>>         movq    _caml_young_ptr@GOTPCREL(%rip), %r11
>>         movq    (%r11), %r15
>>         cmpq    $1, %rax
>>         je      L217
>>         movq    32(%rsp), %rdi
>>         jmp     L216
>>         .align  2
>> L217:
>>         movq    24(%rsp), %rdi
>> L216:
>>         movq    8(%rsp), %rdx
>>         movq    %rdx, %rax
>>         addq    $2, %rdx
>>         movq    %rdx, 8(%rsp)
>>         movq    16(%rsp), %rbx
>>         cmpq    %rbx, %rax
>>         jne     L215
>>=20
>>=20
>> The assembly for the more carefully writting case is below.  No
>> allocation.  No call to external C code.  No test/branch for array
>> tag.  This matches what I think most people would like to see.  It is
>> compact enough that (maybe) it would benefit from unrolling:
>>=20
>> l225:
>>         .loc    1       46      14
>>         movsd   -4(%rax,%rdi,4), %xmm1
>>         comisd  %xmm1, %xmm0
>>         jbe     l227
>>         jmp     l226
>>         .align  2
>> l227:
>>         movapd  %xmm1, %xmm0
>> l226:
>>         movq    %rdi, %rsi
>>         addq    $2, %rdi
>>         cmpq    %rbx, %rsi
>>         jne     l225
>>=20
>>=20
>> The two main learnings we found were:
>>=20
>> * Polymorphic primitives ([Array.get], [compare], [>=3D], [min]) are
>> only specialized if they appear in a context where the types can be
>> determined at their exact call site, otherwise a polymorphic
>> version is used.  If the use of the primitive is later inlined in a
>> context where the type is no longer polymorphic, the function will
>> _not_ be specialized by the compiler.
>>=20
>> * Use of abstract data types prevents specialization.  In particular,
>> defining an abstract data type in a module ("type t ;;") will
>> prevent specialization (even after inlining) for any polymorphic
>> primitives (eg, "caml_equal") used with that type.  For instance,
>> if the underlying type for [t] is actually [int], other modules
>> will still use polymorphic equality instead of a single machine
>> instruction.  You can prevent this behavior with the "private"
>> keyword in order to export the type information, "type t =3D private
>> int".  Alternatively, the module can include its own specialized
>> operations and other modules can be careful to use them.
>>=20
>> It bears emphasizing that the issues described in this note apply
>> even when all of the code is "fully inlined" and uses highest level
>> of optimization.  Specialization in the Ocaml compiler occurs in a
>> stage prior to inlining.  If it hasn=E2=80=99t happened before inlining,=
 it
>> won=E2=80=99t happen afterwards.
>>=20
>> What kind of effect does lack of specialization have on performance?
>> Calling the "caml_compare" Ocaml C runtime function to compare
>> integers can be 10-20x times slower than using a single integer
>> comparison machine instruction.  Ditto for floating point values.
>> The unspecialized [Array.get], [Array.set], and (on 32-bit)
>> [Array.length] have to check the tag on the array to determine if the
>> array uses the unboxed floating-point represntation (I wish Ocaml
>> didn't use this!).  For instance, the polymorphic [Array.get] checks
>> the tag on the array and (for floating point arrays) reads the value
>> and boxes the floating point value (ie, allocate 2-3 words on the
>> heap).  Note that when iterating over an array, the check on the tag
>> will be included in _each_ loop iteration.
>>=20
>> Other impacts of using non-specialized functions:
>>=20
>> * Use of polymorphic primitives means floating point values have to
>> be boxed, requiring heap allocation.  Through use of specialized
>> specialized primitives, in many cases floats can remain unboxed.
>>=20
>> * All the extra native code from using polymorphic primitives
>> (checking array tags, conditionally boxing floats, calling out to
>> Ocaml C runtime) can have follow-on effects for further inlining.
>> In other words, when native code can be kept compact, then more
>> code can be inlined and/or loops unrolled and this can in turn
>> allow further optimization.
>>=20
>> Some suggestions others may find helpful:
>>=20
>> * Consider using the "private" keyword for any abstract types in your
>> modules.  We added over 50 of these to our code base.  It is an
>> ugly but effective work-around.
>>=20
>> * The min/max functions in standard library Pervasives are
>> particularly problematic.  They are polymorphic so their comparison
>> will never be specialized.  It can be helpful to define specialized
>> functions such as:
>>=20
>>   let [@inline] float_max (v0:float) (v1:float) =3D
>>     if v0 > v1 then v0 else v1
>>   ;;
>>=20
>>   let [@inline] int_max (v0:int) (v1:int) =3D
>>     if v0 > v1 then v0 else v1
>>   ;;
>>=20
>> These will be compiled to use native machine code and unboxed
>> values.
>>=20
>> * Any use of polymorphism can negatively affect performance.  Be
>> careful about inadvertently introducing polymorphism into your
>> program, such as this helper function:
>>=20
>>   let [@inline] getter v ofs =3D Array.get v ofs ;;
>>=20
>> This will result in unspecialized version of [Array.get] being
>> inlined at all call-sites.  Note that if your .mli file defines the
>> function as non-polymorphic that will still _not_ affect how
>> [getter] is compiled.:
>>=20
>>   type getter : t -> int -> int ;;  (* does not affect optimization *)
>>=20
>> You must cast the type in the implementation in order for [Array.get]
>> to be specialized:
>>=20
>>   let [@inline] getter (v:int array) ofs =3D Array.get v ofs ;;
>>=20
>> * All the iterators in the Array module (eg, [Array.iter]) in the
>> standard library are polymorphic, so will use unspecialized
>> accessors and be affected by the issues described here.  Using the
>> following to sum and array of floats may seem elegant:
>>=20
>>   Array.fold_left (+) 0.0 arr
>>=20
>> However, the resulting code is much slower (and allocates 2 floats
>> per array entry, ie 4-6 words) than a "specialized" version.  Note
>> that even if [Array.fold_left] and surrounding code were "fully
>> inlined," it is still a polymorphic function so the above
>> performance penalty for checking the array tag and boxing the float
>> is present.  See also the earlier example.
>>=20
>> * It can be helpful to review the compiled assembly code (using "-S"
>> option for ocamlopt) and look for tell-tale signs of lack of
>> specialization, such as calls to [_caml_greaterequal] or allocation
>> [caml_call_gc] in cases where they are not expected.  You can refer
>> to the assembly code for the original implementation, know that
>> that code will not be specialized when inlined.
>>=20
>> As I said, by examining our hot spots and following the suggestions
>> above, we found the resulting native code could be comparable to what
>> we would expect from C.  It is unfortunate these issues were
>> (apparently) designed into the Ocaml compiler architecture, because
>> otherwise it would have seemed this would be a natural area of
>> improvement for the compiler.  I would have thought a staticly typed
>> language such as Ocaml would (through its type checker) be
>> well-suited for the simple types of function specialization described
>> in this note.
>>=20
>>=20
>> --
>> Caml-list mailing list.  Subscription management and archives:
>> https://sympa.inria.fr/sympa/arc/caml-list
>> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
>> Bug reports: http://caml.inria.fr/bin/caml-bugs