From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <berke.durak@gmail.com>
X-Original-To: caml-list@sympa.inria.fr
Delivered-To: caml-list@sympa.inria.fr
Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83])
	by sympa.inria.fr (Postfix) with ESMTPS id F081A7FD6C
	for <caml-list@sympa.inria.fr>; Sun, 22 Jan 2017 21:06:29 +0100 (CET)
Authentication-Results: mail2-smtp-roc.national.inria.fr; spf=None smtp.pra=berke.durak@gmail.com; spf=Pass smtp.mailfrom=berke.durak@gmail.com; spf=None smtp.helo=postmaster@mail-ua0-f179.google.com
Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender
  authenticity information available from domain of
  berke.durak@gmail.com) identity=pra;
  client-ip=209.85.217.179;
  receiver=mail2-smtp-roc.national.inria.fr;
  envelope-from="berke.durak@gmail.com";
  x-sender="berke.durak@gmail.com";
  x-conformance=sidf_compatible
Received-SPF: Pass (mail2-smtp-roc.national.inria.fr: domain of
  berke.durak@gmail.com designates 209.85.217.179 as permitted
  sender) identity=mailfrom; client-ip=209.85.217.179;
  receiver=mail2-smtp-roc.national.inria.fr;
  envelope-from="berke.durak@gmail.com";
  x-sender="berke.durak@gmail.com";
  x-conformance=sidf_compatible; x-record-type="v=spf1"
Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender
  authenticity information available from domain of
  postmaster@mail-ua0-f179.google.com) identity=helo;
  client-ip=209.85.217.179;
  receiver=mail2-smtp-roc.national.inria.fr;
  envelope-from="berke.durak@gmail.com";
  x-sender="postmaster@mail-ua0-f179.google.com";
  x-conformance=sidf_compatible
IronPort-PHdr: =?us-ascii?q?9a23=3AxX7kpRa7Q73VNVkF/fGufTn/LSx+4OfEezUN459i?=
 =?us-ascii?q?sYplN5qZpsS/bnLW6fgltlLVR4KTs6sC0LuK9fy/EjVbu97B6ClEK8McEUddyI?=
 =?us-ascii?q?0/pE8JPo2sMQXDNvnkbig3ToxpdWRO2DWFC3VTA9v0fFbIo3e/vnY4ExT7Mhdp?=
 =?us-ascii?q?dKyuQtaBx5f/6+fn8JTWZ0BMhSGhKed5JRCy6AHQrdU+gI14K693xAGf8VVSfO?=
 =?us-ascii?q?ED725yJl/bswvm78T4qJRu6Sdd/f487cdDear/dqU8C7dfCWJ1YCgO+MT3uEyb?=
 =?us-ascii?q?HkO07XwGXzBOnw=3D=3D?=
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: =?us-ascii?q?A0C7CgAMEIVYhrPZVdFEGkgGDIMTAQEBA?=
 =?us-ascii?q?QFBPoEJB6I4hhSORSyFPDoCgg4HQhUBAQEBAQEBAQEBARIBAQEICwsKHTCCMxm?=
 =?us-ascii?q?CHgYjBBkBGx4DDAYDAgs3AgIiAREBBQEcBjOIUAEDGA4tjnCRDT+MAoFrGAUBH?=
 =?us-ascii?q?IMJBYNNChknDVWCMwsBAQEBARoCBhKGOYNngQmCbIE5AVCCWYJeBZAqiyGBfIR?=
 =?us-ascii?q?miwiCSY4lkTAUHoEUDyaBORMdTxSEJRELggIdNQEEhVYBJQSCEwEBAQ?=
X-IPAS-Result: =?us-ascii?q?A0C7CgAMEIVYhrPZVdFEGkgGDIMTAQEBAQFBPoEJB6I4hhS?=
 =?us-ascii?q?ORSyFPDoCgg4HQhUBAQEBAQEBAQEBARIBAQEICwsKHTCCMxmCHgYjBBkBGx4DD?=
 =?us-ascii?q?AYDAgs3AgIiAREBBQEcBjOIUAEDGA4tjnCRDT+MAoFrGAUBHIMJBYNNChknDVW?=
 =?us-ascii?q?CMwsBAQEBARoCBhKGOYNngQmCbIE5AVCCWYJeBZAqiyGBfIRmiwiCSY4lkTAUH?=
 =?us-ascii?q?oEUDyaBORMdTxSEJRELggIdNQEEhVYBJQSCEwEBAQ?=
X-IronPort-AV: E=Sophos;i="5.33,270,1477954800"; 
   d="scan'208,217";a="256782034"
Received: from mail-ua0-f179.google.com ([209.85.217.179])
  by mail2-smtp-roc.national.inria.fr with ESMTP/TLS/AES128-GCM-SHA256; 22 Jan 2017 21:06:28 +0100
Received: by mail-ua0-f179.google.com with SMTP id 96so96981243uaq.3
        for <caml-list@inria.fr>; Sun, 22 Jan 2017 12:06:28 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to;
        bh=P90HN0z8k9Uk9mdKWMppOc2iT3RIJKFlOJ/eRi/q88g=;
        b=W0p5hd/oOgEYj6qqUXE6N87L+JeA1ubZcfUzXW65QEaPuxbdQY9f4+GHCTzkNs3WgS
         M1Osza8UgT30t7Mk+1Biw2uscKBiypkGR2wn47LNlKnXugia98e79DitwKgcqm4Py4oK
         4tH5t7ZsVHhg6Uttuam6Vr0KvGzcEOf9EXvzEIcZGLfTW+uAx7qhvTNIh3mz5OQapVQ0
         ushwoWkKFOQlEWpbWsWOEF5VoVBeexhNTA3Ee+R15wmQihbLcN4RRKku96CuhmTYziKx
         QhH1RnWZeVd03DfcHRm8dXkyiLJ1tiAMF2OUQRGqq9KJjbJe0L+7o3jqAXDt8fDf35Gp
         E1wA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:in-reply-to:references:from:date
         :message-id:subject:to;
        bh=P90HN0z8k9Uk9mdKWMppOc2iT3RIJKFlOJ/eRi/q88g=;
        b=QixFhL4faVVcw3D6JmJcWiu8Pmq/ZNw1KF/hCwiPuTDFz22e1/z+Zax0xabMzmKU5w
         FIVFi6fhYxap81jAcMtyB8RJmEKagw+0WPRqU7wq5wnzqfa6GPOCKTANOjWcfHQf8Elj
         teaKHALAxUJWwc77broA2PCwgxkYx0/55ZrjIg+cX3TqSL/5aEiR0Be29E+vs9ocM2mX
         L5Re4bVse+ghRh3nF//BK/MficVCfOs3xDhyLVxGe0yT1RlVpl0xRVckbw90PqPh7UEN
         4gRWSeVSi3V5LCDgMNVrmzAkCrTwRjlg7oMiq9PHv/uRAlgBYjZNSjMc6s5WVndSZIlz
         Rsbw==
X-Gm-Message-State: AIkVDXKVRRHDRElqvj8lC5PRx6nZvNpkZB1rjwGd+TPWiym826k+eCkkaHyZjp+6CG+ZDA0ZpUizOQKEQXLmDg==
X-Received: by 10.176.0.181 with SMTP id 50mr13425914uaj.103.1485115587035;
 Sun, 22 Jan 2017 12:06:27 -0800 (PST)
MIME-Version: 1.0
Received: by 10.31.142.73 with HTTP; Sun, 22 Jan 2017 12:06:26 -0800 (PST)
In-Reply-To: <CACLX4jTwywp4RuyUPnViS9LQrQ=jjdjwmsHZqgWUoGkOVKj2DA@mail.gmail.com>
References: <2B595CCC-1121-4C8C-8F5F-A235D3AB19BB@yahoo.com>
 <CACLX4jQjjGTOrooQcsu1kcGW9HOMwnNFw80B7i7-gcXi1FTCBQ@mail.gmail.com>
 <2EA73F0B-9C8C-443F-9F05-F0F856ACF2C5@yahoo.com> <CACLX4jTwywp4RuyUPnViS9LQrQ=jjdjwmsHZqgWUoGkOVKj2DA@mail.gmail.com>
From: Berke Durak <berke.durak@gmail.com>
Date: Sun, 22 Jan 2017 12:06:26 -0800
Message-ID: <CAALTfKAUYoDxj_eVbAig8+TGFcyqR0rCaHJoqa_uwZfCvAWkMg@mail.gmail.com>
To: caml-list <caml-list@inria.fr>
Content-Type: multipart/alternative; boundary=001a113ed25804136a0546b46bfa
Subject: Re: [Caml-list] Ocaml optimizer pitfalls & work-arounds

--001a113ed25804136a0546b46bfa
Content-Type: text/plain; charset=UTF-8

[Re-sending to list because of size limit]

Of course the same kind of issues apply to Bigarray values.  A polymorphic
getter:

  let get a i = a.{i}

will have pretty bad performance, unless "a" is annotated with the Bigarray
kind and layout.

Yesterday I rewrote some numerical double-integration inner loop in
Fortran.   One of the loops looks like this:

    do ui = 1,n
       u = real(ui - 1) / real(n - 1)
       do vi = 1,ui
          v = real(vi - 1) / real(n - 1)
          p = p1 + u*p12 + v*p13
          r = norm(s - p)
          if (r > epsilon) then
             q = qs(1)*(1 - u - v) + qs(2)*u + qs(3)*v
             z = z + q/r
          end if
       end do
    end do

I was half-disappointed, half pleasantly surprised when I noticed that it
only got about 3 times faster than the OCaml version (with gfortran 6.3.0
and -O3 -fcheck-bounds -ffast-math).

Granted the OCaml code is similar written in a mostly imperative style with
Bigarrays and references, but it's using a function passed to a domain
iterator instead of having two nested identical loops.   This is with
4.05+pr985+flambda.

I examined the Fortran assembly code, and while it looks numerically dense,
it looks like gfortran ended up calling its internal pack/unpack primitives:

        mulsd   %xmm1, %xmm2    # v, D.4022
        movsd   %xmm0, 168(%rsp)        # D.4022, A.12
        movsd   16(%r15), %xmm0 # *s_110(D), *s_110(D)
        subsd   88(%rsp), %xmm0 # %sfp, D.4022
        subsd   32(%rsp), %xmm0 # %sfp, D.4022
        subsd   %xmm2, %xmm0    # D.4022, D.4022
        movsd   %xmm0, 176(%rsp)        # D.4022, A.12
        call    _gfortran_internal_pack #
        movsd   (%rax), %xmm2   # MEM[(real(kind=8)[3] *)_117], D.4022
        cmpq    %r12, %rax      # tmp266, D.4025
        movsd   8(%rax), %xmm3  # MEM[(real(kind=8)[3] *)_117], D.4022
        movq    %rax, %rbx      #, D.4025
        mulsd   %xmm2, %xmm2    # D.4022, D.4022
        mulsd   %xmm3, %xmm3    # D.4022, D.4022
        movsd   16(%rax), %xmm0 # MEM[(real(kind=8)[3] *)_117], D.4022
        movsd   (%rsp), %xmm1   # %sfp, v
        mulsd   %xmm0, %xmm0    # D.4022, D.4022
        addsd   %xmm3, %xmm2    # D.4022, D.4022
        addsd   %xmm2, %xmm0    # D.4022, D.4022
        sqrtsd  %xmm0, %xmm0    # D.4022, D.4022
        je      .L4     #,
        leaq    192(%rsp), %rdi #, tmp328
        movq    %rax, %rsi      # D.4025,
        movsd   %xmm0, 8(%rsp)  # D.4022, %sfp
        call    _gfortran_internal_unpack       #

I'm new at using Fortran, so maybe there are a few simple things to make
the code faster.  I suspect these calls are due to the vector operations,
such as the call to norm(u) and the vector substraction, in spite of norm()
being defined in the same Fortran module and is inlined.  (Note that
pack/unpack aren't that expensive.)

My point in describing all this is that if some of the pitfalls described
are avoided, OCaml is not bad for numerical code if you compare it to
"unoptimized" Fortran code.  Getting the "magical optimization" from
Fortran compilers (auto-vectorization, or auto-parallelization as provided
by PGI Fortran) is neither automatic nor easy.

Now a lot of scientists are stuck with Matlab, and the newer generation
tends to use Python with Numpy.

Assuming the required matrix/vector operations are available in OCaml
libraries Lacaml, Fftw, etc. we OCaml programmers will find Python + Numpy
to be inferior to OCaml because Python is an inferior language.

The benefit of Python is that it's easier to "just get started" since
"types won't get in the way", but then it's not any better than Matlab
(besides the price tag).  And yes, Python is a better language than Matlab.
 (Octave and INRIA's Scilab seem to be slightly better language-wise.)

But for writing a non-temporary numerical routine Ocaml is superior since
you can produce a type-checked, fast standalone executable efficiently
thanks to the high-level programming offered.

People dissatisfied with Python's performance often rave about Cython and
how wonderful it is.  This thing generates C code from type-annotated
Python code.  Examining the generated C and then assembly code from the
heat.pyx examples, the code (with bounds checks disabled) doesn't look
glorious.

The Cython code looks like this:

@cython.boundscheck(False)
def solve_heat_buf_nocheck(initial_conditions, double dx, double dt, int
iter):
    cdef numpy.ndarray[double, ndim=2] cur = initial_conditions.copy()
    cdef numpy.ndarray[double, ndim=2] next =
numpy.zeros_like(initial_conditions)
    cdef int count, i, j, M, N
    M, N = cur.shape[0], cur.shape[1]
    cdef double step
    for count in range(iter):
        for i in range(1, M-1):
            for j in range(1, N-1):
                step = cur[i-1,j] + cur[i+1,j] + cur[i,j-1] + cur[i,j+1] -
4*cur[i,j]
                next[i,j] = cur[i,j] + dt*step/dx**2
        cur, next = next, cur
    return cur

This, with two other Python functions of similar size, generates about 8000
lines of C code.

The actual loop is compiled into something that looks like this...

        if (__pyx_t_23 < 0) __pyx_t_23 +=
__pyx_pybuffernd_cur.diminfo[0].shape;
        if (__pyx_t_24 < 0) __pyx_t_24 +=
__pyx_pybuffernd_cur.diminfo[1].shape;
        __pyx_v_step = (((((*__Pyx_BufPtrStrided2d(double *,
__pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_15,
__pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_16,
__pyx_pybuffernd_cur.diminfo[1].strides)) + (*__Pyx_BufPtrStrided2d(double
*, __pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_17,
__pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_18,
__pyx_pybuffernd_cur.diminfo[1].strides))) + (*__Pyx_BufPtrStrided2d(double
*, __pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_19,
__pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_20,
__pyx_pybuffernd_cur.diminfo[1].strides))) + (*__Pyx_BufPtrStrided2d(double
*, __pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_21,
__pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_22,
__pyx_pybuffernd_cur.diminfo[1].strides))) - (4.0 *
(*__Pyx_BufPtrStrided2d(double *,
__pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_23,
__pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_24,
__pyx_pybuffernd_cur.diminfo[1].strides))));

...repeated a hundred times.  I pasted the excerpt at

  https://gist.github.com/anonymous/e54964121b212e5f783fb10c696ed9e2

This comes with an incredible amount of wrapper code for checking types and
recursion limits.

Regarding the float array issue, I have these two thoughts:

1) Why not just completely drop the float *array* optimization?  If you're
writing numerical code, use bigarrays.  Those are optimized.  I rarely use
float arrays.

2) However, I often use structures and tuples of floats (e.g. type vec3 =
float * float * float or type vec3 = { x : float; y : float; z : float }).
Having the floats in structures/tuples be boxed would be very annoying.

I'm not sure (1) and (2) interact.

3) What about 63-bit floats?  This would be tricky and non-compliant, but
someone had already made the suggestion here:


http://blog.frama-c.com/index.php?post/2013/05/09/A-63-bit-floating-point-type-for-64-bit-OCaml

These could be provided as a separate type.

4) Same thing for single-precision floats (i.e. actual C floats.)  On
64-bit machines these would fit with no problems in a 64-bit word.
Wasteful, but fast.

5) The really important issue may be float boxing/unboxing when passed to
functions.  Consider:

let accu_over_range i1 i2 f q0 = let rec loop i q = if i > i2 then q else
loop (i + 1) (f q i) in loop i1 q0

let _ = Printf.printf "%g\n" (accu_over_range 1 100_000_000 (fun q i -> q
+. float i *. float i) 0.0)

The inner loop translates into this (with the same 4.05+pr485+flambda)

camlFloataccu__loop_133:
        subq    $8, %rsp
.L113:
        movq    %rax, %rdi
        cmpq    $200000001, %rdi
        jle     .L112
        movq    %rbx, %rax
        addq    $8, %rsp
        ret
.L112:
.L114:
        subq    $16, %r15
        movq    caml_young_limit@GOTPCREL(%rip), %rax
        cmpq    (%rax), %r15
        jb      .L115
        leaq    8(%r15), %rsi
        movq    $1277, -8(%rsi)
        movq    %rdi, %rax
        sarq    $1, %rax
        cvtsi2sdq       %rax, %xmm0
        movapd  %xmm0, %xmm1
        mulsd   %xmm0, %xmm1
        addsd   (%rbx), %xmm1
        movsd   %xmm1, (%rsi)
        movq    %rdi, %rax
        addq    $2, %rax
        movq    %rsi, %rbx
        jmp     .L113
.L115:
        call    caml_call_gc@PLT
.L116:
        jmp     .L114

And we see that floats are boxed.  Type annotation doesn't help.  I did
quickly try a few Flambda options such as:

  ocamlopt -O3 floataccu-anno.ml -unbox-closures -rounds=10
-inline-call-cost=1000 -inlining-report

but that didn't change anything.

Maybe there is a way?

Note that Fortran 2008 currently has "inner functions" which can be defined
locally and passed to subroutines.  They do capture variables, but the
catch is that they are limited to one nesting level.  Also some compilers
(eg. PGI Fortran) don't support them yet.  See:
http://www.fortran90.org/src/faq.html#does-fortran-support-closures

My point is that efficient calls with float arguments are much more
important than this issue.  Having to add a few type annotations to avoid
polymorphic code is inconvenient, but not being able to use functions
efficiently (the fundamental construct!) is a roadblock.

6) For records containing a few floats, there is the Mlton approach of
laying out all unboxed fields before boxed ones, however as long as
all-float structures are unboxed, this can be worked around by having the
used manually place all these fields in a sub-record.

--
Berke Durak, VA7OBD (CN88)

--001a113ed25804136a0546b46bfa
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>[Re-sending to list because of size limit]</div><div>=
<br></div>Of course the same kind of issues apply to Bigarray values.=C2=A0=
 A polymorphic getter:<br><br>=C2=A0 let get a i =3D a.{i}<br><br>will have=
 pretty bad performance, unless &quot;a&quot; is annotated with the Bigarra=
y kind and layout.<br><br>Yesterday I rewrote some numerical double-integra=
tion inner loop in Fortran. =C2=A0 One of the loops looks like this:<br><br=
>=C2=A0 =C2=A0 do ui =3D 1,n<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0u =3D real(ui - =
1) / real(n - 1)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0do vi =3D 1,ui<br>=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 v =3D real(vi - 1) / real(n - 1)<br>=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 p =3D p1 + u*p12 + v*p13<br>=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 r =3D norm(s - p)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (r &g=
t; epsilon) then<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0q =3D q=
s(1)*(1 - u - v) + qs(2)*u + qs(3)*v<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0z =3D z + q/r<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 end if<br>=
=C2=A0 =C2=A0 =C2=A0 =C2=A0end do<br>=C2=A0 =C2=A0 end do<br><br>I was half=
-disappointed, half pleasantly surprised when I noticed that it only got ab=
out 3 times faster than the OCaml version (with gfortran 6.3.0 and -O3 -fch=
eck-bounds -ffast-math).<br><br>Granted the OCaml code is similar written i=
n a mostly imperative style with Bigarrays and references, but it&#39;s usi=
ng a function passed to a domain iterator instead of having two nested iden=
tical loops. =C2=A0 This is with 4.05+pr985+flambda.<br><br>I examined the =
Fortran assembly code, and while it looks numerically dense, it looks like =
gfortran ended up calling its internal pack/unpack primitives:<br><br>=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 mulsd =C2=A0 %xmm1, %xmm2 =C2=A0 =C2=A0# v, D.4022=
<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 movsd =C2=A0 %xmm0, 168(%rsp) =C2=A0 =C2=A0=
 =C2=A0 =C2=A0# D.4022, A.12<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 movsd =C2=A0 16=
(%r15), %xmm0 # *s_110(D), *s_110(D)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 subsd =
=C2=A0 88(%rsp), %xmm0 # %sfp, D.4022<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 subsd =
=C2=A0 32(%rsp), %xmm0 # %sfp, D.4022<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 subsd =
=C2=A0 %xmm2, %xmm0 =C2=A0 =C2=A0# D.4022, D.4022<br>=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 movsd =C2=A0 %xmm0, 176(%rsp) =C2=A0 =C2=A0 =C2=A0 =C2=A0# D.4022, A=
.12<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 call =C2=A0 =C2=A0_gfortran_internal_pac=
k #<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 movsd =C2=A0 (%rax), %xmm2 =C2=A0 # MEM[=
(real(kind=3D8)[3] *)_117], D.4022<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 cmpq =C2=
=A0 =C2=A0%r12, %rax =C2=A0 =C2=A0 =C2=A0# tmp266, D.4025<br>=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 movsd =C2=A0 8(%rax), %xmm3 =C2=A0# MEM[(real(kind=3D8)[3] *)=
_117], D.4022<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 movq =C2=A0 =C2=A0%rax, %rbx =
=C2=A0 =C2=A0 =C2=A0#, D.4025<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 mulsd =C2=A0 %=
xmm2, %xmm2 =C2=A0 =C2=A0# D.4022, D.4022<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 mu=
lsd =C2=A0 %xmm3, %xmm3 =C2=A0 =C2=A0# D.4022, D.4022<br>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 movsd =C2=A0 16(%rax), %xmm0 # MEM[(real(kind=3D8)[3] *)_117], D=
.4022<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 movsd =C2=A0 (%rsp), %xmm1 =C2=A0 # %s=
fp, v<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 mulsd =C2=A0 %xmm0, %xmm0 =C2=A0 =C2=
=A0# D.4022, D.4022<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 addsd =C2=A0 %xmm3, %xmm=
2 =C2=A0 =C2=A0# D.4022, D.4022<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 addsd =C2=A0=
 %xmm2, %xmm0 =C2=A0 =C2=A0# D.4022, D.4022<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =
sqrtsd =C2=A0%xmm0, %xmm0 =C2=A0 =C2=A0# D.4022, D.4022<br>=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 je =C2=A0 =C2=A0 =C2=A0.L4 =C2=A0 =C2=A0 #,<br>=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 leaq =C2=A0 =C2=A0192(%rsp), %rdi #, tmp328<br>=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 movq =C2=A0 =C2=A0%rax, %rsi =C2=A0 =C2=A0 =C2=A0# D.4025,<br=
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 movsd =C2=A0 %xmm0, 8(%rsp) =C2=A0# D.4022, %s=
fp<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 call =C2=A0 =C2=A0_gfortran_internal_unpa=
ck =C2=A0 =C2=A0 =C2=A0 #<br><br>I&#39;m new at using Fortran, so maybe the=
re are a few simple things to make the code faster.=C2=A0 I suspect these c=
alls are due to the vector operations, such as the call to norm(u) and the =
vector substraction, in spite of norm() being defined in the same Fortran m=
odule and is inlined. =C2=A0(Note that pack/unpack aren&#39;t that expensiv=
e.)<br><br>My point in describing all this is that if some of the pitfalls =
described are avoided, OCaml is not bad for numerical code if you compare i=
t to &quot;unoptimized&quot; Fortran code.=C2=A0 Getting the &quot;magical =
optimization&quot; from Fortran compilers (auto-vectorization, or auto-para=
llelization as provided by PGI Fortran) is neither automatic nor easy.<br><=
br>Now a lot of scientists are stuck with Matlab, and the newer generation =
tends to use Python with Numpy.<br><br>Assuming the required matrix/vector =
operations are available in OCaml libraries Lacaml, Fftw, etc. we OCaml pro=
grammers will find Python + Numpy to be inferior to OCaml because Python is=
 an inferior language.<br><br>The benefit of Python is that it&#39;s easier=
 to &quot;just get started&quot; since &quot;types won&#39;t get in the way=
&quot;, but then it&#39;s not any better than Matlab (besides the price tag=
).=C2=A0 And yes, Python is a better language than Matlab. =C2=A0(Octave an=
d INRIA&#39;s Scilab seem to be slightly better language-wise.)<br><br>But =
for writing a non-temporary numerical routine Ocaml is superior since you c=
an produce a type-checked, fast standalone executable efficiently thanks to=
 the high-level programming offered.<br><br>People dissatisfied with Python=
&#39;s performance often rave about Cython and how wonderful it is.=C2=A0 T=
his thing generates C code from type-annotated Python code.=C2=A0 Examining=
 the generated C and then assembly code from the heat.pyx examples, the cod=
e (with bounds checks disabled) doesn&#39;t look glorious.<br><br>The Cytho=
n code looks like this:<br><br>@cython.boundscheck(False)<br>def solve_heat=
_buf_nocheck(initial_conditions, double dx, double dt, int iter):<br>=C2=A0=
 =C2=A0 cdef numpy.ndarray[double, ndim=3D2] cur =3D initial_conditions.cop=
y()<br>=C2=A0 =C2=A0 cdef numpy.ndarray[double, ndim=3D2] next =3D numpy.ze=
ros_like(initial_conditions)<br>=C2=A0 =C2=A0 cdef int count, i, j, M, N<br=
>=C2=A0 =C2=A0 M, N =3D cur.shape[0], cur.shape[1]<br>=C2=A0 =C2=A0 cdef do=
uble step<br>=C2=A0 =C2=A0 for count in range(iter):<br>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 for i in range(1, M-1):<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 for j in range(1, N-1):<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 step =3D cur[i-1,j] + cur[i+1,j] + cur[i,j-1] + cur[i,j+1] -=
 4*cur[i,j]<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 next=
[i,j] =3D cur[i,j] + dt*step/dx**2<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 cur, next=
 =3D next, cur<br>=C2=A0 =C2=A0 return cur<br><br>This, with two other Pyth=
on functions of similar size, generates about 8000 lines of C code.<br><br>=
The actual loop is compiled into something that looks like this...<br><br>=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (__pyx_t_23 &lt; 0) __pyx_t_23 +=3D __pyx_py=
buffernd_cur.diminfo[0].shape;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (__pyx_t_2=
4 &lt; 0) __pyx_t_24 +=3D __pyx_pybuffernd_cur.diminfo[1].shape;<br>=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 __pyx_v_step =3D (((((*__Pyx_BufPtrStrided2d(double *,=
 __pyx_pybuffernd_cur.rcbuffer-&gt;pybuffer.buf, __pyx_t_15, __pyx_pybuffer=
nd_cur.diminfo[0].strides, __pyx_t_16, __pyx_pybuffernd_cur.diminfo[1].stri=
des)) + (*__Pyx_BufPtrStrided2d(double *, __pyx_pybuffernd_cur.rcbuffer-&gt=
;pybuffer.buf, __pyx_t_17, __pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t=
_18, __pyx_pybuffernd_cur.diminfo[1].strides))) + (*__Pyx_BufPtrStrided2d(d=
ouble *, __pyx_pybuffernd_cur.rcbuffer-&gt;pybuffer.buf, __pyx_t_19, __pyx_=
pybuffernd_cur.diminfo[0].strides, __pyx_t_20, __pyx_pybuffernd_cur.diminfo=
[1].strides))) + (*__Pyx_BufPtrStrided2d(double *, __pyx_pybuffernd_cur.rcb=
uffer-&gt;pybuffer.buf, __pyx_t_21, __pyx_pybuffernd_cur.diminfo[0].strides=
, __pyx_t_22, __pyx_pybuffernd_cur.diminfo[1].strides))) - (4.0 * (*__Pyx_B=
ufPtrStrided2d(double *, __pyx_pybuffernd_cur.rcbuffer-&gt;pybuffer.buf, __=
pyx_t_23, __pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_24, __pyx_pybuff=
ernd_cur.diminfo[1].strides))));<br><br>...repeated a hundred times.=C2=A0 =
I pasted the excerpt at<br><br>=C2=A0 <a href=3D"https://gist.github.com/an=
onymous/e54964121b212e5f783fb10c696ed9e2">https://gist.github.com/anonymous=
/e54964121b212e5f783fb10c696ed9e2</a><br><br>This comes with an incredible =
amount of wrapper code for checking types and recursion limits.<br><br>Rega=
rding the float array issue, I have these two thoughts:<br><br>1) Why not j=
ust completely drop the float *array* optimization?=C2=A0 If you&#39;re wri=
ting numerical code, use bigarrays.=C2=A0 Those are optimized.=C2=A0 I rare=
ly use float arrays.<br><br>2) However, I often use structures and tuples o=
f floats (e.g. type vec3 =3D float * float * float or type vec3 =3D { x : f=
loat; y : float; z : float }).=C2=A0 Having the floats in structures/tuples=
 be boxed would be very annoying.<br><br>I&#39;m not sure (1) and (2) inter=
act.<br><br>3) What about 63-bit floats?=C2=A0 This would be tricky and non=
-compliant, but someone had already made the suggestion here:<br><br>=C2=A0=
 <a href=3D"http://blog.frama-c.com/index.php?post/2013/05/09/A-63-bit-floa=
ting-point-type-for-64-bit-OCaml">http://blog.frama-c.com/index.php?post/20=
13/05/09/A-63-bit-floating-point-type-for-64-bit-OCaml</a><br><br>These cou=
ld be provided as a separate type.<br><br>4) Same thing for single-precisio=
n floats (i.e. actual C floats.) =C2=A0On 64-bit machines these would fit w=
ith no problems in a 64-bit word.=C2=A0 Wasteful, but fast.<br><br>5) The r=
eally important issue may be float boxing/unboxing when passed to functions=
.=C2=A0 Consider:<br><br>let accu_over_range i1 i2 f q0 =3D let rec loop i =
q =3D if i &gt; i2 then q else loop (i + 1) (f q i) in loop i1 q0<br><br>le=
t _ =3D Printf.printf &quot;%g\n&quot; (accu_over_range 1 100_000_000 (fun =
q i -&gt; q +. float i *. float i) 0.0)<br><br>The inner loop translates in=
to this (with the same 4.05+pr485+flambda)<br><br>camlFloataccu__loop_133:<=
br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 subq =C2=A0 =C2=A0$8, %rsp<br>.L113:<br>=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 movq =C2=A0 =C2=A0%rax, %rdi<br>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 cmpq =C2=A0 =C2=A0$200000001, %rdi<br>=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 jle =C2=A0 =C2=A0 .L112<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 movq =C2=A0 =C2=
=A0%rbx, %rax<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 addq =C2=A0 =C2=A0$8, %rsp<br>=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 ret<br>.L112:<br>.L114:<br>=C2=A0 =C2=A0 =C2=A0=
 =C2=A0 subq =C2=A0 =C2=A0$16, %r15<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 movq =C2=
=A0 =C2=A0caml_young_limit@GOTPCREL(%rip), %rax<br>=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 cmpq =C2=A0 =C2=A0(%rax), %r15<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 jb =C2=A0=
 =C2=A0 =C2=A0.L115<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 leaq =C2=A0 =C2=A08(%r15=
), %rsi<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 movq =C2=A0 =C2=A0$1277, -8(%rsi)<br=
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 movq =C2=A0 =C2=A0%rdi, %rax<br>=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 sarq =C2=A0 =C2=A0$1, %rax<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 cvt=
si2sdq =C2=A0 =C2=A0 =C2=A0 %rax, %xmm0<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 mova=
pd =C2=A0%xmm0, %xmm1<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 mulsd =C2=A0 %xmm0, %x=
mm1<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 addsd =C2=A0 (%rbx), %xmm1<br>=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 movsd =C2=A0 %xmm1, (%rsi)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0=
 movq =C2=A0 =C2=A0%rdi, %rax<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 addq =C2=A0 =
=C2=A0$2, %rax<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 movq =C2=A0 =C2=A0%rsi, %rbx<=
br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 jmp =C2=A0 =C2=A0 .L113<br>.L115:<br>=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 call =C2=A0 =C2=A0caml_call_gc@PLT<br>.L116:<br>=C2=A0=
 =C2=A0 =C2=A0 =C2=A0 jmp =C2=A0 =C2=A0 .L114<br><br>And we see that floats=
 are boxed.=C2=A0 Type annotation doesn&#39;t help.=C2=A0 I did quickly try=
 a few Flambda options such as:<br><br>=C2=A0 ocamlopt -O3 <a href=3D"http:=
//floataccu-anno.ml">floataccu-anno.ml</a> -unbox-closures -rounds=3D10 -in=
line-call-cost=3D1000 -inlining-report<br><br>but that didn&#39;t change an=
ything.<br><br>Maybe there is a way?<br><br>Note that Fortran 2008 currentl=
y has &quot;inner functions&quot; which can be defined locally and passed t=
o subroutines.=C2=A0 They do capture variables, but the catch is that they =
are limited to one nesting level.=C2=A0 Also some compilers (eg. PGI Fortra=
n) don&#39;t support them yet.=C2=A0 See: <a href=3D"http://www.fortran90.o=
rg/src/faq.html#does-fortran-support-closures">http://www.fortran90.org/src=
/faq.html#does-fortran-support-closures</a><br><br>My point is that efficie=
nt calls with float arguments are much more important than this issue.=C2=
=A0 Having to add a few type annotations to avoid polymorphic code is incon=
venient, but not being able to use functions efficiently (the fundamental c=
onstruct!) is a roadblock.<br><br>6) For records containing a few floats, t=
here is the Mlton approach of laying out all unboxed fields before boxed on=
es, however as long as all-float structures are unboxed, this can be worked=
 around by having the used manually place all these fields in a sub-record.=
<br><br>--<br>Berke Durak, VA7OBD (CN88)<br><div><div class=3D"gmail_extra">
</div></div></div>

--001a113ed25804136a0546b46bfa--