From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by sympa.inria.fr (Postfix) with ESMTPS id F081A7FD6C for ; Sun, 22 Jan 2017 21:06:29 +0100 (CET) Authentication-Results: mail2-smtp-roc.national.inria.fr; spf=None smtp.pra=berke.durak@gmail.com; spf=Pass smtp.mailfrom=berke.durak@gmail.com; spf=None smtp.helo=postmaster@mail-ua0-f179.google.com Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of berke.durak@gmail.com) identity=pra; client-ip=209.85.217.179; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="berke.durak@gmail.com"; x-sender="berke.durak@gmail.com"; x-conformance=sidf_compatible Received-SPF: Pass (mail2-smtp-roc.national.inria.fr: domain of berke.durak@gmail.com designates 209.85.217.179 as permitted sender) identity=mailfrom; client-ip=209.85.217.179; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="berke.durak@gmail.com"; x-sender="berke.durak@gmail.com"; x-conformance=sidf_compatible; x-record-type="v=spf1" Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of postmaster@mail-ua0-f179.google.com) identity=helo; client-ip=209.85.217.179; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="berke.durak@gmail.com"; x-sender="postmaster@mail-ua0-f179.google.com"; x-conformance=sidf_compatible IronPort-PHdr: =?us-ascii?q?9a23=3AxX7kpRa7Q73VNVkF/fGufTn/LSx+4OfEezUN459i?= =?us-ascii?q?sYplN5qZpsS/bnLW6fgltlLVR4KTs6sC0LuK9fy/EjVbu97B6ClEK8McEUddyI?= =?us-ascii?q?0/pE8JPo2sMQXDNvnkbig3ToxpdWRO2DWFC3VTA9v0fFbIo3e/vnY4ExT7Mhdp?= =?us-ascii?q?dKyuQtaBx5f/6+fn8JTWZ0BMhSGhKed5JRCy6AHQrdU+gI14K693xAGf8VVSfO?= =?us-ascii?q?ED725yJl/bswvm78T4qJRu6Sdd/f487cdDear/dqU8C7dfCWJ1YCgO+MT3uEyb?= =?us-ascii?q?HkO07XwGXzBOnw=3D=3D?= X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: =?us-ascii?q?A0C7CgAMEIVYhrPZVdFEGkgGDIMTAQEBA?= =?us-ascii?q?QFBPoEJB6I4hhSORSyFPDoCgg4HQhUBAQEBAQEBAQEBARIBAQEICwsKHTCCMxm?= =?us-ascii?q?CHgYjBBkBGx4DDAYDAgs3AgIiAREBBQEcBjOIUAEDGA4tjnCRDT+MAoFrGAUBH?= =?us-ascii?q?IMJBYNNChknDVWCMwsBAQEBARoCBhKGOYNngQmCbIE5AVCCWYJeBZAqiyGBfIR?= =?us-ascii?q?miwiCSY4lkTAUHoEUDyaBORMdTxSEJRELggIdNQEEhVYBJQSCEwEBAQ?= X-IPAS-Result: =?us-ascii?q?A0C7CgAMEIVYhrPZVdFEGkgGDIMTAQEBAQFBPoEJB6I4hhS?= =?us-ascii?q?ORSyFPDoCgg4HQhUBAQEBAQEBAQEBARIBAQEICwsKHTCCMxmCHgYjBBkBGx4DD?= =?us-ascii?q?AYDAgs3AgIiAREBBQEcBjOIUAEDGA4tjnCRDT+MAoFrGAUBHIMJBYNNChknDVW?= =?us-ascii?q?CMwsBAQEBARoCBhKGOYNngQmCbIE5AVCCWYJeBZAqiyGBfIRmiwiCSY4lkTAUH?= =?us-ascii?q?oEUDyaBORMdTxSEJRELggIdNQEEhVYBJQSCEwEBAQ?= X-IronPort-AV: E=Sophos;i="5.33,270,1477954800"; d="scan'208,217";a="256782034" Received: from mail-ua0-f179.google.com ([209.85.217.179]) by mail2-smtp-roc.national.inria.fr with ESMTP/TLS/AES128-GCM-SHA256; 22 Jan 2017 21:06:28 +0100 Received: by mail-ua0-f179.google.com with SMTP id 96so96981243uaq.3 for ; Sun, 22 Jan 2017 12:06:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=P90HN0z8k9Uk9mdKWMppOc2iT3RIJKFlOJ/eRi/q88g=; b=W0p5hd/oOgEYj6qqUXE6N87L+JeA1ubZcfUzXW65QEaPuxbdQY9f4+GHCTzkNs3WgS M1Osza8UgT30t7Mk+1Biw2uscKBiypkGR2wn47LNlKnXugia98e79DitwKgcqm4Py4oK 4tH5t7ZsVHhg6Uttuam6Vr0KvGzcEOf9EXvzEIcZGLfTW+uAx7qhvTNIh3mz5OQapVQ0 ushwoWkKFOQlEWpbWsWOEF5VoVBeexhNTA3Ee+R15wmQihbLcN4RRKku96CuhmTYziKx QhH1RnWZeVd03DfcHRm8dXkyiLJ1tiAMF2OUQRGqq9KJjbJe0L+7o3jqAXDt8fDf35Gp E1wA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=P90HN0z8k9Uk9mdKWMppOc2iT3RIJKFlOJ/eRi/q88g=; b=QixFhL4faVVcw3D6JmJcWiu8Pmq/ZNw1KF/hCwiPuTDFz22e1/z+Zax0xabMzmKU5w FIVFi6fhYxap81jAcMtyB8RJmEKagw+0WPRqU7wq5wnzqfa6GPOCKTANOjWcfHQf8Elj teaKHALAxUJWwc77broA2PCwgxkYx0/55ZrjIg+cX3TqSL/5aEiR0Be29E+vs9ocM2mX L5Re4bVse+ghRh3nF//BK/MficVCfOs3xDhyLVxGe0yT1RlVpl0xRVckbw90PqPh7UEN 4gRWSeVSi3V5LCDgMNVrmzAkCrTwRjlg7oMiq9PHv/uRAlgBYjZNSjMc6s5WVndSZIlz Rsbw== X-Gm-Message-State: AIkVDXKVRRHDRElqvj8lC5PRx6nZvNpkZB1rjwGd+TPWiym826k+eCkkaHyZjp+6CG+ZDA0ZpUizOQKEQXLmDg== X-Received: by 10.176.0.181 with SMTP id 50mr13425914uaj.103.1485115587035; Sun, 22 Jan 2017 12:06:27 -0800 (PST) MIME-Version: 1.0 Received: by 10.31.142.73 with HTTP; Sun, 22 Jan 2017 12:06:26 -0800 (PST) In-Reply-To: References: <2B595CCC-1121-4C8C-8F5F-A235D3AB19BB@yahoo.com> <2EA73F0B-9C8C-443F-9F05-F0F856ACF2C5@yahoo.com> From: Berke Durak Date: Sun, 22 Jan 2017 12:06:26 -0800 Message-ID: To: caml-list Content-Type: multipart/alternative; boundary=001a113ed25804136a0546b46bfa Subject: Re: [Caml-list] Ocaml optimizer pitfalls & work-arounds --001a113ed25804136a0546b46bfa Content-Type: text/plain; charset=UTF-8 [Re-sending to list because of size limit] Of course the same kind of issues apply to Bigarray values. A polymorphic getter: let get a i = a.{i} will have pretty bad performance, unless "a" is annotated with the Bigarray kind and layout. Yesterday I rewrote some numerical double-integration inner loop in Fortran. One of the loops looks like this: do ui = 1,n u = real(ui - 1) / real(n - 1) do vi = 1,ui v = real(vi - 1) / real(n - 1) p = p1 + u*p12 + v*p13 r = norm(s - p) if (r > epsilon) then q = qs(1)*(1 - u - v) + qs(2)*u + qs(3)*v z = z + q/r end if end do end do I was half-disappointed, half pleasantly surprised when I noticed that it only got about 3 times faster than the OCaml version (with gfortran 6.3.0 and -O3 -fcheck-bounds -ffast-math). Granted the OCaml code is similar written in a mostly imperative style with Bigarrays and references, but it's using a function passed to a domain iterator instead of having two nested identical loops. This is with 4.05+pr985+flambda. I examined the Fortran assembly code, and while it looks numerically dense, it looks like gfortran ended up calling its internal pack/unpack primitives: mulsd %xmm1, %xmm2 # v, D.4022 movsd %xmm0, 168(%rsp) # D.4022, A.12 movsd 16(%r15), %xmm0 # *s_110(D), *s_110(D) subsd 88(%rsp), %xmm0 # %sfp, D.4022 subsd 32(%rsp), %xmm0 # %sfp, D.4022 subsd %xmm2, %xmm0 # D.4022, D.4022 movsd %xmm0, 176(%rsp) # D.4022, A.12 call _gfortran_internal_pack # movsd (%rax), %xmm2 # MEM[(real(kind=8)[3] *)_117], D.4022 cmpq %r12, %rax # tmp266, D.4025 movsd 8(%rax), %xmm3 # MEM[(real(kind=8)[3] *)_117], D.4022 movq %rax, %rbx #, D.4025 mulsd %xmm2, %xmm2 # D.4022, D.4022 mulsd %xmm3, %xmm3 # D.4022, D.4022 movsd 16(%rax), %xmm0 # MEM[(real(kind=8)[3] *)_117], D.4022 movsd (%rsp), %xmm1 # %sfp, v mulsd %xmm0, %xmm0 # D.4022, D.4022 addsd %xmm3, %xmm2 # D.4022, D.4022 addsd %xmm2, %xmm0 # D.4022, D.4022 sqrtsd %xmm0, %xmm0 # D.4022, D.4022 je .L4 #, leaq 192(%rsp), %rdi #, tmp328 movq %rax, %rsi # D.4025, movsd %xmm0, 8(%rsp) # D.4022, %sfp call _gfortran_internal_unpack # I'm new at using Fortran, so maybe there are a few simple things to make the code faster. I suspect these calls are due to the vector operations, such as the call to norm(u) and the vector substraction, in spite of norm() being defined in the same Fortran module and is inlined. (Note that pack/unpack aren't that expensive.) My point in describing all this is that if some of the pitfalls described are avoided, OCaml is not bad for numerical code if you compare it to "unoptimized" Fortran code. Getting the "magical optimization" from Fortran compilers (auto-vectorization, or auto-parallelization as provided by PGI Fortran) is neither automatic nor easy. Now a lot of scientists are stuck with Matlab, and the newer generation tends to use Python with Numpy. Assuming the required matrix/vector operations are available in OCaml libraries Lacaml, Fftw, etc. we OCaml programmers will find Python + Numpy to be inferior to OCaml because Python is an inferior language. The benefit of Python is that it's easier to "just get started" since "types won't get in the way", but then it's not any better than Matlab (besides the price tag). And yes, Python is a better language than Matlab. (Octave and INRIA's Scilab seem to be slightly better language-wise.) But for writing a non-temporary numerical routine Ocaml is superior since you can produce a type-checked, fast standalone executable efficiently thanks to the high-level programming offered. People dissatisfied with Python's performance often rave about Cython and how wonderful it is. This thing generates C code from type-annotated Python code. Examining the generated C and then assembly code from the heat.pyx examples, the code (with bounds checks disabled) doesn't look glorious. The Cython code looks like this: @cython.boundscheck(False) def solve_heat_buf_nocheck(initial_conditions, double dx, double dt, int iter): cdef numpy.ndarray[double, ndim=2] cur = initial_conditions.copy() cdef numpy.ndarray[double, ndim=2] next = numpy.zeros_like(initial_conditions) cdef int count, i, j, M, N M, N = cur.shape[0], cur.shape[1] cdef double step for count in range(iter): for i in range(1, M-1): for j in range(1, N-1): step = cur[i-1,j] + cur[i+1,j] + cur[i,j-1] + cur[i,j+1] - 4*cur[i,j] next[i,j] = cur[i,j] + dt*step/dx**2 cur, next = next, cur return cur This, with two other Python functions of similar size, generates about 8000 lines of C code. The actual loop is compiled into something that looks like this... if (__pyx_t_23 < 0) __pyx_t_23 += __pyx_pybuffernd_cur.diminfo[0].shape; if (__pyx_t_24 < 0) __pyx_t_24 += __pyx_pybuffernd_cur.diminfo[1].shape; __pyx_v_step = (((((*__Pyx_BufPtrStrided2d(double *, __pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_15, __pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_16, __pyx_pybuffernd_cur.diminfo[1].strides)) + (*__Pyx_BufPtrStrided2d(double *, __pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_17, __pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_18, __pyx_pybuffernd_cur.diminfo[1].strides))) + (*__Pyx_BufPtrStrided2d(double *, __pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_19, __pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_20, __pyx_pybuffernd_cur.diminfo[1].strides))) + (*__Pyx_BufPtrStrided2d(double *, __pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_21, __pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_22, __pyx_pybuffernd_cur.diminfo[1].strides))) - (4.0 * (*__Pyx_BufPtrStrided2d(double *, __pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_23, __pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_24, __pyx_pybuffernd_cur.diminfo[1].strides)))); ...repeated a hundred times. I pasted the excerpt at https://gist.github.com/anonymous/e54964121b212e5f783fb10c696ed9e2 This comes with an incredible amount of wrapper code for checking types and recursion limits. Regarding the float array issue, I have these two thoughts: 1) Why not just completely drop the float *array* optimization? If you're writing numerical code, use bigarrays. Those are optimized. I rarely use float arrays. 2) However, I often use structures and tuples of floats (e.g. type vec3 = float * float * float or type vec3 = { x : float; y : float; z : float }). Having the floats in structures/tuples be boxed would be very annoying. I'm not sure (1) and (2) interact. 3) What about 63-bit floats? This would be tricky and non-compliant, but someone had already made the suggestion here: http://blog.frama-c.com/index.php?post/2013/05/09/A-63-bit-floating-point-type-for-64-bit-OCaml These could be provided as a separate type. 4) Same thing for single-precision floats (i.e. actual C floats.) On 64-bit machines these would fit with no problems in a 64-bit word. Wasteful, but fast. 5) The really important issue may be float boxing/unboxing when passed to functions. Consider: let accu_over_range i1 i2 f q0 = let rec loop i q = if i > i2 then q else loop (i + 1) (f q i) in loop i1 q0 let _ = Printf.printf "%g\n" (accu_over_range 1 100_000_000 (fun q i -> q +. float i *. float i) 0.0) The inner loop translates into this (with the same 4.05+pr485+flambda) camlFloataccu__loop_133: subq $8, %rsp .L113: movq %rax, %rdi cmpq $200000001, %rdi jle .L112 movq %rbx, %rax addq $8, %rsp ret .L112: .L114: subq $16, %r15 movq caml_young_limit@GOTPCREL(%rip), %rax cmpq (%rax), %r15 jb .L115 leaq 8(%r15), %rsi movq $1277, -8(%rsi) movq %rdi, %rax sarq $1, %rax cvtsi2sdq %rax, %xmm0 movapd %xmm0, %xmm1 mulsd %xmm0, %xmm1 addsd (%rbx), %xmm1 movsd %xmm1, (%rsi) movq %rdi, %rax addq $2, %rax movq %rsi, %rbx jmp .L113 .L115: call caml_call_gc@PLT .L116: jmp .L114 And we see that floats are boxed. Type annotation doesn't help. I did quickly try a few Flambda options such as: ocamlopt -O3 floataccu-anno.ml -unbox-closures -rounds=10 -inline-call-cost=1000 -inlining-report but that didn't change anything. Maybe there is a way? Note that Fortran 2008 currently has "inner functions" which can be defined locally and passed to subroutines. They do capture variables, but the catch is that they are limited to one nesting level. Also some compilers (eg. PGI Fortran) don't support them yet. See: http://www.fortran90.org/src/faq.html#does-fortran-support-closures My point is that efficient calls with float arguments are much more important than this issue. Having to add a few type annotations to avoid polymorphic code is inconvenient, but not being able to use functions efficiently (the fundamental construct!) is a roadblock. 6) For records containing a few floats, there is the Mlton approach of laying out all unboxed fields before boxed ones, however as long as all-float structures are unboxed, this can be worked around by having the used manually place all these fields in a sub-record. -- Berke Durak, VA7OBD (CN88) --001a113ed25804136a0546b46bfa Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
[Re-sending to list because of size limit]
=
Of course the same kind of issues apply to Bigarray values.=C2=A0= A polymorphic getter:

=C2=A0 let get a i =3D a.{i}

will have= pretty bad performance, unless "a" is annotated with the Bigarra= y kind and layout.

Yesterday I rewrote some numerical double-integra= tion inner loop in Fortran. =C2=A0 One of the loops looks like this:
=C2=A0 =C2=A0 do ui =3D 1,n
=C2=A0 =C2=A0 =C2=A0 =C2=A0u =3D real(ui - = 1) / real(n - 1)
=C2=A0 =C2=A0 =C2=A0 =C2=A0do vi =3D 1,ui
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 v =3D real(vi - 1) / real(n - 1)
=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 p =3D p1 + u*p12 + v*p13
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 r =3D norm(s - p)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (r &g= t; epsilon) then
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0q =3D q= s(1)*(1 - u - v) + qs(2)*u + qs(3)*v
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0z =3D z + q/r
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 end if
= =C2=A0 =C2=A0 =C2=A0 =C2=A0end do
=C2=A0 =C2=A0 end do

I was half= -disappointed, half pleasantly surprised when I noticed that it only got ab= out 3 times faster than the OCaml version (with gfortran 6.3.0 and -O3 -fch= eck-bounds -ffast-math).

Granted the OCaml code is similar written i= n a mostly imperative style with Bigarrays and references, but it's usi= ng a function passed to a domain iterator instead of having two nested iden= tical loops. =C2=A0 This is with 4.05+pr985+flambda.

I examined the = Fortran assembly code, and while it looks numerically dense, it looks like = gfortran ended up calling its internal pack/unpack primitives:

=C2= =A0 =C2=A0 =C2=A0 =C2=A0 mulsd =C2=A0 %xmm1, %xmm2 =C2=A0 =C2=A0# v, D.4022=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 movsd =C2=A0 %xmm0, 168(%rsp) =C2=A0 =C2=A0= =C2=A0 =C2=A0# D.4022, A.12
=C2=A0 =C2=A0 =C2=A0 =C2=A0 movsd =C2=A0 16= (%r15), %xmm0 # *s_110(D), *s_110(D)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 subsd = =C2=A0 88(%rsp), %xmm0 # %sfp, D.4022
=C2=A0 =C2=A0 =C2=A0 =C2=A0 subsd = =C2=A0 32(%rsp), %xmm0 # %sfp, D.4022
=C2=A0 =C2=A0 =C2=A0 =C2=A0 subsd = =C2=A0 %xmm2, %xmm0 =C2=A0 =C2=A0# D.4022, D.4022
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 movsd =C2=A0 %xmm0, 176(%rsp) =C2=A0 =C2=A0 =C2=A0 =C2=A0# D.4022, A= .12
=C2=A0 =C2=A0 =C2=A0 =C2=A0 call =C2=A0 =C2=A0_gfortran_internal_pac= k #
=C2=A0 =C2=A0 =C2=A0 =C2=A0 movsd =C2=A0 (%rax), %xmm2 =C2=A0 # MEM[= (real(kind=3D8)[3] *)_117], D.4022
=C2=A0 =C2=A0 =C2=A0 =C2=A0 cmpq =C2= =A0 =C2=A0%r12, %rax =C2=A0 =C2=A0 =C2=A0# tmp266, D.4025
=C2=A0 =C2=A0 = =C2=A0 =C2=A0 movsd =C2=A0 8(%rax), %xmm3 =C2=A0# MEM[(real(kind=3D8)[3] *)= _117], D.4022
=C2=A0 =C2=A0 =C2=A0 =C2=A0 movq =C2=A0 =C2=A0%rax, %rbx = =C2=A0 =C2=A0 =C2=A0#, D.4025
=C2=A0 =C2=A0 =C2=A0 =C2=A0 mulsd =C2=A0 %= xmm2, %xmm2 =C2=A0 =C2=A0# D.4022, D.4022
=C2=A0 =C2=A0 =C2=A0 =C2=A0 mu= lsd =C2=A0 %xmm3, %xmm3 =C2=A0 =C2=A0# D.4022, D.4022
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 movsd =C2=A0 16(%rax), %xmm0 # MEM[(real(kind=3D8)[3] *)_117], D= .4022
=C2=A0 =C2=A0 =C2=A0 =C2=A0 movsd =C2=A0 (%rsp), %xmm1 =C2=A0 # %s= fp, v
=C2=A0 =C2=A0 =C2=A0 =C2=A0 mulsd =C2=A0 %xmm0, %xmm0 =C2=A0 =C2= =A0# D.4022, D.4022
=C2=A0 =C2=A0 =C2=A0 =C2=A0 addsd =C2=A0 %xmm3, %xmm= 2 =C2=A0 =C2=A0# D.4022, D.4022
=C2=A0 =C2=A0 =C2=A0 =C2=A0 addsd =C2=A0= %xmm2, %xmm0 =C2=A0 =C2=A0# D.4022, D.4022
=C2=A0 =C2=A0 =C2=A0 =C2=A0 = sqrtsd =C2=A0%xmm0, %xmm0 =C2=A0 =C2=A0# D.4022, D.4022
=C2=A0 =C2=A0 = =C2=A0 =C2=A0 je =C2=A0 =C2=A0 =C2=A0.L4 =C2=A0 =C2=A0 #,
=C2=A0 =C2=A0 = =C2=A0 =C2=A0 leaq =C2=A0 =C2=A0192(%rsp), %rdi #, tmp328
=C2=A0 =C2=A0 = =C2=A0 =C2=A0 movq =C2=A0 =C2=A0%rax, %rsi =C2=A0 =C2=A0 =C2=A0# D.4025,=C2=A0 =C2=A0 =C2=A0 =C2=A0 movsd =C2=A0 %xmm0, 8(%rsp) =C2=A0# D.4022, %s= fp
=C2=A0 =C2=A0 =C2=A0 =C2=A0 call =C2=A0 =C2=A0_gfortran_internal_unpa= ck =C2=A0 =C2=A0 =C2=A0 #

I'm new at using Fortran, so maybe the= re are a few simple things to make the code faster.=C2=A0 I suspect these c= alls are due to the vector operations, such as the call to norm(u) and the = vector substraction, in spite of norm() being defined in the same Fortran m= odule and is inlined. =C2=A0(Note that pack/unpack aren't that expensiv= e.)

My point in describing all this is that if some of the pitfalls = described are avoided, OCaml is not bad for numerical code if you compare i= t to "unoptimized" Fortran code.=C2=A0 Getting the "magical = optimization" from Fortran compilers (auto-vectorization, or auto-para= llelization as provided by PGI Fortran) is neither automatic nor easy.
<= br>Now a lot of scientists are stuck with Matlab, and the newer generation = tends to use Python with Numpy.

Assuming the required matrix/vector = operations are available in OCaml libraries Lacaml, Fftw, etc. we OCaml pro= grammers will find Python + Numpy to be inferior to OCaml because Python is= an inferior language.

The benefit of Python is that it's easier= to "just get started" since "types won't get in the way= ", but then it's not any better than Matlab (besides the price tag= ).=C2=A0 And yes, Python is a better language than Matlab. =C2=A0(Octave an= d INRIA's Scilab seem to be slightly better language-wise.)

But = for writing a non-temporary numerical routine Ocaml is superior since you c= an produce a type-checked, fast standalone executable efficiently thanks to= the high-level programming offered.

People dissatisfied with Python= 's performance often rave about Cython and how wonderful it is.=C2=A0 T= his thing generates C code from type-annotated Python code.=C2=A0 Examining= the generated C and then assembly code from the heat.pyx examples, the cod= e (with bounds checks disabled) doesn't look glorious.

The Cytho= n code looks like this:

@cython.boundscheck(False)
def solve_heat= _buf_nocheck(initial_conditions, double dx, double dt, int iter):
=C2=A0= =C2=A0 cdef numpy.ndarray[double, ndim=3D2] cur =3D initial_conditions.cop= y()
=C2=A0 =C2=A0 cdef numpy.ndarray[double, ndim=3D2] next =3D numpy.ze= ros_like(initial_conditions)
=C2=A0 =C2=A0 cdef int count, i, j, M, N=C2=A0 =C2=A0 M, N =3D cur.shape[0], cur.shape[1]
=C2=A0 =C2=A0 cdef do= uble step
=C2=A0 =C2=A0 for count in range(iter):
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 for i in range(1, M-1):
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 for j in range(1, N-1):
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 step =3D cur[i-1,j] + cur[i+1,j] + cur[i,j-1] + cur[i,j+1] -= 4*cur[i,j]
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 next= [i,j] =3D cur[i,j] + dt*step/dx**2
=C2=A0 =C2=A0 =C2=A0 =C2=A0 cur, next= =3D next, cur
=C2=A0 =C2=A0 return cur

This, with two other Pyth= on functions of similar size, generates about 8000 lines of C code.

= The actual loop is compiled into something that looks like this...

= =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (__pyx_t_23 < 0) __pyx_t_23 +=3D __pyx_py= buffernd_cur.diminfo[0].shape;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (__pyx_t_2= 4 < 0) __pyx_t_24 +=3D __pyx_pybuffernd_cur.diminfo[1].shape;
=C2=A0 = =C2=A0 =C2=A0 =C2=A0 __pyx_v_step =3D (((((*__Pyx_BufPtrStrided2d(double *,= __pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_15, __pyx_pybuffer= nd_cur.diminfo[0].strides, __pyx_t_16, __pyx_pybuffernd_cur.diminfo[1].stri= des)) + (*__Pyx_BufPtrStrided2d(double *, __pyx_pybuffernd_cur.rcbuffer->= ;pybuffer.buf, __pyx_t_17, __pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t= _18, __pyx_pybuffernd_cur.diminfo[1].strides))) + (*__Pyx_BufPtrStrided2d(d= ouble *, __pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_19, __pyx_= pybuffernd_cur.diminfo[0].strides, __pyx_t_20, __pyx_pybuffernd_cur.diminfo= [1].strides))) + (*__Pyx_BufPtrStrided2d(double *, __pyx_pybuffernd_cur.rcb= uffer->pybuffer.buf, __pyx_t_21, __pyx_pybuffernd_cur.diminfo[0].strides= , __pyx_t_22, __pyx_pybuffernd_cur.diminfo[1].strides))) - (4.0 * (*__Pyx_B= ufPtrStrided2d(double *, __pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __= pyx_t_23, __pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_24, __pyx_pybuff= ernd_cur.diminfo[1].strides))));

...repeated a hundred times.=C2=A0 = I pasted the excerpt at

=C2=A0 https://gist.github.com/anonymous= /e54964121b212e5f783fb10c696ed9e2

This comes with an incredible = amount of wrapper code for checking types and recursion limits.

Rega= rding the float array issue, I have these two thoughts:

1) Why not j= ust completely drop the float *array* optimization?=C2=A0 If you're wri= ting numerical code, use bigarrays.=C2=A0 Those are optimized.=C2=A0 I rare= ly use float arrays.

2) However, I often use structures and tuples o= f floats (e.g. type vec3 =3D float * float * float or type vec3 =3D { x : f= loat; y : float; z : float }).=C2=A0 Having the floats in structures/tuples= be boxed would be very annoying.

I'm not sure (1) and (2) inter= act.

3) What about 63-bit floats?=C2=A0 This would be tricky and non= -compliant, but someone had already made the suggestion here:

=C2=A0= http://blog.frama-c.com/index.php?post/20= 13/05/09/A-63-bit-floating-point-type-for-64-bit-OCaml

These cou= ld be provided as a separate type.

4) Same thing for single-precisio= n floats (i.e. actual C floats.) =C2=A0On 64-bit machines these would fit w= ith no problems in a 64-bit word.=C2=A0 Wasteful, but fast.

5) The r= eally important issue may be float boxing/unboxing when passed to functions= .=C2=A0 Consider:

let accu_over_range i1 i2 f q0 =3D let rec loop i = q =3D if i > i2 then q else loop (i + 1) (f q i) in loop i1 q0

le= t _ =3D Printf.printf "%g\n" (accu_over_range 1 100_000_000 (fun = q i -> q +. float i *. float i) 0.0)

The inner loop translates in= to this (with the same 4.05+pr485+flambda)

camlFloataccu__loop_133:<= br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 subq =C2=A0 =C2=A0$8, %rsp
.L113:
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 movq =C2=A0 =C2=A0%rax, %rdi
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 cmpq =C2=A0 =C2=A0$200000001, %rdi
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 jle =C2=A0 =C2=A0 .L112
=C2=A0 =C2=A0 =C2=A0 =C2=A0 movq =C2=A0 =C2= =A0%rbx, %rax
=C2=A0 =C2=A0 =C2=A0 =C2=A0 addq =C2=A0 =C2=A0$8, %rsp
= =C2=A0 =C2=A0 =C2=A0 =C2=A0 ret
.L112:
.L114:
=C2=A0 =C2=A0 =C2=A0= =C2=A0 subq =C2=A0 =C2=A0$16, %r15
=C2=A0 =C2=A0 =C2=A0 =C2=A0 movq =C2= =A0 =C2=A0caml_young_limit@GOTPCREL(%rip), %rax
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 cmpq =C2=A0 =C2=A0(%rax), %r15
=C2=A0 =C2=A0 =C2=A0 =C2=A0 jb =C2=A0= =C2=A0 =C2=A0.L115
=C2=A0 =C2=A0 =C2=A0 =C2=A0 leaq =C2=A0 =C2=A08(%r15= ), %rsi
=C2=A0 =C2=A0 =C2=A0 =C2=A0 movq =C2=A0 =C2=A0$1277, -8(%rsi)=C2=A0 =C2=A0 =C2=A0 =C2=A0 movq =C2=A0 =C2=A0%rdi, %rax
=C2=A0 =C2=A0 = =C2=A0 =C2=A0 sarq =C2=A0 =C2=A0$1, %rax
=C2=A0 =C2=A0 =C2=A0 =C2=A0 cvt= si2sdq =C2=A0 =C2=A0 =C2=A0 %rax, %xmm0
=C2=A0 =C2=A0 =C2=A0 =C2=A0 mova= pd =C2=A0%xmm0, %xmm1
=C2=A0 =C2=A0 =C2=A0 =C2=A0 mulsd =C2=A0 %xmm0, %x= mm1
=C2=A0 =C2=A0 =C2=A0 =C2=A0 addsd =C2=A0 (%rbx), %xmm1
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 movsd =C2=A0 %xmm1, (%rsi)
=C2=A0 =C2=A0 =C2=A0 =C2=A0= movq =C2=A0 =C2=A0%rdi, %rax
=C2=A0 =C2=A0 =C2=A0 =C2=A0 addq =C2=A0 = =C2=A0$2, %rax
=C2=A0 =C2=A0 =C2=A0 =C2=A0 movq =C2=A0 =C2=A0%rsi, %rbx<= br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 jmp =C2=A0 =C2=A0 .L113
.L115:
=C2=A0 = =C2=A0 =C2=A0 =C2=A0 call =C2=A0 =C2=A0caml_call_gc@PLT
.L116:
=C2=A0= =C2=A0 =C2=A0 =C2=A0 jmp =C2=A0 =C2=A0 .L114

And we see that floats= are boxed.=C2=A0 Type annotation doesn't help.=C2=A0 I did quickly try= a few Flambda options such as:

=C2=A0 ocamlopt -O3 floataccu-anno.ml -unbox-closures -rounds=3D10 -in= line-call-cost=3D1000 -inlining-report

but that didn't change an= ything.

Maybe there is a way?

Note that Fortran 2008 currentl= y has "inner functions" which can be defined locally and passed t= o subroutines.=C2=A0 They do capture variables, but the catch is that they = are limited to one nesting level.=C2=A0 Also some compilers (eg. PGI Fortra= n) don't support them yet.=C2=A0 See: http://www.fortran90.org/src= /faq.html#does-fortran-support-closures

My point is that efficie= nt calls with float arguments are much more important than this issue.=C2= =A0 Having to add a few type annotations to avoid polymorphic code is incon= venient, but not being able to use functions efficiently (the fundamental c= onstruct!) is a roadblock.

6) For records containing a few floats, t= here is the Mlton approach of laying out all unboxed fields before boxed on= es, however as long as all-float structures are unboxed, this can be worked= around by having the used manually place all these fields in a sub-record.=

--
Berke Durak, VA7OBD (CN88)
--001a113ed25804136a0546b46bfa--