From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ivg@ieee.org>
X-Original-To: caml-list@sympa.inria.fr
Delivered-To: caml-list@sympa.inria.fr
Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104])
	by sympa.inria.fr (Postfix) with ESMTPS id B766E80175
	for <caml-list@sympa.inria.fr>; Thu, 15 Jun 2017 12:38:23 +0200 (CEST)
Authentication-Results: mail3-smtp-sop.national.inria.fr; spf=None smtp.pra=ivg@ieee.org; spf=Pass smtp.mailfrom=ivg@ieee.org; spf=None smtp.helo=postmaster@mail-yw0-f170.google.com
Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender
  authenticity information available from domain of
  ivg@ieee.org) identity=pra; client-ip=209.85.161.170;
  receiver=mail3-smtp-sop.national.inria.fr;
  envelope-from="ivg@ieee.org"; x-sender="ivg@ieee.org";
  x-conformance=sidf_compatible
Received-SPF: Pass (mail3-smtp-sop.national.inria.fr: domain of
  ivg@ieee.org designates 209.85.161.170 as permitted sender)
  identity=mailfrom; client-ip=209.85.161.170;
  receiver=mail3-smtp-sop.national.inria.fr;
  envelope-from="ivg@ieee.org"; x-sender="ivg@ieee.org";
  x-conformance=sidf_compatible; x-record-type="v=spf1"
Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender
  authenticity information available from domain of
  postmaster@mail-yw0-f170.google.com) identity=helo;
  client-ip=209.85.161.170;
  receiver=mail3-smtp-sop.national.inria.fr;
  envelope-from="ivg@ieee.org";
  x-sender="postmaster@mail-yw0-f170.google.com";
  x-conformance=sidf_compatible
IronPort-PHdr: =?us-ascii?q?9a23=3AGmEPyhR+73vuTSKa36WuMR7redpsv+yvbD5Q0YIu?=
 =?us-ascii?q?jvd0So/mwa6yZR2N2/xhgRfzUJnB7Loc0qyN4v+mATRIyK3CmUhKSIZLWR4BhJ?=
 =?us-ascii?q?detC0bK+nBN3fGKuX3ZTcxBsVIWQwt1Xi6NU9IBJS2PAWK8TW94jEIBxrwKxd+?=
 =?us-ascii?q?KPjrFY7OlcS30P2594HObwlSijewZbF/IA+qoQnNq8IbnZZsJqEtxxXTv3BGYf?=
 =?us-ascii?q?5WxWRmJVKSmxbz+MK994N9/ipTpvws6ddOXb31cKokQ7NYCi8mM30u683wqRbD?=
 =?us-ascii?q?VwqP6WACXWgQjxFFHhLK7BD+Xpf2ryv6qu9w0zSUMMHqUbw5Xymp4rx1QxH0li?=
 =?us-ascii?q?gIKz858HnWisNuiqJbvAmhrAF7z4LNfY2ZKOZycqbbcNwdWGRBQ91RVzRfDYyg?=
 =?us-ascii?q?c4sBAe0BPeNCoIn8oVsFsB+yCAaoCe/qzDJDm3340rAg0+k5Ew7G0gwuEdwNvn?=
 =?us-ascii?q?rJstv6KLwfUeWpwKTS1zjPc+9a1DX75YPVch4hu/aMXbdofMTS10kgDQXFhUiR?=
 =?us-ascii?q?p4ziIzOV0foNvHSb7+phSeKvkHMspgZwojixycchkYjJiZwLxV/a7yl5x5w1Jd?=
 =?us-ascii?q?KhRUN9fNWqHpxQtySAOIt3RMMvW2BouCAgyr0Ho5G3ZiYKyI4/yx/fcfOHc4+I?=
 =?us-ascii?q?4hX5WOmNJjd4gXRoc6+8iRaq6UWs1PHwW82u3FtJridJiMTAu3EQ2xDJ98SKSO?=
 =?us-ascii?q?dx80G80jiVzQ/T8PtLIUUsmKrbNZEhxrkwm4IWsUvZHy/2nFz6ja+Yd0k44+So?=
 =?us-ascii?q?5fnrb7f6qpOGOI90jQb+MqsqmsOhG+g3Lg8OX22D9eS90r3s41H5Ta1UgvEqlq?=
 =?us-ascii?q?TVqpPXKMQBqqKkAgJZz5wv5wu9Aju6yNgYmGMILFNBeBKJlYjpPFTOLej5Dfeh?=
 =?us-ascii?q?jFShizZryO7YMbL/GJnNKWLDkLj5cbZn90Fc0BYzzcxY559MFr4OOvfzWkvouN?=
 =?us-ascii?q?zcDx85KBC0zv38CNR904MeQXiADrWYMKPUq1+I5/ggL/OCZI8P637BLK0f4Pjn?=
 =?us-ascii?q?izcdlBc9bOH9x5wRYXb+GvlmMm2WZHPthpEKFmJc7SQkS+m/qUOLV3Z8YGq1Qa?=
 =?us-ascii?q?k85y0gQNanE4jrR42gjfqGxijtTc4eXXxPFl3ZSSSgTI6DQfpZLX/LLw=3D=3D?=
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: =?us-ascii?q?A0CBAAC/YkJZhqqhVdFcHAEBBAEBCgEBG?=
 =?us-ascii?q?AYMgkWBS4ENB4NvgTaaWoJAgn6CbY8OA1wHJYV4AoJVB0IVAQEBAQEBAQEBAQE?=
 =?us-ascii?q?SAQEBCAsLCCgvgjMkgkIBAgIBIwQZAQEpAwsBBAsJAgsaHQICAh8BEgEFAQoSB?=
 =?us-ascii?q?hMICok1TQMIBQgQjFmRGj+LHWuBbDqDCQEBBYQpAwqECAEBAQEBAQEDAQEBAQE?=
 =?us-ascii?q?BAQEBEAcIEoQbgjWBYIMigliBfisWgmWCYQGHcwyJBotzgRo7hBKCHn2HQIRng?=
 =?us-ascii?q?gdVhHGKPotRh2YUH4EVDyaBLIEBCEkZBoQ1Kh8lgU8+NgGHE4I+AQEB?=
X-IPAS-Result: =?us-ascii?q?A0CBAAC/YkJZhqqhVdFcHAEBBAEBCgEBGAYMgkWBS4ENB4N?=
 =?us-ascii?q?vgTaaWoJAgn6CbY8OA1wHJYV4AoJVB0IVAQEBAQEBAQEBAQESAQEBCAsLCCgvg?=
 =?us-ascii?q?jMkgkIBAgIBIwQZAQEpAwsBBAsJAgsaHQICAh8BEgEFAQoSBhMICok1TQMIBQg?=
 =?us-ascii?q?QjFmRGj+LHWuBbDqDCQEBBYQpAwqECAEBAQEBAQEDAQEBAQEBAQEBEAcIEoQbg?=
 =?us-ascii?q?jWBYIMigliBfisWgmWCYQGHcwyJBotzgRo7hBKCHn2HQIRnggdVhHGKPotRh2Y?=
 =?us-ascii?q?UH4EVDyaBLIEBCEkZBoQ1Kh8lgU8+NgGHE4I+AQEB?=
X-IronPort-AV: E=Sophos;i="5.39,343,1493676000"; 
   d="ml'?scan'208,217";a="228405315"
Received: from mail-yw0-f170.google.com ([209.85.161.170])
  by mail3-smtp-sop.national.inria.fr with ESMTP/TLS/AES128-GCM-SHA256; 15 Jun 2017 12:38:21 +0200
Received: by mail-yw0-f170.google.com with SMTP id 63so3664902ywr.0
        for <caml-list@inria.fr>; Thu, 15 Jun 2017 03:38:21 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=ieee-org.20150623.gappssmtp.com; s=20150623;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :cc;
        bh=AyLMbOPWdGQKJTSvaVFgdhmhfg62ftc1MyA5ULp10oM=;
        b=ZrKJqt5pW6KtzeAXl1lHNbYJOYq1JWPJWUAz+vmywZ/1alDbGsB6utMjGm1hih3IKJ
         DCCbRp90G+XbVkv0x6jY1HU5vuZQoDP9C5I3dcw8YhWvX0CmZZVVSAQatL5zxHfY/C5W
         yNIaT3OPp50xHLP4ExJyNF3vNcgV3nGSb+5rEk/PMJzWbQuAioq8cGCyniRw+nG/16EN
         udQ/PLEYphIbmx0nOKVwArNg0d+fsPWDuruHrotuGsLhyT8Ul931u943cVw5fQLKl06g
         nIM4lfoOoiGIzPekH0PuB+RFGWO0Tj19uSRWAyN4gHlENn+IES1EusGGmg6Q77kLvhnl
         YpVA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:in-reply-to:references:from:date
         :message-id:subject:to:cc;
        bh=AyLMbOPWdGQKJTSvaVFgdhmhfg62ftc1MyA5ULp10oM=;
        b=QAVIL1a1+UQWWkCDsxszPwEuaPDVgwM00Pknnqsh3OkKq3w6i1lq7EL6YtEdv7Nw3r
         B8puj8Wdj/eXAw2I7Fma99cLpUH0EnIvbMAMQ1VwRGEP5Eod8pdmfuiuyxsLfU+BK+z9
         vLMJJ515tcj6tkKtgx7+o2IaW0Z67MoCDg4uiQo0JgoXfI1JkUjI7q3X3+sJg41VR9YJ
         dK302wCM505uAjBCMEMBqtn7dGjKv3uGIp/caa8WK4DOuvrFCRM1JyLHCR99ifOmdHY9
         pSpLD1O5DVfZtKI8zJkpVheGB0/fe9PHyEALmnmqWYXo/XoWYhxDkpfCQonFx8xM3ww7
         fSHA==
X-Gm-Message-State: AKS2vOylcxnLy3sYU44ORoAyBtRBrNLDiOx6XG7bmsoB+KNQQFY0LTFM
	b3yTcJeHjKoTmkAO+IRJDcSelSVpeWPk
X-Received: by 10.13.232.10 with SMTP id r10mr4026740ywe.161.1497523099708;
 Thu, 15 Jun 2017 03:38:19 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.129.26.1 with HTTP; Thu, 15 Jun 2017 03:38:18 -0700 (PDT)
In-Reply-To: <CAEc-HQm8ZxOeiMFtAaqo_0ZgPBqrmcfpyX2UxwUX+u0mLnFf4A@mail.gmail.com>
References: <412f58b7-8a8b-2356-2626-e1bd010be683@bioreg.kyushu-u.ac.jp> <CAEc-HQm8ZxOeiMFtAaqo_0ZgPBqrmcfpyX2UxwUX+u0mLnFf4A@mail.gmail.com>
From: Ivan Gotovchits <ivg@ieee.org>
Date: Thu, 15 Jun 2017 12:38:18 +0200
Message-ID: <CALdWJ+zsKqDm+C74TGO8K6U4JvQO_SPspv41EDk125RQepWRYw@mail.gmail.com>
To: Ronan Le Hy <ronan.lehy@gmail.com>
Cc: Francois BERENGER <berenger@bioreg.kyushu-u.ac.jp>, OCaml Mailing List <caml-list@inria.fr>
Content-Type: multipart/mixed; boundary="94eb2c0886586708250551fd44ef"
Subject: Re: [Caml-list] Can this code be accelerated by porting it to SPOC,
 SAREK or MetaOCaml ?

--94eb2c0886586708250551fd44ef
Content-Type: multipart/alternative; boundary="94eb2c0886586708210551fd44ed"

--94eb2c0886586708210551fd44ed
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi Fran=C3=A7ois,

Your code is actually bound by memory, not by computation. The problem is
that you create lots of data, and all the time is spent in the allocation
functions and GC. The actual computation consumes less than 5% of overall
program execution, that can be easily seen by profiling (always profile
before starting optimization). Thus moving to GPU will make this 5% of code
run faster, at the expense of even higher overhead of moving data between
CPU and GPU.

So, let's see how we can optimize your code. First of all, your code uses
too much mutable state, that can have a negative performance impact in
OCaml due to write barriers. Let's make it a little bit cleaner and faster:


let nonmutab points =3D
  let rec loop xs acc =3D match xs with
    | [] -> acc
    | (x,w) :: xs ->
      List.fold_left (fun acc (x',w') ->
          (Vector3.dist x x', w *. w') :: acc) acc xs |>
      loop xs in
  loop points []

This code runs a little bit faster (on 10000 points):

original: 14084.1 ms
nonmutab: 11309.2 ms

It computes absolutely the same result, using the same computational core,
and allocating the same amount of trash. The only difference is that we are
not using a mutable state anymore, and we are rewarded for that.

The next step is to consider, do you really need to produce these
intermediate structures, if the result of your program is the computed
data, then you can just store it in a file. Or, if you need later this data
to be reduced to some value, then you shouldn't create an intermediate
result, and apply the reduction in place (so called de-foresting). So,
let's generalize a little bit the function so that instead of producing a
new list, the function just applies a user-provided function to all
produced elements, along with an accumulator.

let nonalloc f acc points =3D
  let rec loop xs acc =3D match xs with
    | [] -> acc
    | (x,w) :: xs ->
      List.fold_left (fun acc (x',w') ->
          f acc (Vector3.dist x x') (w *. w')) acc xs |>
      loop xs in
  loop points acc


This function will not allocate any unnecessary data (it will though
allocate a pair of floats per each call). So we can use this function to
implement the `nonmutab` version, by passing a consing operator and an
empty list. We can also try to use it to store data. The data storage
process depends on how fast we can reformat the data, and how fast is the
sink device. If we will output to /dev/null (that is known to be fast),
then we can use the Marshal module and get about 300 MB/s throughtput. Not
bad, but still about 10 seconds of running time. If, for example, we just
need some scalar metrics, then we don't need to pay an overhead of data,
and it will be as fast as possible. For example, with the following
implementations of the kernel function

let flags =3D [Marshal.No_sharing]

let print () x1 x2 =3D
  Marshal.to_channel stdout x1 flags;
  Marshal.to_channel stdout x2 flags

let product total x1 x2 =3D total +. x1 *. x2


We will have the following timings:

printall: 9649.54 ms
nonalloc: 541.186 ms


So, the actual computation took only half a second, the rest is data
processing.

Please find attached the whole example. You can run it with the following
command:

ocamlbuild -pkgs vector3,unix points.native -- > /dev/null


Regards,
Ivan Gotovchits


On Thu, Jun 15, 2017 at 9:09 AM, Ronan Le Hy <ronan.lehy@gmail.com> wrote:

> Hi Fran=C3=A7ois,
>
> 2017-06-15 8:28 GMT+02:00 Francois BERENGER <berenger@bioreg.kyushu-u.ac.
> jp>:
> > I am wondering if some high performance OCaml experts out there
> > can know in advance if some code can go faster by executing it
> > on a GPU.
> > I have some clear bottleneck in my program.
> > Here is how the code looks like:
> > ---
> > let f (points: (Vector3.t * float) list) =3D
> >   let acc =3D ref [] in
> >   let ac p1 x (p2, y) =3D
> >     acc :=3D (Vector3.dist p1 p2, x *. y) :: !acc
> >   in
> >   let rec loop =3D function
> >     | [] -> ()
> >     | (p1, x) :: xs ->
> >       L.iter (ac p1 x) xs;
> >       loop xs
> >   in
> >   loop points;
> >   !acc
>
> As a baseline before attempting anything on the GPU, I'd vectorize
> this. Put all your vectors in a big matrix. Put all your numbers in a
> vector. Compute the distances and the products all at once using
> Lacaml (make sure you use OpenBLAS as a backend). I'd expect it to be
> much faster than the above loop already.
>
> --
> Ronan
>
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>

--94eb2c0886586708210551fd44ed
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><span style=3D"font-size:12.8px">Hi Fran=C3=A7ois,</span><=
br><div><span style=3D"font-size:12.8px"><br></span></div><div><span style=
=3D"font-size:12.8px">Your code is actually bound by memory, not by computa=
tion. The problem is that you create lots of data, and all the time is spen=
t in the allocation functions and GC. The actual computation consumes less =
than 5% of overall program execution, that can be easily seen by profiling =
(always profile before starting optimization). Thus moving to GPU will make=
 this 5% of code run faster, at the expense of even higher overhead of movi=
ng data between CPU and GPU.=C2=A0</span></div><div><span style=3D"font-siz=
e:12.8px"><br></span></div><div><span style=3D"font-size:12.8px">So, let=
9;s see how we can optimize your code. First of all, your code uses too muc=
h mutable state, that can have a negative performance impact in OCaml due t=
o write barriers. Let&#39;s make it a little bit cleaner and faster:</span>=
</div><div><span style=3D"font-size:12.8px">=C2=A0 =C2=A0</span><br></div><=
blockquote style=3D"margin:0px 0px 0px 40px;border:none;padding:0px"><div><=
div><span style=3D"font-size:12.8px"><font face=3D"monospace, monospace">le=
t nonmutab points =3D=C2=A0</font></span></div></div><div><div><span style=
=3D"font-size:12.8px"><font face=3D"monospace, monospace">=C2=A0 let rec lo=
op xs acc =3D match xs with</font></span></div></div><div><div><span style=
=3D"font-size:12.8px"><font face=3D"monospace, monospace">=C2=A0 =C2=A0 | [=
] -&gt; acc</font></span></div></div><div><div><span style=3D"font-size:12.=
8px"><font face=3D"monospace, monospace">=C2=A0 =C2=A0 | (x,w) :: xs -&gt;=
=C2=A0</font></span></div></div><div><div><span style=3D"font-size:12.8px">=
<font face=3D"monospace, monospace">=C2=A0 =C2=A0 =C2=A0 List.fold_left (fu=
n acc (x&#39;,w&#39;) -&gt;=C2=A0</font></span></div></div><div><div><span =
style=3D"font-size:12.8px"><font face=3D"monospace, monospace">=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 (Vector3.dist x x&#39;, w *. w&#39;) :: acc) acc x=
s |&gt;=C2=A0</font></span></div></div><div><div><span style=3D"font-size:1=
2.8px"><font face=3D"monospace, monospace">=C2=A0 =C2=A0 =C2=A0 loop xs in<=
/font></span></div></div><div><div><span style=3D"font-size:12.8px"><font f=
ace=3D"monospace, monospace">=C2=A0 loop points []</font></span></div></div=
><div><span style=3D"font-size:12.8px"><br></span></div></blockquote><span =
style=3D"font-size:12.8px">This code runs a little bit faster (on 10000 poi=
nts):</span><div><span style=3D"font-size:12.8px"><br></span></div><blockqu=
ote style=3D"margin:0px 0px 0px 40px;border:none;padding:0px"><div><div><sp=
an style=3D"font-size:12.8px"><font face=3D"monospace, monospace">original:=
 14084.1 ms</font></span></div></div><div><div><span style=3D"font-size:12.=
8px"><font face=3D"monospace, monospace">nonmutab: 11309.2 ms</font></span>=
</div></div><div><div style=3D"font-size:12.8px"><br></div></div></blockquo=
te><div style=3D"font-size:12.8px">It computes absolutely the same result, =
using the same computational core, and allocating the same amount of trash.=
 The only difference is that we are not using a mutable state anymore, and =
we are rewarded for that.</div><div style=3D"font-size:12.8px"><br></div><d=
iv style=3D"font-size:12.8px">The next step is to consider, do you really n=
eed to produce these intermediate structures, if the result of your program=
 is the computed data, then you can just store it in a file. Or, if you nee=
d later this data to be reduced to some value, then you shouldn&#39;t creat=
e an intermediate result, and apply the reduction in place (so called de-fo=
resting). So, let&#39;s generalize a little bit the function so that instea=
d of producing a new list, the function just applies a user-provided functi=
on to all produced elements, along with an accumulator.=C2=A0</div><div><sp=
an style=3D"font-size:12.8px"><br></span></div><blockquote style=3D"margin:=
0px 0px 0px 40px;border:none;padding:0px"><div><div><span style=3D"font-siz=
e:12.8px"><font face=3D"monospace, monospace">let nonalloc f acc points =3D=
=C2=A0</font></span></div></div><div><div><span style=3D"font-size:12.8px">=
<font face=3D"monospace, monospace">=C2=A0 let rec loop xs acc =3D match xs=
 with</font></span></div></div><div><div><span style=3D"font-size:12.8px"><=
font face=3D"monospace, monospace">=C2=A0 =C2=A0 | [] -&gt; acc</font></spa=
n></div></div><div><div><span style=3D"font-size:12.8px"><font face=3D"mono=
space, monospace">=C2=A0 =C2=A0 | (x,w) :: xs -&gt;=C2=A0</font></span></di=
v></div><div><div><span style=3D"font-size:12.8px"><font face=3D"monospace,=
 monospace">=C2=A0 =C2=A0 =C2=A0 List.fold_left (fun acc (x&#39;,w&#39;) -&=
gt;=C2=A0</font></span></div></div><div><div><span style=3D"font-size:12.8p=
x"><font face=3D"monospace, monospace">=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 f=
 acc (Vector3.dist x x&#39;) (w *. w&#39;)) acc xs |&gt;=C2=A0</font></span=
></div></div><div><div><span style=3D"font-size:12.8px"><font face=3D"monos=
pace, monospace">=C2=A0 =C2=A0 =C2=A0 loop xs in</font></span></div></div><=
div><div><span style=3D"font-size:12.8px"><font face=3D"monospace, monospac=
e">=C2=A0 loop points acc</font></span></div></div></blockquote><div><div s=
tyle=3D"font-size:12.8px"><br></div><div style=3D"font-size:12.8px"><br></d=
iv><div style=3D"font-size:12.8px">This function will not allocate any unne=
cessary=C2=A0data (it will though allocate a pair of floats per each call).=
 So we can use this function to implement the `nonmutab` version, by passin=
g a consing operator and an empty list. We can also try to use it to store =
data. The data storage process depends on how fast we can reformat the data=
, and how fast is the sink device. If we will output to /dev/null (that is =
known to be fast), then we can use the Marshal module and get about 300 MB/=
s throughtput.=C2=A0Not bad, but still about 10 seconds of running time. If=
, for example, we just need some scalar metrics, then we don&#39;t need to =
pay an overhead of data, and it will be as fast as possible. For example, w=
ith the following implementations of the kernel function</div><div style=3D=
"font-size:12.8px"><br></div></div><blockquote style=3D"margin:0px 0px 0px =
40px;border:none;padding:0px"><div><div><div><span style=3D"font-size:12.8p=
x"><font face=3D"monospace, monospace">let flags =3D [Marshal.No_sharing]=
=C2=A0</font></span></div></div></div><div><div><div><span style=3D"font-si=
ze:12.8px"><font face=3D"monospace, monospace"><br></font></span></div></di=
v></div><div><div><div><span style=3D"font-size:12.8px"><font face=3D"monos=
pace, monospace">let print () x1 x2 =3D</font></span></div></div></div><div=
><div><div><span style=3D"font-size:12.8px"><font face=3D"monospace, monosp=
ace">=C2=A0 Marshal.to_channel stdout x1 flags;</font></span></div></div></=
div><div><div><div><span style=3D"font-size:12.8px"><font face=3D"monospace=
, monospace">=C2=A0 Marshal.to_channel stdout x2 flags</font></span></div><=
/div></div><div><div><div><span style=3D"font-size:12.8px"><font face=3D"mo=
nospace, monospace"><br></font></span></div></div></div><div><div><div><spa=
n style=3D"font-size:12.8px"><font face=3D"monospace, monospace">let produc=
t total x1 x2 =3D total +. x1 *. x2</font></span></div></div></div></blockq=
uote><div><div><div style=3D"font-size:12.8px"><br></div></div><div style=
=3D"font-size:12.8px">We will have the following timings:</div><div style=
=3D"font-size:12.8px"><br></div></div><blockquote style=3D"margin:0px 0px 0=
px 40px;border:none;padding:0px"><div><div><span style=3D"font-size:12.8px"=
><font face=3D"monospace, monospace">printall: 9649.54 ms</font></span></di=
v></div><div><font face=3D"monospace, monospace"><span style=3D"font-size:1=
2.8px">nonalloc: 541.186 ms</span><br></font></div></blockquote><div><div><=
div style=3D"font-size:12.8px"><br></div></div></div><div style=3D"font-siz=
e:12.8px">So, the actual computation took only half a second, the rest is d=
ata processing.</div><div style=3D"font-size:12.8px"><br></div><div style=
=3D"font-size:12.8px">Please find attached the whole example. You can run i=
t with the following command:</div><div style=3D"font-size:12.8px"><br></di=
v><blockquote style=3D"margin:0 0 0 40px;border:none;padding:0px"><div><spa=
n style=3D"font-size:12.8px"><font face=3D"monospace, monospace">ocamlbuild=
 -pkgs vector3,unix points.native -- &gt; /dev/null</font></span></div></bl=
ockquote><div style=3D"font-size:12.8px"><br></div><div style=3D"font-size:=
12.8px">Regards,</div><div style=3D"font-size:12.8px">Ivan Gotovchits</div>=
<div style=3D"font-size:12.8px"><br></div></div><div class=3D"gmail_extra">=
<br><div class=3D"gmail_quote">On Thu, Jun 15, 2017 at 9:09 AM, Ronan Le Hy=
 <span dir=3D"ltr">&lt;<a href=3D"mailto:ronan.lehy@gmail.com" target=3D"_b=
lank">ronan.lehy@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gm=
ail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-le=
ft:1ex">Hi Fran=C3=A7ois,<br>
<span class=3D""><br>
2017-06-15 8:28 GMT+02:00 Francois BERENGER &lt;<a href=3D"mailto:berenger@=
bioreg.kyushu-u.ac.jp">berenger@bioreg.kyushu-u.ac.<wbr>jp</a>&gt;:<br>
&gt; I am wondering if some high performance OCaml experts out there<br>
&gt; can know in advance if some code can go faster by executing it<br>
&gt; on a GPU.<br>
</span><span class=3D"">&gt; I have some clear bottleneck in my program.<br>
&gt; Here is how the code looks like:<br>
&gt; ---<br>
&gt; let f (points: (Vector3.t * float) list) =3D<br>
&gt;=C2=A0 =C2=A0let acc =3D ref [] in<br>
&gt;=C2=A0 =C2=A0let ac p1 x (p2, y) =3D<br>
&gt;=C2=A0 =C2=A0 =C2=A0acc :=3D (Vector3.dist p1 p2, x *. y) :: !acc<br>
&gt;=C2=A0 =C2=A0in<br>
&gt;=C2=A0 =C2=A0let rec loop =3D function<br>
&gt;=C2=A0 =C2=A0 =C2=A0| [] -&gt; ()<br>
&gt;=C2=A0 =C2=A0 =C2=A0| (p1, x) :: xs -&gt;<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0L.iter (ac p1 x) xs;<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0loop xs<br>
&gt;=C2=A0 =C2=A0in<br>
&gt;=C2=A0 =C2=A0loop points;<br>
&gt;=C2=A0 =C2=A0!acc<br>
<br>
</span>As a baseline before attempting anything on the GPU, I&#39;d vectori=
ze<br>
this. Put all your vectors in a big matrix. Put all your numbers in a<br>
vector. Compute the distances and the products all at once using<br>
Lacaml (make sure you use OpenBLAS as a backend). I&#39;d expect it to be<b=
r>
much faster than the above loop already.<br>
<span class=3D"HOEnZb"><font color=3D"#888888"><br>
--<br>
Ronan<br>
</font></span><div class=3D"HOEnZb"><div class=3D"h5"><br>
--<br>
Caml-list mailing list.=C2=A0 Subscription management and archives:<br>
<a href=3D"https://sympa.inria.fr/sympa/arc/caml-list" rel=3D"noreferrer" t=
arget=3D"_blank">https://sympa.inria.fr/sympa/<wbr>arc/caml-list</a><br>
Beginner&#39;s list: <a href=3D"http://groups.yahoo.com/group/ocaml_beginne=
rs" rel=3D"noreferrer" target=3D"_blank">http://groups.yahoo.com/group/<wbr=
>ocaml_beginners</a><br>
Bug reports: <a href=3D"http://caml.inria.fr/bin/caml-bugs" rel=3D"noreferr=
er" target=3D"_blank">http://caml.inria.fr/bin/caml-<wbr>bugs</a></div></di=
v></blockquote></div><br></div>

--94eb2c0886586708210551fd44ed--

--94eb2c0886586708250551fd44ef
Content-Type: application/octet-stream; name="points.ml"
Content-Disposition: attachment; filename="points.ml"
Content-Transfer-Encoding: base64
X-Attachment-Id: f_j3yajbyr0

b3BlbiBQcmludGYKCgpsZXQgb3JpZ2luYWwgKHBvaW50czogKFZlY3RvcjMu
dCAqIGZsb2F0KSBsaXN0KSA9CiAgbGV0IGFjYyA9IHJlZiBbXSBpbgogIGxl
dCBhYyBwMSB4IChwMiwgeSkgPQogICAgYWNjIDo9IChWZWN0b3IzLmRpc3Qg
cDEgcDIsIHggKi4geSkgOjogIWFjYwogIGluCiAgbGV0IHJlYyBsb29wID0g
ZnVuY3Rpb24KICAgIHwgW10gLT4gKCkKICAgIHwgKHAxLCB4KSA6OiB4cyAt
PgogICAgICBMaXN0Lml0ZXIgKGFjIHAxIHgpIHhzOwogICAgICBsb29wIHhz
CiAgaW4KICBsb29wIHBvaW50czsKICAhYWNjCgpsZXQgbm9ubXV0YWIgcG9p
bnRzID0gCiAgbGV0IHJlYyBsb29wIHhzIGFjYyA9IG1hdGNoIHhzIHdpdGgK
ICAgIHwgW10gLT4gYWNjCiAgICB8ICh4LHcpIDo6IHhzIC0+IAogICAgICBM
aXN0LmZvbGRfbGVmdCAoZnVuIGFjYyAoeCcsdycpIC0+IAogICAgICAgICAg
KFZlY3RvcjMuZGlzdCB4IHgnLCB3ICouIHcnKSA6OiBhY2MpIGFjYyB4cyB8
PiAKICAgICAgbG9vcCB4cyBpbgogIGxvb3AgcG9pbnRzIFtdCgpsZXQgbm9u
YWxsb2MgZiBhY2MgcG9pbnRzID0gCiAgbGV0IHJlYyBsb29wIHhzIGFjYyA9
IG1hdGNoIHhzIHdpdGgKICAgIHwgW10gLT4gYWNjCiAgICB8ICh4LHcpIDo6
IHhzIC0+IAogICAgICBMaXN0LmZvbGRfbGVmdCAoZnVuIGFjYyAoeCcsdycp
IC0+IAogICAgICAgICAgZiBhY2MgKFZlY3RvcjMuZGlzdCB4IHgnKSAodyAq
LiB3JykpIGFjYyB4cyB8PiAKICAgICAgbG9vcCB4cyBpbgogIGxvb3AgcG9p
bnRzIGFjYwoKCmxldCB0aW1laXQgbmFtZSBmIHggPSAKICBHYy5tYWpvciAo
KTsKICBsZXQgdDAgPSBVbml4LmdldHRpbWVvZmRheSAoKSBpbgogIGxldCBy
ID0gZiB4IGluCiAgbGV0IHQxID0gVW5peC5nZXR0aW1lb2ZkYXkgKCkgaW4K
ICBlcHJpbnRmICIlczogJWcgbXNcbiUhIiBuYW1lICgodDEgLS4gdDApICou
IDEwMDAuKTsKICByCgpsZXQgcmV2X2luaXQgbiBmID0gCiAgbGV0IHJlYyBs
b29wIGkgeHMgPSAKICAgIGlmIGkgPCBuIHRoZW4gbG9vcCAoaSsxKSAoZiBp
IDo6IHhzKSBlbHNlIHhzIGluCiAgbG9vcCAwIFtdCgpsZXQgcmFuZG9tX3Bv
aW50cyBuID0gcmV2X2luaXQgbiAoZnVuIGkgLT4gCiAgICBWZWN0b3IzLm1h
a2UKICAgICAgKFJhbmRvbS5mbG9hdCAxLikKICAgICAgKFJhbmRvbS5mbG9h
dCAxLikKICAgICAgKFJhbmRvbS5mbG9hdCAxLiksCiAgICBSYW5kb20uZmxv
YXQgMS4pCgpsZXQgZmxhZ3MgPSBbTWFyc2hhbC5Ob19zaGFyaW5nXSAKCmxl
dCBwcmludCAoKSB4MSB4MiA9CiAgTWFyc2hhbC50b19jaGFubmVsIHN0ZG91
dCB4MSBmbGFnczsKICBNYXJzaGFsLnRvX2NoYW5uZWwgc3Rkb3V0IHgyIGZs
YWdzCgpsZXQgcHJvZHVjdCB0b3RhbCB4MSB4MiA9IHRvdGFsICsuIHgxICou
IHgyCgoKCmxldCBtYWluICgpID0gCiAgUmFuZG9tLmluaXQgNDI7CiAgbGV0
IHBvaW50cyA9IHJhbmRvbV9wb2ludHMgMTAwMDAgaW4KICBsZXQgX3IxID0g
dGltZWl0ICJvcmlnaW5hbCIgb3JpZ2luYWwgcG9pbnRzIGluCiAgbGV0IF9y
MiA9IHRpbWVpdCAibm9ubXV0YWIiIG5vbm11dGFiIHBvaW50cyBpbgogIGxl
dCAoKSAgPSB0aW1laXQgInByaW50YWxsIiAobm9uYWxsb2MgcHJpbnQgKCkp
IHBvaW50cyBpbgogIGxldCBwcmQgPSB0aW1laXQgIm5vbmFsbG9jIiAobm9u
YWxsb2MgcHJvZHVjdCAwLikgcG9pbnRzIGluCiAgZXByaW50ZiAicHJvZHVj
dCA9ICVnXG4iIHByZAoKCmxldCAoKSA9IG1haW4gKCkK

--94eb2c0886586708250551fd44ef--