From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <berenger@bioreg.kyushu-u.ac.jp>
X-Original-To: caml-list@sympa.inria.fr
Delivered-To: caml-list@sympa.inria.fr
Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104])
	by sympa.inria.fr (Postfix) with ESMTPS id 4019B80175
	for <caml-list@sympa.inria.fr>; Fri, 16 Jun 2017 11:49:19 +0200 (CEST)
Authentication-Results: mail3-smtp-sop.national.inria.fr; spf=None smtp.pra=berenger@bioreg.kyushu-u.ac.jp; spf=None smtp.mailfrom=berenger@bioreg.kyushu-u.ac.jp; spf=None smtp.helo=postmaster@h4.hosting4.cc.kyushu-u.ac.jp
Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender
  authenticity information available from domain of
  berenger@bioreg.kyushu-u.ac.jp) identity=pra;
  client-ip=133.5.13.5;
  receiver=mail3-smtp-sop.national.inria.fr;
  envelope-from="berenger@bioreg.kyushu-u.ac.jp";
  x-sender="berenger@bioreg.kyushu-u.ac.jp";
  x-conformance=sidf_compatible
Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender
  authenticity information available from domain of
  berenger@bioreg.kyushu-u.ac.jp) identity=mailfrom;
  client-ip=133.5.13.5;
  receiver=mail3-smtp-sop.national.inria.fr;
  envelope-from="berenger@bioreg.kyushu-u.ac.jp";
  x-sender="berenger@bioreg.kyushu-u.ac.jp";
  x-conformance=sidf_compatible
Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender
  authenticity information available from domain of
  postmaster@h4.hosting4.cc.kyushu-u.ac.jp) identity=helo;
  client-ip=133.5.13.5;
  receiver=mail3-smtp-sop.national.inria.fr;
  envelope-from="berenger@bioreg.kyushu-u.ac.jp";
  x-sender="postmaster@h4.hosting4.cc.kyushu-u.ac.jp";
  x-conformance=sidf_compatible
IronPort-PHdr: =?us-ascii?q?9a23=3Ai3pYbhNhLTed3dtpUyAl6mtUPXoX/o7sNwtQ0KIM?=
 =?us-ascii?q?zox0IvT9rarrMEGX3/hxlliBBdydsKMbzbKO+4nbGkU4qa6bt34DdJEeHzQksu?=
 =?us-ascii?q?4x2zIaPcieFEfgJ+TrZSFpVO5LVVti4m3peRMNQJW2aFLduGC94iAPERvjKwV1?=
 =?us-ascii?q?Ov71GonPhMiryuy+4ZPebgFKiTanfb9+MAi9oBnMuMURnYZsMLs6xAHTontPde?=
 =?us-ascii?q?RWxGdoKkyWkh3h+Mq+/4Nt/jpJtf45+MFOTav1f6IjTbxFFzsmKHw65NfqtRbY?=
 =?us-ascii?q?UwSC4GYXX3gMnRpJBwjF6wz6Xov0vyDnuOdxxDWWMMvrRr8zRDqi8rxrSAf2hy?=
 =?us-ascii?q?gbKz43/mbXislqg6JaphKquhhzzoHQbY2QMvd1Y6HTcs4ARWdZQ8hfSSJBDIO/?=
 =?us-ascii?q?YYUBAeUOMuRXoJXyqVYVsRuzBhOhCP/zxjJGhHL727Ax3eQ7EQHB2QwtB9IAsG?=
 =?us-ascii?q?7Oo9XzKKgSVuG1zLLVxjjeYP1YxTjz5pDJfB4uvf+HQLV9ftHPxkk1CQzFiFqQ?=
 =?us-ascii?q?ppL/Pz6OzesNsm+b7/B+WuKgkWInqAFwoiW0xscsl4nFn58Vxkre+ipl2oo1J8?=
 =?us-ascii?q?W4RVd9bNW5HpVQsCSaOJF3QsMkW2xouSA6yqcHuZGhZiQKxo4nywbfavOdc4iI?=
 =?us-ascii?q?5RXjWPyNLjd/gXJofq+0iRWq8UW41+HxWMe53ExKoyZfj9XBuGoB2hzL5sSaRP?=
 =?us-ascii?q?Zw8F2t1DaV2wzN9+1JIlo4mbfUJpMixLM7i4Advl7ZHiDsnUX7lK+WeVsg+uiv?=
 =?us-ascii?q?8+nnYrrrqoWcN49zkQH+LqUumsqwAek3KAQBQ3SU9f6/1Lzj4E35W7VKjuAvnq?=
 =?us-ascii?q?nEqpzVP9gUqrS7Aw9Nyooj6hC/ACm60NkAgHUKIlxIdAiHgoTzJl3DLur0APen?=
 =?us-ascii?q?j1SpijhrxvTGPrP7ApXKK3jOiKzucqhn60FCzgozws5Q54hPB74aIfLzXVXxu8?=
 =?us-ascii?q?LXDhMjMAy1w/vnCM591oMDQG6PH7WVP7nOvlOS5OIvO/GAZJUJtzblN/gl+/nu?=
 =?us-ascii?q?gGclllAHeKmp2YIbaHS5HvR9P0WUemHsg9cEEWcSpAUyVu3qiFuYUT5SfXm+Ra?=
 =?us-ascii?q?w85itoQL6hWIzKQ4TohL2awA+6GIdXbyZIEAOiC3DtIqueUvEHbyOJauVMqBEi?=
 =?us-ascii?q?epWbA9sr2AujsgD30ZJtL+3O9yJetoP+z9hoovCVnBp09yQiXJfV6H2EU2whxj?=
 =?us-ascii?q?BAfDQxxq0q/R1w?=
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: =?us-ascii?q?A0COAQCtqENZlwUNBYVeHAEBBAEBCgEBF?=
 =?us-ascii?q?wEBBAEBCgEBhA8DgQqDa4sLkQiCQIVrjVyCEQErhXgCgygXAQEBAQEBAQEBAQE?=
 =?us-ascii?q?SAQEBAQEGGAZXgjMMglkBAQEBAgEjBAsBBTMDEAsLGAICCRoDAgIfAiQBERMGA?=
 =?us-ascii?q?gEBDooCAw0HEakRgWw6hzQDCisagzsMASWBC4c4K4J4gliBfhSDEoJhBYdFByY?=
 =?us-ascii?q?MiUaLPIEcOx6HEYdEhGeCCFaEcYNLhnSLVYkvIQKBPoEJSYUdJYFcZ4cRgj4BA?=
 =?us-ascii?q?QE?=
X-IPAS-Result: =?us-ascii?q?A0COAQCtqENZlwUNBYVeHAEBBAEBCgEBFwEBBAEBCgEBhA8?=
 =?us-ascii?q?DgQqDa4sLkQiCQIVrjVyCEQErhXgCgygXAQEBAQEBAQEBAQESAQEBAQEGGAZXg?=
 =?us-ascii?q?jMMglkBAQEBAgEjBAsBBTMDEAsLGAICCRoDAgIfAiQBERMGAgEBDooCAw0HEak?=
 =?us-ascii?q?RgWw6hzQDCisagzsMASWBC4c4K4J4gliBfhSDEoJhBYdFByYMiUaLPIEcOx6HE?=
 =?us-ascii?q?YdEhGeCCFaEcYNLhnSLVYkvIQKBPoEJSYUdJYFcZ4cRgj4BAQE?=
X-IronPort-AV: E=Sophos;i="5.39,346,1493676000"; 
   d="scan'208";a="228553664"
Received: from hosting4.cc.kyushu-u.ac.jp (HELO h4.hosting4.cc.kyushu-u.ac.jp) ([133.5.13.5])
  by mail3-smtp-sop.national.inria.fr with ESMTP; 16 Jun 2017 11:49:16 +0200
Received: from [192.168.2.36] (unknown [133.5.218.148])
	(using TLSv1.2 with cipher DHE-RSA-AES128-SHA (128/128 bits))
	(No client certificate requested)
	(Authenticated sender: berenger@bioreg.kyushu-u.ac.jp)
	by h4.hosting4.cc.kyushu-u.ac.jp (hde-lc-postfix) with ESMTPSA id 99CB62AC0C8;
	Fri, 16 Jun 2017 18:49:12 +0900 (JST)
	(envelope-from berenger@bioreg.kyushu-u.ac.jp)
To: caml-list@inria.fr
References: <412f58b7-8a8b-2356-2626-e1bd010be683@bioreg.kyushu-u.ac.jp>
 <CAEc-HQm8ZxOeiMFtAaqo_0ZgPBqrmcfpyX2UxwUX+u0mLnFf4A@mail.gmail.com>
 <CALdWJ+zsKqDm+C74TGO8K6U4JvQO_SPspv41EDk125RQepWRYw@mail.gmail.com>
From: Francois BERENGER <berenger@bioreg.kyushu-u.ac.jp>
Message-ID: <d930e2f3-8b74-39bc-b2d1-f69ad6a9dd33@bioreg.kyushu-u.ac.jp>
Date: Fri, 16 Jun 2017 18:49:12 +0900
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.1.1
MIME-Version: 1.0
In-Reply-To: <CALdWJ+zsKqDm+C74TGO8K6U4JvQO_SPspv41EDk125RQepWRYw@mail.gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Subject: Re: [Caml-list] Can this code be accelerated by porting it to SPOC,
 SAREK or MetaOCaml ?

On 06/15/2017 07:38 PM, Ivan Gotovchits wrote:
> Hi François,
> 
> Your code is actually bound by memory, not by computation. The problem 
> is that you create lots of data, and all the time is spent in the 
> allocation functions and GC. The actual computation consumes less than 
> 5% of overall program execution, that can be easily seen by profiling 
> (always profile before starting optimization). Thus moving to GPU will 
> make this 5% of code run faster, at the expense of even higher overhead 
> of moving data between CPU and GPU.

Thanks Ivan for the detailed explanations, the code suggestions and even 
benchmarks !

Thanks to profiling, I knew the code snippet I sent is the bottleneck.

I also saw the whole program was spending 40% of its time doing GC.

I went with your deforesting/noalloc approach.
However, the reduction function cannot avoid doing array updates.

But, the results were amazing already: now, my program runs 4 times
faster than before and the numerical results are unchanged. I am impressed.

Thanks a lot!
F.

> So, let's see how we can optimize your code. First of all, your code 
> uses too much mutable state, that can have a negative performance impact 
> in OCaml due to write barriers. Let's make it a little bit cleaner and 
> faster:
> 
>     let nonmutab points =
>        let rec loop xs acc = match xs with
>          | [] -> acc
>          | (x,w) :: xs ->
>            List.fold_left (fun acc (x',w') ->
>                (Vector3.dist x x', w *. w') :: acc) acc xs |>
>            loop xs in
>        loop points []
> 
> This code runs a little bit faster (on 10000 points):
> 
>     original: 14084.1 ms
>     nonmutab: 11309.2 ms
> 
> It computes absolutely the same result, using the same computational 
> core, and allocating the same amount of trash. The only difference is 
> that we are not using a mutable state anymore, and we are rewarded for that.
> 
> The next step is to consider, do you really need to produce these 
> intermediate structures, if the result of your program is the computed 
> data, then you can just store it in a file. Or, if you need later this 
> data to be reduced to some value, then you shouldn't create an 
> intermediate result, and apply the reduction in place (so called 
> de-foresting). So, let's generalize a little bit the function so that 
> instead of producing a new list, the function just applies a 
> user-provided function to all produced elements, along with an accumulator.
> 
>     let nonalloc f acc points =
>        let rec loop xs acc = match xs with
>          | [] -> acc
>          | (x,w) :: xs ->
>            List.fold_left (fun acc (x',w') ->
>                f acc (Vector3.dist x x') (w *. w')) acc xs |>
>            loop xs in
>        loop points acc
> 
> 
> 
> This function will not allocate any unnecessary data (it will though 
> allocate a pair of floats per each call). So we can use this function to 
> implement the `nonmutab` version, by passing a consing operator and an 
> empty list. We can also try to use it to store data. The data storage 
> process depends on how fast we can reformat the data, and how fast is 
> the sink device. If we will output to /dev/null (that is known to be 
> fast), then we can use the Marshal module and get about 300 MB/s 
> throughtput. Not bad, but still about 10 seconds of running time. If, 
> for example, we just need some scalar metrics, then we don't need to pay 
> an overhead of data, and it will be as fast as possible. For example, 
> with the following implementations of the kernel function
> 
>     let flags = [Marshal.No_sharing]
> 
>     let print () x1 x2 =
>        Marshal.to_channel stdout x1 flags;
>        Marshal.to_channel stdout x2 flags
> 
>     let product total x1 x2 = total +. x1 *. x2
> 
> 
> We will have the following timings:
> 
>     printall: 9649.54 ms
>     nonalloc: 541.186 ms
> 
> 
> So, the actual computation took only half a second, the rest is data 
> processing.
> 
> Please find attached the whole example. You can run it with the 
> following command:
> 
>     ocamlbuild -pkgs vector3,unix points.native -- > /dev/null
> 
> 
> Regards,
> Ivan Gotovchits
> 
> 
> On Thu, Jun 15, 2017 at 9:09 AM, Ronan Le Hy <ronan.lehy@gmail.com 
> <mailto:ronan.lehy@gmail.com>> wrote:
> 
>     Hi François,
> 
>     2017-06-15 8:28 GMT+02:00 Francois BERENGER
>     <berenger@bioreg.kyushu-u.ac.jp
>     <mailto:berenger@bioreg.kyushu-u.ac.jp>>:
>     > I am wondering if some high performance OCaml experts out there
>     > can know in advance if some code can go faster by executing it
>     > on a GPU.
>     > I have some clear bottleneck in my program.
>     > Here is how the code looks like:
>     > ---
>     > let f (points: (Vector3.t * float) list) =
>     >   let acc = ref [] in
>     >   let ac p1 x (p2, y) =
>     >     acc := (Vector3.dist p1 p2, x *. y) :: !acc
>     >   in
>     >   let rec loop = function
>     >     | [] -> ()
>     >     | (p1, x) :: xs ->
>     >       L.iter (ac p1 x) xs;
>     >       loop xs
>     >   in
>     >   loop points;
>     >   !acc
> 
>     As a baseline before attempting anything on the GPU, I'd vectorize
>     this. Put all your vectors in a big matrix. Put all your numbers in a
>     vector. Compute the distances and the products all at once using
>     Lacaml (make sure you use OpenBLAS as a backend). I'd expect it to be
>     much faster than the above loop already.
> 
>     --
>     Ronan
> 
>     --
>     Caml-list mailing list.  Subscription management and archives:
>     https://sympa.inria.fr/sympa/arc/caml-list
>     <https://sympa.inria.fr/sympa/arc/caml-list>
>     Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
>     <http://groups.yahoo.com/group/ocaml_beginners>
>     Bug reports: http://caml.inria.fr/bin/caml-bugs
>     <http://caml.inria.fr/bin/caml-bugs>
> 
>