From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by sympa.inria.fr (Postfix) with ESMTPS id 4019B80175 for ; Fri, 16 Jun 2017 11:49:19 +0200 (CEST) Authentication-Results: mail3-smtp-sop.national.inria.fr; spf=None smtp.pra=berenger@bioreg.kyushu-u.ac.jp; spf=None smtp.mailfrom=berenger@bioreg.kyushu-u.ac.jp; spf=None smtp.helo=postmaster@h4.hosting4.cc.kyushu-u.ac.jp Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of berenger@bioreg.kyushu-u.ac.jp) identity=pra; client-ip=133.5.13.5; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="berenger@bioreg.kyushu-u.ac.jp"; x-sender="berenger@bioreg.kyushu-u.ac.jp"; x-conformance=sidf_compatible Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of berenger@bioreg.kyushu-u.ac.jp) identity=mailfrom; client-ip=133.5.13.5; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="berenger@bioreg.kyushu-u.ac.jp"; x-sender="berenger@bioreg.kyushu-u.ac.jp"; x-conformance=sidf_compatible Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of postmaster@h4.hosting4.cc.kyushu-u.ac.jp) identity=helo; client-ip=133.5.13.5; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="berenger@bioreg.kyushu-u.ac.jp"; x-sender="postmaster@h4.hosting4.cc.kyushu-u.ac.jp"; x-conformance=sidf_compatible IronPort-PHdr: =?us-ascii?q?9a23=3Ai3pYbhNhLTed3dtpUyAl6mtUPXoX/o7sNwtQ0KIM?= =?us-ascii?q?zox0IvT9rarrMEGX3/hxlliBBdydsKMbzbKO+4nbGkU4qa6bt34DdJEeHzQksu?= =?us-ascii?q?4x2zIaPcieFEfgJ+TrZSFpVO5LVVti4m3peRMNQJW2aFLduGC94iAPERvjKwV1?= =?us-ascii?q?Ov71GonPhMiryuy+4ZPebgFKiTanfb9+MAi9oBnMuMURnYZsMLs6xAHTontPde?= =?us-ascii?q?RWxGdoKkyWkh3h+Mq+/4Nt/jpJtf45+MFOTav1f6IjTbxFFzsmKHw65NfqtRbY?= =?us-ascii?q?UwSC4GYXX3gMnRpJBwjF6wz6Xov0vyDnuOdxxDWWMMvrRr8zRDqi8rxrSAf2hy?= =?us-ascii?q?gbKz43/mbXislqg6JaphKquhhzzoHQbY2QMvd1Y6HTcs4ARWdZQ8hfSSJBDIO/?= =?us-ascii?q?YYUBAeUOMuRXoJXyqVYVsRuzBhOhCP/zxjJGhHL727Ax3eQ7EQHB2QwtB9IAsG?= =?us-ascii?q?7Oo9XzKKgSVuG1zLLVxjjeYP1YxTjz5pDJfB4uvf+HQLV9ftHPxkk1CQzFiFqQ?= =?us-ascii?q?ppL/Pz6OzesNsm+b7/B+WuKgkWInqAFwoiW0xscsl4nFn58Vxkre+ipl2oo1J8?= =?us-ascii?q?W4RVd9bNW5HpVQsCSaOJF3QsMkW2xouSA6yqcHuZGhZiQKxo4nywbfavOdc4iI?= =?us-ascii?q?5RXjWPyNLjd/gXJofq+0iRWq8UW41+HxWMe53ExKoyZfj9XBuGoB2hzL5sSaRP?= =?us-ascii?q?Zw8F2t1DaV2wzN9+1JIlo4mbfUJpMixLM7i4Advl7ZHiDsnUX7lK+WeVsg+uiv?= =?us-ascii?q?8+nnYrrrqoWcN49zkQH+LqUumsqwAek3KAQBQ3SU9f6/1Lzj4E35W7VKjuAvnq?= =?us-ascii?q?nEqpzVP9gUqrS7Aw9Nyooj6hC/ACm60NkAgHUKIlxIdAiHgoTzJl3DLur0APen?= =?us-ascii?q?j1SpijhrxvTGPrP7ApXKK3jOiKzucqhn60FCzgozws5Q54hPB74aIfLzXVXxu8?= =?us-ascii?q?LXDhMjMAy1w/vnCM591oMDQG6PH7WVP7nOvlOS5OIvO/GAZJUJtzblN/gl+/nu?= =?us-ascii?q?gGclllAHeKmp2YIbaHS5HvR9P0WUemHsg9cEEWcSpAUyVu3qiFuYUT5SfXm+Ra?= =?us-ascii?q?w85itoQL6hWIzKQ4TohL2awA+6GIdXbyZIEAOiC3DtIqueUvEHbyOJauVMqBEi?= =?us-ascii?q?epWbA9sr2AujsgD30ZJtL+3O9yJetoP+z9hoovCVnBp09yQiXJfV6H2EU2whxj?= =?us-ascii?q?BAfDQxxq0q/R1w?= X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: =?us-ascii?q?A0COAQCtqENZlwUNBYVeHAEBBAEBCgEBF?= =?us-ascii?q?wEBBAEBCgEBhA8DgQqDa4sLkQiCQIVrjVyCEQErhXgCgygXAQEBAQEBAQEBAQE?= =?us-ascii?q?SAQEBAQEGGAZXgjMMglkBAQEBAgEjBAsBBTMDEAsLGAICCRoDAgIfAiQBERMGA?= =?us-ascii?q?gEBDooCAw0HEakRgWw6hzQDCisagzsMASWBC4c4K4J4gliBfhSDEoJhBYdFByY?= =?us-ascii?q?MiUaLPIEcOx6HEYdEhGeCCFaEcYNLhnSLVYkvIQKBPoEJSYUdJYFcZ4cRgj4BA?= =?us-ascii?q?QE?= X-IPAS-Result: =?us-ascii?q?A0COAQCtqENZlwUNBYVeHAEBBAEBCgEBFwEBBAEBCgEBhA8?= =?us-ascii?q?DgQqDa4sLkQiCQIVrjVyCEQErhXgCgygXAQEBAQEBAQEBAQESAQEBAQEGGAZXg?= =?us-ascii?q?jMMglkBAQEBAgEjBAsBBTMDEAsLGAICCRoDAgIfAiQBERMGAgEBDooCAw0HEak?= =?us-ascii?q?RgWw6hzQDCisagzsMASWBC4c4K4J4gliBfhSDEoJhBYdFByYMiUaLPIEcOx6HE?= =?us-ascii?q?YdEhGeCCFaEcYNLhnSLVYkvIQKBPoEJSYUdJYFcZ4cRgj4BAQE?= X-IronPort-AV: E=Sophos;i="5.39,346,1493676000"; d="scan'208";a="228553664" Received: from hosting4.cc.kyushu-u.ac.jp (HELO h4.hosting4.cc.kyushu-u.ac.jp) ([133.5.13.5]) by mail3-smtp-sop.national.inria.fr with ESMTP; 16 Jun 2017 11:49:16 +0200 Received: from [192.168.2.36] (unknown [133.5.218.148]) (using TLSv1.2 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: berenger@bioreg.kyushu-u.ac.jp) by h4.hosting4.cc.kyushu-u.ac.jp (hde-lc-postfix) with ESMTPSA id 99CB62AC0C8; Fri, 16 Jun 2017 18:49:12 +0900 (JST) (envelope-from berenger@bioreg.kyushu-u.ac.jp) To: caml-list@inria.fr References: <412f58b7-8a8b-2356-2626-e1bd010be683@bioreg.kyushu-u.ac.jp> From: Francois BERENGER Message-ID: Date: Fri, 16 Jun 2017 18:49:12 +0900 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.1.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit Subject: Re: [Caml-list] Can this code be accelerated by porting it to SPOC, SAREK or MetaOCaml ? On 06/15/2017 07:38 PM, Ivan Gotovchits wrote: > Hi François, > > Your code is actually bound by memory, not by computation. The problem > is that you create lots of data, and all the time is spent in the > allocation functions and GC. The actual computation consumes less than > 5% of overall program execution, that can be easily seen by profiling > (always profile before starting optimization). Thus moving to GPU will > make this 5% of code run faster, at the expense of even higher overhead > of moving data between CPU and GPU. Thanks Ivan for the detailed explanations, the code suggestions and even benchmarks ! Thanks to profiling, I knew the code snippet I sent is the bottleneck. I also saw the whole program was spending 40% of its time doing GC. I went with your deforesting/noalloc approach. However, the reduction function cannot avoid doing array updates. But, the results were amazing already: now, my program runs 4 times faster than before and the numerical results are unchanged. I am impressed. Thanks a lot! F. > So, let's see how we can optimize your code. First of all, your code > uses too much mutable state, that can have a negative performance impact > in OCaml due to write barriers. Let's make it a little bit cleaner and > faster: > > let nonmutab points = > let rec loop xs acc = match xs with > | [] -> acc > | (x,w) :: xs -> > List.fold_left (fun acc (x',w') -> > (Vector3.dist x x', w *. w') :: acc) acc xs |> > loop xs in > loop points [] > > This code runs a little bit faster (on 10000 points): > > original: 14084.1 ms > nonmutab: 11309.2 ms > > It computes absolutely the same result, using the same computational > core, and allocating the same amount of trash. The only difference is > that we are not using a mutable state anymore, and we are rewarded for that. > > The next step is to consider, do you really need to produce these > intermediate structures, if the result of your program is the computed > data, then you can just store it in a file. Or, if you need later this > data to be reduced to some value, then you shouldn't create an > intermediate result, and apply the reduction in place (so called > de-foresting). So, let's generalize a little bit the function so that > instead of producing a new list, the function just applies a > user-provided function to all produced elements, along with an accumulator. > > let nonalloc f acc points = > let rec loop xs acc = match xs with > | [] -> acc > | (x,w) :: xs -> > List.fold_left (fun acc (x',w') -> > f acc (Vector3.dist x x') (w *. w')) acc xs |> > loop xs in > loop points acc > > > > This function will not allocate any unnecessary data (it will though > allocate a pair of floats per each call). So we can use this function to > implement the `nonmutab` version, by passing a consing operator and an > empty list. We can also try to use it to store data. The data storage > process depends on how fast we can reformat the data, and how fast is > the sink device. If we will output to /dev/null (that is known to be > fast), then we can use the Marshal module and get about 300 MB/s > throughtput. Not bad, but still about 10 seconds of running time. If, > for example, we just need some scalar metrics, then we don't need to pay > an overhead of data, and it will be as fast as possible. For example, > with the following implementations of the kernel function > > let flags = [Marshal.No_sharing] > > let print () x1 x2 = > Marshal.to_channel stdout x1 flags; > Marshal.to_channel stdout x2 flags > > let product total x1 x2 = total +. x1 *. x2 > > > We will have the following timings: > > printall: 9649.54 ms > nonalloc: 541.186 ms > > > So, the actual computation took only half a second, the rest is data > processing. > > Please find attached the whole example. You can run it with the > following command: > > ocamlbuild -pkgs vector3,unix points.native -- > /dev/null > > > Regards, > Ivan Gotovchits > > > On Thu, Jun 15, 2017 at 9:09 AM, Ronan Le Hy > wrote: > > Hi François, > > 2017-06-15 8:28 GMT+02:00 Francois BERENGER > >: > > I am wondering if some high performance OCaml experts out there > > can know in advance if some code can go faster by executing it > > on a GPU. > > I have some clear bottleneck in my program. > > Here is how the code looks like: > > --- > > let f (points: (Vector3.t * float) list) = > > let acc = ref [] in > > let ac p1 x (p2, y) = > > acc := (Vector3.dist p1 p2, x *. y) :: !acc > > in > > let rec loop = function > > | [] -> () > > | (p1, x) :: xs -> > > L.iter (ac p1 x) xs; > > loop xs > > in > > loop points; > > !acc > > As a baseline before attempting anything on the GPU, I'd vectorize > this. Put all your vectors in a big matrix. Put all your numbers in a > vector. Compute the distances and the products all at once using > Lacaml (make sure you use OpenBLAS as a backend). I'd expect it to be > much faster than the above loop already. > > -- > Ronan > > -- > Caml-list mailing list. Subscription management and archives: > https://sympa.inria.fr/sympa/arc/caml-list > > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > > Bug reports: http://caml.inria.fr/bin/caml-bugs > > >