Re: [Caml-list] Can this code be accelerated by porting it to SPOC, SAREK or MetaOCaml ?

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

From: Francois BERENGER <berenger@bioreg.kyushu-u.ac.jp>
To: caml-list@inria.fr
Subject: Re: [Caml-list] Can this code be accelerated by porting it to SPOC, SAREK or MetaOCaml ?
Date: Fri, 16 Jun 2017 18:49:12 +0900	[thread overview]
Message-ID: <d930e2f3-8b74-39bc-b2d1-f69ad6a9dd33@bioreg.kyushu-u.ac.jp> (raw)
In-Reply-To: <CALdWJ+zsKqDm+C74TGO8K6U4JvQO_SPspv41EDk125RQepWRYw@mail.gmail.com>

On 06/15/2017 07:38 PM, Ivan Gotovchits wrote:
> Hi François,
> 
> Your code is actually bound by memory, not by computation. The problem 
> is that you create lots of data, and all the time is spent in the 
> allocation functions and GC. The actual computation consumes less than 
> 5% of overall program execution, that can be easily seen by profiling 
> (always profile before starting optimization). Thus moving to GPU will 
> make this 5% of code run faster, at the expense of even higher overhead 
> of moving data between CPU and GPU.

Thanks Ivan for the detailed explanations, the code suggestions and even 
benchmarks !

Thanks to profiling, I knew the code snippet I sent is the bottleneck.

I also saw the whole program was spending 40% of its time doing GC.

I went with your deforesting/noalloc approach.
However, the reduction function cannot avoid doing array updates.

But, the results were amazing already: now, my program runs 4 times
faster than before and the numerical results are unchanged. I am impressed.

Thanks a lot!
F.

> So, let's see how we can optimize your code. First of all, your code 
> uses too much mutable state, that can have a negative performance impact 
> in OCaml due to write barriers. Let's make it a little bit cleaner and 
> faster:
> 
>     let nonmutab points =
>        let rec loop xs acc = match xs with
>          | [] -> acc
>          | (x,w) :: xs ->
>            List.fold_left (fun acc (x',w') ->
>                (Vector3.dist x x', w *. w') :: acc) acc xs |>
>            loop xs in
>        loop points []
> 
> This code runs a little bit faster (on 10000 points):
> 
>     original: 14084.1 ms
>     nonmutab: 11309.2 ms
> 
> It computes absolutely the same result, using the same computational 
> core, and allocating the same amount of trash. The only difference is 
> that we are not using a mutable state anymore, and we are rewarded for that.
> 
> The next step is to consider, do you really need to produce these 
> intermediate structures, if the result of your program is the computed 
> data, then you can just store it in a file. Or, if you need later this 
> data to be reduced to some value, then you shouldn't create an 
> intermediate result, and apply the reduction in place (so called 
> de-foresting). So, let's generalize a little bit the function so that 
> instead of producing a new list, the function just applies a 
> user-provided function to all produced elements, along with an accumulator.
> 
>     let nonalloc f acc points =
>        let rec loop xs acc = match xs with
>          | [] -> acc
>          | (x,w) :: xs ->
>            List.fold_left (fun acc (x',w') ->
>                f acc (Vector3.dist x x') (w *. w')) acc xs |>
>            loop xs in
>        loop points acc
> 
> 
> 
> This function will not allocate any unnecessary data (it will though 
> allocate a pair of floats per each call). So we can use this function to 
> implement the `nonmutab` version, by passing a consing operator and an 
> empty list. We can also try to use it to store data. The data storage 
> process depends on how fast we can reformat the data, and how fast is 
> the sink device. If we will output to /dev/null (that is known to be 
> fast), then we can use the Marshal module and get about 300 MB/s 
> throughtput. Not bad, but still about 10 seconds of running time. If, 
> for example, we just need some scalar metrics, then we don't need to pay 
> an overhead of data, and it will be as fast as possible. For example, 
> with the following implementations of the kernel function
> 
>     let flags = [Marshal.No_sharing]
> 
>     let print () x1 x2 =
>        Marshal.to_channel stdout x1 flags;
>        Marshal.to_channel stdout x2 flags
> 
>     let product total x1 x2 = total +. x1 *. x2
> 
> 
> We will have the following timings:
> 
>     printall: 9649.54 ms
>     nonalloc: 541.186 ms
> 
> 
> So, the actual computation took only half a second, the rest is data 
> processing.
> 
> Please find attached the whole example. You can run it with the 
> following command:
> 
>     ocamlbuild -pkgs vector3,unix points.native -- > /dev/null
> 
> 
> Regards,
> Ivan Gotovchits
> 
> 
> On Thu, Jun 15, 2017 at 9:09 AM, Ronan Le Hy <ronan.lehy@gmail.com 
> <mailto:ronan.lehy@gmail.com>> wrote:
> 
>     Hi François,
> 
>     2017-06-15 8:28 GMT+02:00 Francois BERENGER
>     <berenger@bioreg.kyushu-u.ac.jp
>     <mailto:berenger@bioreg.kyushu-u.ac.jp>>:
>     > I am wondering if some high performance OCaml experts out there
>     > can know in advance if some code can go faster by executing it
>     > on a GPU.
>     > I have some clear bottleneck in my program.
>     > Here is how the code looks like:
>     > ---
>     > let f (points: (Vector3.t * float) list) =
>     >   let acc = ref [] in
>     >   let ac p1 x (p2, y) =
>     >     acc := (Vector3.dist p1 p2, x *. y) :: !acc
>     >   in
>     >   let rec loop = function
>     >     | [] -> ()
>     >     | (p1, x) :: xs ->
>     >       L.iter (ac p1 x) xs;
>     >       loop xs
>     >   in
>     >   loop points;
>     >   !acc
> 
>     As a baseline before attempting anything on the GPU, I'd vectorize
>     this. Put all your vectors in a big matrix. Put all your numbers in a
>     vector. Compute the distances and the products all at once using
>     Lacaml (make sure you use OpenBLAS as a backend). I'd expect it to be
>     much faster than the above loop already.
> 
>     --
>     Ronan
> 
>     --
>     Caml-list mailing list.  Subscription management and archives:
>     https://sympa.inria.fr/sympa/arc/caml-list
>     <https://sympa.inria.fr/sympa/arc/caml-list>
>     Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
>     <http://groups.yahoo.com/group/ocaml_beginners>
>     Bug reports: http://caml.inria.fr/bin/caml-bugs
>     <http://caml.inria.fr/bin/caml-bugs>
> 
>

     prev parent reply	other threads:[~2017-06-16  9:49 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-15  6:28 Francois BERENGER
2017-06-15  7:09 ` Ronan Le Hy
2017-06-15 10:38   ` Ivan Gotovchits
2017-06-16  9:49     ` Francois BERENGER [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d930e2f3-8b74-39bc-b2d1-f69ad6a9dd33@bioreg.kyushu-u.ac.jp \
    --to=berenger@bioreg.kyushu-u.ac.jp \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).