caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Ivan Gotovchits <ivg@ieee.org>
To: Ronan Le Hy <ronan.lehy@gmail.com>
Cc: Francois BERENGER <berenger@bioreg.kyushu-u.ac.jp>,
	OCaml Mailing List <caml-list@inria.fr>
Subject: Re: [Caml-list] Can this code be accelerated by porting it to SPOC, SAREK or MetaOCaml ?
Date: Thu, 15 Jun 2017 12:38:18 +0200	[thread overview]
Message-ID: <CALdWJ+zsKqDm+C74TGO8K6U4JvQO_SPspv41EDk125RQepWRYw@mail.gmail.com> (raw)
In-Reply-To: <CAEc-HQm8ZxOeiMFtAaqo_0ZgPBqrmcfpyX2UxwUX+u0mLnFf4A@mail.gmail.com>


[-- Attachment #1.1: Type: text/plain, Size: 4748 bytes --]

Hi François,

Your code is actually bound by memory, not by computation. The problem is
that you create lots of data, and all the time is spent in the allocation
functions and GC. The actual computation consumes less than 5% of overall
program execution, that can be easily seen by profiling (always profile
before starting optimization). Thus moving to GPU will make this 5% of code
run faster, at the expense of even higher overhead of moving data between
CPU and GPU.

So, let's see how we can optimize your code. First of all, your code uses
too much mutable state, that can have a negative performance impact in
OCaml due to write barriers. Let's make it a little bit cleaner and faster:


let nonmutab points =
  let rec loop xs acc = match xs with
    | [] -> acc
    | (x,w) :: xs ->
      List.fold_left (fun acc (x',w') ->
          (Vector3.dist x x', w *. w') :: acc) acc xs |>
      loop xs in
  loop points []

This code runs a little bit faster (on 10000 points):

original: 14084.1 ms
nonmutab: 11309.2 ms

It computes absolutely the same result, using the same computational core,
and allocating the same amount of trash. The only difference is that we are
not using a mutable state anymore, and we are rewarded for that.

The next step is to consider, do you really need to produce these
intermediate structures, if the result of your program is the computed
data, then you can just store it in a file. Or, if you need later this data
to be reduced to some value, then you shouldn't create an intermediate
result, and apply the reduction in place (so called de-foresting). So,
let's generalize a little bit the function so that instead of producing a
new list, the function just applies a user-provided function to all
produced elements, along with an accumulator.

let nonalloc f acc points =
  let rec loop xs acc = match xs with
    | [] -> acc
    | (x,w) :: xs ->
      List.fold_left (fun acc (x',w') ->
          f acc (Vector3.dist x x') (w *. w')) acc xs |>
      loop xs in
  loop points acc



This function will not allocate any unnecessary data (it will though
allocate a pair of floats per each call). So we can use this function to
implement the `nonmutab` version, by passing a consing operator and an
empty list. We can also try to use it to store data. The data storage
process depends on how fast we can reformat the data, and how fast is the
sink device. If we will output to /dev/null (that is known to be fast),
then we can use the Marshal module and get about 300 MB/s throughtput. Not
bad, but still about 10 seconds of running time. If, for example, we just
need some scalar metrics, then we don't need to pay an overhead of data,
and it will be as fast as possible. For example, with the following
implementations of the kernel function

let flags = [Marshal.No_sharing]

let print () x1 x2 =
  Marshal.to_channel stdout x1 flags;
  Marshal.to_channel stdout x2 flags

let product total x1 x2 = total +. x1 *. x2


We will have the following timings:

printall: 9649.54 ms
nonalloc: 541.186 ms


So, the actual computation took only half a second, the rest is data
processing.

Please find attached the whole example. You can run it with the following
command:

ocamlbuild -pkgs vector3,unix points.native -- > /dev/null


Regards,
Ivan Gotovchits


On Thu, Jun 15, 2017 at 9:09 AM, Ronan Le Hy <ronan.lehy@gmail.com> wrote:

> Hi François,
>
> 2017-06-15 8:28 GMT+02:00 Francois BERENGER <berenger@bioreg.kyushu-u.ac.
> jp>:
> > I am wondering if some high performance OCaml experts out there
> > can know in advance if some code can go faster by executing it
> > on a GPU.
> > I have some clear bottleneck in my program.
> > Here is how the code looks like:
> > ---
> > let f (points: (Vector3.t * float) list) =
> >   let acc = ref [] in
> >   let ac p1 x (p2, y) =
> >     acc := (Vector3.dist p1 p2, x *. y) :: !acc
> >   in
> >   let rec loop = function
> >     | [] -> ()
> >     | (p1, x) :: xs ->
> >       L.iter (ac p1 x) xs;
> >       loop xs
> >   in
> >   loop points;
> >   !acc
>
> As a baseline before attempting anything on the GPU, I'd vectorize
> this. Put all your vectors in a big matrix. Put all your numbers in a
> vector. Compute the distances and the products all at once using
> Lacaml (make sure you use OpenBLAS as a backend). I'd expect it to be
> much faster than the above loop already.
>
> --
> Ronan
>
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>

[-- Attachment #1.2: Type: text/html, Size: 10399 bytes --]

[-- Attachment #2: points.ml --]
[-- Type: application/octet-stream, Size: 1743 bytes --]

open Printf


let original (points: (Vector3.t * float) list) =
  let acc = ref [] in
  let ac p1 x (p2, y) =
    acc := (Vector3.dist p1 p2, x *. y) :: !acc
  in
  let rec loop = function
    | [] -> ()
    | (p1, x) :: xs ->
      List.iter (ac p1 x) xs;
      loop xs
  in
  loop points;
  !acc

let nonmutab points = 
  let rec loop xs acc = match xs with
    | [] -> acc
    | (x,w) :: xs -> 
      List.fold_left (fun acc (x',w') -> 
          (Vector3.dist x x', w *. w') :: acc) acc xs |> 
      loop xs in
  loop points []

let nonalloc f acc points = 
  let rec loop xs acc = match xs with
    | [] -> acc
    | (x,w) :: xs -> 
      List.fold_left (fun acc (x',w') -> 
          f acc (Vector3.dist x x') (w *. w')) acc xs |> 
      loop xs in
  loop points acc


let timeit name f x = 
  Gc.major ();
  let t0 = Unix.gettimeofday () in
  let r = f x in
  let t1 = Unix.gettimeofday () in
  eprintf "%s: %g ms\n%!" name ((t1 -. t0) *. 1000.);
  r

let rev_init n f = 
  let rec loop i xs = 
    if i < n then loop (i+1) (f i :: xs) else xs in
  loop 0 []

let random_points n = rev_init n (fun i -> 
    Vector3.make
      (Random.float 1.)
      (Random.float 1.)
      (Random.float 1.),
    Random.float 1.)

let flags = [Marshal.No_sharing] 

let print () x1 x2 =
  Marshal.to_channel stdout x1 flags;
  Marshal.to_channel stdout x2 flags

let product total x1 x2 = total +. x1 *. x2



let main () = 
  Random.init 42;
  let points = random_points 10000 in
  let _r1 = timeit "original" original points in
  let _r2 = timeit "nonmutab" nonmutab points in
  let ()  = timeit "printall" (nonalloc print ()) points in
  let prd = timeit "nonalloc" (nonalloc product 0.) points in
  eprintf "product = %g\n" prd


let () = main ()

  reply	other threads:[~2017-06-15 10:38 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-15  6:28 Francois BERENGER
2017-06-15  7:09 ` Ronan Le Hy
2017-06-15 10:38   ` Ivan Gotovchits [this message]
2017-06-16  9:49     ` Francois BERENGER

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALdWJ+zsKqDm+C74TGO8K6U4JvQO_SPspv41EDk125RQepWRYw@mail.gmail.com \
    --to=ivg@ieee.org \
    --cc=berenger@bioreg.kyushu-u.ac.jp \
    --cc=caml-list@inria.fr \
    --cc=ronan.lehy@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).