[Caml-list] Can this code be accelerated by porting it to SPOC, SAREK or MetaOCaml ?

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

* [Caml-list] Can this code be accelerated by porting it to SPOC, SAREK or MetaOCaml ?
@ 2017-06-15  6:28 Francois BERENGER
  2017-06-15  7:09 ` Ronan Le Hy
  0 siblings, 1 reply; 4+ messages in thread
From: Francois BERENGER @ 2017-06-15  6:28 UTC (permalink / raw)
  To: caml-list

Dear OCaml hackers,

I am wondering if some high performance OCaml experts out there
can know in advance if some code can go faster by executing it
on a GPU.

I'd like to get some feeling before I invest into some new technology.

I'd like to stay in OCaml, not drop down to C.
So I mention SPOC, SAREK and MetaOCaml which look like the right 
technologies.

I have some clear bottleneck in my program.
Here is how the code looks like:

---
let f (points: (Vector3.t * float) list) =
   let acc = ref [] in
   let ac p1 x (p2, y) =
     acc := (Vector3.dist p1 p2, x *. y) :: !acc
   in
   let rec loop = function
     | [] -> ()
     | (p1, x) :: xs ->
       L.iter (ac p1 x) xs;
       loop xs
   in
   loop points;
   !acc
---

So, in terms of size: each point is 3 floats (a 3D coordinate) plus 1 
float (some value).
I usually have several thousands of points, but less than 10_000.

The f function will be called thousands of times for one run of the 
program (let's say 300k calls is a big but reasonable use case).

I don't care about the order of points in the input list.

I also don't care about the order of the results in acc.

I very probably don't care about using single point precision (32 bit 
floats) for everything, instead of double precision.

Thanks a lot for any feedback.

Best regards,
Francois.

PS: vector3 is available in opam, if that helps

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Caml-list] Can this code be accelerated by porting it to SPOC, SAREK or MetaOCaml ?
  2017-06-15  6:28 [Caml-list] Can this code be accelerated by porting it to SPOC, SAREK or MetaOCaml ? Francois BERENGER
@ 2017-06-15  7:09 ` Ronan Le Hy
  2017-06-15 10:38   ` Ivan Gotovchits
  0 siblings, 1 reply; 4+ messages in thread
From: Ronan Le Hy @ 2017-06-15  7:09 UTC (permalink / raw)
  To: Francois BERENGER; +Cc: OCaml Mailing List

Hi François,

2017-06-15 8:28 GMT+02:00 Francois BERENGER <berenger@bioreg.kyushu-u.ac.jp>:
> I am wondering if some high performance OCaml experts out there
> can know in advance if some code can go faster by executing it
> on a GPU.
> I have some clear bottleneck in my program.
> Here is how the code looks like:
> ---
> let f (points: (Vector3.t * float) list) =
>   let acc = ref [] in
>   let ac p1 x (p2, y) =
>     acc := (Vector3.dist p1 p2, x *. y) :: !acc
>   in
>   let rec loop = function
>     | [] -> ()
>     | (p1, x) :: xs ->
>       L.iter (ac p1 x) xs;
>       loop xs
>   in
>   loop points;
>   !acc

As a baseline before attempting anything on the GPU, I'd vectorize
this. Put all your vectors in a big matrix. Put all your numbers in a
vector. Compute the distances and the products all at once using
Lacaml (make sure you use OpenBLAS as a backend). I'd expect it to be
much faster than the above loop already.

-- 
Ronan

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Caml-list] Can this code be accelerated by porting it to SPOC, SAREK or MetaOCaml ?
  2017-06-15  7:09 ` Ronan Le Hy
@ 2017-06-15 10:38   ` Ivan Gotovchits
  2017-06-16  9:49     ` Francois BERENGER
  0 siblings, 1 reply; 4+ messages in thread
From: Ivan Gotovchits @ 2017-06-15 10:38 UTC (permalink / raw)
  To: Ronan Le Hy; +Cc: Francois BERENGER, OCaml Mailing List

[-- Attachment #1.1: Type: text/plain, Size: 4748 bytes --]

Hi François,

Your code is actually bound by memory, not by computation. The problem is
that you create lots of data, and all the time is spent in the allocation
functions and GC. The actual computation consumes less than 5% of overall
program execution, that can be easily seen by profiling (always profile
before starting optimization). Thus moving to GPU will make this 5% of code
run faster, at the expense of even higher overhead of moving data between
CPU and GPU.

So, let's see how we can optimize your code. First of all, your code uses
too much mutable state, that can have a negative performance impact in
OCaml due to write barriers. Let's make it a little bit cleaner and faster:

let nonmutab points =
  let rec loop xs acc = match xs with
    | [] -> acc
    | (x,w) :: xs ->
      List.fold_left (fun acc (x',w') ->
          (Vector3.dist x x', w *. w') :: acc) acc xs |>
      loop xs in
  loop points []

This code runs a little bit faster (on 10000 points):

original: 14084.1 ms
nonmutab: 11309.2 ms

It computes absolutely the same result, using the same computational core,
and allocating the same amount of trash. The only difference is that we are
not using a mutable state anymore, and we are rewarded for that.

The next step is to consider, do you really need to produce these
intermediate structures, if the result of your program is the computed
data, then you can just store it in a file. Or, if you need later this data
to be reduced to some value, then you shouldn't create an intermediate
result, and apply the reduction in place (so called de-foresting). So,
let's generalize a little bit the function so that instead of producing a
new list, the function just applies a user-provided function to all
produced elements, along with an accumulator.

let nonalloc f acc points =
  let rec loop xs acc = match xs with
    | [] -> acc
    | (x,w) :: xs ->
      List.fold_left (fun acc (x',w') ->
          f acc (Vector3.dist x x') (w *. w')) acc xs |>
      loop xs in
  loop points acc

This function will not allocate any unnecessary data (it will though
allocate a pair of floats per each call). So we can use this function to
implement the `nonmutab` version, by passing a consing operator and an
empty list. We can also try to use it to store data. The data storage
process depends on how fast we can reformat the data, and how fast is the
sink device. If we will output to /dev/null (that is known to be fast),
then we can use the Marshal module and get about 300 MB/s throughtput. Not
bad, but still about 10 seconds of running time. If, for example, we just
need some scalar metrics, then we don't need to pay an overhead of data,
and it will be as fast as possible. For example, with the following
implementations of the kernel function

let flags = [Marshal.No_sharing]

let print () x1 x2 =
  Marshal.to_channel stdout x1 flags;
  Marshal.to_channel stdout x2 flags

let product total x1 x2 = total +. x1 *. x2

We will have the following timings:

printall: 9649.54 ms
nonalloc: 541.186 ms

So, the actual computation took only half a second, the rest is data
processing.

Please find attached the whole example. You can run it with the following
command:

ocamlbuild -pkgs vector3,unix points.native -- > /dev/null

Regards,
Ivan Gotovchits

On Thu, Jun 15, 2017 at 9:09 AM, Ronan Le Hy <ronan.lehy@gmail.com> wrote:

> Hi François,
>
> 2017-06-15 8:28 GMT+02:00 Francois BERENGER <berenger@bioreg.kyushu-u.ac.
> jp>:
> > I am wondering if some high performance OCaml experts out there
> > can know in advance if some code can go faster by executing it
> > on a GPU.
> > I have some clear bottleneck in my program.
> > Here is how the code looks like:
> > ---
> > let f (points: (Vector3.t * float) list) =
> >   let acc = ref [] in
> >   let ac p1 x (p2, y) =
> >     acc := (Vector3.dist p1 p2, x *. y) :: !acc
> >   in
> >   let rec loop = function
> >     | [] -> ()
> >     | (p1, x) :: xs ->
> >       L.iter (ac p1 x) xs;
> >       loop xs
> >   in
> >   loop points;
> >   !acc
>
> As a baseline before attempting anything on the GPU, I'd vectorize
> this. Put all your vectors in a big matrix. Put all your numbers in a
> vector. Compute the distances and the products all at once using
> Lacaml (make sure you use OpenBLAS as a backend). I'd expect it to be
> much faster than the above loop already.
>
> --
> Ronan
>
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>

[-- Attachment #1.2: Type: text/html, Size: 10399 bytes --]

[-- Attachment #2: points.ml --]
[-- Type: application/octet-stream, Size: 1743 bytes --]

open Printf

let original (points: (Vector3.t * float) list) =
  let acc = ref [] in
  let ac p1 x (p2, y) =
    acc := (Vector3.dist p1 p2, x *. y) :: !acc
  in
  let rec loop = function
    | [] -> ()
    | (p1, x) :: xs ->
      List.iter (ac p1 x) xs;
      loop xs
  in
  loop points;
  !acc

let nonmutab points = 
  let rec loop xs acc = match xs with
    | [] -> acc
    | (x,w) :: xs -> 
      List.fold_left (fun acc (x',w') -> 
          (Vector3.dist x x', w *. w') :: acc) acc xs |> 
      loop xs in
  loop points []

let nonalloc f acc points = 
  let rec loop xs acc = match xs with
    | [] -> acc
    | (x,w) :: xs -> 
      List.fold_left (fun acc (x',w') -> 
          f acc (Vector3.dist x x') (w *. w')) acc xs |> 
      loop xs in
  loop points acc

let timeit name f x = 
  Gc.major ();
  let t0 = Unix.gettimeofday () in
  let r = f x in
  let t1 = Unix.gettimeofday () in
  eprintf "%s: %g ms\n%!" name ((t1 -. t0) *. 1000.);
  r

let rev_init n f = 
  let rec loop i xs = 
    if i < n then loop (i+1) (f i :: xs) else xs in
  loop 0 []

let random_points n = rev_init n (fun i -> 
    Vector3.make
      (Random.float 1.)
      (Random.float 1.)
      (Random.float 1.),
    Random.float 1.)

let flags = [Marshal.No_sharing] 

let print () x1 x2 =
  Marshal.to_channel stdout x1 flags;
  Marshal.to_channel stdout x2 flags

let product total x1 x2 = total +. x1 *. x2

let main () = 
  Random.init 42;
  let points = random_points 10000 in
  let _r1 = timeit "original" original points in
  let _r2 = timeit "nonmutab" nonmutab points in
  let ()  = timeit "printall" (nonalloc print ()) points in
  let prd = timeit "nonalloc" (nonalloc product 0.) points in
  eprintf "product = %g\n" prd

let () = main ()

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Caml-list] Can this code be accelerated by porting it to SPOC, SAREK or MetaOCaml ?
  2017-06-15 10:38   ` Ivan Gotovchits
@ 2017-06-16  9:49     ` Francois BERENGER
  0 siblings, 0 replies; 4+ messages in thread
From: Francois BERENGER @ 2017-06-16  9:49 UTC (permalink / raw)
  To: caml-list

On 06/15/2017 07:38 PM, Ivan Gotovchits wrote:
> Hi François,
> 
> Your code is actually bound by memory, not by computation. The problem 
> is that you create lots of data, and all the time is spent in the 
> allocation functions and GC. The actual computation consumes less than 
> 5% of overall program execution, that can be easily seen by profiling 
> (always profile before starting optimization). Thus moving to GPU will 
> make this 5% of code run faster, at the expense of even higher overhead 
> of moving data between CPU and GPU.

Thanks Ivan for the detailed explanations, the code suggestions and even 
benchmarks !

Thanks to profiling, I knew the code snippet I sent is the bottleneck.

I also saw the whole program was spending 40% of its time doing GC.

I went with your deforesting/noalloc approach.
However, the reduction function cannot avoid doing array updates.

But, the results were amazing already: now, my program runs 4 times
faster than before and the numerical results are unchanged. I am impressed.

Thanks a lot!
F.

> So, let's see how we can optimize your code. First of all, your code 
> uses too much mutable state, that can have a negative performance impact 
> in OCaml due to write barriers. Let's make it a little bit cleaner and 
> faster:
> 
>     let nonmutab points =
>        let rec loop xs acc = match xs with
>          | [] -> acc
>          | (x,w) :: xs ->
>            List.fold_left (fun acc (x',w') ->
>                (Vector3.dist x x', w *. w') :: acc) acc xs |>
>            loop xs in
>        loop points []
> 
> This code runs a little bit faster (on 10000 points):
> 
>     original: 14084.1 ms
>     nonmutab: 11309.2 ms
> 
> It computes absolutely the same result, using the same computational 
> core, and allocating the same amount of trash. The only difference is 
> that we are not using a mutable state anymore, and we are rewarded for that.
> 
> The next step is to consider, do you really need to produce these 
> intermediate structures, if the result of your program is the computed 
> data, then you can just store it in a file. Or, if you need later this 
> data to be reduced to some value, then you shouldn't create an 
> intermediate result, and apply the reduction in place (so called 
> de-foresting). So, let's generalize a little bit the function so that 
> instead of producing a new list, the function just applies a 
> user-provided function to all produced elements, along with an accumulator.
> 
>     let nonalloc f acc points =
>        let rec loop xs acc = match xs with
>          | [] -> acc
>          | (x,w) :: xs ->
>            List.fold_left (fun acc (x',w') ->
>                f acc (Vector3.dist x x') (w *. w')) acc xs |>
>            loop xs in
>        loop points acc
> 
> 
> 
> This function will not allocate any unnecessary data (it will though 
> allocate a pair of floats per each call). So we can use this function to 
> implement the `nonmutab` version, by passing a consing operator and an 
> empty list. We can also try to use it to store data. The data storage 
> process depends on how fast we can reformat the data, and how fast is 
> the sink device. If we will output to /dev/null (that is known to be 
> fast), then we can use the Marshal module and get about 300 MB/s 
> throughtput. Not bad, but still about 10 seconds of running time. If, 
> for example, we just need some scalar metrics, then we don't need to pay 
> an overhead of data, and it will be as fast as possible. For example, 
> with the following implementations of the kernel function
> 
>     let flags = [Marshal.No_sharing]
> 
>     let print () x1 x2 =
>        Marshal.to_channel stdout x1 flags;
>        Marshal.to_channel stdout x2 flags
> 
>     let product total x1 x2 = total +. x1 *. x2
> 
> 
> We will have the following timings:
> 
>     printall: 9649.54 ms
>     nonalloc: 541.186 ms
> 
> 
> So, the actual computation took only half a second, the rest is data 
> processing.
> 
> Please find attached the whole example. You can run it with the 
> following command:
> 
>     ocamlbuild -pkgs vector3,unix points.native -- > /dev/null
> 
> 
> Regards,
> Ivan Gotovchits
> 
> 
> On Thu, Jun 15, 2017 at 9:09 AM, Ronan Le Hy <ronan.lehy@gmail.com 
> <mailto:ronan.lehy@gmail.com>> wrote:
> 
>     Hi François,
> 
>     2017-06-15 8:28 GMT+02:00 Francois BERENGER
>     <berenger@bioreg.kyushu-u.ac.jp
>     <mailto:berenger@bioreg.kyushu-u.ac.jp>>:
>     > I am wondering if some high performance OCaml experts out there
>     > can know in advance if some code can go faster by executing it
>     > on a GPU.
>     > I have some clear bottleneck in my program.
>     > Here is how the code looks like:
>     > ---
>     > let f (points: (Vector3.t * float) list) =
>     >   let acc = ref [] in
>     >   let ac p1 x (p2, y) =
>     >     acc := (Vector3.dist p1 p2, x *. y) :: !acc
>     >   in
>     >   let rec loop = function
>     >     | [] -> ()
>     >     | (p1, x) :: xs ->
>     >       L.iter (ac p1 x) xs;
>     >       loop xs
>     >   in
>     >   loop points;
>     >   !acc
> 
>     As a baseline before attempting anything on the GPU, I'd vectorize
>     this. Put all your vectors in a big matrix. Put all your numbers in a
>     vector. Compute the distances and the products all at once using
>     Lacaml (make sure you use OpenBLAS as a backend). I'd expect it to be
>     much faster than the above loop already.
> 
>     --
>     Ronan
> 
>     --
>     Caml-list mailing list.  Subscription management and archives:
>     https://sympa.inria.fr/sympa/arc/caml-list
>     <https://sympa.inria.fr/sympa/arc/caml-list>
>     Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
>     <http://groups.yahoo.com/group/ocaml_beginners>
>     Bug reports: http://caml.inria.fr/bin/caml-bugs
>     <http://caml.inria.fr/bin/caml-bugs>
> 
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-06-16  9:49 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-15  6:28 [Caml-list] Can this code be accelerated by porting it to SPOC, SAREK or MetaOCaml ? Francois BERENGER
2017-06-15  7:09 ` Ronan Le Hy
2017-06-15 10:38   ` Ivan Gotovchits
2017-06-16  9:49     ` Francois BERENGER

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).