caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] [ANN]: Parmap 0.9.8
@ 2011-11-30 14:42 Roberto Di Cosmo
  2011-11-30 14:57 ` Markus Weißmann
  0 siblings, 1 reply; 2+ messages in thread
From: Roberto Di Cosmo @ 2011-11-30 14:42 UTC (permalink / raw)
  To: caml-list; +Cc: roberto, marcod

Dear all,
   a few lines to announce the (much improved) version 0.9.8 of Parmap, the
minimalistic library introduced last August, and that can be useful to exploit
your multicore processor with minimal modifications to your OCaml programs.

You will find a full description in the README file, as well as
several examples.

The main changes are:

 - all functions (parmap, parfold, parmapfold) accept an optional parameter
   chunksize that allows to control granularity of the parallelism

 - highly specialised functions are now present to work on arrays and
   especially on (unboxed) float arrays

 - configure and ocamlbuild harness

 - documentation (just make doc)

 - on Linux kernels, set_affinity is used to attach a worker to a given core

Special thanks to: Jérôme Vouillon for highly efficient code for the float
arrays; Paul Vernaza for clever suggestions on pre-allocation of
memory buffers for float arrays; Pietro Abate for autoconf/ocamlbuild help;
Francois Berenger for tests.

Project home: https://gitorious.org/parmap

Version 0.9.8 is cc8915bceb7d05c2c645587f5eaaf0f6cb6080c6

To compile and install:

git clone git://gitorious.org/parmap/parmap.git
git checkout pipes
aclocal -I m4
autoconf
./configure
make
make install

Enjoy

-- Marco Danelutto and Roberto Di Cosmo


=====================Copy of the README file======================
Parmap in a nutshell
--------------------

Parmap is a minimalistic library allowing  to exploit multicore architecture for
OCaml programs with minimal modifications: if you want to use your
many cores to
accelerate an   operation  which  happens   to be   a  map,  fold
or map/fold
(map-reduce),  just use  Parmap's  parmap, parfold  and parmapfold
primitives in
place  of the standard   List.map   and friends,  and  specify  the
number   of
subprocesses to use by the optional parameter ~ncores.

See the example directory for a couple of running programs.

DO'S and DONT'S
---------------

Parmap is *not*  meant to be a replacement  for a full fledged
implementation of
parallelism skeletons  (map, reduce, pipe, and the  many others
described in the
scientific literature   since the end   of the  1980's,   much earlier  than the
specific   implementation by Google   engineers  that popularised
them).  It  is
meant,  instead, to allow you to  quickly leverage the  idle
processing power of
your extra cores, when handling some heavy computational load.

The principle of parmap is very simple: when you call one of the three available
primitives, map, fold, and  mapfold , your OCaml  sequential program forks  in n
subprocesses (you choose the n), and each subprocess performs the computation on
the 1/n of the data, in chunks  of a size you  can choose, returning the results
through a shared memory area to the  parent process, that resumes execution once
all the children have terminated, and the data has been recollected.

You need  to run your  program on a single multicore  machine;  repeat after me:
Parmap  is   not meant to   run  on a cluster,  see   one of the  many available
(re)implementations of the map-reduce schema for that.

By forking the parent process  on a sigle  machine, the children get access, for
free, to all the data structures already built, even the imperative ones, and as
far as your computation  inside the map/fold  does not produce side effects that
need  to be  preserved, the  final result will   be the same  as  performing the
sequential operation, the only difference is that you might get it faster.

The OCaml code is reasonably simple and only marginally relies on
external C libraries:
most of the magic is done by your operating system's fork and memory
mapping mechanisms.
One could gain some speed by implementing a marshal/unmarshal operation directly
on bigarrays, but we did not do this yet.

Of course, if you happen  to have open  channels, or files, or other connections
that should only be  used by the parent  process, your program  may behave in  a
very wierd way: as an example, *do  not* open a  graphic window before calling a
Parmap primitive, and   *do   not*  use  this  library   if  your  program    is
multi-threaded!

Pinning processes to physical CPUs
----------------------------------

To obtain maximum  speed,  Parmap tries to  pin  the worker processes to  a CPU,
using  the scheduler  affinity  interface  that is  available   in recent  Linux
kernels.   Similar functionality may be  obtained  on different platforms  using
slightly different API. Contributions  are welcome to  support those other APIs,
just make sure that you use autoconf properly.

Using Parmap with Ocamlnat
--------------------------

You can use Parmap in a native toplevel  (it may be quite useful  if you use the
native toplevel to perform fast interactive computations), but remember that you
need to load the .cmxs modules in it; an example is given in example/topnat.ml

Preservation of output order in Parmap
--------------------------------------

If the number of chunks is equal to the number of cores, it is easy to preserve
the order of the elements of the sequence passed to the map/fold operations, so
the result will be a list with the same order as if the sequential
function would
be applied to the input. This is what the parmap, parmafold and
parfold functions
do when the chunksize argument is not used.

If the user specifies a chunksize that is different from the number of cores,
there is no general way to preserve the ordering, so the result of calling
Parmap.parmap f l are not necessarily in the same order as List.map f l.

In general, using little chunksize helps in balancing the load among
the workers,
and provides better speed, at the price of losing the ordering: there is a
tradeoff, and it is up to the user to choose the solution that better
suits him/her.

Fast map on arrays and on float arrays
--------------------------------------

Visiting an array is much faster than visiting a list, and conversion
of an array
to and from a list is expensive, on large data structures, so we
provide a specialised
version of map on arrays, that beaves exactly like parmap.

We also provide a highly optimised specialised parmap version that is targeted
to float arrays, array_float_parmap, that allows you to perform parallel
computation on very large float arrays efficiently, without the boxing/unboxing
overhead introduced by the other primitives, including array_parmap.

To understand the efficiency issues involved in the case of large
arrays of float,
here is a short summary of the steps that any implementation of a parallel map
function must perform.

  1) create a float array to hold the result of the computation.
     This operation is expensive: on an Intel i7, creating a 10M float array
     takes 50 milliseconds

     ocamlnat
          Objective Caml version 3.12.0 - native toplevel

     # #load "unix.cmxs";;
     # let d = Unix.gettimeofday() in ignore(Array.create 10000000
0.); Unix.gettimeofday() -. d;;
     - : float = 0.0501301288604736328

  2) create a shared memory area

  3) possibly copy the result array to the shared memory area

  4) perform the computation in the children writing the result in the
shared memory area

  5) possibly copy the result back to the OCaml array

All implementations need to do 1, 2 and 4; steps 3 and/or 5 may be
omitted depending on
what the user wants to do with the result.

The array_float_parmap performs steps 1,2,4 and 5. It is possible to share steps
1 and 2 among subsequent calls to the parallel function by
preallocating the result
array and the shared memory buffer, and passing them as optional
parameters to the
array_float_parmap function: this may save a significant amount of time if the
array is very large.


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [Caml-list] [ANN]: Parmap 0.9.8
  2011-11-30 14:42 [Caml-list] [ANN]: Parmap 0.9.8 Roberto Di Cosmo
@ 2011-11-30 14:57 ` Markus Weißmann
  0 siblings, 0 replies; 2+ messages in thread
From: Markus Weißmann @ 2011-11-30 14:57 UTC (permalink / raw)
  To: Roberto Di Cosmo; +Cc: caml-list, roberto, marcod

Hi Roberto,

imho it would be beneficial to have a global properties that specifies how
many workers there should be; like

val set_number_of_workers : int -> unit

If you're on a octa-core (..) machine you probably always want to use 8
(..) workers so it would be nice to have this as a global property
instead.


cheers

-Markus

> Dear all,
>    a few lines to announce the (much improved) version 0.9.8 of Parmap,
> the
> minimalistic library introduced last August, and that can be useful to
> exploit
> your multicore processor with minimal modifications to your OCaml
> programs.
>
> You will find a full description in the README file, as well as
> several examples.
>
> The main changes are:
>
>  - all functions (parmap, parfold, parmapfold) accept an optional
> parameter
>    chunksize that allows to control granularity of the parallelism
>
>  - highly specialised functions are now present to work on arrays and
>    especially on (unboxed) float arrays
>
>  - configure and ocamlbuild harness
>
>  - documentation (just make doc)
>
>  - on Linux kernels, set_affinity is used to attach a worker to a given
> core
>
> Special thanks to: Jérôme Vouillon for highly efficient code for the float
> arrays; Paul Vernaza for clever suggestions on pre-allocation of
> memory buffers for float arrays; Pietro Abate for autoconf/ocamlbuild
> help;
> Francois Berenger for tests.
>
> Project home: https://gitorious.org/parmap
>
> Version 0.9.8 is cc8915bceb7d05c2c645587f5eaaf0f6cb6080c6
>
> To compile and install:
>
> git clone git://gitorious.org/parmap/parmap.git
> git checkout pipes
> aclocal -I m4
> autoconf
> ./configure
> make
> make install
>
> Enjoy
>
> -- Marco Danelutto and Roberto Di Cosmo
>
>
> =====================Copy of the README file======================
> Parmap in a nutshell
> --------------------
>
> Parmap is a minimalistic library allowing  to exploit multicore
> architecture for
> OCaml programs with minimal modifications: if you want to use your
> many cores to
> accelerate an   operation  which  happens   to be   a  map,  fold
> or map/fold
> (map-reduce),  just use  Parmap's  parmap, parfold  and parmapfold
> primitives in
> place  of the standard   List.map   and friends,  and  specify  the
> number   of
> subprocesses to use by the optional parameter ~ncores.
>
> See the example directory for a couple of running programs.
>
> DO'S and DONT'S
> ---------------
>
> Parmap is *not*  meant to be a replacement  for a full fledged
> implementation of
> parallelism skeletons  (map, reduce, pipe, and the  many others
> described in the
> scientific literature   since the end   of the  1980's,   much earlier
> than the
> specific   implementation by Google   engineers  that popularised
> them).  It  is
> meant,  instead, to allow you to  quickly leverage the  idle
> processing power of
> your extra cores, when handling some heavy computational load.
>
> The principle of parmap is very simple: when you call one of the three
> available
> primitives, map, fold, and  mapfold , your OCaml  sequential program forks
>  in n
> subprocesses (you choose the n), and each subprocess performs the
> computation on
> the 1/n of the data, in chunks  of a size you  can choose, returning the
> results
> through a shared memory area to the  parent process, that resumes
> execution once
> all the children have terminated, and the data has been recollected.
>
> You need  to run your  program on a single multicore  machine;  repeat
> after me:
> Parmap  is   not meant to   run  on a cluster,  see   one of the  many
> available
> (re)implementations of the map-reduce schema for that.
>
> By forking the parent process  on a sigle  machine, the children get
> access, for
> free, to all the data structures already built, even the imperative ones,
> and as
> far as your computation  inside the map/fold  does not produce side
> effects that
> need  to be  preserved, the  final result will   be the same  as
> performing the
> sequential operation, the only difference is that you might get it faster.
>
> The OCaml code is reasonably simple and only marginally relies on
> external C libraries:
> most of the magic is done by your operating system's fork and memory
> mapping mechanisms.
> One could gain some speed by implementing a marshal/unmarshal operation
> directly
> on bigarrays, but we did not do this yet.
>
> Of course, if you happen  to have open  channels, or files, or other
> connections
> that should only be  used by the parent  process, your program  may behave
> in  a
> very wierd way: as an example, *do  not* open a  graphic window before
> calling a
> Parmap primitive, and   *do   not*  use  this  library   if  your  program
>    is
> multi-threaded!
>
> Pinning processes to physical CPUs
> ----------------------------------
>
> To obtain maximum  speed,  Parmap tries to  pin  the worker processes to
> a CPU,
> using  the scheduler  affinity  interface  that is  available   in recent
> Linux
> kernels.   Similar functionality may be  obtained  on different platforms
> using
> slightly different API. Contributions  are welcome to  support those other
> APIs,
> just make sure that you use autoconf properly.
>
> Using Parmap with Ocamlnat
> --------------------------
>
> You can use Parmap in a native toplevel  (it may be quite useful  if you
> use the
> native toplevel to perform fast interactive computations), but remember
> that you
> need to load the .cmxs modules in it; an example is given in
> example/topnat.ml
>
> Preservation of output order in Parmap
> --------------------------------------
>
> If the number of chunks is equal to the number of cores, it is easy to
> preserve
> the order of the elements of the sequence passed to the map/fold
> operations, so
> the result will be a list with the same order as if the sequential
> function would
> be applied to the input. This is what the parmap, parmafold and
> parfold functions
> do when the chunksize argument is not used.
>
> If the user specifies a chunksize that is different from the number of
> cores,
> there is no general way to preserve the ordering, so the result of calling
> Parmap.parmap f l are not necessarily in the same order as List.map f l.
>
> In general, using little chunksize helps in balancing the load among
> the workers,
> and provides better speed, at the price of losing the ordering: there is a
> tradeoff, and it is up to the user to choose the solution that better
> suits him/her.
>
> Fast map on arrays and on float arrays
> --------------------------------------
>
> Visiting an array is much faster than visiting a list, and conversion
> of an array
> to and from a list is expensive, on large data structures, so we
> provide a specialised
> version of map on arrays, that beaves exactly like parmap.
>
> We also provide a highly optimised specialised parmap version that is
> targeted
> to float arrays, array_float_parmap, that allows you to perform parallel
> computation on very large float arrays efficiently, without the
> boxing/unboxing
> overhead introduced by the other primitives, including array_parmap.
>
> To understand the efficiency issues involved in the case of large
> arrays of float,
> here is a short summary of the steps that any implementation of a parallel
> map
> function must perform.
>
>   1) create a float array to hold the result of the computation.
>      This operation is expensive: on an Intel i7, creating a 10M float
> array
>      takes 50 milliseconds
>
>      ocamlnat
>           Objective Caml version 3.12.0 - native toplevel
>
>      # #load "unix.cmxs";;
>      # let d = Unix.gettimeofday() in ignore(Array.create 10000000
> 0.); Unix.gettimeofday() -. d;;
>      - : float = 0.0501301288604736328
>
>   2) create a shared memory area
>
>   3) possibly copy the result array to the shared memory area
>
>   4) perform the computation in the children writing the result in the
> shared memory area
>
>   5) possibly copy the result back to the OCaml array
>
> All implementations need to do 1, 2 and 4; steps 3 and/or 5 may be
> omitted depending on
> what the user wants to do with the result.
>
> The array_float_parmap performs steps 1,2,4 and 5. It is possible to share
> steps
> 1 and 2 among subsequent calls to the parallel function by
> preallocating the result
> array and the shared memory buffer, and passing them as optional
> parameters to the
> array_float_parmap function: this may save a significant amount of time if
> the
> array is very large.
>
>
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa-roc.inria.fr/wws/info/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>
>


-- 
Markus Weißmann, M.Sc.
Institut für Informatik
Technische Universität München
Raum 03.07.054, Boltzmannstr. 3, 85748 Garching

Tel. +49 (89) 2 89-1 81 05
Fax +49 (89) 2 89-1 81 07

mailto:markus.weissmann@in.tum.de
http://www6.in.tum.de/


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2011-11-30 14:58 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-30 14:42 [Caml-list] [ANN]: Parmap 0.9.8 Roberto Di Cosmo
2011-11-30 14:57 ` Markus Weißmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).