From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105]) by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id pAUEwEDw029412 for ; Wed, 30 Nov 2011 15:58:15 +0100 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqwBACND1k6Bu/5qkWdsb2JhbABEmn6QESIBAQEBCQsLBxQEIYFyAQEEAQwbCwE5AggDBQsLEhMPCwdFAQMOBhMSCYdsCLgag22HMwSmaA X-IronPort-AV: E=Sophos;i="4.69,597,1315173600"; d="scan'208";a="121524433" Received: from mailrelay1.lrz-muenchen.de ([129.187.254.106]) by mail4-smtp-sop.national.inria.fr with ESMTP; 30 Nov 2011 15:58:09 +0100 Received: from webmail.mwn.de ([129.187.254.85] [129.187.254.85]) by mailout.lrz-muenchen.de with ESMTP; Wed, 30 Nov 2011 15:57:49 +0100 Received: from webprx2.audi.de ([143.164.102.14]) (SquirrelMail authenticated user weissmam) by webmail.mwn.de with HTTP; Wed, 30 Nov 2011 15:57:51 +0100 (CET) Message-Id: <63143.143.164.102.14.1322665071.squirrel@webmail.mwn.de> In-Reply-To: References: Date: Wed, 30 Nov 2011 15:57:51 +0100 (CET) From: Markus =?iso-8859-1?Q?Wei=DFmann?= To: "Roberto Di Cosmo" Cc: caml-list@inria.fr, roberto@dicosmo.org, marcod@di.unipi.it User-Agent: SquirrelMail/1.4.13 MIME-Version: 1.0 Content-Type: text/plain;charset=iso-8859-1 Content-Transfer-Encoding: 8bit Subject: Re: [Caml-list] [ANN]: Parmap 0.9.8 Hi Roberto, imho it would be beneficial to have a global properties that specifies how many workers there should be; like val set_number_of_workers : int -> unit If you're on a octa-core (..) machine you probably always want to use 8 (..) workers so it would be nice to have this as a global property instead. cheers -Markus > Dear all, > a few lines to announce the (much improved) version 0.9.8 of Parmap, > the > minimalistic library introduced last August, and that can be useful to > exploit > your multicore processor with minimal modifications to your OCaml > programs. > > You will find a full description in the README file, as well as > several examples. > > The main changes are: > > - all functions (parmap, parfold, parmapfold) accept an optional > parameter > chunksize that allows to control granularity of the parallelism > > - highly specialised functions are now present to work on arrays and > especially on (unboxed) float arrays > > - configure and ocamlbuild harness > > - documentation (just make doc) > > - on Linux kernels, set_affinity is used to attach a worker to a given > core > > Special thanks to: Jérôme Vouillon for highly efficient code for the float > arrays; Paul Vernaza for clever suggestions on pre-allocation of > memory buffers for float arrays; Pietro Abate for autoconf/ocamlbuild > help; > Francois Berenger for tests. > > Project home: https://gitorious.org/parmap > > Version 0.9.8 is cc8915bceb7d05c2c645587f5eaaf0f6cb6080c6 > > To compile and install: > > git clone git://gitorious.org/parmap/parmap.git > git checkout pipes > aclocal -I m4 > autoconf > ./configure > make > make install > > Enjoy > > -- Marco Danelutto and Roberto Di Cosmo > > > =====================Copy of the README file====================== > Parmap in a nutshell > -------------------- > > Parmap is a minimalistic library allowing to exploit multicore > architecture for > OCaml programs with minimal modifications: if you want to use your > many cores to > accelerate an operation which happens to be a map, fold > or map/fold > (map-reduce), just use Parmap's parmap, parfold and parmapfold > primitives in > place of the standard List.map and friends, and specify the > number of > subprocesses to use by the optional parameter ~ncores. > > See the example directory for a couple of running programs. > > DO'S and DONT'S > --------------- > > Parmap is *not* meant to be a replacement for a full fledged > implementation of > parallelism skeletons (map, reduce, pipe, and the many others > described in the > scientific literature since the end of the 1980's, much earlier > than the > specific implementation by Google engineers that popularised > them). It is > meant, instead, to allow you to quickly leverage the idle > processing power of > your extra cores, when handling some heavy computational load. > > The principle of parmap is very simple: when you call one of the three > available > primitives, map, fold, and mapfold , your OCaml sequential program forks > in n > subprocesses (you choose the n), and each subprocess performs the > computation on > the 1/n of the data, in chunks of a size you can choose, returning the > results > through a shared memory area to the parent process, that resumes > execution once > all the children have terminated, and the data has been recollected. > > You need to run your program on a single multicore machine; repeat > after me: > Parmap is not meant to run on a cluster, see one of the many > available > (re)implementations of the map-reduce schema for that. > > By forking the parent process on a sigle machine, the children get > access, for > free, to all the data structures already built, even the imperative ones, > and as > far as your computation inside the map/fold does not produce side > effects that > need to be preserved, the final result will be the same as > performing the > sequential operation, the only difference is that you might get it faster. > > The OCaml code is reasonably simple and only marginally relies on > external C libraries: > most of the magic is done by your operating system's fork and memory > mapping mechanisms. > One could gain some speed by implementing a marshal/unmarshal operation > directly > on bigarrays, but we did not do this yet. > > Of course, if you happen to have open channels, or files, or other > connections > that should only be used by the parent process, your program may behave > in a > very wierd way: as an example, *do not* open a graphic window before > calling a > Parmap primitive, and *do not* use this library if your program > is > multi-threaded! > > Pinning processes to physical CPUs > ---------------------------------- > > To obtain maximum speed, Parmap tries to pin the worker processes to > a CPU, > using the scheduler affinity interface that is available in recent > Linux > kernels. Similar functionality may be obtained on different platforms > using > slightly different API. Contributions are welcome to support those other > APIs, > just make sure that you use autoconf properly. > > Using Parmap with Ocamlnat > -------------------------- > > You can use Parmap in a native toplevel (it may be quite useful if you > use the > native toplevel to perform fast interactive computations), but remember > that you > need to load the .cmxs modules in it; an example is given in > example/topnat.ml > > Preservation of output order in Parmap > -------------------------------------- > > If the number of chunks is equal to the number of cores, it is easy to > preserve > the order of the elements of the sequence passed to the map/fold > operations, so > the result will be a list with the same order as if the sequential > function would > be applied to the input. This is what the parmap, parmafold and > parfold functions > do when the chunksize argument is not used. > > If the user specifies a chunksize that is different from the number of > cores, > there is no general way to preserve the ordering, so the result of calling > Parmap.parmap f l are not necessarily in the same order as List.map f l. > > In general, using little chunksize helps in balancing the load among > the workers, > and provides better speed, at the price of losing the ordering: there is a > tradeoff, and it is up to the user to choose the solution that better > suits him/her. > > Fast map on arrays and on float arrays > -------------------------------------- > > Visiting an array is much faster than visiting a list, and conversion > of an array > to and from a list is expensive, on large data structures, so we > provide a specialised > version of map on arrays, that beaves exactly like parmap. > > We also provide a highly optimised specialised parmap version that is > targeted > to float arrays, array_float_parmap, that allows you to perform parallel > computation on very large float arrays efficiently, without the > boxing/unboxing > overhead introduced by the other primitives, including array_parmap. > > To understand the efficiency issues involved in the case of large > arrays of float, > here is a short summary of the steps that any implementation of a parallel > map > function must perform. > > 1) create a float array to hold the result of the computation. > This operation is expensive: on an Intel i7, creating a 10M float > array > takes 50 milliseconds > > ocamlnat > Objective Caml version 3.12.0 - native toplevel > > # #load "unix.cmxs";; > # let d = Unix.gettimeofday() in ignore(Array.create 10000000 > 0.); Unix.gettimeofday() -. d;; > - : float = 0.0501301288604736328 > > 2) create a shared memory area > > 3) possibly copy the result array to the shared memory area > > 4) perform the computation in the children writing the result in the > shared memory area > > 5) possibly copy the result back to the OCaml array > > All implementations need to do 1, 2 and 4; steps 3 and/or 5 may be > omitted depending on > what the user wants to do with the result. > > The array_float_parmap performs steps 1,2,4 and 5. It is possible to share > steps > 1 and 2 among subsequent calls to the parallel function by > preallocating the result > array and the shared memory buffer, and passing them as optional > parameters to the > array_float_parmap function: this may save a significant amount of time if > the > array is very large. > > > -- > Caml-list mailing list. Subscription management and archives: > https://sympa-roc.inria.fr/wws/info/caml-list > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs > > -- Markus Weißmann, M.Sc. Institut für Informatik Technische Universität München Raum 03.07.054, Boltzmannstr. 3, 85748 Garching Tel. +49 (89) 2 89-1 81 05 Fax +49 (89) 2 89-1 81 07 mailto:markus.weissmann@in.tum.de http://www6.in.tum.de/