From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by sympa.inria.fr (Postfix) with ESMTPS id 9DEEA7EE99 for ; Wed, 8 Jan 2014 23:13:28 +0100 (CET) Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of yotambarnoy@gmail.com) identity=pra; client-ip=209.85.216.174; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="yotambarnoy@gmail.com"; x-sender="yotambarnoy@gmail.com"; x-conformance=sidf_compatible Received-SPF: Pass (mail2-smtp-roc.national.inria.fr: domain of yotambarnoy@gmail.com designates 209.85.216.174 as permitted sender) identity=mailfrom; client-ip=209.85.216.174; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="yotambarnoy@gmail.com"; x-sender="yotambarnoy@gmail.com"; x-conformance=sidf_compatible; x-record-type="v=spf1" Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of postmaster@mail-qc0-f174.google.com) identity=helo; client-ip=209.85.216.174; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="yotambarnoy@gmail.com"; x-sender="postmaster@mail-qc0-f174.google.com"; x-conformance=sidf_compatible X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ApkEAHbMzVLRVdiulGdsb2JhbABZg0NQBrBqiFWBDAgWDgEBAQEHCwsJEiqCJQEBAQMBQAEUBxEMAQMBCwYFCw0uIgERAQUBHAYTCRKHVAEDCQgIBZ1ZjFyDCZEWChknDWSELhEBBQyOdAEEB4IvgggEiUOOVIEwiywBgWmBYhgphHce X-IPAS-Result: ApkEAHbMzVLRVdiulGdsb2JhbABZg0NQBrBqiFWBDAgWDgEBAQEHCwsJEiqCJQEBAQMBQAEUBxEMAQMBCwYFCw0uIgERAQUBHAYTCRKHVAEDCQgIBZ1ZjFyDCZEWChknDWSELhEBBQyOdAEEB4IvgggEiUOOVIEwiywBgWmBYhgphHce X-IronPort-AV: E=Sophos;i="4.95,626,1384297200"; d="scan'208";a="52334353" Received: from mail-qc0-f174.google.com ([209.85.216.174]) by mail2-smtp-roc.national.inria.fr with ESMTP/TLS/RC4-SHA; 08 Jan 2014 23:13:27 +0100 Received: by mail-qc0-f174.google.com with SMTP id x13so994164qcv.19 for ; Wed, 08 Jan 2014 14:13:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=I1m3zYGnZnUlegx1Zkvdivi04w0d72djbDRRsCvubbc=; b=PqTrcJ5FHK8IGUvcznk96sKOs3f4cZfjJPoqUoD4ZEBlM5K0vEIqVqJVT6JgNIUIKk 419UbajbeQ4gXQLVklowEQ2v2WXzr0E7kXr1vGiHPFPjQri2vlzGEs7Gta7UAQL1jyE2 pgENnUoKlg30oSJYd5G7zKsNEqS6pMjfnskwDtEHhBUVWiIBHpEYXkH7CENwW5WiMQjO h+RPCgeoUu2kwRmYSIE7kdPpRwPq+hzXOxCa0BYaKuAFx0zdNq103lBlg+2cn2UsTM8Q KdFu0RpA2KvcKxu/IKXCPA2MDB9vzI85EH3emmtwxKI1abU9zTirF1rkJxJjq/yFJgPC WLSA== X-Received: by 10.49.25.46 with SMTP id z14mr22958555qef.20.1389219206073; Wed, 08 Jan 2014 14:13:26 -0800 (PST) MIME-Version: 1.0 Received: by 10.224.95.8 with HTTP; Wed, 8 Jan 2014 14:13:05 -0800 (PST) In-Reply-To: <20140108202952.GA3669@voyager> References: <20140108202952.GA3669@voyager> From: Yotam Barnoy Date: Wed, 8 Jan 2014 17:13:05 -0500 Message-ID: To: Roberto Di Cosmo Cc: Ocaml Mailing List Content-Type: multipart/alternative; boundary=047d7b5d9b994b10e304ef7ccd0e Subject: Re: [Caml-list] Concurrent/parallel programming --047d7b5d9b994b10e304ef7ccd0e Content-Type: text/plain; charset=ISO-8859-1 Thank you for the very detailed answer, Roberto, and thanks for the reference to the excellent list discussion. The machine learning example I gave was not a typical one -- I was using it for illustration, and honestly I wasn't sure what kind of feedback I'd get. I'd love to go back and figure out what was wrong with my implementation, but I currently don't have the time. Nevertheless, I'm glad to see that many people on this list use Parmap successfully. My assumption at the time was that perhaps the code was creating and destructing sub-processes continuously, since the slowdown I experienced was by several orders of magnitude, but I'm just not sure. Regarding a place to share ideas, it seems like it would be very useful to have an official ocaml wiki. Haskell has this and it's a huge help. In fact, I would say haskell development would be greatly hampered without it. There's so much information that's relevant to more than one library ie. doesn't fit in any particular library's documentation. It wouldn't be too hard to set up a wikimedia instance on ocaml.org, would it? Alternatively it should be pretty easy to set up something on wikia. This wiki would also be a great place to describe the conceptual implementation of the compiler, which is again what haskell has. In terms of my personal use cases, I'm much more likely to want to be able to share a lot of state between cores than anything else. Out of everything that's been mentioned so far (including the discussion you linked), the best fit for me seems to be either netmulticore or perhaps ocaml's mpi bindings, given the testament to mpi's speed I saw in the aforementioned discussion. Netmulticore seems like a marvelous piece of work, but I'm bothered by how unsafe it is -- one of ocaml's great assets is its safety. I wish there was a way for the type system or the runtime in general to make a shared heap work better, but I guess we're stuck with what we have for now. Anyway, I'm really excited to see what Leo White can come up with for making a truly multicore runtime a reality. -Yotam On Wed, Jan 8, 2014 at 3:29 PM, Roberto Di Cosmo wrote: > Dear Yotam, > there are regularly discussions on how to perform computations > involving > parallelism on this list; a relatively recent and detailed one is > > https://sympa.inria.fr/sympa/arc/caml-list/2012-06/msg00043.html > > And yes, one might be puzzled because there are so many different > approaches, > but this is unavoidable, because there are so many different needs which > are not > easily covered by a single approach. If one wants to distribute a > computation > that has very little data exchange compared to the local computation on > each > node, the best approach is quite different to the one needed when large > data > sets need to be shared or exchanged among the workers. And if you can do > all on > a single machine, you can obtain significant speedup on a multicore > machines by > exploiting OS level mechanisms that are not available if you need a > cluster. Not > to mention the fact that in many cases one is looking for concurrency, and > not > parallelism: even if at first sight they may look similar, deep down they > really > are not. > > Since you mention machine learning, it's quite probable that you want to > perform > several iterations of relatively inexpensive computations on very large > arrays > of integers or floats: if this is the case, Parmap (and not ParMap, that > lends > to confusion with a different project in the Haskell world) was specially > designed to provide a highly efficient solution, and avoid boxing/unboxing > overhead of float arrays, *if* you use it properly (in particular, see the > notes > at the end of the README file, or look at > http://www.dicosmo.org/code/parmap/ > and the research article pointed from there, where a precise discussion of > the > steps involved in performing parallel computation on float arrays is > given). > Actually, feedback from happy users, and synthetic benchmarks indicate that > Parmap should provide one of the best possible performances for this use > case on > a multicore machine, short of resorting to using libraries like ancient, > that > requires carefulness to avoid core dumps, or external libraries that > already > have taken care of parallelism for you (like LaPack, etc.). > > But if one performs a map over an array of floats *without* using > the special functions for floats, then all sort of boxing/unboxing and > copying will take place, and the "parallel" version might very well be > even slower than the sequential one, *if* the computation on each float > is fast. > > Finally, if the float computations are the bottleneck, then a very > interesting > project to keep an eye on is SPOC, that may significantly outperform > everything > else on earth: taking advantage of the GPUs in your machine, it can perform > float computations on large arrays in a fraction of the time that your CPU, > even multicore, requires... but of course you need to learn to program a > GPU > kernel for that. Learn more about this here http://www.algo-prog.info/spoc > > The bottonline is, parallelism is easier than concurrency, but when > one looks for speed, every detail counts, and getting a real speedup > does not come for free. > > We would really need a single place where to share ideas, tips and serious > analysis of the various available approaches: multicore machines and GPUs > are a > reality, and this issue is bound to come up again and again. > > -- > Roberto > > > > On Tue, Jan 07, 2014 at 02:54:33PM -0500, Yotam Barnoy wrote: > > Hi List > > > > So far, I've been programming in ocaml using only sequential programs. > In my > > last project, which was an implementation of a large machine learning > > algorithm, I tried to speed up computation using a little bit of > parallelism > > with ParMap, and it was a complete failure. It's possible that more time > would > > have yielded better results, but I just didn't have the time to invest > in it > > given how bad the initial results were. > > > > My question is, what are the options right now as far as parallelism is > > concerned? I'm not talking about cooperative multitasking, but about > really > > taking advantage of multiple cores. I'm well aware of the runtime lock > and I'm > > ok with message passing between processes or a shared area in memory, > but I'd > > rather have something more high level than starting up several processes, > > creating a named pipe or a socket, and trying to pass messages through > that. > > Also, I assume that using a shared area in memory involves some C code? > Am I > > wrong about that? > > > > I was expecting Core's Async to fill this role, but realworldocaml is > fuzzy on > > this topic, apparently preferring to dwell on cooperative multitasking > (which > > is fine but not what I'm looking for), and I couldn't find any other > > documentation that was clearer. > > > > Thanks > > Yotam > > -- > Roberto Di Cosmo > > ------------------------------------------------------------------ > Professeur En delegation a l'INRIA > PPS E-mail: roberto@dicosmo.org > Universite Paris Diderot WWW : http://www.dicosmo.org > Case 7014 Tel : ++33-(0)1-57 27 92 20 > 5, Rue Thomas Mann > F-75205 Paris Cedex 13 Identica: http://identi.ca/rdicosmo > FRANCE. Twitter: http://twitter.com/rdicosmo > ------------------------------------------------------------------ > Attachments: > MIME accepted, Word deprecated > http://www.gnu.org/philosophy/no-word-attachments.html > ------------------------------------------------------------------ > Office location: > > Bureau 3020 (3rd floor) > Batiment Sophie Germain > Avenue de France > Metro Bibliotheque Francois Mitterrand, ligne 14/RER C > ----------------------------------------------------------------- > GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3 > --047d7b5d9b994b10e304ef7ccd0e Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Thank you for the very detailed answer= , Roberto, and thanks for the reference to the excellent list discussion.
The machine learning example I gave was not a typical one -- I = was using it for illustration, and honestly I wasn't sure what kind of = feedback I'd get. I'd love to go back and figure out what was wrong= with my implementation, but I currently don't have the time. Neverthel= ess, I'm glad to see that many people on this list use Parmap successfu= lly. My assumption at the time was that perhaps the code was creating and d= estructing sub-processes continuously, since the slowdown I experienced was= by several orders of magnitude, but I'm just not sure.

Regarding a place to share ideas, it seems like it would be very = useful to have an official ocaml wiki. Haskell has this and it's a huge= help. In fact, I would say haskell development would be greatly hampered w= ithout it. There's so much information that's relevant to more than= one library ie. doesn't fit in any particular library's documentat= ion. It wouldn't be too hard to set up a wikimedia instance on ocaml.org, would it? Alternatively it should be p= retty easy to set up something on wikia. This wiki would also be a great pl= ace to describe the conceptual implementation of the compiler, which is aga= in what haskell has.

In terms of my personal use cases, I'm much more likely to wa= nt to be able to share a lot of state between cores than anything else. Out= of everything that's been mentioned so far (including the discussion y= ou linked), the best fit for me seems to be either netmulticore or perhaps = ocaml's mpi bindings, given the testament to mpi's speed I saw in t= he aforementioned discussion. Netmulticore seems like a marvelous piece of = work, but I'm bothered by how unsafe it is -- one of ocaml's great = assets is its safety. I wish there was a way for the type system or the run= time in general to make a shared heap work better, but I guess we're st= uck with what we have for now. Anyway, I'm really excited to see what L= eo White can come up with for making a truly multicore runtime a reality.
-Yotam


On Wed, Jan 8, 2014 at 3:29 PM, Roberto Di Cosmo <roberto@d= icosmo.org> wrote:
Dear Yotam,
=A0 =A0 there are regularly discussions on how to perform computations invo= lving
parallelism on this list; a relatively recent and detailed one is

=A0 =A0 https://sympa.inria.fr/sympa/arc/caml-list/2012= -06/msg00043.html

And yes, one might be puzzled because there are so many different approache= s,
but this is unavoidable, because there are so many different needs which ar= e not
easily covered by a single approach. =A0If one wants to distribute a comput= ation
that has very little data exchange compared to the local computation on eac= h
node, the best approach is quite different to the one needed when large dat= a
sets need to be shared or exchanged among the workers. And if you can do al= l on
a single machine, you can obtain significant speedup on a multicore machine= s by
exploiting OS level mechanisms that are not available if you need a cluster= . Not
to mention the fact that in many cases one is looking for concurrency, and = not
parallelism: even if at first sight they may look similar, deep down they r= eally
are not.

Since you mention machine learning, it's quite probable that you want t= o perform
several iterations of relatively inexpensive computations on very large arr= ays
of integers or floats: if this is the case, Parmap (and not ParMap, that le= nds
to confusion with a different project in the Haskell world) was specially designed to provide a highly efficient solution, and avoid boxing/unboxing<= br> overhead of float arrays, *if* you use it properly (in particular, see the = notes
at the end of the README file, or look at http://www.dicosmo.org/code/parmap/
and the research article pointed from there, where a precise discussion of = the
steps involved in performing parallel computation on float arrays is given)= .
Actually, feedback from happy users, and synthetic benchmarks indicate that=
Parmap should provide one of the best possible performances for this use ca= se on
a multicore machine, short of resorting to using libraries like ancient, th= at
requires carefulness to avoid core dumps, or external libraries that alread= y
have taken care of parallelism for you (like LaPack, etc.).

But if one performs a map over an array of floats *without* using
the special functions for floats, then all sort of boxing/unboxing and
copying will take place, and the "parallel" version might very we= ll be
even slower than the sequential one, *if* the computation on each float
is fast.

Finally, if the float computations are the bottleneck, then a very interest= ing
project to keep an eye on is SPOC, that may significantly outperform everyt= hing
else on earth: taking advantage of the GPUs in your machine, it can perform=
float computations on large arrays in a fraction of the time that your CPU,=
even multicore, requires... but of course you need to learn to program a GP= U
kernel for that. Learn more about this here http://www.algo-prog.info/spoc

The bottonline is, parallelism is easier than concurrency, but when
one looks for speed, every detail counts, and getting a real speedup
does not come for free.

We would really need a single place where to share ideas, tips and serious<= br> analysis of the various available approaches: multicore machines and GPUs a= re a
reality, and this issue is bound to come up again and again.

--
Roberto



On Tue, Jan 07, 2014 at 02:54:33PM -0500, Yotam Barnoy wrote:
> Hi List
>
> So far, I've been programming in ocaml using only sequential progr= ams. In my
> last project, which was an implementation of a large machine learning<= br> > algorithm, I tried to speed up computation using a little bit of paral= lelism
> with ParMap, and it was a complete failure. It's possible that mor= e time would
> have yielded better results, but I just didn't have the time to in= vest in it
> given how bad the initial results were.
>
> My question is, what are the options right now as far as parallelism i= s
> concerned? I'm not talking about cooperative multitasking, but abo= ut really
> taking advantage of multiple cores. I'm well aware of the runtime = lock and I'm
> ok with message passing between processes or a shared area in memory, = but I'd
> rather have something more high level than starting up several process= es,
> creating a named pipe or a socket, and trying to pass messages through= that.
> Also, I assume that using a shared area in memory involves some C code= ? Am I
> wrong about that?
>
> I was expecting Core's Async to fill this role, but realworldocaml= is fuzzy on
> this topic, apparently preferring to dwell on cooperative multitasking= (which
> is fine but not what I'm looking for), and I couldn't find any= other
> documentation that was clearer.
>
> Thanks
> Yotam

--
Roberto Di Cosmo

------------------------------------------------------------------
Professeur =A0 =A0 =A0 =A0 =A0 =A0 =A0 En delegation a l'INRIA
PPS =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0E-mail: roberto@dicosmo.org
Universite Paris Diderot WWW =A0: http://www.dicosmo.org
Case 7014 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Tel =A0: ++33-(0)1-57 27 92 20=
5, Rue Thomas Mann
F-75205 Paris Cedex 13 =A0 Identica: http://identi.ca/rdicosmo
FRANCE. =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Twitter: http://twitter.com/rdicosmo
------------------------------------------------------------------
Attachments:
MIME accepted, Word deprecated
=A0 =A0 =A0 http://www.gnu.org/philosophy/no-word-attachments.htm= l
------------------------------------------------------------------
Office location:

Bureau 3020 (3rd floor)
Batiment Sophie Germain
Avenue de France
Metro Bibliotheque Francois Mitterrand, ligne 14/RER C
-----------------------------------------------------------------
GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3

--047d7b5d9b994b10e304ef7ccd0e--