Thank you for the very detailed answer, Roberto, and thanks for the reference to the excellent list discussion. The machine learning example I gave was not a typical one -- I was using it for illustration, and honestly I wasn't sure what kind of feedback I'd get. I'd love to go back and figure out what was wrong with my implementation, but I currently don't have the time. Nevertheless, I'm glad to see that many people on this list use Parmap successfully. My assumption at the time was that perhaps the code was creating and destructing sub-processes continuously, since the slowdown I experienced was by several orders of magnitude, but I'm just not sure. Regarding a place to share ideas, it seems like it would be very useful to have an official ocaml wiki. Haskell has this and it's a huge help. In fact, I would say haskell development would be greatly hampered without it. There's so much information that's relevant to more than one library ie. doesn't fit in any particular library's documentation. It wouldn't be too hard to set up a wikimedia instance on ocaml.org, would it? Alternatively it should be pretty easy to set up something on wikia. This wiki would also be a great place to describe the conceptual implementation of the compiler, which is again what haskell has. In terms of my personal use cases, I'm much more likely to want to be able to share a lot of state between cores than anything else. Out of everything that's been mentioned so far (including the discussion you linked), the best fit for me seems to be either netmulticore or perhaps ocaml's mpi bindings, given the testament to mpi's speed I saw in the aforementioned discussion. Netmulticore seems like a marvelous piece of work, but I'm bothered by how unsafe it is -- one of ocaml's great assets is its safety. I wish there was a way for the type system or the runtime in general to make a shared heap work better, but I guess we're stuck with what we have for now. Anyway, I'm really excited to see what Leo White can come up with for making a truly multicore runtime a reality. -Yotam On Wed, Jan 8, 2014 at 3:29 PM, Roberto Di Cosmo wrote: > Dear Yotam, > there are regularly discussions on how to perform computations > involving > parallelism on this list; a relatively recent and detailed one is > > https://sympa.inria.fr/sympa/arc/caml-list/2012-06/msg00043.html > > And yes, one might be puzzled because there are so many different > approaches, > but this is unavoidable, because there are so many different needs which > are not > easily covered by a single approach. If one wants to distribute a > computation > that has very little data exchange compared to the local computation on > each > node, the best approach is quite different to the one needed when large > data > sets need to be shared or exchanged among the workers. And if you can do > all on > a single machine, you can obtain significant speedup on a multicore > machines by > exploiting OS level mechanisms that are not available if you need a > cluster. Not > to mention the fact that in many cases one is looking for concurrency, and > not > parallelism: even if at first sight they may look similar, deep down they > really > are not. > > Since you mention machine learning, it's quite probable that you want to > perform > several iterations of relatively inexpensive computations on very large > arrays > of integers or floats: if this is the case, Parmap (and not ParMap, that > lends > to confusion with a different project in the Haskell world) was specially > designed to provide a highly efficient solution, and avoid boxing/unboxing > overhead of float arrays, *if* you use it properly (in particular, see the > notes > at the end of the README file, or look at > http://www.dicosmo.org/code/parmap/ > and the research article pointed from there, where a precise discussion of > the > steps involved in performing parallel computation on float arrays is > given). > Actually, feedback from happy users, and synthetic benchmarks indicate that > Parmap should provide one of the best possible performances for this use > case on > a multicore machine, short of resorting to using libraries like ancient, > that > requires carefulness to avoid core dumps, or external libraries that > already > have taken care of parallelism for you (like LaPack, etc.). > > But if one performs a map over an array of floats *without* using > the special functions for floats, then all sort of boxing/unboxing and > copying will take place, and the "parallel" version might very well be > even slower than the sequential one, *if* the computation on each float > is fast. > > Finally, if the float computations are the bottleneck, then a very > interesting > project to keep an eye on is SPOC, that may significantly outperform > everything > else on earth: taking advantage of the GPUs in your machine, it can perform > float computations on large arrays in a fraction of the time that your CPU, > even multicore, requires... but of course you need to learn to program a > GPU > kernel for that. Learn more about this here http://www.algo-prog.info/spoc > > The bottonline is, parallelism is easier than concurrency, but when > one looks for speed, every detail counts, and getting a real speedup > does not come for free. > > We would really need a single place where to share ideas, tips and serious > analysis of the various available approaches: multicore machines and GPUs > are a > reality, and this issue is bound to come up again and again. > > -- > Roberto > > > > On Tue, Jan 07, 2014 at 02:54:33PM -0500, Yotam Barnoy wrote: > > Hi List > > > > So far, I've been programming in ocaml using only sequential programs. > In my > > last project, which was an implementation of a large machine learning > > algorithm, I tried to speed up computation using a little bit of > parallelism > > with ParMap, and it was a complete failure. It's possible that more time > would > > have yielded better results, but I just didn't have the time to invest > in it > > given how bad the initial results were. > > > > My question is, what are the options right now as far as parallelism is > > concerned? I'm not talking about cooperative multitasking, but about > really > > taking advantage of multiple cores. I'm well aware of the runtime lock > and I'm > > ok with message passing between processes or a shared area in memory, > but I'd > > rather have something more high level than starting up several processes, > > creating a named pipe or a socket, and trying to pass messages through > that. > > Also, I assume that using a shared area in memory involves some C code? > Am I > > wrong about that? > > > > I was expecting Core's Async to fill this role, but realworldocaml is > fuzzy on > > this topic, apparently preferring to dwell on cooperative multitasking > (which > > is fine but not what I'm looking for), and I couldn't find any other > > documentation that was clearer. > > > > Thanks > > Yotam > > -- > Roberto Di Cosmo > > ------------------------------------------------------------------ > Professeur En delegation a l'INRIA > PPS E-mail: roberto@dicosmo.org > Universite Paris Diderot WWW : http://www.dicosmo.org > Case 7014 Tel : ++33-(0)1-57 27 92 20 > 5, Rue Thomas Mann > F-75205 Paris Cedex 13 Identica: http://identi.ca/rdicosmo > FRANCE. Twitter: http://twitter.com/rdicosmo > ------------------------------------------------------------------ > Attachments: > MIME accepted, Word deprecated > http://www.gnu.org/philosophy/no-word-attachments.html > ------------------------------------------------------------------ > Office location: > > Bureau 3020 (3rd floor) > Batiment Sophie Germain > Avenue de France > Metro Bibliotheque Francois Mitterrand, ligne 14/RER C > ----------------------------------------------------------------- > GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3 >