Thank you for the very detailed answer, Roberto, and thanks for the
reference to the excellent list discussion.

The machine learning example I gave was not a typical one -- I was using it
for illustration, and honestly I wasn't sure what kind of feedback I'd get.
I'd love to go back and figure out what was wrong with my implementation,
but I currently don't have the time. Nevertheless, I'm glad to see that
many people on this list use Parmap successfully. My assumption at the time
was that perhaps the code was creating and destructing sub-processes
continuously, since the slowdown I experienced was by several orders of
magnitude, but I'm just not sure.

Regarding a place to share ideas, it seems like it would be very useful to
have an official ocaml wiki. Haskell has this and it's a huge help. In
fact, I would say haskell development would be greatly hampered without it.
There's so much information that's relevant to more than one library ie.
doesn't fit in any particular library's documentation. It wouldn't be too
hard to set up a wikimedia instance on ocaml.org, would it? Alternatively
it should be pretty easy to set up something on wikia. This wiki would also
be a great place to describe the conceptual implementation of the compiler,
which is again what haskell has.

In terms of my personal use cases, I'm much more likely to want to be able
to share a lot of state between cores than anything else. Out of everything
that's been mentioned so far (including the discussion you linked), the
best fit for me seems to be either netmulticore or perhaps ocaml's mpi
bindings, given the testament to mpi's speed I saw in the aforementioned
discussion. Netmulticore seems like a marvelous piece of work, but I'm
bothered by how unsafe it is -- one of ocaml's great assets is its safety.
I wish there was a way for the type system or the runtime in general to
make a shared heap work better, but I guess we're stuck with what we have
for now. Anyway, I'm really excited to see what Leo White can come up with
for making a truly multicore runtime a reality.

-Yotam


On Wed, Jan 8, 2014 at 3:29 PM, Roberto Di Cosmo <roberto@dicosmo.org>wrote:

> Dear Yotam,
>     there are regularly discussions on how to perform computations
> involving
> parallelism on this list; a relatively recent and detailed one is
>
>     https://sympa.inria.fr/sympa/arc/caml-list/2012-06/msg00043.html
>
> And yes, one might be puzzled because there are so many different
> approaches,
> but this is unavoidable, because there are so many different needs which
> are not
> easily covered by a single approach.  If one wants to distribute a
> computation
> that has very little data exchange compared to the local computation on
> each
> node, the best approach is quite different to the one needed when large
> data
> sets need to be shared or exchanged among the workers. And if you can do
> all on
> a single machine, you can obtain significant speedup on a multicore
> machines by
> exploiting OS level mechanisms that are not available if you need a
> cluster. Not
> to mention the fact that in many cases one is looking for concurrency, and
> not
> parallelism: even if at first sight they may look similar, deep down they
> really
> are not.
>
> Since you mention machine learning, it's quite probable that you want to
> perform
> several iterations of relatively inexpensive computations on very large
> arrays
> of integers or floats: if this is the case, Parmap (and not ParMap, that
> lends
> to confusion with a different project in the Haskell world) was specially
> designed to provide a highly efficient solution, and avoid boxing/unboxing
> overhead of float arrays, *if* you use it properly (in particular, see the
> notes
> at the end of the README file, or look at
> http://www.dicosmo.org/code/parmap/
> and the research article pointed from there, where a precise discussion of
> the
> steps involved in performing parallel computation on float arrays is
> given).
> Actually, feedback from happy users, and synthetic benchmarks indicate that
> Parmap should provide one of the best possible performances for this use
> case on
> a multicore machine, short of resorting to using libraries like ancient,
> that
> requires carefulness to avoid core dumps, or external libraries that
> already
> have taken care of parallelism for you (like LaPack, etc.).
>
> But if one performs a map over an array of floats *without* using
> the special functions for floats, then all sort of boxing/unboxing and
> copying will take place, and the "parallel" version might very well be
> even slower than the sequential one, *if* the computation on each float
> is fast.
>
> Finally, if the float computations are the bottleneck, then a very
> interesting
> project to keep an eye on is SPOC, that may significantly outperform
> everything
> else on earth: taking advantage of the GPUs in your machine, it can perform
> float computations on large arrays in a fraction of the time that your CPU,
> even multicore, requires... but of course you need to learn to program a
> GPU
> kernel for that. Learn more about this here http://www.algo-prog.info/spoc
>
> The bottonline is, parallelism is easier than concurrency, but when
> one looks for speed, every detail counts, and getting a real speedup
> does not come for free.
>
> We would really need a single place where to share ideas, tips and serious
> analysis of the various available approaches: multicore machines and GPUs
> are a
> reality, and this issue is bound to come up again and again.
>
> --
> Roberto
>
>
>
> On Tue, Jan 07, 2014 at 02:54:33PM -0500, Yotam Barnoy wrote:
> > Hi List
> >
> > So far, I've been programming in ocaml using only sequential programs.
> In my
> > last project, which was an implementation of a large machine learning
> > algorithm, I tried to speed up computation using a little bit of
> parallelism
> > with ParMap, and it was a complete failure. It's possible that more time
> would
> > have yielded better results, but I just didn't have the time to invest
> in it
> > given how bad the initial results were.
> >
> > My question is, what are the options right now as far as parallelism is
> > concerned? I'm not talking about cooperative multitasking, but about
> really
> > taking advantage of multiple cores. I'm well aware of the runtime lock
> and I'm
> > ok with message passing between processes or a shared area in memory,
> but I'd
> > rather have something more high level than starting up several processes,
> > creating a named pipe or a socket, and trying to pass messages through
> that.
> > Also, I assume that using a shared area in memory involves some C code?
> Am I
> > wrong about that?
> >
> > I was expecting Core's Async to fill this role, but realworldocaml is
> fuzzy on
> > this topic, apparently preferring to dwell on cooperative multitasking
> (which
> > is fine but not what I'm looking for), and I couldn't find any other
> > documentation that was clearer.
> >
> > Thanks
> > Yotam
>
> --
> Roberto Di Cosmo
>
> ------------------------------------------------------------------
> Professeur               En delegation a l'INRIA
> PPS                      E-mail: roberto@dicosmo.org
> Universite Paris Diderot WWW  : http://www.dicosmo.org
> Case 7014                Tel  : ++33-(0)1-57 27 92 20
> 5, Rue Thomas Mann
> F-75205 Paris Cedex 13   Identica: http://identi.ca/rdicosmo
> FRANCE.                  Twitter: http://twitter.com/rdicosmo
> ------------------------------------------------------------------
> Attachments:
> MIME accepted, Word deprecated
>       http://www.gnu.org/philosophy/no-word-attachments.html
> ------------------------------------------------------------------
> Office location:
>
> Bureau 3020 (3rd floor)
> Batiment Sophie Germain
> Avenue de France
> Metro Bibliotheque Francois Mitterrand, ligne 14/RER C
> -----------------------------------------------------------------
> GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3
>