From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <yotambarnoy@gmail.com>
X-Original-To: caml-list@sympa.inria.fr
Delivered-To: caml-list@sympa.inria.fr
Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83])
	by sympa.inria.fr (Postfix) with ESMTPS id 9DEEA7EE99
	for <caml-list@sympa.inria.fr>; Wed,  8 Jan 2014 23:13:28 +0100 (CET)
Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender
  authenticity information available from domain of
  yotambarnoy@gmail.com) identity=pra;
  client-ip=209.85.216.174;
  receiver=mail2-smtp-roc.national.inria.fr;
  envelope-from="yotambarnoy@gmail.com";
  x-sender="yotambarnoy@gmail.com";
  x-conformance=sidf_compatible
Received-SPF: Pass (mail2-smtp-roc.national.inria.fr: domain of
  yotambarnoy@gmail.com designates 209.85.216.174 as permitted
  sender) identity=mailfrom; client-ip=209.85.216.174;
  receiver=mail2-smtp-roc.national.inria.fr;
  envelope-from="yotambarnoy@gmail.com";
  x-sender="yotambarnoy@gmail.com";
  x-conformance=sidf_compatible; x-record-type="v=spf1"
Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender
  authenticity information available from domain of
  postmaster@mail-qc0-f174.google.com) identity=helo;
  client-ip=209.85.216.174;
  receiver=mail2-smtp-roc.national.inria.fr;
  envelope-from="yotambarnoy@gmail.com";
  x-sender="postmaster@mail-qc0-f174.google.com";
  x-conformance=sidf_compatible
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: ApkEAHbMzVLRVdiulGdsb2JhbABZg0NQBrBqiFWBDAgWDgEBAQEHCwsJEiqCJQEBAQMBQAEUBxEMAQMBCwYFCw0uIgERAQUBHAYTCRKHVAEDCQgIBZ1ZjFyDCZEWChknDWSELhEBBQyOdAEEB4IvgggEiUOOVIEwiywBgWmBYhgphHce
X-IPAS-Result: ApkEAHbMzVLRVdiulGdsb2JhbABZg0NQBrBqiFWBDAgWDgEBAQEHCwsJEiqCJQEBAQMBQAEUBxEMAQMBCwYFCw0uIgERAQUBHAYTCRKHVAEDCQgIBZ1ZjFyDCZEWChknDWSELhEBBQyOdAEEB4IvgggEiUOOVIEwiywBgWmBYhgphHce
X-IronPort-AV: E=Sophos;i="4.95,626,1384297200"; 
   d="scan'208";a="52334353"
Received: from mail-qc0-f174.google.com ([209.85.216.174])
  by mail2-smtp-roc.national.inria.fr with ESMTP/TLS/RC4-SHA; 08 Jan 2014 23:13:27 +0100
Received: by mail-qc0-f174.google.com with SMTP id x13so994164qcv.19
        for <caml-list@inria.fr>; Wed, 08 Jan 2014 14:13:26 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20120113;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :cc:content-type;
        bh=I1m3zYGnZnUlegx1Zkvdivi04w0d72djbDRRsCvubbc=;
        b=PqTrcJ5FHK8IGUvcznk96sKOs3f4cZfjJPoqUoD4ZEBlM5K0vEIqVqJVT6JgNIUIKk
         419UbajbeQ4gXQLVklowEQ2v2WXzr0E7kXr1vGiHPFPjQri2vlzGEs7Gta7UAQL1jyE2
         pgENnUoKlg30oSJYd5G7zKsNEqS6pMjfnskwDtEHhBUVWiIBHpEYXkH7CENwW5WiMQjO
         h+RPCgeoUu2kwRmYSIE7kdPpRwPq+hzXOxCa0BYaKuAFx0zdNq103lBlg+2cn2UsTM8Q
         KdFu0RpA2KvcKxu/IKXCPA2MDB9vzI85EH3emmtwxKI1abU9zTirF1rkJxJjq/yFJgPC
         WLSA==
X-Received: by 10.49.25.46 with SMTP id z14mr22958555qef.20.1389219206073;
 Wed, 08 Jan 2014 14:13:26 -0800 (PST)
MIME-Version: 1.0
Received: by 10.224.95.8 with HTTP; Wed, 8 Jan 2014 14:13:05 -0800 (PST)
In-Reply-To: <20140108202952.GA3669@voyager>
References: <CAN6ygOnW9bqcB3SeZiqgxFtPuqt2PXJ0-EBRS7Na9M0S6fT3KQ@mail.gmail.com>
 <20140108202952.GA3669@voyager>
From: Yotam Barnoy <yotambarnoy@gmail.com>
Date: Wed, 8 Jan 2014 17:13:05 -0500
Message-ID: <CAN6ygOko4Hkxmf-O5+k9WQSDr5ZmQEDTtMT+hmB=A8KtGYzDTw@mail.gmail.com>
To: Roberto Di Cosmo <roberto@dicosmo.org>
Cc: Ocaml Mailing List <caml-list@inria.fr>
Content-Type: multipart/alternative; boundary=047d7b5d9b994b10e304ef7ccd0e
Subject: Re: [Caml-list] Concurrent/parallel programming

--047d7b5d9b994b10e304ef7ccd0e
Content-Type: text/plain; charset=ISO-8859-1

Thank you for the very detailed answer, Roberto, and thanks for the
reference to the excellent list discussion.

The machine learning example I gave was not a typical one -- I was using it
for illustration, and honestly I wasn't sure what kind of feedback I'd get.
I'd love to go back and figure out what was wrong with my implementation,
but I currently don't have the time. Nevertheless, I'm glad to see that
many people on this list use Parmap successfully. My assumption at the time
was that perhaps the code was creating and destructing sub-processes
continuously, since the slowdown I experienced was by several orders of
magnitude, but I'm just not sure.

Regarding a place to share ideas, it seems like it would be very useful to
have an official ocaml wiki. Haskell has this and it's a huge help. In
fact, I would say haskell development would be greatly hampered without it.
There's so much information that's relevant to more than one library ie.
doesn't fit in any particular library's documentation. It wouldn't be too
hard to set up a wikimedia instance on ocaml.org, would it? Alternatively
it should be pretty easy to set up something on wikia. This wiki would also
be a great place to describe the conceptual implementation of the compiler,
which is again what haskell has.

In terms of my personal use cases, I'm much more likely to want to be able
to share a lot of state between cores than anything else. Out of everything
that's been mentioned so far (including the discussion you linked), the
best fit for me seems to be either netmulticore or perhaps ocaml's mpi
bindings, given the testament to mpi's speed I saw in the aforementioned
discussion. Netmulticore seems like a marvelous piece of work, but I'm
bothered by how unsafe it is -- one of ocaml's great assets is its safety.
I wish there was a way for the type system or the runtime in general to
make a shared heap work better, but I guess we're stuck with what we have
for now. Anyway, I'm really excited to see what Leo White can come up with
for making a truly multicore runtime a reality.

-Yotam


On Wed, Jan 8, 2014 at 3:29 PM, Roberto Di Cosmo <roberto@dicosmo.org>wrote:

> Dear Yotam,
>     there are regularly discussions on how to perform computations
> involving
> parallelism on this list; a relatively recent and detailed one is
>
>     https://sympa.inria.fr/sympa/arc/caml-list/2012-06/msg00043.html
>
> And yes, one might be puzzled because there are so many different
> approaches,
> but this is unavoidable, because there are so many different needs which
> are not
> easily covered by a single approach.  If one wants to distribute a
> computation
> that has very little data exchange compared to the local computation on
> each
> node, the best approach is quite different to the one needed when large
> data
> sets need to be shared or exchanged among the workers. And if you can do
> all on
> a single machine, you can obtain significant speedup on a multicore
> machines by
> exploiting OS level mechanisms that are not available if you need a
> cluster. Not
> to mention the fact that in many cases one is looking for concurrency, and
> not
> parallelism: even if at first sight they may look similar, deep down they
> really
> are not.
>
> Since you mention machine learning, it's quite probable that you want to
> perform
> several iterations of relatively inexpensive computations on very large
> arrays
> of integers or floats: if this is the case, Parmap (and not ParMap, that
> lends
> to confusion with a different project in the Haskell world) was specially
> designed to provide a highly efficient solution, and avoid boxing/unboxing
> overhead of float arrays, *if* you use it properly (in particular, see the
> notes
> at the end of the README file, or look at
> http://www.dicosmo.org/code/parmap/
> and the research article pointed from there, where a precise discussion of
> the
> steps involved in performing parallel computation on float arrays is
> given).
> Actually, feedback from happy users, and synthetic benchmarks indicate that
> Parmap should provide one of the best possible performances for this use
> case on
> a multicore machine, short of resorting to using libraries like ancient,
> that
> requires carefulness to avoid core dumps, or external libraries that
> already
> have taken care of parallelism for you (like LaPack, etc.).
>
> But if one performs a map over an array of floats *without* using
> the special functions for floats, then all sort of boxing/unboxing and
> copying will take place, and the "parallel" version might very well be
> even slower than the sequential one, *if* the computation on each float
> is fast.
>
> Finally, if the float computations are the bottleneck, then a very
> interesting
> project to keep an eye on is SPOC, that may significantly outperform
> everything
> else on earth: taking advantage of the GPUs in your machine, it can perform
> float computations on large arrays in a fraction of the time that your CPU,
> even multicore, requires... but of course you need to learn to program a
> GPU
> kernel for that. Learn more about this here http://www.algo-prog.info/spoc
>
> The bottonline is, parallelism is easier than concurrency, but when
> one looks for speed, every detail counts, and getting a real speedup
> does not come for free.
>
> We would really need a single place where to share ideas, tips and serious
> analysis of the various available approaches: multicore machines and GPUs
> are a
> reality, and this issue is bound to come up again and again.
>
> --
> Roberto
>
>
>
> On Tue, Jan 07, 2014 at 02:54:33PM -0500, Yotam Barnoy wrote:
> > Hi List
> >
> > So far, I've been programming in ocaml using only sequential programs.
> In my
> > last project, which was an implementation of a large machine learning
> > algorithm, I tried to speed up computation using a little bit of
> parallelism
> > with ParMap, and it was a complete failure. It's possible that more time
> would
> > have yielded better results, but I just didn't have the time to invest
> in it
> > given how bad the initial results were.
> >
> > My question is, what are the options right now as far as parallelism is
> > concerned? I'm not talking about cooperative multitasking, but about
> really
> > taking advantage of multiple cores. I'm well aware of the runtime lock
> and I'm
> > ok with message passing between processes or a shared area in memory,
> but I'd
> > rather have something more high level than starting up several processes,
> > creating a named pipe or a socket, and trying to pass messages through
> that.
> > Also, I assume that using a shared area in memory involves some C code?
> Am I
> > wrong about that?
> >
> > I was expecting Core's Async to fill this role, but realworldocaml is
> fuzzy on
> > this topic, apparently preferring to dwell on cooperative multitasking
> (which
> > is fine but not what I'm looking for), and I couldn't find any other
> > documentation that was clearer.
> >
> > Thanks
> > Yotam
>
> --
> Roberto Di Cosmo
>
> ------------------------------------------------------------------
> Professeur               En delegation a l'INRIA
> PPS                      E-mail: roberto@dicosmo.org
> Universite Paris Diderot WWW  : http://www.dicosmo.org
> Case 7014                Tel  : ++33-(0)1-57 27 92 20
> 5, Rue Thomas Mann
> F-75205 Paris Cedex 13   Identica: http://identi.ca/rdicosmo
> FRANCE.                  Twitter: http://twitter.com/rdicosmo
> ------------------------------------------------------------------
> Attachments:
> MIME accepted, Word deprecated
>       http://www.gnu.org/philosophy/no-word-attachments.html
> ------------------------------------------------------------------
> Office location:
>
> Bureau 3020 (3rd floor)
> Batiment Sophie Germain
> Avenue de France
> Metro Bibliotheque Francois Mitterrand, ligne 14/RER C
> -----------------------------------------------------------------
> GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3
>

--047d7b5d9b994b10e304ef7ccd0e
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div><div><div>Thank you for the very detailed answer=
, Roberto, and thanks for the reference to the excellent list discussion.<b=
r><br></div>The machine learning example I gave was not a typical one -- I =
was using it for illustration, and honestly I wasn&#39;t sure what kind of =
feedback I&#39;d get. I&#39;d love to go back and figure out what was wrong=
 with my implementation, but I currently don&#39;t have the time. Neverthel=
ess, I&#39;m glad to see that many people on this list use Parmap successfu=
lly. My assumption at the time was that perhaps the code was creating and d=
estructing sub-processes continuously, since the slowdown I experienced was=
 by several orders of magnitude, but I&#39;m just not sure.<br>

<br></div>Regarding a place to share ideas, it seems like it would be very =
useful to have an official ocaml wiki. Haskell has this and it&#39;s a huge=
 help. In fact, I would say haskell development would be greatly hampered w=
ithout it. There&#39;s so much information that&#39;s relevant to more than=
 one library ie. doesn&#39;t fit in any particular library&#39;s documentat=
ion. It wouldn&#39;t be too hard to set up a wikimedia instance on <a href=
=3D"http://ocaml.org">ocaml.org</a>, would it? Alternatively it should be p=
retty easy to set up something on wikia. This wiki would also be a great pl=
ace to describe the conceptual implementation of the compiler, which is aga=
in what haskell has.<br>

<br></div>In terms of my personal use cases, I&#39;m much more likely to wa=
nt to be able to share a lot of state between cores than anything else. Out=
 of everything that&#39;s been mentioned so far (including the discussion y=
ou linked), the best fit for me seems to be either netmulticore or perhaps =
ocaml&#39;s mpi bindings, given the testament to mpi&#39;s speed I saw in t=
he aforementioned discussion. Netmulticore seems like a marvelous piece of =
work, but I&#39;m bothered by how unsafe it is -- one of ocaml&#39;s great =
assets is its safety. I wish there was a way for the type system or the run=
time in general to make a shared heap work better, but I guess we&#39;re st=
uck with what we have for now. Anyway, I&#39;m really excited to see what L=
eo White can come up with for making a truly multicore runtime a reality.<b=
r>

<br></div>-Yotam<br></div><div class=3D"gmail_extra"><br><br><div class=3D"=
gmail_quote">On Wed, Jan 8, 2014 at 3:29 PM, Roberto Di Cosmo <span dir=3D"=
ltr">&lt;<a href=3D"mailto:roberto@dicosmo.org" target=3D"_blank">roberto@d=
icosmo.org</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Dear Yotam,<br>
=A0 =A0 there are regularly discussions on how to perform computations invo=
lving<br>
parallelism on this list; a relatively recent and detailed one is<br>
<br>
=A0 =A0 <a href=3D"https://sympa.inria.fr/sympa/arc/caml-list/2012-06/msg00=
043.html" target=3D"_blank">https://sympa.inria.fr/sympa/arc/caml-list/2012=
-06/msg00043.html</a><br>
<br>
And yes, one might be puzzled because there are so many different approache=
s,<br>
but this is unavoidable, because there are so many different needs which ar=
e not<br>
easily covered by a single approach. =A0If one wants to distribute a comput=
ation<br>
that has very little data exchange compared to the local computation on eac=
h<br>
node, the best approach is quite different to the one needed when large dat=
a<br>
sets need to be shared or exchanged among the workers. And if you can do al=
l on<br>
a single machine, you can obtain significant speedup on a multicore machine=
s by<br>
exploiting OS level mechanisms that are not available if you need a cluster=
. Not<br>
to mention the fact that in many cases one is looking for concurrency, and =
not<br>
parallelism: even if at first sight they may look similar, deep down they r=
eally<br>
are not.<br>
<br>
Since you mention machine learning, it&#39;s quite probable that you want t=
o perform<br>
several iterations of relatively inexpensive computations on very large arr=
ays<br>
of integers or floats: if this is the case, Parmap (and not ParMap, that le=
nds<br>
to confusion with a different project in the Haskell world) was specially<b=
r>
designed to provide a highly efficient solution, and avoid boxing/unboxing<=
br>
overhead of float arrays, *if* you use it properly (in particular, see the =
notes<br>
at the end of the README file, or look at <a href=3D"http://www.dicosmo.org=
/code/parmap/" target=3D"_blank">http://www.dicosmo.org/code/parmap/</a><br>
and the research article pointed from there, where a precise discussion of =
the<br>
steps involved in performing parallel computation on float arrays is given)=
.<br>
Actually, feedback from happy users, and synthetic benchmarks indicate that=
<br>
Parmap should provide one of the best possible performances for this use ca=
se on<br>
a multicore machine, short of resorting to using libraries like ancient, th=
at<br>
requires carefulness to avoid core dumps, or external libraries that alread=
y<br>
have taken care of parallelism for you (like LaPack, etc.).<br>
<br>
But if one performs a map over an array of floats *without* using<br>
the special functions for floats, then all sort of boxing/unboxing and<br>
copying will take place, and the &quot;parallel&quot; version might very we=
ll be<br>
even slower than the sequential one, *if* the computation on each float<br>
is fast.<br>
<br>
Finally, if the float computations are the bottleneck, then a very interest=
ing<br>
project to keep an eye on is SPOC, that may significantly outperform everyt=
hing<br>
else on earth: taking advantage of the GPUs in your machine, it can perform=
<br>
float computations on large arrays in a fraction of the time that your CPU,=
<br>
even multicore, requires... but of course you need to learn to program a GP=
U<br>
kernel for that. Learn more about this here <a href=3D"http://www.algo-prog=
.info/spoc" target=3D"_blank">http://www.algo-prog.info/spoc</a><br>
<br>
The bottonline is, parallelism is easier than concurrency, but when<br>
one looks for speed, every detail counts, and getting a real speedup<br>
does not come for free.<br>
<br>
We would really need a single place where to share ideas, tips and serious<=
br>
analysis of the various available approaches: multicore machines and GPUs a=
re a<br>
reality, and this issue is bound to come up again and again.<br>
<div class=3D"im HOEnZb"><br>
--<br>
Roberto<br>
<br>
<br>
<br>
On Tue, Jan 07, 2014 at 02:54:33PM -0500, Yotam Barnoy wrote:<br>
</div><div class=3D"HOEnZb"><div class=3D"h5">&gt; Hi List<br>
&gt;<br>
&gt; So far, I&#39;ve been programming in ocaml using only sequential progr=
ams. In my<br>
&gt; last project, which was an implementation of a large machine learning<=
br>
&gt; algorithm, I tried to speed up computation using a little bit of paral=
lelism<br>
&gt; with ParMap, and it was a complete failure. It&#39;s possible that mor=
e time would<br>
&gt; have yielded better results, but I just didn&#39;t have the time to in=
vest in it<br>
&gt; given how bad the initial results were.<br>
&gt;<br>
&gt; My question is, what are the options right now as far as parallelism i=
s<br>
&gt; concerned? I&#39;m not talking about cooperative multitasking, but abo=
ut really<br>
&gt; taking advantage of multiple cores. I&#39;m well aware of the runtime =
lock and I&#39;m<br>
&gt; ok with message passing between processes or a shared area in memory, =
but I&#39;d<br>
&gt; rather have something more high level than starting up several process=
es,<br>
&gt; creating a named pipe or a socket, and trying to pass messages through=
 that.<br>
&gt; Also, I assume that using a shared area in memory involves some C code=
? Am I<br>
&gt; wrong about that?<br>
&gt;<br>
&gt; I was expecting Core&#39;s Async to fill this role, but realworldocaml=
 is fuzzy on<br>
&gt; this topic, apparently preferring to dwell on cooperative multitasking=
 (which<br>
&gt; is fine but not what I&#39;m looking for), and I couldn&#39;t find any=
 other<br>
&gt; documentation that was clearer.<br>
&gt;<br>
&gt; Thanks<br>
&gt; Yotam<br>
<br>
</div></div><div class=3D"HOEnZb"><div class=3D"h5">--<br>
Roberto Di Cosmo<br>
<br>
------------------------------------------------------------------<br>
Professeur =A0 =A0 =A0 =A0 =A0 =A0 =A0 En delegation a l&#39;INRIA<br>
PPS =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0E-mail: <a href=3D"mailto:ro=
berto@dicosmo.org">roberto@dicosmo.org</a><br>
Universite Paris Diderot WWW =A0: <a href=3D"http://www.dicosmo.org" target=
=3D"_blank">http://www.dicosmo.org</a><br>
Case 7014 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Tel =A0: <a href=3D"tel:%2B%2B33-%=
280%291-57%2027%2092%2020" value=3D"+33157279220">++33-(0)1-57 27 92 20</a>=
<br>
5, Rue Thomas Mann<br>
F-75205 Paris Cedex 13 =A0 Identica: <a href=3D"http://identi.ca/rdicosmo" =
target=3D"_blank">http://identi.ca/rdicosmo</a><br>
FRANCE. =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Twitter: <a href=3D"http://twitt=
er.com/rdicosmo" target=3D"_blank">http://twitter.com/rdicosmo</a><br>
------------------------------------------------------------------<br>
Attachments:<br>
MIME accepted, Word deprecated<br>
=A0 =A0 =A0 <a href=3D"http://www.gnu.org/philosophy/no-word-attachments.ht=
ml" target=3D"_blank">http://www.gnu.org/philosophy/no-word-attachments.htm=
l</a><br>
------------------------------------------------------------------<br>
Office location:<br>
<br>
Bureau 3020 (3rd floor)<br>
Batiment Sophie Germain<br>
Avenue de France<br>
Metro Bibliotheque Francois Mitterrand, ligne 14/RER C<br>
-----------------------------------------------------------------<br>
GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3<br>
</div></div></blockquote></div><br></div>

--047d7b5d9b994b10e304ef7ccd0e--