Re: [Caml-list] Efficient OCaml multicore -- roadmap?

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
       [not found] <2054357367.219171.1300974318806.JavaMail.root@zmbs4.inria.fr>
@ 2011-03-24 23:13 ` Fabrice Le Fessant
  2011-03-25  0:23   ` [Caml-list] " Sylvain Le Gall
                     ` (5 more replies)
  0 siblings, 6 replies; 74+ messages in thread
From: Fabrice Le Fessant @ 2011-03-24 23:13 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 2277 bytes --]

Hi,

  Actually, I had a discussion two weeks ago with Xavier and Damien
about this issue. There is some kind of agreement that the ocaml way of
supporting multicore would be to have several runtimes running in the
same process, in different threads. That way, the GC would still be
mono-threaded, so almost no speed loss for mono-threaded programs (i.e.
currently all OCaml programs ;-) ). There would be some kind of "fork"
function, that would create a new thread running a function in a new
heap, probably generated by a copy-on-need algorithm. The different
threads would not share heap memory, but would be allowed to share
structures outside of their heaps, probably for simple types like
strings and int/float arrays (or using the Ancient library).

  Now, there are still two problems:
(1) We don't know yet how to implement that in a portable way. TLS
(Thread-local storage) is only available on a few architectures. And not
using TLS implies non-backward compatible changes to the FFI
(Foreign-Functions Interface), i.e. all stub libraries would have to be
rewritten.
(2) As Gerd pointed it, there are not so many programs that would
benefit from that. So it is not currently on the top of our priority
list, although I am planning to give it a try in the next months, at
least for the TLS version.

Cheers,
-- 
Fabrice

On 03/24/2011 02:45 PM, Alexy Khrabrov wrote:
> Where does the OCaml team stand on the multicore issues?  A year or so ago, when there was a prototype parallel GC implementation, IIRC, Xavier said it has to be done right.  So what are the official plans and the status of integrating what volunteers had done?
> 
> WIth Scala having a robust actors model and AKKA kernel, and Clojure built around efficient shared memory concurrency with agents and references and STM, and Haskell also really parallel, OCaml is lacking behind.  Furthermore, F# builds on strongly parallel .NET, overcoming granddaddy.  With multicores common even in laptops and iPads, we need an efficient  multicore OCaml!  Due to the model different from Haskell or Scala and Clojure, now all on github, OCaml is both more stable and also is slower to advance -- what do folks think about this situation?  How do you do shared memory parallelism now?
> 
> Cheers,
> Alexy
> 

[-- Attachment #2: fabrice_le_fessant.vcf --]
[-- Type: text/x-vcard, Size: 384 bytes --]

begin:vcard
fn:Fabrice LE FESSANT
n:LE FESSANT;Fabrice
org:INRIA Saclay -- Ile-de-France;Projet OCamlPro
adr;quoted-printable:;;Parc Orsay Universit=C3=A9 ;Orsay CEDEX;;91893;France
email;internet:fabrice.le_fessant@inria.fr
title;quoted-printable:Charg=C3=A9 de Recherche
tel;work:+33 1 74 85 42 14
tel;fax:+33 1 74 85 42 49 
url:http://fabrice.lefessant.net/
version:2.1
end:vcard

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [Caml-list] Re: Efficient OCaml multicore -- roadmap?
  2011-03-24 23:13 ` [Caml-list] Efficient OCaml multicore -- roadmap? Fabrice Le Fessant
@ 2011-03-25  0:23   ` Sylvain Le Gall
  2011-03-25  9:55   ` [Caml-list] " Alain Frisch
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 74+ messages in thread
From: Sylvain Le Gall @ 2011-03-25  0:23 UTC (permalink / raw)
  To: caml-list

On 24-03-2011, Fabrice Le Fessant <Fabrice.Le_fessant@inria.fr> wrote:
> This is a multi-part message in MIME format.
> --------------010005060004040101060205
> Content-Type: text/plain; charset=ISO-8859-1
> Content-Transfer-Encoding: 7bit
>
> Hi,
>
>   Actually, I had a discussion two weeks ago with Xavier and Damien
> about this issue. There is some kind of agreement that the ocaml way of
> supporting multicore would be to have several runtimes running in the
> same process, in different threads. That way, the GC would still be
> mono-threaded, so almost no speed loss for mono-threaded programs (i.e.
> currently all OCaml programs ;-) ). There would be some kind of "fork"
> function, that would create a new thread running a function in a new
> heap, probably generated by a copy-on-need algorithm. 

That is a very good idea. I think this should be the less expensive
solution. I hope to see this implemented.

> The different threads would not share heap memory, but would be
> allowed to share structures outside of their heaps, probably for
> simple types like strings and int/float arrays (or using the Ancient
> library).
>

I think that you should basically left this "sharing" topic open.
Create communication channel and allow to send the address of a C
allocated structure (not managed by by the GC, of course). 

>   Now, there are still two problems:
> (1) We don't know yet how to implement that in a portable way. TLS
> (Thread-local storage) is only available on a few architectures. And not
> using TLS implies non-backward compatible changes to the FFI
> (Foreign-Functions Interface), i.e. all stub libraries would have to be
> rewritten.

Well, I have thought about this issue a lot during the last months.
Currently my best idea is to use a keyword "reentrant" for foreign
function (just as "noalloc").

If an external function is marked "reentrant", you pass an additional
parameters to the function, which is the context for doing allocation
and other things.

If an external function is not markd "reentrant", you acquire a general
lock and do as usual (or use TLS or whatever).

This "reentrant" keyword could be applied to a limited set of functions
of the OCaml runtime, to begin the benchmarks. Once it seems ok, you can
generalize the solution.

> (2) As Gerd pointed it, there are not so many programs that would
> benefit from that. 

Though I agree, I think a roadmap is a good point to show that INRIA
OCaml team is working on this issue. Today, even a half-working solution
for // will be better than nothing inside the INRIA OCaml core
distribution.

> So it is not currently on the top of our priority list, although I am
> planning to give it a try in the next months, at least for the TLS
> version.
>

I wish you good luck.

Cheers,
Sylvain Le Gall
-- 
My company: http://www.ocamlcore.com
Linkedin:   http://fr.linkedin.com/in/sylvainlegall
Start an OCaml project here: http://forge.ocamlcore.org
OCaml blogs:                 http://planet.ocamlcore.org

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-24 23:13 ` [Caml-list] Efficient OCaml multicore -- roadmap? Fabrice Le Fessant
  2011-03-25  0:23   ` [Caml-list] " Sylvain Le Gall
@ 2011-03-25  9:55   ` Alain Frisch
  2011-03-25 11:44     ` Gerd Stolpmann
       [not found]   ` <1396338209.232813.1301046980856.JavaMail.root@zmbs4.inria.fr>
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 74+ messages in thread
From: Alain Frisch @ 2011-03-25  9:55 UTC (permalink / raw)
  To: Fabrice Le Fessant; +Cc: caml-list

Hi Fabrice,

On 03/25/2011 12:13 AM, Fabrice Le Fessant wrote:
>    Actually, I had a discussion two weeks ago with Xavier and Damien
> about this issue. There is some kind of agreement that the ocaml way of
> supporting multicore would be to have several runtimes running in the
> same process, in different threads. That way, the GC would still be
> mono-threaded, so almost no speed loss for mono-threaded programs (i.e.
> currently all OCaml programs ;-) ). There would be some kind of "fork"
> function, that would create a new thread running a function in a new
> heap, probably generated by a copy-on-need algorithm. The different
> threads would not share heap memory, but would be allowed to share
> structures outside of their heaps, probably for simple types like
> strings and int/float arrays (or using the Ancient library).

This sound very reasonable, and it's good to know that the subject is 
being discussed.

What are the advantages of this approach over multi-processing (+ 
explicit shared memory)?  Multi-processing brings extra isolation and 
safety; and it requires no change to the runtime system or the FFI.

One advantage of the approach you describe is that it would allow to use 
native libraries that support multi-threading. Do you see other advantages?


Cheers,

Alain

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
       [not found]   ` <1396338209.232813.1301046980856.JavaMail.root@zmbs4.inria.fr>
@ 2011-03-25 10:23     ` Fabrice Le Fessant
  2011-03-25 12:07       ` Gerd Stolpmann
  0 siblings, 1 reply; 74+ messages in thread
From: Fabrice Le Fessant @ 2011-03-25 10:23 UTC (permalink / raw)
  To: Alain Frisch; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 1922 bytes --]

Hi Alain,

> What are the advantages of this approach over multi-processing (+
> explicit shared memory)?  Multi-processing brings extra isolation and
> safety; and it requires no change to the runtime system or the FFI.

Multi-processing has several drawbacks: you have to either create
several binaries for different tasks (master and workers usually), or
use options for that; you have to define a way to setup the system, for
example the master has to launch the workers, and for that, you have to
know where binaries are installed. On the contrary, with the "ocaml
fork" approach, you just have one program executing, so you can assume a
smaller knowledge of the environment.

For the FFI, I think it is actually possible to keep some
backward-compatibility, by allowing both a new kind of external C
functions (taking a thread specific context as argument, and declared
with the "reentrant" flag introduced by Sylvain) and the former kind
(without the context). Each function of the runtime would come in both
kinds, for example "caml_alloc_r" taking the context as extra argument,
and "caml_alloc"  without the context, the latter being a wrapper for
the former, with a way to recover the context in the middle (either
through TLS or thread specific). Most stub libraries would probably work
without change, maybe with a little extra-cost (actually, I think TLS is
costless, and thread specific is very efficient on Mac OS X where TLS
does not exist).

> One advantage of the approach you describe is that it would allow to use
> native libraries that support multi-threading. Do you see other advantages?

As I said before, the main one for me is the ability to setup everything
in one process, you can for example create locks, shared segments and
other synchronisation primitives in the main thread, before creating the
other threads, which is a little more difficult with multi-processing.

Cheers,
-- 
Fabrice

[-- Attachment #2: fabrice_le_fessant.vcf --]
[-- Type: text/x-vcard, Size: 384 bytes --]

begin:vcard
fn:Fabrice LE FESSANT
n:LE FESSANT;Fabrice
org:INRIA Saclay -- Ile-de-France;Projet OCamlPro
adr;quoted-printable:;;Parc Orsay Universit=C3=A9 ;Orsay CEDEX;;91893;France
email;internet:fabrice.le_fessant@inria.fr
title;quoted-printable:Charg=C3=A9 de Recherche
tel;work:+33 1 74 85 42 14
tel;fax:+33 1 74 85 42 49 
url:http://fabrice.lefessant.net/
version:2.1
end:vcard

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-24 23:13 ` [Caml-list] Efficient OCaml multicore -- roadmap? Fabrice Le Fessant
                     ` (2 preceding siblings ...)
       [not found]   ` <1396338209.232813.1301046980856.JavaMail.root@zmbs4.inria.fr>
@ 2011-03-25 10:51   ` Hugo Ferreira
  2011-03-25 12:25     ` Gerd Stolpmann
  2011-03-25 18:45   ` Andrei Formiga
  2011-04-13  3:36   ` Lucas Dixon
  5 siblings, 1 reply; 74+ messages in thread
From: Hugo Ferreira @ 2011-03-25 10:51 UTC (permalink / raw)
  To: Fabrice Le Fessant; +Cc: caml-list

Hello,

First and foremost just let me say that I am not an expert on this
issue but I have a vested interest in Ocaml's support for parallel
execution, hence my comment.

On 03/24/2011 11:13 PM, Fabrice Le Fessant wrote:
> Hi,
>
>    Actually, I had a discussion two weeks ago with Xavier and Damien
> about this issue. There is some kind of agreement that the ocaml way of
> supporting multicore would be to have several runtimes running in the
> same process, in different threads. That way, the GC would still be
> mono-threaded, so almost no speed loss for mono-threaded programs (i.e.
> currently all OCaml programs ;-) ). There would be some kind of "fork"
> function, that would create a new thread running a function in a new
> heap, probably generated by a copy-on-need algorithm. The different
> threads would not share heap memory, but would be allowed to share
> structures outside of their heaps, probably for simple types like
> strings and int/float arrays (or using the Ancient library).
>

I see the restriction that shared data be simple data types as a 
significant disadvantage. In my experience being able to access complex 
data structure is crucial. Whenever I see this constraint I keep asking: 
why not bite the bullet? Why not take the bull by the horns and do it as 
well as possible? If not, I believe its a waste of precious time.

That being said, I have a question: why will the proposal below not work?

Assuming all shared data structures are immutable is it possible to:

1) Use a GC that is thread aware.
2) Let each thread have its own thread-aware GC.
4) Mark the shared data by tagging it with the owner of the data and 
counting the number of threads using this data.
5) Each thread has it own local GC. This GC only manages the thread's
local data that is not shared with any other.
6) If data is shared, the thread that shares immediately marks it as 
shared.
7) If a thread terminates before the shared data can be collected (that 
is when the shared counter is not 0), the ownership is reverted to any 
one of the sharing threads.
8) The last thread to hold a shared data structure is responsible for GC.

Am I totally off the mark here?

Regards,
Hugo F.

P.S: Note that the above can be "enhanced" by for example identifying 
read-only data in order to facilitate GC.

>    Now, there are still two problems:
> (1) We don't know yet how to implement that in a portable way. TLS
> (Thread-local storage) is only available on a few architectures. And not
> using TLS implies non-backward compatible changes to the FFI
> (Foreign-Functions Interface), i.e. all stub libraries would have to be
> rewritten.
> (2) As Gerd pointed it, there are not so many programs that would
> benefit from that. So it is not currently on the top of our priority
> list, although I am planning to give it a try in the next months, at
> least for the TLS version.
>
> Cheers,

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-25  9:55   ` [Caml-list] " Alain Frisch
@ 2011-03-25 11:44     ` Gerd Stolpmann
  0 siblings, 0 replies; 74+ messages in thread
From: Gerd Stolpmann @ 2011-03-25 11:44 UTC (permalink / raw)
  To: Alain Frisch; +Cc: Fabrice Le Fessant, caml-list

Am Freitag, den 25.03.2011, 10:55 +0100 schrieb Alain Frisch:
> Hi Fabrice,
> 
> On 03/25/2011 12:13 AM, Fabrice Le Fessant wrote:
> >    Actually, I had a discussion two weeks ago with Xavier and Damien
> > about this issue. There is some kind of agreement that the ocaml way of
> > supporting multicore would be to have several runtimes running in the
> > same process, in different threads. That way, the GC would still be
> > mono-threaded, so almost no speed loss for mono-threaded programs (i.e.
> > currently all OCaml programs ;-) ). There would be some kind of "fork"
> > function, that would create a new thread running a function in a new
> > heap, probably generated by a copy-on-need algorithm. The different
> > threads would not share heap memory, but would be allowed to share
> > structures outside of their heaps, probably for simple types like
> > strings and int/float arrays (or using the Ancient library).
> 
> This sound very reasonable, and it's good to know that the subject is 
> being discussed.
> 
> What are the advantages of this approach over multi-processing (+ 
> explicit shared memory)?  Multi-processing brings extra isolation and 
> safety; and it requires no change to the runtime system or the FFI.
> 
> One advantage of the approach you describe is that it would allow to use 
> native libraries that support multi-threading. Do you see other advantages?

It is an interesting approach. There is at least one big advantage: When
new memory is allocated that is to be shared by all workers, it is
mapped at the same address. In the multi-process design this is
difficult to organize - there is no system call for this. The only safe
way to get this effect is to allocate memory before new workers are
forked (the unsafe way is to manage addresses manually, but this is
quickly becoming very OS-dependent).

Being able to map shared memory at the same address is important for
some uses, especially when ocaml values are moved there directly in the
style of Ancient. For my Camlbox data structure I had to work around
this issue (and it was feasible), but not all interesting data
structures would allow that.

Gerd

> Cheers,
> 
> Alain
> 


-- 
------------------------------------------------------------
Gerd Stolpmann, Bad Nauheimer Str.3, 64289 Darmstadt,Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-25 10:23     ` Fabrice Le Fessant
@ 2011-03-25 12:07       ` Gerd Stolpmann
  2011-04-16 12:12         ` Jon Harrop
  0 siblings, 1 reply; 74+ messages in thread
From: Gerd Stolpmann @ 2011-03-25 12:07 UTC (permalink / raw)
  To: Fabrice Le Fessant; +Cc: Alain Frisch, caml-list

> For the FFI, I think it is actually possible to keep some
> backward-compatibility, by allowing both a new kind of external C
> functions (taking a thread specific context as argument, and declared
> with the "reentrant" flag introduced by Sylvain) and the former kind
> (without the context). Each function of the runtime would come in both
> kinds, for example "caml_alloc_r" taking the context as extra argument,
> and "caml_alloc"  without the context, the latter being a wrapper for
> the former, with a way to recover the context in the middle (either
> through TLS or thread specific). Most stub libraries would probably work
> without change, maybe with a little extra-cost (actually, I think TLS is
> costless, and thread specific is very efficient on Mac OS X where TLS
> does not exist).

AFAIK TLS is not costless but relatively cheap. There is an extra
indirection for each access (as if the variables were arrays indexed by
a thread ID). The thread ID is stored in otherwise unused registers that
are left-overs from the 16 bit times.

I guess there could be some performance impact on frequently accessed GC
variables (e.g. caml_young_limit).

> > One advantage of the approach you describe is that it would allow to use
> > native libraries that support multi-threading. Do you see other advantages?
> 
> As I said before, the main one for me is the ability to setup everything
> in one process, you can for example create locks, shared segments and
> other synchronisation primitives in the main thread, before creating the
> other threads, which is a little more difficult with multi-processing.

I think the more interesting point is that you create/setup the
primitives after creating the threads, and this is very easy because of
the shared address space. There are POSIX calls for doing this in the
multi-process approach but the remaining problem is that one cannot
control the address space, especially that shared memory is mapped to
the same address in all processes.

Gerd

> 
> Cheers,
> -- 
> Fabrice
> 


-- 
------------------------------------------------------------
Gerd Stolpmann, Bad Nauheimer Str.3, 64289 Darmstadt,Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-25 10:51   ` Hugo Ferreira
@ 2011-03-25 12:25     ` Gerd Stolpmann
  2011-03-25 12:58       ` Hugo Ferreira
       [not found]       ` <341494683.237537.1301057887481.JavaMail.root@zmbs4.inria.fr>
  0 siblings, 2 replies; 74+ messages in thread
From: Gerd Stolpmann @ 2011-03-25 12:25 UTC (permalink / raw)
  To: Hugo Ferreira; +Cc: Fabrice Le Fessant, caml-list

Am Freitag, den 25.03.2011, 10:51 +0000 schrieb Hugo Ferreira:
> Hello,
> 
> First and foremost just let me say that I am not an expert on this
> issue but I have a vested interest in Ocaml's support for parallel
> execution, hence my comment.
> 
> On 03/24/2011 11:13 PM, Fabrice Le Fessant wrote:
> > Hi,
> >
> >    Actually, I had a discussion two weeks ago with Xavier and Damien
> > about this issue. There is some kind of agreement that the ocaml way of
> > supporting multicore would be to have several runtimes running in the
> > same process, in different threads. That way, the GC would still be
> > mono-threaded, so almost no speed loss for mono-threaded programs (i.e.
> > currently all OCaml programs ;-) ). There would be some kind of "fork"
> > function, that would create a new thread running a function in a new
> > heap, probably generated by a copy-on-need algorithm. The different
> > threads would not share heap memory, but would be allowed to share
> > structures outside of their heaps, probably for simple types like
> > strings and int/float arrays (or using the Ancient library).
> >
> 
> I see the restriction that shared data be simple data types as a 
> significant disadvantage. 

Well, Fabrice was mentioning Ancient which is a code word for putting
complex values to non-GC-managed memory. It's only that the application
programmer is finally responsible for managing that memory (especially
when to deallocate).

> In my experience being able to access complex 
> data structure is crucial. Whenever I see this constraint I keep asking: 
> why not bite the bullet? Why not take the bull by the horns and do it as 
> well as possible? If not, I believe its a waste of precious time.
> 
> That being said, I have a question: why will the proposal below not work?
> 
> Assuming all shared data structures are immutable is it possible to:
> 
> 1) Use a GC that is thread aware.
> 2) Let each thread have its own thread-aware GC.
> 4) Mark the shared data by tagging it with the owner of the data and 
> counting the number of threads using this data.
> 5) Each thread has it own local GC. This GC only manages the thread's
> local data that is not shared with any other.
> 6) If data is shared, the thread that shares immediately marks it as 
> shared.
> 7) If a thread terminates before the shared data can be collected (that 
> is when the shared counter is not 0), the ownership is reverted to any 
> one of the sharing threads.
> 8) The last thread to hold a shared data structure is responsible for GC.
> 
> Am I totally off the mark here?

It would be a bit more complicated. You have two possible pointers:

(1) From thread-specific memory to shared memory
(2) From shared memory to shared memory

Pointers from shared memory to thread-specific memory would be
forbidden, of course. The problem is now that you can manage (1) with
reference counting (where the counter would need some protection for
concurrent accesses), but for (2) you would need, at least in the
general view, a garbage collector that is able to deal with circular
pointers.

So, probably there is a solution to this, but it is very complex. You
would need to extend the header of Ocaml values with counters when they
are allocated in the shared space (or the counters are put elsewhere,
but this would make it more costly to access them). There needs to be a
special version of the garbage collector for the shared space.

This is probably all possible, but the effort is not low to get this
level of automatism. If we restrict ourselves to manually managed shared
space (i.e. the app has to decide when to delete something), we can
maybe get something soon.

Gerd

> 
> Regards,
> Hugo F.
> 
> P.S: Note that the above can be "enhanced" by for example identifying 
> read-only data in order to facilitate GC.
> 
> 
> >    Now, there are still two problems:
> > (1) We don't know yet how to implement that in a portable way. TLS
> > (Thread-local storage) is only available on a few architectures. And not
> > using TLS implies non-backward compatible changes to the FFI
> > (Foreign-Functions Interface), i.e. all stub libraries would have to be
> > rewritten.
> > (2) As Gerd pointed it, there are not so many programs that would
> > benefit from that. So it is not currently on the top of our priority
> > list, although I am planning to give it a try in the next months, at
> > least for the TLS version.
> >
> > Cheers,
> 
> 


-- 
------------------------------------------------------------
Gerd Stolpmann, Bad Nauheimer Str.3, 64289 Darmstadt,Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-25 12:25     ` Gerd Stolpmann
@ 2011-03-25 12:58       ` Hugo Ferreira
       [not found]       ` <341494683.237537.1301057887481.JavaMail.root@zmbs4.inria.fr>
  1 sibling, 0 replies; 74+ messages in thread
From: Hugo Ferreira @ 2011-03-25 12:58 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: Fabrice Le Fessant, caml-list

On 03/25/2011 12:25 PM, Gerd Stolpmann wrote:
> Am Freitag, den 25.03.2011, 10:51 +0000 schrieb Hugo Ferreira:
>> That being said, I have a question: why will the proposal below not work?
>>
>> Assuming all shared data structures are immutable is it possible to:
>>
>> 1) Use a GC that is thread aware.
>> 2) Let each thread have its own thread-aware GC.
>> 4) Mark the shared data by tagging it with the owner of the data and
>> counting the number of threads using this data.
>> 5) Each thread has it own local GC. This GC only manages the thread's
>> local data that is not shared with any other.
>> 6) If data is shared, the thread that shares immediately marks it as
>> shared.
>> 7) If a thread terminates before the shared data can be collected (that
>> is when the shared counter is not 0), the ownership is reverted to any
>> one of the sharing threads.
>> 8) The last thread to hold a shared data structure is responsible for GC.
>>
>> Am I totally off the mark here?
>
> It would be a bit more complicated. You have two possible pointers:
>
> (1) From thread-specific memory to shared memory
> (2) From shared memory to shared memory
>
> Pointers from shared memory to thread-specific memory would be
> forbidden, of course.

I was assuming that this would not be possible given the construction
of non-mutable data.

> The problem is now that you can manage (1) with
> reference counting (where the counter would need some protection for
> concurrent accesses), but for (2) you would need, at least in the
> general view, a garbage collector that is able to deal with circular
> pointers.

Ok, circularity would certainly complicate assigning ownership to a
thread GC.

>
> So, probably there is a solution to this, but it is very complex. You
> would need to extend the header of Ocaml values with counters when they
> are allocated in the shared space (or the counters are put elsewhere,
> but this would make it more costly to access them). There needs to be a
> special version of the garbage collector for the shared space.
>
> This is probably all possible, but the effort is not low to get this
> level of automatism. If we restrict ourselves to manually managed shared
> space (i.e. the app has to decide when to delete something), we can
> maybe get something soon.
>

Which seems to bring us back to the initial issue: it would be
difficult if not impossible to manipulate and change shared complex data
structure manually.

I am now wondering if disallowing sharing that causes cycles would make 
things easier. A sort of FILO stack of sharing. A thread may only share
to its descendants. Problem: how to ensure only one of these remains
owner. Going to give this some more thought.

Thanks for the feedback.

R,
Hugo F.

> Gerd
>
>>
>> Regards,
>> Hugo F.
>>
>> P.S: Note that the above can be "enhanced" by for example identifying
>> read-only data in order to facilitate GC.
>>
>>
>>>     Now, there are still two problems:
>>> (1) We don't know yet how to implement that in a portable way. TLS
>>> (Thread-local storage) is only available on a few architectures. And not
>>> using TLS implies non-backward compatible changes to the FFI
>>> (Foreign-Functions Interface), i.e. all stub libraries would have to be
>>> rewritten.
>>> (2) As Gerd pointed it, there are not so many programs that would
>>> benefit from that. So it is not currently on the top of our priority
>>> list, although I am planning to give it a try in the next months, at
>>> least for the TLS version.
>>>
>>> Cheers,
>>
>>
>
>


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
       [not found]       ` <341494683.237537.1301057887481.JavaMail.root@zmbs4.inria.fr>
@ 2011-03-25 13:10         ` Fabrice Le Fessant
  2011-03-25 13:41           ` Dario Teixeira
                             ` (3 more replies)
  0 siblings, 4 replies; 74+ messages in thread
From: Fabrice Le Fessant @ 2011-03-25 13:10 UTC (permalink / raw)
  To: Hugo Ferreira; +Cc: Gerd Stolpmann, caml-list

[-- Attachment #1: Type: text/plain, Size: 873 bytes --]

On 03/25/2011 01:58 PM, Hugo Ferreira wrote:
>>> Assuming all shared data structures are immutable is it possible to:

Well, Java has fully multi-threaded garbage collection, so there is no
point that it is possible to do it.

The problem is that it has a cost, and the cost is a huge slowdown on
mono-threaded programs. Since most OCaml programs are mono-threaded, and
only few programs would really benefit from multi-threading, we are
trying to come up with a solution that would satisfy both worlds.
Mono-threaded programs would still run as fast (possibly using a
non-reentrant version of the runtime, to avoid the cost of TLS if it is
visible), and multi-threaded programs can be written easily.

 Of course, sharing structured mutable data between threads will not be
possible, but actually, it is a good thing if you want to write correct
programs ;-)

-- 
Fabrice

[-- Attachment #2: fabrice_le_fessant.vcf --]
[-- Type: text/x-vcard, Size: 397 bytes --]

begin:vcard
fn:Fabrice LE FESSANT
n:LE FESSANT;Fabrice
org:INRIA Saclay -- Ile-de-France;Projet OCamlPro
adr;quoted-printable:;;Parc Orsay Universit=C3=A9 ;Orsay CEDEX;;91893;France
email;internet:fabrice.le_fessant@inria.fr
title;quoted-printable:Charg=C3=A9 de Recherche
tel;work:+33 1 74 85 42 14
tel;fax:+33 1 74 85 42 49 
url:http://fabrice.lefessant.net/
version:2.1
end:vcard

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-25 13:10         ` Fabrice Le Fessant
@ 2011-03-25 13:41           ` Dario Teixeira
  2011-03-30 18:12             ` Jon Harrop
  2011-03-25 15:44           ` Hugo Ferreira
                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 74+ messages in thread
From: Dario Teixeira @ 2011-03-25 13:41 UTC (permalink / raw)
  To: Fabrice Le Fessant; +Cc: caml-list

Hi,

> Of course, sharing structured mutable data between threads will not be
> possible, but actually, it is a good thing if you want to write correct
> programs ;-)

This is a good point that is often neglected in multicore discussions.
What kind of language constructs would also be introduced in order to
manage the concurrency between the multiple threads?  Some sort of
message passing API?  Will Jocaml become the new Ocaml?

Cheers,
Dario Teixeira



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-25 13:10         ` Fabrice Le Fessant
  2011-03-25 13:41           ` Dario Teixeira
@ 2011-03-25 15:44           ` Hugo Ferreira
  2011-03-25 18:24             ` Martin Jambon
  2011-03-25 20:27             ` Philippe Strauss
  2011-04-19 22:47           ` Jon Harrop
       [not found]           ` <869445701.579183.1303253283515.JavaMail.root@zmbs4.inria.fr>
  3 siblings, 2 replies; 74+ messages in thread
From: Hugo Ferreira @ 2011-03-25 15:44 UTC (permalink / raw)
  To: Fabrice Le Fessant; +Cc: Gerd Stolpmann, caml-list

Hi,

Don't mean to drag this on but...

On 03/25/2011 01:10 PM, Fabrice Le Fessant wrote:
> On 03/25/2011 01:58 PM, Hugo Ferreira wrote:
>>>> Assuming all shared data structures are immutable is it possible to:
>
> Well, Java has fully multi-threaded garbage collection, so there is no
> point that it is possible to do it.
>
> The problem is that it has a cost, and the cost is a huge slowdown on
> mono-threaded programs. Since most OCaml programs are mono-threaded, and
> only few programs would really benefit from multi-threading, we are
> trying to come up with a solution that would satisfy both worlds.
> Mono-threaded programs would still run as fast (possibly using a
> non-reentrant version of the runtime, to avoid the cost of TLS if it is
> visible), and multi-threaded programs can be written easily.
>

Personally I think this is an "impossible" feat.
Why not simply allow for the existence of two GC's and let
the programmer select the one that is best for the application?

>   Of course, sharing structured mutable data between threads will not be
> possible, but actually, it is a good thing if you want to write correct
> programs ;-)
>

I'll stick to my guns here. It simply makes solving certain problem
unfeasible. Point in case: I work on machine learning algorithms. I
use large data-structures that must be processed (altered)
in order to learn. Because these data-structures are large it become
impractical to copy this to a process every time I start off a new
"thread".

R,
Hugo F.



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-25 15:44           ` Hugo Ferreira
@ 2011-03-25 18:24             ` Martin Jambon
  2011-03-25 19:19               ` Hugo Ferreira
                                 ` (2 more replies)
  2011-03-25 20:27             ` Philippe Strauss
  1 sibling, 3 replies; 74+ messages in thread
From: Martin Jambon @ 2011-03-25 18:24 UTC (permalink / raw)
  To: caml-list

> On 03/25/2011 01:10 PM, Fabrice Le Fessant wrote:
>>   Of course, sharing structured mutable data between threads will not be
>> possible, but actually, it is a good thing if you want to write correct
>> programs ;-)

On 03/25/11 08:44, Hugo Ferreira replied:
> I'll stick to my guns here. It simply makes solving certain problem
> unfeasible. Point in case: I work on machine learning algorithms. I
> use large data-structures that must be processed (altered)
> in order to learn. Because these data-structures are large it become
> impractical to copy this to a process every time I start off a new
> "thread".

The solution would be to use get/set via a message-passing interface.

From my purely speculative perspective, it seems unavoidable that
message-passing happens at some level in order to keep a shared data
structure in sync between a large number of processors. In other words,
any access to a shared data structure requires some physical copy no
matter what the programming language makes it look like.


Martin

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-24 23:13 ` [Caml-list] Efficient OCaml multicore -- roadmap? Fabrice Le Fessant
                     ` (3 preceding siblings ...)
  2011-03-25 10:51   ` Hugo Ferreira
@ 2011-03-25 18:45   ` Andrei Formiga
  2011-03-30 17:00     ` Jon Harrop
  2011-04-13  3:36   ` Lucas Dixon
  5 siblings, 1 reply; 74+ messages in thread
From: Andrei Formiga @ 2011-03-25 18:45 UTC (permalink / raw)
  To: Fabrice Le Fessant; +Cc: caml-list

On Thu, Mar 24, 2011 at 8:13 PM, Fabrice Le Fessant
<Fabrice.Le_fessant@inria.fr> wrote:
> Hi,
>
>  Actually, I had a discussion two weeks ago with Xavier and Damien
> about this issue. There is some kind of agreement that the ocaml way of
> supporting multicore would be to have several runtimes running in the
> same process, in different threads. That way, the GC would still be
> mono-threaded, so almost no speed loss for mono-threaded programs (i.e.
> currently all OCaml programs ;-) ). There would be some kind of "fork"
> function, that would create a new thread running a function in a new
> heap, probably generated by a copy-on-need algorithm. The different
> threads would not share heap memory, but would be allowed to share
> structures outside of their heaps, probably for simple types like
> strings and int/float arrays (or using the Ancient library).
>

This looks very interesting. I know there is a lot of hype around
multicore, and many programs wouldn't benefit from this, but in the
situations where you'd want to use the multiple cores you have, the
solutions available now in OCaml aren't easy to use or very practical.

Besides, although I don't think this is extremely important (it does
have some importance for the language's ecosystem) there is still the
"marketing" side. Saying that the OCaml solution to have programs
taking advantage of multicores is "create multiple processes and
manage them and their communication by yourself" isn't a good answer,
in my mind.

-- 
[]s, Andrei Formiga


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-25 18:24             ` Martin Jambon
@ 2011-03-25 19:19               ` Hugo Ferreira
  2011-03-25 20:26                 ` Gerd Stolpmann
                                   ` (2 more replies)
  2011-03-30 17:02               ` Jon Harrop
  2011-04-20 19:23               ` Jon Harrop
  2 siblings, 3 replies; 74+ messages in thread
From: Hugo Ferreira @ 2011-03-25 19:19 UTC (permalink / raw)
  To: Martin Jambon; +Cc: caml-list

On 03/25/2011 06:24 PM, Martin Jambon wrote:
>> On 03/25/2011 01:10 PM, Fabrice Le Fessant wrote:
>>>    Of course, sharing structured mutable data between threads will not be
>>> possible, but actually, it is a good thing if you want to write correct
>>> programs ;-)
>
> On 03/25/11 08:44, Hugo Ferreira replied:
>> I'll stick to my guns here. It simply makes solving certain problem
>> unfeasible. Point in case: I work on machine learning algorithms. I
>> use large data-structures that must be processed (altered)
>> in order to learn. Because these data-structures are large it become
>> impractical to copy this to a process every time I start off a new
>> "thread".
>
> The solution would be to use get/set via a message-passing interface.
>

Cannot see how this works. Say I want to share a balanced binary tree.
Several processes/threads each take this tree and alter it by adding and
deleting elements. Each (new) tree is then further processed by other
processes/threads.

How can get/set be used in this scenario?

>> From my purely speculative perspective, it seems unavoidable that
> message-passing happens at some level in order to keep a shared data
> structure in sync between a large number of processors. In other words,
> any access to a shared data structure requires some physical copy no
> matter what the programming language makes it look like.
>

I assume you are referring to multi-processing were memory is not shared 
amongst CPU's, correct?

Hugo F.


>
> Martin
>


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-25 19:19               ` Hugo Ferreira
@ 2011-03-25 20:26                 ` Gerd Stolpmann
  2011-03-26  9:11                   ` Hugo Ferreira
                                     ` (2 more replies)
  2011-04-19  9:57                 ` Eray Ozkural
  2011-04-19 22:49                 ` Jon Harrop
  2 siblings, 3 replies; 74+ messages in thread
From: Gerd Stolpmann @ 2011-03-25 20:26 UTC (permalink / raw)
  To: Hugo Ferreira; +Cc: Martin Jambon, caml-list

Am Freitag, den 25.03.2011, 19:19 +0000 schrieb Hugo Ferreira:
> On 03/25/2011 06:24 PM, Martin Jambon wrote:
> >> On 03/25/2011 01:10 PM, Fabrice Le Fessant wrote:
> >>>    Of course, sharing structured mutable data between threads will not be
> >>> possible, but actually, it is a good thing if you want to write correct
> >>> programs ;-)
> >
> > On 03/25/11 08:44, Hugo Ferreira replied:
> >> I'll stick to my guns here. It simply makes solving certain problem
> >> unfeasible. Point in case: I work on machine learning algorithms. I
> >> use large data-structures that must be processed (altered)
> >> in order to learn. Because these data-structures are large it become
> >> impractical to copy this to a process every time I start off a new
> >> "thread".
> >
> > The solution would be to use get/set via a message-passing interface.
> >
> 
> Cannot see how this works. Say I want to share a balanced binary tree.
> Several processes/threads each take this tree and alter it by adding and
> deleting elements. Each (new) tree is then further processed by other
> processes/threads.
> 
> How can get/set be used in this scenario?
> 
> >> From my purely speculative perspective, it seems unavoidable that
> > message-passing happens at some level in order to keep a shared data
> > structure in sync between a large number of processors. In other words,
> > any access to a shared data structure requires some physical copy no
> > matter what the programming language makes it look like.
> >
> 
> I assume you are referring to multi-processing were memory is not shared 
> amongst CPU's, correct?

This is quite normal nowadays when you have several CPUs in a (server)
system. Each CPU gets its own cache, or even its own bank of RAM (e.g.
all Opterons have this). And there is indeed message passing on the
hardware level (HyperTransport, QuickPath, or Infiniband). For the
software, it still looks as if memory was uniform (cache coherency), but
under the hood messages are exchanged to get this effect (or even RAM
accesses are routed over the CPU interconnect).

The messages are only noticeable from software as speed degradation when
you access RAM in the wrong way. E.g. you can see this when you
read/modify/write the same memory cell in a loop from two threads
running on two cores. This is a lot slower than if only a single core
did this because the two cores constantly exchange messages and have to
wait until the delivery of the message is done (this is also known as
cache line bouncing).

When you implement a message passing API for a high level language, the
way to go is to provide message buffers in memory. When a thread
delivers a message it just writes to the buffer. The real message,
however, is sent by the hardware under the hood, and must be delivered
until the reading thread synchronizes. The point is here, IMHO, that you
exploit the features of the hardware the most when you do it in a way
the hardware can deal best with. Thus a program written against the
message passing API will finally be faster than one that uncritically
assumes a uniform memory, and modifies memory in-place, and will finally
suffer from cache line bouncing.

For current standard server hardware there are usually not enough CPUs
to get a big effect. But in a few years it will be very important (as it
is for current supercomputers), maybe even for standard 128 core laptops
in 2015 (just guessing).

Gerd

> 
> Hugo F.
> 
> 
> >
> > Martin
> >
> 
> 

-- 
------------------------------------------------------------
Gerd Stolpmann, Bad Nauheimer Str.3, 64289 Darmstadt,Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-25 15:44           ` Hugo Ferreira
  2011-03-25 18:24             ` Martin Jambon
@ 2011-03-25 20:27             ` Philippe Strauss
  1 sibling, 0 replies; 74+ messages in thread
From: Philippe Strauss @ 2011-03-25 20:27 UTC (permalink / raw)
  To: Hugo Ferreira; +Cc: Fabrice Le Fessant, Gerd Stolpmann, caml-list


Le 25 mars 2011 à 16:44, Hugo Ferreira a écrit :

> Hi,
> 
> Don't mean to drag this on but...
> 
> On 03/25/2011 01:10 PM, Fabrice Le Fessant wrote:
>> On 03/25/2011 01:58 PM, Hugo Ferreira wrote:
>>>>> Assuming all shared data structures are immutable is it possible to:
>> 
>> Well, Java has fully multi-threaded garbage collection, so there is no
>> point that it is possible to do it.
>> 
>> The problem is that it has a cost, and the cost is a huge slowdown on
>> mono-threaded programs. Since most OCaml programs are mono-threaded, and
>> only few programs would really benefit from multi-threading, we are
>> trying to come up with a solution that would satisfy both worlds.
>> Mono-threaded programs would still run as fast (possibly using a
>> non-reentrant version of the runtime, to avoid the cost of TLS if it is
>> visible), and multi-threaded programs can be written easily.
>> 
> 
> Personally I think this is an "impossible" feat.
> Why not simply allow for the existence of two GC's and let
> the programmer select the one that is best for the application?
> 

I second this one.

Seems GC and caml are heavily intricated, but modularizing this part seems a future proof investment in effort.



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-25 20:26                 ` Gerd Stolpmann
@ 2011-03-26  9:11                   ` Hugo Ferreira
  2011-03-26 10:31                   ` Richard W.M. Jones
  2011-04-20 21:44                   ` Jon Harrop
  2 siblings, 0 replies; 74+ messages in thread
From: Hugo Ferreira @ 2011-03-26  9:11 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: Martin Jambon, caml-list

On 03/25/2011 08:26 PM, Gerd Stolpmann wrote:
> Am Freitag, den 25.03.2011, 19:19 +0000 schrieb Hugo Ferreira:
>> On 03/25/2011 06:24 PM, Martin Jambon wrote:
>>>> On 03/25/2011 01:10 PM, Fabrice Le Fessant wrote:
>>>>>     Of course, sharing structured mutable data between threads will not be
>>>>> possible, but actually, it is a good thing if you want to write correct
>>>>> programs ;-)
>>>
>>> On 03/25/11 08:44, Hugo Ferreira replied:
>>>> I'll stick to my guns here. It simply makes solving certain problem
>>>> unfeasible. Point in case: I work on machine learning algorithms. I
>>>> use large data-structures that must be processed (altered)
>>>> in order to learn. Because these data-structures are large it become
>>>> impractical to copy this to a process every time I start off a new
>>>> "thread".
>>>
>>> The solution would be to use get/set via a message-passing interface.
>>>
>>
>> Cannot see how this works. Say I want to share a balanced binary tree.
>> Several processes/threads each take this tree and alter it by adding and
>> deleting elements. Each (new) tree is then further processed by other
>> processes/threads.
>>
>> How can get/set be used in this scenario?
>>
>>>>  From my purely speculative perspective, it seems unavoidable that
>>> message-passing happens at some level in order to keep a shared data
>>> structure in sync between a large number of processors. In other words,
>>> any access to a shared data structure requires some physical copy no
>>> matter what the programming language makes it look like.
>>>
>>
>> I assume you are referring to multi-processing were memory is not shared
>> amongst CPU's, correct?
>
> This is quite normal nowadays when you have several CPUs in a (server)
> system. Each CPU gets its own cache, or even its own bank of RAM (e.g.
> all Opterons have this). And there is indeed message passing on the
> hardware level (HyperTransport, QuickPath, or Infiniband). For the
> software, it still looks as if memory was uniform (cache coherency), but
> under the hood messages are exchanged to get this effect (or even RAM
> accesses are routed over the CPU interconnect).
>
> The messages are only noticeable from software as speed degradation when
> you access RAM in the wrong way. E.g. you can see this when you
> read/modify/write the same memory cell in a loop from two threads
> running on two cores. This is a lot slower than if only a single core
> did this because the two cores constantly exchange messages and have to
> wait until the delivery of the message is done (this is also known as
> cache line bouncing).
>
> When you implement a message passing API for a high level language, the
> way to go is to provide message buffers in memory. When a thread
> delivers a message it just writes to the buffer. The real message,
> however, is sent by the hardware under the hood, and must be delivered
> until the reading thread synchronizes. The point is here, IMHO, that you
> exploit the features of the hardware the most when you do it in a way
> the hardware can deal best with. Thus a program written against the
> message passing API will finally be faster than one that uncritically
> assumes a uniform memory, and modifies memory in-place, and will finally
> suffer from cache line bouncing.
>
> For current standard server hardware there are usually not enough CPUs
> to get a big effect. But in a few years it will be very important (as it
> is for current supercomputers), maybe even for standard 128 core laptops
> in 2015 (just guessing).
>

Thanks for the info.


Hugo

> Gerd
>
>
>>
>> Hugo F.
>>
>>
>>>
>>> Martin
>>>
>>
>>
>
>


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-25 20:26                 ` Gerd Stolpmann
  2011-03-26  9:11                   ` Hugo Ferreira
@ 2011-03-26 10:31                   ` Richard W.M. Jones
  2011-03-30 16:56                     ` Jon Harrop
  2011-04-20 21:44                   ` Jon Harrop
  2 siblings, 1 reply; 74+ messages in thread
From: Richard W.M. Jones @ 2011-03-26 10:31 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: Hugo Ferreira, Martin Jambon, caml-list

Obligatory pointer to Uli Drepper's series about what every programmer
should know about memory.  Part 1 is here, and parts 2-9 are linked to
at the end of the article just before the comments:

http://lwn.net/Articles/250967/

To expand on what I said before:

The sort of high end hardware we're seeing now has 128 cores and
hundreds of gigabytes or even a terabyte of non-uniform RAM.

The cores are grouped into NUMA nodes with each node having its own
attached memory, southbridge and hardware (network ports etc) -- in
effect separate computers interconnected with some very fast and very
specialized "network" channels that exist but are invisible to the
programmer.

If you 'cross' a node boundary, eg. by having a program or its data
located in the memory in one node and running on a core in another
node, then you suffer some penalty (10-40% directly, plus a lot of
hard-to-measure indirect costs from consuming channel resources
between nodes).

Obviously we try hard to schedule things and to pin processes, virtual
machines and so on so that this never happens.

Having said the above, even straight SMP isn't very uniform.  You've
still got to take into account caches and cache consistency (which at
the hardware level is message passing or bus snooping).

Rich.

-- 
Richard Jones
Red Hat

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-26 10:31                   ` Richard W.M. Jones
@ 2011-03-30 16:56                     ` Jon Harrop
  2011-03-30 19:24                       ` Richard W.M. Jones
  0 siblings, 1 reply; 74+ messages in thread
From: Jon Harrop @ 2011-03-30 16:56 UTC (permalink / raw)
  To: 'Richard W.M. Jones'; +Cc: caml-list

Richard Jones wrote:
> The sort of high end hardware we're seeing now has 128 cores and hundreds
of
> gigabytes or even a terabyte of non-uniform RAM.

FWIW, Azul have been shipping 864-core machines for years now. Note also
that they use a single multi-threaded GC for the JVM to write efficient
parallel code (albeit with some hardware support) and not separate GCs with
message passing (which is basically unheard of). 

Cheers,
Jon.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-25 18:45   ` Andrei Formiga
@ 2011-03-30 17:00     ` Jon Harrop
  0 siblings, 0 replies; 74+ messages in thread
From: Jon Harrop @ 2011-03-30 17:00 UTC (permalink / raw)
  To: 'Andrei Formiga'; +Cc: caml-list

Andrei Formiga wrote:
> This looks very interesting. I know there is a lot of hype around
multicore, and
> many programs wouldn't benefit from this...

IME, many programs benefit enormously from multicore. Particularly time
consuming ones. Multicore doesn't have to be hard...

Cheers,
Jon.



^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-25 18:24             ` Martin Jambon
  2011-03-25 19:19               ` Hugo Ferreira
@ 2011-03-30 17:02               ` Jon Harrop
  2011-04-20 19:23               ` Jon Harrop
  2 siblings, 0 replies; 74+ messages in thread
From: Jon Harrop @ 2011-03-30 17:02 UTC (permalink / raw)
  To: 'Martin Jambon', caml-list

Martin Jambon wrote:
> The solution would be to use get/set via a message-passing interface.

That is rarely a solution in the context of parallel programming because the sole point is to improve performance and that is hugely inefficient (you are introducing massive contention for shared resources unnecessarily).

Cheers,
Jon.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-25 13:41           ` Dario Teixeira
@ 2011-03-30 18:12             ` Jon Harrop
  0 siblings, 0 replies; 74+ messages in thread
From: Jon Harrop @ 2011-03-30 18:12 UTC (permalink / raw)
  To: 'Dario Teixeira', 'Fabrice Le Fessant'; +Cc: caml-list

Dario wrote:
> Fabrice wrote:
> > Of course, sharing structured mutable data between threads will not be
> > possible, but actually, it is a good thing if you want to write
> > correct programs ;-)
> 
> This is a good point that is often neglected in multicore discussions.

There be dragons.

There is a critical separation of concerns here. On one side you have parallel programming for multicores, the sole objective of which is to increase throughput on those machines. On the other side you have concurrent programming, which is primarily concerned with handling the complexity of non-determinism and often latency as well as throughput. These are two different concepts with different goals and different solutions.

Due to the memory hierarchy on a multicore, mutating shared data is basically essential if you want to get decent performance from a parallel program. If you don't do this, you end up with cache misses from all cores contending for shared memory and, therefore, no performance gain from more cores.

You can easily see this effect by considering the following embarrassingly-parallel serial program:

  Array.init 10000 (fun _ -> Array.init 10000 float)

This is trivial to parallelize in F#:

  Array.Parallel.init 10000 (fun _ -> Array.init 10000 float)

But I see just 2.4x speedup going from 1 to 8 cores by doing this because of the massive contention for shared memory. Specifically, the memory bandwidth on this 2x quadcore Xeon box is enough to feed 2.4 of these cores but not all 8 of them.

The Haskell guys got this wrong:

  http://flyingfrogblog.blogspot.com/2010/06/regular-shape-polymorphic-parallel.html

Their parallel Laplace code is a particularly good example because it uses the anti-pattern:

  for t=0 to m-1 do
    parallel for i=0 to n-1 do
      for j=0 to n-1 do
        ... xs.(i).(j) ...

This is bad because there is no correlation (aka "temporal locality") between the tasks spawned by the inner parallel "for" loop and the cores they run on. Consequently, each iteration of the outer serial "for" loop requires different data on different cores and the machine is swamped with cache misses so they see (surprise!) only 2.4x speedup on their 8-core Xeon when other solution get faster serial performance and 5-7x speedups on 8-core Xeons.

One solution that attains excellent performance on multicores optimizes the algorithm by reducing the asymptotic cache miss rate (aka "cache complexity") by decomposing the space-time (t, i, j) into trapezoids and relying upon an implementation that uses work-stealing queues where child tasks executed on the same core as their parent task with high probability. See the excellent paper "Cache oblivious stencil computations" by Matteo Frigo (of FFTW fame) for more details:

  http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.115.6578&rep=rep1&type=pdf

This style of parallel programming is now the defacto standard for multicores, with popular implementations provided by both the Intel TBB and Microsoft TPL (in .NET 4).

Mutating shared data is a bad idea in the context of concurrent programming where throughput is not important and handling non-determinism is usually the main headache. Here, you can sacrifice throughput by introducing abstractions that make it much easier to write correct programs. Abstractions like agents.

> What kind of language constructs would also be introduced in order to manage the concurrency between the multiple threads?

Just use message passing between agents. For a concrete implementation, look at the MailboxProcessor in the F# standard library.

Cheers,
Jon.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-30 16:56                     ` Jon Harrop
@ 2011-03-30 19:24                       ` Richard W.M. Jones
  0 siblings, 0 replies; 74+ messages in thread
From: Richard W.M. Jones @ 2011-03-30 19:24 UTC (permalink / raw)
  To: Jon Harrop; +Cc: caml-list

On Wed, Mar 30, 2011 at 05:56:25PM +0100, Jon Harrop wrote:
> Richard Jones wrote:
> > The sort of high end hardware we're seeing now has 128 cores and hundreds
> of
> > gigabytes or even a terabyte of non-uniform RAM.
> 
> FWIW, Azul have been shipping 864-core machines for years now.
> Note also that they use a single multi-threaded GC for the JVM to
> write efficient parallel code (albeit with some hardware support)
> and not separate GCs with message passing (which is basically
> unheard of).

That's great, but Azul are selling specialized Java hardware and I was
talking about x86, Intel, AMD, HyperTransport, QuickPath and so on.

My point (about x86 hardware) wasn't that message passing was
necessary or the right or only way to do it in software.  It was that
message passing happens at the hardware level and that the programmer
needs to understand that to write efficient software.  Of course with
a lot of hard work you can write multithreaded garbage collectors,
apps etc which don't do bad stuff on NUMA.

Rich.

-- 
Richard Jones
Red Hat

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-24 23:13 ` [Caml-list] Efficient OCaml multicore -- roadmap? Fabrice Le Fessant
                     ` (4 preceding siblings ...)
  2011-03-25 18:45   ` Andrei Formiga
@ 2011-04-13  3:36   ` Lucas Dixon
  2011-04-13 13:01     ` Gerd Stolpmann
  2011-04-16 13:54     ` Jon Harrop
  5 siblings, 2 replies; 74+ messages in thread
From: Lucas Dixon @ 2011-04-13  3:36 UTC (permalink / raw)
  To: caml-list

On 24/03/2011 19:13, Fabrice Le Fessant wrote:
> On 03/24/2011 02:45 PM, Alexy Khrabrov wrote:
>>> Where does the OCaml team stand on the multicore issues?  A year
>>> or so ago, when there was a prototype parallel GC implementation,
>>> IIRC, Xavier said it has to be done right.  So what are the
>>> official plans and the status of integrating what volunteers had
>>> done?
>>>
>>> WIth Scala having a robust actors model and AKKA kernel, and
>>> Clojure built around efficient shared memory concurrency with
>>> agents and references and STM, and Haskell also really parallel,
>>> OCaml is lacking behind.  Furthermore, F# builds on strongly
>>> parallel .NET, overcoming granddaddy.  With multicores common
>>> even in laptops and iPads, we need an efficient  multicore OCaml!
>>> Due to the model different from Haskell or Scala and Clojure, now
>>> all on github, OCaml is both more stable and also is slower to
>>> advance -- what do folks think about this situation?  How do you
>>> do shared memory parallelism now?
>>>
>>> Cheers, Alexy
>>>

The PolyML implementation of SML has had a multi-core runtime, without 
slowing down single-threaded programs for a few years now. It also has a 
concurrent garbage collector. The main problem currently seems to be 
that a lot of garbage gets produced, and memory access speeds end up 
being the main bottle-neck. But we do see 4-5x speed-up on 8 core 
machines for large real-world applications (in SML, our real world means 
a big formal proof development, or graph rewriting at the moment... but 
there's more fun stuff possible: http://kidkarolis.github.com/PolyChrome/)

Anyway, I suspect you'll run into some of the same issues as PolyML 
encountered; there's a paper or two by David Matthews on this stuff; 
(here's the first one I found: 
http://portal.acm.org/citation.cfm?id=1708058&dl=ACM) and of course 
there is the source code too:
http://polyml.svn.sourceforge.net/viewvc/polyml/

Good luck with the OCaml version; it's been wanted for a long time! :)

best,
lucas



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-13  3:36   ` Lucas Dixon
@ 2011-04-13 13:01     ` Gerd Stolpmann
  2011-04-13 13:09       ` Lucas Dixon
  2011-04-13 23:04       ` Goswin von Brederlow
  2011-04-16 13:54     ` Jon Harrop
  1 sibling, 2 replies; 74+ messages in thread
From: Gerd Stolpmann @ 2011-04-13 13:01 UTC (permalink / raw)
  To: Lucas Dixon; +Cc: caml-list

Am Dienstag, den 12.04.2011, 23:36 -0400 schrieb Lucas Dixon:
> On 24/03/2011 19:13, Fabrice Le Fessant wrote:
> > On 03/24/2011 02:45 PM, Alexy Khrabrov wrote:
> >>> Where does the OCaml team stand on the multicore issues?  A year
> >>> or so ago, when there was a prototype parallel GC implementation,
> >>> IIRC, Xavier said it has to be done right.  So what are the
> >>> official plans and the status of integrating what volunteers had
> >>> done?
> >>>
> >>> WIth Scala having a robust actors model and AKKA kernel, and
> >>> Clojure built around efficient shared memory concurrency with
> >>> agents and references and STM, and Haskell also really parallel,
> >>> OCaml is lacking behind.  Furthermore, F# builds on strongly
> >>> parallel .NET, overcoming granddaddy.  With multicores common
> >>> even in laptops and iPads, we need an efficient  multicore OCaml!
> >>> Due to the model different from Haskell or Scala and Clojure, now
> >>> all on github, OCaml is both more stable and also is slower to
> >>> advance -- what do folks think about this situation?  How do you
> >>> do shared memory parallelism now?
> >>>
> >>> Cheers, Alexy
> >>>
> 
> The PolyML implementation of SML has had a multi-core runtime, without 
> slowing down single-threaded programs for a few years now.

How do you know that there are no slow-downs? Has it ever been tried to
get rid of all additional measures that are necessary to support
multicore? There might be e.g. architectural simplifications that would
lead to speed-ups for the single-threaded case.

>  It also has a 
> concurrent garbage collector. The main problem currently seems to be 
> that a lot of garbage gets produced, and memory access speeds end up 
> being the main bottle-neck. But we do see 4-5x speed-up on 8 core 
> machines for large real-world applications (in SML, our real world means 
> a big formal proof development, or graph rewriting at the moment... but 
> there's more fun stuff possible: http://kidkarolis.github.com/PolyChrome/)
> 
> Anyway, I suspect you'll run into some of the same issues as PolyML 
> encountered; there's a paper or two by David Matthews on this stuff; 
> (here's the first one I found: 
> http://portal.acm.org/citation.cfm?id=1708058&dl=ACM) and of course 
> there is the source code too:
> http://polyml.svn.sourceforge.net/viewvc/polyml/
> 
> Good luck with the OCaml version; it's been wanted for a long time! :)

You may be interested then in this pre-announcement I posted yesterday:

http://blog.camlcity.org/blog/multicore1.html

I was able to make good progress on my Netmulticore library, which -
without changing anything in the compiler or runtime - allows it to
write multicore-enabled shared memory programs following the
"multi-runtime" design (which is emulated by forking subprocesses).

More on this later. I'm working on a release.

Gerd
-- 
------------------------------------------------------------
Gerd Stolpmann, Bad Nauheimer Str.3, 64289 Darmstadt,Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-13 13:01     ` Gerd Stolpmann
@ 2011-04-13 13:09       ` Lucas Dixon
  2011-04-13 23:04       ` Goswin von Brederlow
  1 sibling, 0 replies; 74+ messages in thread
From: Lucas Dixon @ 2011-04-13 13:09 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: caml-list

On 13/04/2011 09:01, Gerd Stolpmann wrote:
> Am Dienstag, den 12.04.2011, 23:36 -0400 schrieb Lucas Dixon:
>> On 24/03/2011 19:13, Fabrice Le Fessant wrote:
>>> On 03/24/2011 02:45 PM, Alexy Khrabrov wrote:
>>>>> Where does the OCaml team stand on the multicore issues?  A year
>>>>> or so ago, when there was a prototype parallel GC implementation,
>>>>> IIRC, Xavier said it has to be done right.  So what are the
>>>>> official plans and the status of integrating what volunteers had
>>>>> done?
>>>>>
>>>>> WIth Scala having a robust actors model and AKKA kernel, and
>>>>> Clojure built around efficient shared memory concurrency with
>>>>> agents and references and STM, and Haskell also really parallel,
>>>>> OCaml is lacking behind.  Furthermore, F# builds on strongly
>>>>> parallel .NET, overcoming granddaddy.  With multicores common
>>>>> even in laptops and iPads, we need an efficient  multicore OCaml!
>>>>> Due to the model different from Haskell or Scala and Clojure, now
>>>>> all on github, OCaml is both more stable and also is slower to
>>>>> advance -- what do folks think about this situation?  How do you
>>>>> do shared memory parallelism now?
>>>>>
>>>>> Cheers, Alexy
>>>>>
>>
>> The PolyML implementation of SML has had a multi-core runtime, without
>> slowing down single-threaded programs for a few years now.
>
> How do you know that there are no slow-downs? Has it ever been tried to
> get rid of all additional measures that are necessary to support
> multicore? There might be e.g. architectural simplifications that would
> lead to speed-ups for the single-threaded case.

Before it was introduced, PolyML was single-threaded. We compared 
running on the single-thread version of PolyML against using a single 
thread in the multi-threaded version. It may be that some optimisations 
were also introduced that cancelled out a slowdown from the 
multi-threaded version, but I can't remember anyone mentioning this.

best,
lucas

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-13 13:01     ` Gerd Stolpmann
  2011-04-13 13:09       ` Lucas Dixon
@ 2011-04-13 23:04       ` Goswin von Brederlow
  1 sibling, 0 replies; 74+ messages in thread
From: Goswin von Brederlow @ 2011-04-13 23:04 UTC (permalink / raw)
  To: caml-list

Gerd Stolpmann <info@gerd-stolpmann.de> writes:

> You may be interested then in this pre-announcement I posted yesterday:
>
> http://blog.camlcity.org/blog/multicore1.html
>
> I was able to make good progress on my Netmulticore library, which -
> without changing anything in the compiler or runtime - allows it to
> write multicore-enabled shared memory programs following the
> "multi-runtime" design (which is emulated by forking subprocesses).
>
> More on this later. I'm working on a release.
>
> Gerd

I believe this is a good way to go. In many cases you have only a small
set of shared data (might be huge in bytes but small in structure) and
then you have a ton of purely local (temporary) data each thread creates
while processing. Idealy I would prefer the compiler detecting which
data is local and which shared but I don't think that is easy if even
feasable to get right. So let the user do some work.

A slightly different design could be to still have the local and shared
heap but automaticaly migrate values to the shared heap if and when they
are shared the first time. This would mean dropping the automatic locks
on shared values and have the user take care of locking properly as
needed.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-25 12:07       ` Gerd Stolpmann
@ 2011-04-16 12:12         ` Jon Harrop
  0 siblings, 0 replies; 74+ messages in thread
From: Jon Harrop @ 2011-04-16 12:12 UTC (permalink / raw)
  To: 'Gerd Stolpmann'; +Cc: caml-list

Gerd wrote:
> AFAIK TLS is not costless but relatively cheap.

Actually TLS is very expensive in this context because thread-local data is accessed so frequently. When I hoisted the use of TLS by carrying pointers to thread-local data around (altering the calling convention) in HLVM it resulted in 1.5-2x speedups on all benchmarks and the result was then only 17% slower than serial code.

Cheers,
Jon.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-13  3:36   ` Lucas Dixon
  2011-04-13 13:01     ` Gerd Stolpmann
@ 2011-04-16 13:54     ` Jon Harrop
  1 sibling, 0 replies; 74+ messages in thread
From: Jon Harrop @ 2011-04-16 13:54 UTC (permalink / raw)
  To: caml-list

Lucas wrote:
> The PolyML implementation of SML has had a multi-core runtime, without
> slowing down single-threaded programs for a few years now... we do see
> 4-5x speed-up on 8 core machines for large real-world applications (in
SML,
> our real world means a big formal proof development...

Really great to see such good work being done on parallelism in ML!

The following recent papers on PolyML's fast multicore garbage collector
were very interesting:

http://www4.in.tum.de/~wenzelm/papers/parallel-ml.pdf
http://www4.in.tum.de/~wenzelm/papers/itp-smp.pdf

Cheers,
Jon. 



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-25 19:19               ` Hugo Ferreira
  2011-03-25 20:26                 ` Gerd Stolpmann
@ 2011-04-19  9:57                 ` Eray Ozkural
  2011-04-19 10:05                   ` Hugo Ferreira
  2011-04-19 20:26                   ` Gerd Stolpmann
  2011-04-19 22:49                 ` Jon Harrop
  2 siblings, 2 replies; 74+ messages in thread
From: Eray Ozkural @ 2011-04-19  9:57 UTC (permalink / raw)
  To: Hugo Ferreira; +Cc: Martin Jambon, caml-list

[-- Attachment #1: Type: text/plain, Size: 1837 bytes --]

On Fri, Mar 25, 2011 at 9:19 PM, Hugo Ferreira <hmf@inescporto.pt> wrote:

> On 03/25/2011 06:24 PM, Martin Jambon wrote:
>
>> On 03/25/2011 01:10 PM, Fabrice Le Fessant wrote:
>>>
>>>>   Of course, sharing structured mutable data between threads will not be
>>>> possible, but actually, it is a good thing if you want to write correct
>>>> programs ;-)
>>>>
>>>
>> On 03/25/11 08:44, Hugo Ferreira replied:
>>
>>> I'll stick to my guns here. It simply makes solving certain problem
>>> unfeasible. Point in case: I work on machine learning algorithms. I
>>> use large data-structures that must be processed (altered)
>>> in order to learn. Because these data-structures are large it become
>>> impractical to copy this to a process every time I start off a new
>>> "thread".
>>>
>>
>> The solution would be to use get/set via a message-passing interface.
>>
>>
> Cannot see how this works. Say I want to share a balanced binary tree.
> Several processes/threads each take this tree and alter it by adding and
> deleting elements. Each (new) tree is then further processed by other
> processes/threads.
>
> How can get/set be used in this scenario?
>
>
I think it won't have good performance and it won't scale, and it will fail
for truly delicate shared memory architectures of the future with thousands
of cores....

And neither will it support on-chip message passing facilities of those
future processors.

The shared memory message passing never worked too well, anyway, too many
redundant copies. Not fitting for high performance computing.

No need at all except for embarrassingly parallel applications. I suppose
that's the target, right?

Best,

-- 
Eray Ozkural, PhD candidate.  Comp. Sci. Dept., Bilkent University, Ankara
http://groups.yahoo.com/group/ai-philosophy
http://myspace.com/arizanesil http://myspace.com/malfunct

[-- Attachment #2: Type: text/html, Size: 2923 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-19  9:57                 ` Eray Ozkural
@ 2011-04-19 10:05                   ` Hugo Ferreira
  2011-04-19 20:26                   ` Gerd Stolpmann
  1 sibling, 0 replies; 74+ messages in thread
From: Hugo Ferreira @ 2011-04-19 10:05 UTC (permalink / raw)
  To: caml-list

On 04/19/2011 10:57 AM, Eray Ozkural wrote:
>
>
> On Fri, Mar 25, 2011 at 9:19 PM, Hugo Ferreira <hmf@inescporto.pt
> <mailto:hmf@inescporto.pt>> wrote:
>
>     On 03/25/2011 06:24 PM, Martin Jambon wrote:
>
>             On 03/25/2011 01:10 PM, Fabrice Le Fessant wrote:
>
>                    Of course, sharing structured mutable data between
>                 threads will not be
>                 possible, but actually, it is a good thing if you want
>                 to write correct
>                 programs ;-)
>
>
>         On 03/25/11 08:44, Hugo Ferreira replied:
>
>             I'll stick to my guns here. It simply makes solving certain
>             problem
>             unfeasible. Point in case: I work on machine learning
>             algorithms. I
>             use large data-structures that must be processed (altered)
>             in order to learn. Because these data-structures are large
>             it become
>             impractical to copy this to a process every time I start off
>             a new
>             "thread".
>
>
>         The solution would be to use get/set via a message-passing
>         interface.
>
>
>     Cannot see how this works. Say I want to share a balanced binary tree.
>     Several processes/threads each take this tree and alter it by adding and
>     deleting elements. Each (new) tree is then further processed by other
>     processes/threads.
>
>     How can get/set be used in this scenario?
>
>
> I think it won't have good performance and it won't scale, and it will
> fail for truly delicate shared memory architectures of the future with
> thousands of cores....
>
> And neither will it support on-chip message passing facilities of those
> future processors.
>
> The shared memory message passing never worked too well, anyway, too
> many redundant copies. Not fitting for high performance computing.
>
> No need at all except for embarrassingly parallel applications. I
> suppose that's the target, right?
>

Yes. The problem is decomposable and can use sampling.
The idea is to have threads that take a data-set decompose it and make
the smaller parts available for further (parallel) processing.

Regards,
HF

> Best,
>
> --
> Eray Ozkural, PhD candidate.  Comp. Sci. Dept., Bilkent University, Ankara
> http://groups.yahoo.com/group/ai-philosophy
> http://myspace.com/arizanesil http://myspace.com/malfunct
>


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-19  9:57                 ` Eray Ozkural
  2011-04-19 10:05                   ` Hugo Ferreira
@ 2011-04-19 20:26                   ` Gerd Stolpmann
  2011-04-20  7:59                     ` Hugo Ferreira
  1 sibling, 1 reply; 74+ messages in thread
From: Gerd Stolpmann @ 2011-04-19 20:26 UTC (permalink / raw)
  To: Eray Ozkural; +Cc: Hugo Ferreira, Martin Jambon, caml-list

Am Dienstag, den 19.04.2011, 12:57 +0300 schrieb Eray Ozkural:
> 
> 
> On Fri, Mar 25, 2011 at 9:19 PM, Hugo Ferreira <hmf@inescporto.pt>
> wrote:
>         On 03/25/2011 06:24 PM, Martin Jambon wrote:
>                         On 03/25/2011 01:10 PM, Fabrice Le Fessant
>                         wrote:
>                                   Of course, sharing structured
>                                 mutable data between threads will not
>                                 be
>                                 possible, but actually, it is a good
>                                 thing if you want to write correct
>                                 programs ;-)
>                 
>                 On 03/25/11 08:44, Hugo Ferreira replied:
>                         I'll stick to my guns here. It simply makes
>                         solving certain problem
>                         unfeasible. Point in case: I work on machine
>                         learning algorithms. I
>                         use large data-structures that must be
>                         processed (altered)
>                         in order to learn. Because these
>                         data-structures are large it become
>                         impractical to copy this to a process every
>                         time I start off a new
>                         "thread".
>                 
>                 The solution would be to use get/set via a
>                 message-passing interface.
>                 
>         
>         
>         Cannot see how this works. Say I want to share a balanced
>         binary tree.
>         Several processes/threads each take this tree and alter it by
>         adding and
>         deleting elements. Each (new) tree is then further processed
>         by other
>         processes/threads.
>         
>         How can get/set be used in this scenario?
>         
>         
> 
> 
> I think it won't have good performance and it won't scale, and it will
> fail for truly delicate shared memory architectures of the future with
> thousands of cores....
> 
> 
> And neither will it support on-chip message passing facilities of
> those future processors.
> 
> 
> The shared memory message passing never worked too well, anyway, too
> many redundant copies. Not fitting for high performance computing. 
> 
> 
> No need at all except for embarrassingly parallel applications. I
> suppose that's the target, right?

I did experiment a bit in the meantime with Netmulticore [1], my
implementation of multi-processing with shared memory. Netmulticore
provides both alterable data structures and a quite efficient message
passing interface. The experience so far: Message passing wins if you do
it the right way. Avoid ping-pong games like get/set - shared memory is
far better for this kind of operation. The better topology for message
passing are pipelines where data flows only into one direction.

I don't know how this translates to Hugo's machine learning problem. I
could imagine a shared data structure is good here for providing the
starting point for learning. If you run several learning steps in
parallel, you want to avoid that these steps lock out each other, i.e.
try to ensure that they affect distinct parts of the matrix. These
updates would be sent (using message passing) to an update manager,
which would apply the learning results to the matrix, and compute the
next version, for which a number of new learning steps would be started.
I'm just guessing here how it could be done. In my imagination a clever
combination of both (alterable) shared memory and message passing is the
way to go.

Hugo, I'd like to do some more experiments into this direction. Is there
a simple version of machine learning algorithm I could try to
parallelize?

Gerd

[1] Netmulticore: now available in ocamlnet-3.3.0test1:
http://blog.camlcity.org/blog/multicore2.html

> 
> 
> Best,
> 
> -- 
> Eray Ozkural, PhD candidate.  Comp. Sci. Dept., Bilkent University,
> Ankara
> http://groups.yahoo.com/group/ai-philosophy
> http://myspace.com/arizanesil http://myspace.com/malfunct
> 


-- 
------------------------------------------------------------
Gerd Stolpmann, Bad Nauheimer Str.3, 64289 Darmstadt,Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------


^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-25 13:10         ` Fabrice Le Fessant
  2011-03-25 13:41           ` Dario Teixeira
  2011-03-25 15:44           ` Hugo Ferreira
@ 2011-04-19 22:47           ` Jon Harrop
       [not found]           ` <869445701.579183.1303253283515.JavaMail.root@zmbs4.inria.fr>
  3 siblings, 0 replies; 74+ messages in thread
From: Jon Harrop @ 2011-04-19 22:47 UTC (permalink / raw)
  To: 'Fabrice Le Fessant'; +Cc: caml-list

Fabrice wrote:
> Well, Java has fully multi-threaded garbage collection, so there is no point that it
> is possible to do it.
> 
> The problem is that it has a cost, and the cost is a huge slowdown on mono-
> threaded programs.

Lucas says "zero". You say "huge". Can we do a better job of quantifying this? Where do you believe the slowdown would come from? Given that GC intensive OCaml programs can spend 30% of their time in the GC, do you not think it might be possible to improve performance by exploiting more cores? Can you give example code that you would expect to be adversely affected the most by a multi-threaded runtime?

Cheers,
Jon.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-25 19:19               ` Hugo Ferreira
  2011-03-25 20:26                 ` Gerd Stolpmann
  2011-04-19  9:57                 ` Eray Ozkural
@ 2011-04-19 22:49                 ` Jon Harrop
  2 siblings, 0 replies; 74+ messages in thread
From: Jon Harrop @ 2011-04-19 22:49 UTC (permalink / raw)
  To: 'Hugo Ferreira'; +Cc: caml-list

Huge wrote:
> Cannot see how this works. Say I want to share a balanced binary tree.
> Several processes/threads each take this tree and alter it by adding and deleting
> elements. Each (new) tree is then further processed by other processes/threads.

Note that you don't want to manipulate a tree on one thread and then pass it to another for processing because the threads are likely to be running on different cores and you will be needlessly moving the tree from one core's cache to another core's cache. Better that any given core runs the different algorithms on the same data to reduce communication.

Cheers,
Jon.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-19 20:26                   ` Gerd Stolpmann
@ 2011-04-20  7:59                     ` Hugo Ferreira
  2011-04-20 12:30                       ` Markus Mottl
  2011-04-20 14:00                       ` Edgar Friendly
  0 siblings, 2 replies; 74+ messages in thread
From: Hugo Ferreira @ 2011-04-20  7:59 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: Eray Ozkural, Martin Jambon, caml-list

On 04/19/2011 09:26 PM, Gerd Stolpmann wrote:
> Am Dienstag, den 19.04.2011, 12:57 +0300 schrieb Eray Ozkural:
>>
>>
>> On Fri, Mar 25, 2011 at 9:19 PM, Hugo Ferreira<hmf@inescporto.pt>
>> wrote:
>>          On 03/25/2011 06:24 PM, Martin Jambon wrote:
>>                          On 03/25/2011 01:10 PM, Fabrice Le Fessant
>>                          wrote:
>>                                    Of course, sharing structured
>>                                  mutable data between threads will not
>>                                  be
>>                                  possible, but actually, it is a good
>>                                  thing if you want to write correct
>>                                  programs ;-)
>>
>>                  On 03/25/11 08:44, Hugo Ferreira replied:
>>                          I'll stick to my guns here. It simply makes
>>                          solving certain problem
>>                          unfeasible. Point in case: I work on machine
>>                          learning algorithms. I
>>                          use large data-structures that must be
>>                          processed (altered)
>>                          in order to learn. Because these
>>                          data-structures are large it become
>>                          impractical to copy this to a process every
>>                          time I start off a new
>>                          "thread".
>>
>>                  The solution would be to use get/set via a
>>                  message-passing interface.
>>
>>
>>
>>          Cannot see how this works. Say I want to share a balanced
>>          binary tree.
>>          Several processes/threads each take this tree and alter it by
>>          adding and
>>          deleting elements. Each (new) tree is then further processed
>>          by other
>>          processes/threads.
>>
>>          How can get/set be used in this scenario?
>>
>>
>>
>>
>> I think it won't have good performance and it won't scale, and it will
>> fail for truly delicate shared memory architectures of the future with
>> thousands of cores....
>>
>>
>> And neither will it support on-chip message passing facilities of
>> those future processors.
>>
>>
>> The shared memory message passing never worked too well, anyway, too
>> many redundant copies. Not fitting for high performance computing.
>>
>>
>> No need at all except for embarrassingly parallel applications. I
>> suppose that's the target, right?
>
> I did experiment a bit in the meantime with Netmulticore [1], my
> implementation of multi-processing with shared memory. Netmulticore
> provides both alterable data structures and a quite efficient message
> passing interface. The experience so far: Message passing wins if you do
> it the right way. Avoid ping-pong games like get/set - shared memory is
> far better for this kind of operation. The better topology for message
> passing are pipelines where data flows only into one direction.
>
> I don't know how this translates to Hugo's machine learning problem. I
> could imagine a shared data structure is good here for providing the
> starting point for learning. If you run several learning steps in
> parallel, you want to avoid that these steps lock out each other, i.e.
> try to ensure that they affect distinct parts of the matrix. These
> updates would be sent (using message passing) to an update manager,
> which would apply the learning results to the matrix, and compute the
> next version, for which a number of new learning steps would be started.
> I'm just guessing here how it could be done. In my imagination a clever
> combination of both (alterable) shared memory and message passing is the
> way to go.
>

Agreed. Currently I am using messaging via sexplib to send data-sets to 
slaves for processing and returns results to a master process the same
way. But my objective is to share data structures and return partial
results to a master process. Returning results requires messaging.

> Hugo, I'd like to do some more experiments into this direction. Is there
> a simple version of machine learning algorithm I could try to
> parallelize?
>

Unfortunately not. The "system" currently distributes the experiments
to 8 machines with 8 CPU's each connected in a cluster. The algorithms
themselves, used in each experiment, do _not_ use parallel processing. I
am now working on this (deal with issue of noisy data). My last
attempt tried to use the Ancient module but failed because I needed to
alter complex data-structures.

Once I get the basic algorithm down, I will try to use parallel 
processing again. Lot of work until then 8-(.

Note: a quick look at Netmulticore seems to indicate that I will still 
have to jump some hoops.

Hugo

> Gerd
>
> [1] Netmulticore: now available in ocamlnet-3.3.0test1:
> http://blog.camlcity.org/blog/multicore2.html
>
>>
>>
>> Best,
>>
>> --
>> Eray Ozkural, PhD candidate.  Comp. Sci. Dept., Bilkent University,
>> Ankara
>> http://groups.yahoo.com/group/ai-philosophy
>> http://myspace.com/arizanesil http://myspace.com/malfunct
>>
>
>


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
       [not found]           ` <869445701.579183.1303253283515.JavaMail.root@zmbs4.inria.fr>
@ 2011-04-20  9:25             ` Fabrice Le Fessant
  0 siblings, 0 replies; 74+ messages in thread
From: Fabrice Le Fessant @ 2011-04-20  9:25 UTC (permalink / raw)
  To: Jon Harrop; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 713 bytes --]

On 04/20/2011 12:48 AM, Jon Harrop wrote:
> Fabrice wrote:
>> Well, Java has fully multi-threaded garbage collection, so there is no point that it
>> is possible to do it.
>>
>> The problem is that it has a cost, and the cost is a huge slowdown on mono-
>> threaded programs.
> 
> Lucas says "zero". You say "huge". Can we do a better job of quantifying this?

No, I cannot, but you can ask the OC4MC team for numbers, to compare the
performance of their implementation with the official one for sequential
programs (coq, ocamlopt, etc.) I could only find 3 programs on their
slides, not really representative of classical OCaml programs, but one
of them already showed a 17% slowdown compared to OCaml.

Fabrice

[-- Attachment #2: fabrice_le_fessant.vcf --]
[-- Type: text/x-vcard, Size: 393 bytes --]

begin:vcard
fn:Fabrice LE FESSANT
n:LE FESSANT;Fabrice
org:INRIA Saclay -- Ile-de-France;P2P & OCaml
adr;quoted-printable:;;Parc Orsay Universit=C3=A9 ;Orsay CEDEX;;91893;France
email;internet:fabrice.le_fessant@inria.fr
title;quoted-printable:Charg=C3=A9 de Recherche
tel;work:+33 1 74 85 42 14
tel;fax:+33 1 74 85 42 49 
url:http://fabrice.lefessant.net/
version:2.1
end:vcard


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-20  7:59                     ` Hugo Ferreira
@ 2011-04-20 12:30                       ` Markus Mottl
  2011-04-20 12:53                         ` Hugo Ferreira
  2011-04-20 14:00                       ` Edgar Friendly
  1 sibling, 1 reply; 74+ messages in thread
From: Markus Mottl @ 2011-04-20 12:30 UTC (permalink / raw)
  To: Hugo Ferreira; +Cc: Gerd Stolpmann, Eray Ozkural, Martin Jambon, caml-list

On Wed, Apr 20, 2011 at 03:59, Hugo Ferreira <hmf@inescporto.pt> wrote:
> Agreed. Currently I am using messaging via sexplib to send data-sets to
> slaves for processing and returns results to a master process the same
> way. But my objective is to share data structures and return partial
> results to a master process. Returning results requires messaging.

Btw., why not use bin-prot instead of sexplib for messaging?  It is
not human-readable, but otherwise comparable in convenience.  It is
very much more efficient (especially for numeric data), both in terms
of processing speed and storage requirements.

Regards,
Markus

-- 
Markus Mottl        http://www.ocaml.info        markus.mottl@gmail.com


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-20 12:30                       ` Markus Mottl
@ 2011-04-20 12:53                         ` Hugo Ferreira
  2011-04-20 13:22                           ` Markus Mottl
  0 siblings, 1 reply; 74+ messages in thread
From: Hugo Ferreira @ 2011-04-20 12:53 UTC (permalink / raw)
  To: caml-list

On 04/20/2011 01:30 PM, Markus Mottl wrote:
> On Wed, Apr 20, 2011 at 03:59, Hugo Ferreira<hmf@inescporto.pt>  wrote:
>> Agreed. Currently I am using messaging via sexplib to send data-sets to
>> slaves for processing and returns results to a master process the same
>> way. But my objective is to share data structures and return partial
>> results to a master process. Returning results requires messaging.
>
> Btw., why not use bin-prot instead of sexplib for messaging?  It is
> not human-readable, but otherwise comparable in convenience.  It is
> very much more efficient (especially for numeric data), both in terms
> of processing speed and storage requirements.
>

Most of the data-set and the results are symbolic (sets of clauses in
predicate logic) which are passed on as text to be parsed. Don't know
if bin-prot helps any in this context.

Thanks for the suggestion.
Hugo F.

> Regards,
> Markus
>


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-20 12:53                         ` Hugo Ferreira
@ 2011-04-20 13:22                           ` Markus Mottl
  0 siblings, 0 replies; 74+ messages in thread
From: Markus Mottl @ 2011-04-20 13:22 UTC (permalink / raw)
  To: Hugo Ferreira; +Cc: caml-list

On Wed, Apr 20, 2011 at 08:53, Hugo Ferreira <hmf@inescporto.pt> wrote:
> Most of the data-set and the results are symbolic (sets of clauses in
> predicate logic) which are passed on as text to be parsed. Don't know
> if bin-prot helps any in this context.

It should be easy enough to try out.  Speedups of 10x are not
uncommon.  Sum type constructors as probably used for logic operators
would then surely be much more efficient than textual representation.
Note, too, that Sexplib has to allocate S-expressions as intermediate
representation, whereas bin-prot performs conversions directly from
the wire format to OCaml values.

Regards,
Markus

-- 
Markus Mottl        http://www.ocaml.info        markus.mottl@gmail.com

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-20  7:59                     ` Hugo Ferreira
  2011-04-20 12:30                       ` Markus Mottl
@ 2011-04-20 14:00                       ` Edgar Friendly
  1 sibling, 0 replies; 74+ messages in thread
From: Edgar Friendly @ 2011-04-20 14:00 UTC (permalink / raw)
  To: caml-list

>> Hugo, I'd like to do some more experiments into this direction. Is there
>> a simple version of machine learning algorithm I could try to
>> parallelize?
>>
>
If you're interested in parallelizing Machine Learning algorithms, I've 
had to write a number of them for a class project, and I did it from 
scratch in ocaml (just using BLAS primitives for some fast vector math.  
They're here: https://github.com/thelema/machine-learning  Small parts 
of what I'm doing depend on libosvm, my fork of libsvm that's not 
published yet (just build fixes, needs more work on the types), but 
that's not the part you're interested in, so if you want to compile it, 
just remove the svm_* calls, and it should compile just fine.

The algorithms are all in predict.ml, grouped into binary classifiers 
(which are generally online = sequential, except for klsq (kernel least 
squares) which is just a matrix inversion at its core) and extension 
routines that take a binary classifier and run it many times on the data 
in different ways to get a multi-way classifier.  This second group is 
*ridiculously* parallelizable.

There's also routines for running the produced classifiers on test data, 
which parallelize across different elements of the test data independently.

I wish my deadline weren't today, as I'd love to have a parallel version 
of this code to run my huge dataset through.
E.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-25 18:24             ` Martin Jambon
  2011-03-25 19:19               ` Hugo Ferreira
  2011-03-30 17:02               ` Jon Harrop
@ 2011-04-20 19:23               ` Jon Harrop
  2011-04-20 20:05                 ` Alexy Khrabrov
  2011-04-20 20:30                 ` Gerd Stolpmann
  2 siblings, 2 replies; 74+ messages in thread
From: Jon Harrop @ 2011-04-20 19:23 UTC (permalink / raw)
  To: 'Martin Jambon'; +Cc: Caml List

Martin wrote:
> On 03/25/11 08:44, Hugo Ferreira replied:
> > I'll stick to my guns here. It simply makes solving certain problem
> > unfeasible. Point in case: I work on machine learning algorithms. I
> > use large data-structures that must be processed (altered) in order to
> > learn. Because these data-structures are large it become impractical
> > to copy this to a process every time I start off a new "thread".
> 
> The solution would be to use get/set via a message-passing interface.

The main objective when parallelizing a program is to remove focal points because they are sources of contention. Funnelling all IO through a single getter/setter would add a focal point and, most likely, destroy scalability.

> From my purely speculative perspective, it seems unavoidable that message-
> passing happens at some level in order to keep a shared data structure in sync
> between a large number of processors. In other words, any access to a shared
> data structure requires some physical copy no matter what the programming
> language makes it look like.

Yes. The hardware maintains the illusion of shared memory so the programmer only need optimize on the basis of an accurate cost model. There is no need to burden the programmer with message passing and manual memory management.

Cheers,
Jon.




^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-20 19:23               ` Jon Harrop
@ 2011-04-20 20:05                 ` Alexy Khrabrov
  2011-04-20 23:00                   ` Jon Harrop
       [not found]                   ` <76544177.594058.1303341821437.JavaMail.root@zmbs4.inria.fr>
  2011-04-20 20:30                 ` Gerd Stolpmann
  1 sibling, 2 replies; 74+ messages in thread
From: Alexy Khrabrov @ 2011-04-20 20:05 UTC (permalink / raw)
  To: Jon Harrop; +Cc: Caml List

[-- Attachment #1: Type: text/plain, Size: 464 bytes --]


On Apr 20, 2011, at 3:23 PM, Jon Harrop wrote:
> 
> Yes. The hardware maintains the illusion of shared memory so the programmer only need optimize on the basis of an accurate cost model. There is no need to burden the programmer with message passing and manual memory management.

Jon -- based on your extensive expertise with F#, what do you think is the main obstacle for OCaml to get the same kind of parallelism which F# enjoys through .NET? 

-- Alexy

[-- Attachment #2: Type: text/html, Size: 764 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-20 19:23               ` Jon Harrop
  2011-04-20 20:05                 ` Alexy Khrabrov
@ 2011-04-20 20:30                 ` Gerd Stolpmann
  2011-04-20 23:33                   ` Jon Harrop
  1 sibling, 1 reply; 74+ messages in thread
From: Gerd Stolpmann @ 2011-04-20 20:30 UTC (permalink / raw)
  To: Jon Harrop; +Cc: 'Martin Jambon', Caml List

Am Mittwoch, den 20.04.2011, 20:23 +0100 schrieb Jon Harrop:
> Martin wrote:
> > On 03/25/11 08:44, Hugo Ferreira replied:
> > > I'll stick to my guns here. It simply makes solving certain problem
> > > unfeasible. Point in case: I work on machine learning algorithms. I
> > > use large data-structures that must be processed (altered) in order to
> > > learn. Because these data-structures are large it become impractical
> > > to copy this to a process every time I start off a new "thread".
> > 
> > The solution would be to use get/set via a message-passing interface.
> 
> The main objective when parallelizing a program is to remove focal points because they are sources of contention. Funnelling all IO through a single getter/setter would add a focal point and, most likely, destroy scalability.
> 
> > From my purely speculative perspective, it seems unavoidable that message-
> > passing happens at some level in order to keep a shared data structure in sync
> > between a large number of processors. In other words, any access to a shared
> > data structure requires some physical copy no matter what the programming
> > language makes it look like.
> 
> Yes. The hardware maintains the illusion of shared memory so the programmer only need optimize on the basis of an accurate cost model. There is no need to burden the programmer with message passing and manual memory management.

I don't understand this last sentence. Message passing is just an
abstract concept (i.e. an interface) which can e.g. be implemented with
a shared buffer (or the illusion thereof) or other data structure that
decouples sender and receiver processes. Are you against using the right
interfaces when they are appropriate? We don't want to burden the
programmer with too low-level details, do we?

Gerd


> 
> Cheers,
> Jon.
> 
> 
> 
> 


-- 
------------------------------------------------------------
Gerd Stolpmann, Bad Nauheimer Str.3, 64289 Darmstadt,Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------


^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-25 20:26                 ` Gerd Stolpmann
  2011-03-26  9:11                   ` Hugo Ferreira
  2011-03-26 10:31                   ` Richard W.M. Jones
@ 2011-04-20 21:44                   ` Jon Harrop
  2 siblings, 0 replies; 74+ messages in thread
From: Jon Harrop @ 2011-04-20 21:44 UTC (permalink / raw)
  To: 'Gerd Stolpmann'; +Cc: caml-list

Gerd wrote:
> When you implement a message passing API for a high level language, the way
> to go is to provide message buffers in memory. When a thread delivers a
> message it just writes to the buffer. The real message, however, is sent by the
> hardware under the hood, and must be delivered until the reading thread
> synchronizes. The point is here, IMHO, that you exploit the features of the
> hardware the most when you do it in a way the hardware can deal best with.
> Thus a program written against the message passing API will finally be faster
> than one that uncritically assumes a uniform memory, and modifies memory in-
> place, and will finally suffer from cache line bouncing.

I think we should work through a concrete example like the one Hugo gave using message passing and the other approaches to multicore parallelism. The following F# function "f" takes "tree" and creates "n" derivative data structures by applying "g" to it and then applies "h" to it in order to perform the "subsequent processing":

# let f tree n g h = Array.Parallel.init n (fun i -> h(g i tree))
val f : 'a -> int -> (int -> 'a -> 'b) -> ('b -> 'c) -> 'c array

For example, take the set {2,4,6,8} and add each of 0..9 and then count the number of elements:

# f (set[2;4;6;8]) 10 Set.add Set.count;;
- : int array = [|5; 5; 4; 5; 4; 5; 4; 5; 4; 5|]

Try to solve the same problem using explicit message passing. 

> For current standard server hardware there are usually not enough CPUs to get a
> big effect. But in a few years it will be very important (as it is for current
> supercomputers), maybe even for standard 128 core laptops in 2015 (just
> guessing).

This needs quantifying. Dual core desktops became available with AMD's Athlon 64 X2 in 2005. Six years later, the widest multicore desktop Dell sells has only 6 cores and the widest server has only 12 cores (per CPU). If the trend continues, we'll barely have 18-core desktops by 2015 much less 128-core laptops.

Cheers,
Jon.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-20 20:05                 ` Alexy Khrabrov
@ 2011-04-20 23:00                   ` Jon Harrop
       [not found]                   ` <76544177.594058.1303341821437.JavaMail.root@zmbs4.inria.fr>
  1 sibling, 0 replies; 74+ messages in thread
From: Jon Harrop @ 2011-04-20 23:00 UTC (permalink / raw)
  To: 'Alexy Khrabrov'; +Cc: 'Caml List'

Alexy wrote:
> Jon -- based on your extensive expertise with F#, what do you think
> is the main obstacle for OCaml to get the same kind of parallelism
> which F# enjoys through .NET? 

I think motive is the only obstacle. Nobody is sufficiently motivated to do
this.

The OCaml team at INRIA are not motivated to do this because it does not
constitute research, would probably make Coq slower and would burden them
with maintaining irrelevant features.

OCaml users just migrate to other languages that are closer to what they
want rather than spending years learning how to build a parallel OCaml, then
doing it and then building a community around it. The only notable exception
might be Jane St. Capital because they have the resources and a vested
interest in performance. There are other large companies invested in OCaml,
like Citrix, but they aren't so interested in parallelism.

Cheers,
Jon.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-20 20:30                 ` Gerd Stolpmann
@ 2011-04-20 23:33                   ` Jon Harrop
  0 siblings, 0 replies; 74+ messages in thread
From: Jon Harrop @ 2011-04-20 23:33 UTC (permalink / raw)
  To: 'Gerd Stolpmann'; +Cc: 'Caml List'

Gerd wrote:
> I wrote:
> > Yes. The hardware maintains the illusion of shared memory so the
> programmer only need optimize on the basis of an accurate cost model. There is
> no need to burden the programmer with message passing and manual memory
> management.
> 
> I don't understand this last sentence. Message passing is just an abstract
> concept (i.e. an interface) which can e.g. be implemented with a shared buffer
> (or the illusion thereof) or other data structure that decouples sender and
> receiver processes.

Similarly, caches are an abstract concept that can be implemented in different ways. We could also emulate the CPU cache in software and provide an abstract interface for our emulated cache. Does this also seem like a good idea given that the context is multicore parallelism, i.e. performance?

> Are you against using the right interfaces when they are
> appropriate?

As long as the hardware provides shared memory then using shared memory directly is the right interface in the context of multicore parallelism.

> We don't want to burden the programmer with too low-level
> details, do we?

Exactly. In this context, message passing is a low-level implementation detail that can and should be confined to the hardware layer. The programmer should be oblivious to the number of cores and threads and oblivious to the low-level communication going on between them and to the sender and receiver processes you mentioned. They need only be aware of high-level concepts like cache complexity. This makes parallel programming easy and that is why it has become the defacto-standard approach for multicore programming. After all, we could easily write parallel programs using manual message passing between explicit senders and receivers but it is harder and slower and, therefore, completely pointless.

I should note that your abstraction is of value in the context of concurrent programming when throughput is irrelevant because it can be used to expose a much simpler "memory model" but that is not relevant to this discussion about multicore parallelism.

Cheers,
Jon.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
       [not found]                   ` <76544177.594058.1303341821437.JavaMail.root@zmbs4.inria.fr>
@ 2011-04-21  7:48                     ` Fabrice Le Fessant
  2011-04-21  8:35                       ` Hugo Ferreira
                                         ` (3 more replies)
  0 siblings, 4 replies; 74+ messages in thread
From: Fabrice Le Fessant @ 2011-04-21  7:48 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 1133 bytes --]

On 04/21/2011 01:23 AM, Jon Harrop wrote:
> The OCaml team at INRIA are not motivated to do this because it does not
> constitute research, would probably make Coq slower and would burden them
> with maintaining irrelevant features.

You think the programmers in the world that are only interested in
floating-point intensive computations, with fine-grain concurrency, are
a huge majority. I think they are not so many. Can we do a better job of
quantifying this ?

> OCaml users just migrate to other languages that are closer to what they
> want rather than spending years learning how to build a parallel OCaml, then
> doing it and then building a community around it. The only notable exception
> might be Jane St. Capital because they have the resources and a vested
> interest in performance. There are other large companies invested in OCaml,
> like Citrix, but they aren't so interested in parallelism.

That's probably the reason why so many scientists use Python instead of
OCaml, because it is faster with better multicore support ?  I always
thought Python was slower than OCaml, and had no multicore support...

Fabrice

[-- Attachment #2: fabrice_le_fessant.vcf --]
[-- Type: text/x-vcard, Size: 380 bytes --]

begin:vcard
fn:Fabrice LE FESSANT
n:LE FESSANT;Fabrice
org:INRIA Saclay -- Ile-de-France;P2P & OCaml
adr;quoted-printable:;;Parc Orsay Universit=C3=A9 ;Orsay CEDEX;;91893;France
email;internet:fabrice.le_fessant@inria.fr
title;quoted-printable:Charg=C3=A9 de Recherche
tel;work:+33 1 74 85 42 14
tel;fax:+33 1 74 85 42 49 
url:http://fabrice.lefessant.net/
version:2.1
end:vcard


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-21  7:48                     ` Fabrice Le Fessant
@ 2011-04-21  8:35                       ` Hugo Ferreira
  2011-04-23 17:32                         ` Jon Harrop
  2011-04-21  9:09                       ` Alain Frisch
                                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 74+ messages in thread
From: Hugo Ferreira @ 2011-04-21  8:35 UTC (permalink / raw)
  To: Fabrice Le Fessant; +Cc: caml-list

Fabrice,

Don't mean to stoke this fire but....

On 04/21/2011 08:48 AM, Fabrice Le Fessant wrote:
> On 04/21/2011 01:23 AM, Jon Harrop wrote:
>> The OCaml team at INRIA are not motivated to do this because it does not
>> constitute research, would probably make Coq slower and would burden them
>> with maintaining irrelevant features.
>
> You think the programmers in the world that are only interested in
> floating-point intensive computations, with fine-grain concurrency, are
> a huge majority. I think they are not so many. Can we do a better job of
> quantifying this ?
>

Some of us are interested in parallel computation for symbolic
manipulation. In my case I am manipulating sets of literals in
predicate logic. I am sure this is also useful in other areas
such ans ILP, graph mining, frequent pattern mining, etc.

>> OCaml users just migrate to other languages that are closer to what they
>> want rather than spending years learning how to build a parallel OCaml, then
>> doing it and then building a community around it. The only notable exception
>> might be Jane St. Capital because they have the resources and a vested
>> interest in performance. There are other large companies invested in OCaml,
>> like Citrix, but they aren't so interested in parallelism.
>
> That's probably the reason why so many scientists use Python instead of
> OCaml, because it is faster with better multicore support ?  I always
> thought Python was slower than OCaml, and had no multicore support...
>

Maybe the issue here is not why people use Python but why more people
don't use Haskell, ML, Ocaml, F#, etc.

Regards,
Hugo F.


> Fabrice
>


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-21  7:48                     ` Fabrice Le Fessant
  2011-04-21  8:35                       ` Hugo Ferreira
@ 2011-04-21  9:09                       ` Alain Frisch
       [not found]                         ` <20110421.210304.1267840107736400776.Christophe.Troestler+ocaml@umons.ac.be>
       [not found]                         ` <799994864.610698.1303412613509.JavaMail.root@zmbs4.inria.fr>
  2011-04-21 10:09                       ` Philippe Strauss
  2011-04-23 17:05                       ` Jon Harrop
  3 siblings, 2 replies; 74+ messages in thread
From: Alain Frisch @ 2011-04-21  9:09 UTC (permalink / raw)
  To: caml-list

On 04/21/2011 09:48 AM, Fabrice Le Fessant wrote:
> You think the programmers in the world that are only interested in
> floating-point intensive computations, with fine-grain concurrency, are
> a huge majority. I think they are not so many. Can we do a better job of
> quantifying this ?

I don't know how to quantify this. I can only explain how the kind of 
things we do at LexiFi can benefit from multicore (in the current state 
of OCaml).

We do some heavy floating-point computations related to the pricing and 
risk analysis of financial products. We use mainly two families of 
pricing algorithms: Partial Differential Equation (PDE) solvers and 
Monte Carlo (MC) simulations. A MC simulation consists in drawing many 
(quasi)random trajectories generated by a model and computing the payoff 
for each of them. Running such a simulation of several cores or machines 
is quite easy. We only need to be able to distribute the (quasi)random 
number generator and there are good ways to do this. Making a single PDE 
job benefit from multicore is more challenging. If some matrix 
primitives that could exploit multicore were available, we might benefit 
from them (we don't need them to be implemented in OCaml).

But anyway, the real scaling issue is not about making a single pricing 
job run in parallel. Indeed, we need to run many independent jobs in 
order to price many different products, or even for a single product, in 
order to run some risk analysis that require many pricing jobs. We can 
easily benefit from multi cores or distribute computations to several 
machines (single pricing jobs are expensive enough to justify tiny 
communication costs between processes or machines; the amount of data to 
be exchanged is rather small).

I guess we are in the nice niche of "embarassingly parallel" tasks.
Again, not so much because of the structure of the numerical problem 
itself, but because we need to run many independent instances of the 
problem.

Alain

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-21  7:48                     ` Fabrice Le Fessant
  2011-04-21  8:35                       ` Hugo Ferreira
  2011-04-21  9:09                       ` Alain Frisch
@ 2011-04-21 10:09                       ` Philippe Strauss
  2011-04-23 17:44                         ` Jon Harrop
  2011-04-23 17:05                       ` Jon Harrop
  3 siblings, 1 reply; 74+ messages in thread
From: Philippe Strauss @ 2011-04-21 10:09 UTC (permalink / raw)
  To: Fabrice Le Fessant; +Cc: caml-list


Le 21 avr. 2011 à 09:48, Fabrice Le Fessant a écrit :

> On 04/21/2011 01:23 AM, Jon Harrop wrote:
>> The OCaml team at INRIA are not motivated to do this because it does not
>> constitute research, would probably make Coq slower and would burden them
>> with maintaining irrelevant features.
> 
> You think the programmers in the world that are only interested in
> floating-point intensive computations, with fine-grain concurrency, are
> a huge majority. I think they are not so many. Can we do a better job of
> quantifying this ?

me!

for audio DSP.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
       [not found]                         ` <20110421.210304.1267840107736400776.Christophe.Troestler+ocaml@umons.ac.be>
@ 2011-04-21 19:53                           ` Hezekiah M. Carty
  2011-04-22  8:34                           ` Alain Frisch
  1 sibling, 0 replies; 74+ messages in thread
From: Hezekiah M. Carty @ 2011-04-21 19:53 UTC (permalink / raw)
  To: Christophe TROESTLER; +Cc: alain.frisch, caml-list

2011/4/21 Christophe TROESTLER <Christophe.Troestler@umons.ac.be>:
> On Thu, 21 Apr 2011 11:09:22 +0200, Alain Frisch wrote:
>>
>> On 04/21/2011 09:48 AM, Fabrice Le Fessant wrote:
>> > You think the programmers in the world that are only interested in
>> > floating-point intensive computations, with fine-grain
>> > concurrency, are a huge majority. I think they are not so
>> > many. Can we do a better job of quantifying this ?
>>
>> [...] I guess we are in the nice niche of "embarassingly parallel"
>> tasks.  Again, not so much because of the structure of the numerical
>> problem itself, but because we need to run many independent
>> instances of the problem.
>
> May you say why Functory is not good enough for this ?
> http://www.lri.fr/~filliatr/functory/About.html
>

MPI [1] and ZeroMQ [2] are options as well.  This may be getting a bit
off topic - what does Functory offer over/in addition to these tools?

[1] - http://forge.ocamlcore.org/projects/ocamlmpi/
[2] - https://github.com/pdhborges/ocaml-zmq

Hez


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
       [not found]                         ` <799994864.610698.1303412613509.JavaMail.root@zmbs4.inria.fr>
@ 2011-04-22  8:06                           ` Fabrice Le Fessant
  2011-04-22  9:11                             ` Gerd Stolpmann
  2011-04-22  9:44                             ` Vincent Aravantinos
  0 siblings, 2 replies; 74+ messages in thread
From: Fabrice Le Fessant @ 2011-04-22  8:06 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 1088 bytes --]

On 04/21/2011 09:03 PM, Christophe TROESTLER wrote:
> For the initial question: If OCaml gets proper parallelism, I believe
> it is good not to neglect the needs of scientific users (to which I
> belong) ― maybe they cannot be 100% met due to other considerations
> but rejecting them from the start does not feel right to me.

I think nobody is rejecting the needs of scientific users. OCamlPro
actually started a long term project called "numcaml", to provide
something similar (including syntactically) to numpy, which, I think,
turned a lot of scientific users into Python fans. We are also thinking
about how to address fine-grain parallel numerical computations using
our current multicore solution.

My point was just that J.H. has been writting everywhere, in every
single forum in the world, that OCaml is bad because it could not
address fine-grain parallelism, for his personal needs. I thought it was
time for the silent majority of satisfied OCaml users to say, here and
in those forums, that OCaml is actually addressing their most important
concerns.

Best,
Fabrice

[-- Attachment #2: fabrice_le_fessant.vcf --]
[-- Type: text/x-vcard, Size: 393 bytes --]

begin:vcard
fn:Fabrice LE FESSANT
n:LE FESSANT;Fabrice
org:INRIA Saclay -- Ile-de-France;P2P & OCaml
adr;quoted-printable:;;Parc Orsay Universit=C3=A9 ;Orsay CEDEX;;91893;France
email;internet:fabrice.le_fessant@inria.fr
title;quoted-printable:Charg=C3=A9 de Recherche
tel;work:+33 1 74 85 42 14
tel;fax:+33 1 74 85 42 49 
url:http://fabrice.lefessant.net/
version:2.1
end:vcard

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
       [not found]                         ` <20110421.210304.1267840107736400776.Christophe.Troestler+ocaml@umons.ac.be>
  2011-04-21 19:53                           ` Hezekiah M. Carty
@ 2011-04-22  8:34                           ` Alain Frisch
  1 sibling, 0 replies; 74+ messages in thread
From: Alain Frisch @ 2011-04-22  8:34 UTC (permalink / raw)
  To: Christophe TROESTLER; +Cc: caml-list

On 04/21/2011 09:03 PM, Christophe TROESTLER wrote:
> On Thu, 21 Apr 2011 11:09:22 +0200, Alain Frisch wrote:
>>
>> On 04/21/2011 09:48 AM, Fabrice Le Fessant wrote:
>>> You think the programmers in the world that are only interested in
>>> floating-point intensive computations, with fine-grain
>>> concurrency, are a huge majority. I think they are not so
>>> many. Can we do a better job of quantifying this ?
>>
>> [...] I guess we are in the nice niche of "embarassingly parallel"
>> tasks.  Again, not so much because of the structure of the numerical
>> problem itself, but because we need to run many independent
>> instances of the problem.
> 
> May you say why Functory is not good enough for this ?
> http://www.lri.fr/~filliatr/functory/About.html

I did not say Functory is not good enough for this. Actually, I did not
know about this project. Although my message was not explicit about it,
we have implemented our own solution, based on a multi-process approach,
and we are quite happy with it. That said, we follow with a lot of
interest all the projects and discussions about making it easier to
write distributed computations in OCaml.

After quickly reading the API and the technical report, I think the
library would indeed be a good match for the kind of computations we are
dealing with. Actually, it has a lot in common with the solution we have
implemented. Some difference between Functory and our solution:

- Functory's Cores module relies on system calls which are not available
under Windows. Our solution is based on a different technique: external
worker processed are spawned by the main program
by executing argv[0] with a special command-line argument (to
distinguish from a regular invocation). Our "Computations" library
intercepts this argument and runs a loop that waits for new tasks
(similar to Functory's Network solutions).

- Functory's external worker processes execute a single task and die.
In our solution, we maintain a pool of available worker processes; when
a worker process has finished computing a task, it notifies the
scheduler explicitly. This avoids some non-negligible start-up costs
(especially under Windows) and allows workers to keep some in-memory
caches and database connections. Again, this looks similar to Functory's
Network approach.

- Our API allows workers to notify the caller with intermediate results
and progress feedback (typically, to display some progress bar and
status to the user).  For instance, a single Monte Carlo pricing task
reports progress percentage in the GUI, together with the current price
estimate and observed variance.  Similarly, a "portfolio" pricing task
reports the percentage (number of contracts already priced). The API
also allows the caller to kill a running task (typically when the user
clicks on "Stop" in the GUI). We do this by asking the OS to kill the
worker process. The caller can still use intermediate results to do
something useful.

- We also have some solution for distributing computations across the
network (not in use in production yet), but we go through our
application's database instead of direct TCP networking. Any connected
client makes its cores available to take some tasks from the task pool;
and it can itself submit new tasks. (This also give a cache to share the
result of computations between different clients.)


Cheers,

Alain

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-22  8:06                           ` Fabrice Le Fessant
@ 2011-04-22  9:11                             ` Gerd Stolpmann
  2011-04-23 10:17                               ` Eray Ozkural
  2011-04-22  9:44                             ` Vincent Aravantinos
  1 sibling, 1 reply; 74+ messages in thread
From: Gerd Stolpmann @ 2011-04-22  9:11 UTC (permalink / raw)
  To: Fabrice Le Fessant; +Cc: caml-list

Am Freitag, den 22.04.2011, 10:06 +0200 schrieb Fabrice Le Fessant:
> On 04/21/2011 09:03 PM, Christophe TROESTLER wrote:
> > For the initial question: If OCaml gets proper parallelism, I believe
> > it is good not to neglect the needs of scientific users (to which I
> > belong) ― maybe they cannot be 100% met due to other considerations
> > but rejecting them from the start does not feel right to me.
> 
> I think nobody is rejecting the needs of scientific users. OCamlPro
> actually started a long term project called "numcaml", to provide
> something similar (including syntactically) to numpy, which, I think,
> turned a lot of scientific users into Python fans. We are also thinking
> about how to address fine-grain parallel numerical computations using
> our current multicore solution.
> 
> My point was just that J.H. has been writting everywhere, in every
> single forum in the world, that OCaml is bad because it could not
> address fine-grain parallelism, for his personal needs. I thought it was
> time for the silent majority of satisfied OCaml users to say, here and
> in those forums, that OCaml is actually addressing their most important
> concerns.

That would be good.

I've just released an example of using Netmulticore for fine-grained
parallelism:

http://blog.camlcity.org/blog/multicore3.html

J.H. is certainly right that he demands shared memory. Certain
algorithms cannot be done without. But shm is possible, even with
current ocaml - and Netmulticore demonstrates it. AFAIK the other
multi-process environments lack it.

Gerd

> Best,
> Fabrice
> 


-- 
------------------------------------------------------------
Gerd Stolpmann, Bad Nauheimer Str.3, 64289 Darmstadt,Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-22  8:06                           ` Fabrice Le Fessant
  2011-04-22  9:11                             ` Gerd Stolpmann
@ 2011-04-22  9:44                             ` Vincent Aravantinos
  1 sibling, 0 replies; 74+ messages in thread
From: Vincent Aravantinos @ 2011-04-22  9:44 UTC (permalink / raw)
  To: Fabrice Le Fessant; +Cc: caml-list

Le 22 avr. 11 à 10:06, Fabrice Le Fessant a écrit :

> My point was just that J.H. has been writting everywhere, in every
> single forum in the world, that OCaml is bad because it could not
> address fine-grain parallelism, for his personal needs. I thought it  
> was
> time for the silent majority of satisfied OCaml users to say, here and
> in those forums, that OCaml is actually addressing their most  
> important
> concerns.

I personally don't care that he states Ocaml is bad because of this  
lack,
what bothers me more is that he claims that 1. there are less and less
users of Ocaml and 2. this is due to the lack of parallelism. He  
pretends
to support claim 1 with the frequentation figures of the mailing lists,
but when I observe those figures by myself I surprinsingly do not find
the same figures as J.H. I probably made a mistake when collecting those
figures. Claim 2 is of course impossible to support in any objective
manner, unless a survey among all former Ocaml users was made. Which, to
my knowledge, has not been done.

Funnily enough, Lisp users have been complaining for years that J.H. was
advertising Ocaml against Lisp in a very unobjective manner on their
mailing list. It's our turn now... Why does he still bother
reading/writing on this mailing list since the same flamewars happen
again and again? My belief is that he is just trying to attract  
potential
customers & users to his new favourite language. See e.g. the discussion
"Is OCaml fast?" a few months ago where the original author of the
post told afterwards that J.H. had contacted him privately in order to
advertise F#... From a sales/marketing point of view, this is all the
most fair. But I am not sure that it is the purpose of this list.

V.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-22  9:11                             ` Gerd Stolpmann
@ 2011-04-23 10:17                               ` Eray Ozkural
  2011-04-23 13:47                                 ` Alexy Khrabrov
  2011-04-23 19:02                                 ` Jon Harrop
  0 siblings, 2 replies; 74+ messages in thread
From: Eray Ozkural @ 2011-04-23 10:17 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: Fabrice Le Fessant, caml-list

[-- Attachment #1: Type: text/plain, Size: 678 bytes --]

I don't really care what others say, but to prove that this has any
performance value you should do the following:

Compare your most "parallel" algorithm with the performance of a
corresponding well-written MPI application using openmpi's shared memory
transport. If there is a difference, then your system has some value.

Of course openmpi's shared memory transport is terribly buggy, but it should
give a baseline acceptable performance.

If there is no comparison, we have no idea.

Best,

-- 
Eray Ozkural, PhD candidate.  Comp. Sci. Dept., Bilkent University, Ankara
http://groups.yahoo.com/group/ai-philosophy
http://myspace.com/arizanesil http://myspace.com/malfunct

[-- Attachment #2: Type: text/html, Size: 1001 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-23 10:17                               ` Eray Ozkural
@ 2011-04-23 13:47                                 ` Alexy Khrabrov
  2011-04-23 17:39                                   ` Eray Ozkural
  2011-04-23 19:02                                 ` Jon Harrop
  1 sibling, 1 reply; 74+ messages in thread
From: Alexy Khrabrov @ 2011-04-23 13:47 UTC (permalink / raw)
  To: Eray Ozkural; +Cc: Caml List

On Apr 23, 2011, at 6:17 AM, Eray Ozkural wrote:

> I don't really care what others say, but to prove that this has any performance value you should do the following:
> 
> 
> Compare your most "parallel" algorithm with the performance of a corresponding well-written MPI application using openmpi's shared memory transport. If there is a difference, then your system has some value.
> 
> Of course openmpi's shared memory transport is terribly buggy, but it should give a baseline acceptable performance.
> 
> If there is no comparison, we have no idea.

The problem with "implement in MPI and compare" is that you have to rearchitect a sequential program for a totally different model.  By contrast, using shared memory parallelism, it's often a question of using pmap.

I really recommend everybody interested in parallelism to learn and try Clojure on a small problem.  You can replace a single map by pmap in a suitable setting and observe a not-quite-linear, but proportional speedup.

I'd be really happy if OCaml gets the mechanisms from Clojure.

-- Alexy

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-21  7:48                     ` Fabrice Le Fessant
                                         ` (2 preceding siblings ...)
  2011-04-21 10:09                       ` Philippe Strauss
@ 2011-04-23 17:05                       ` Jon Harrop
  3 siblings, 0 replies; 74+ messages in thread
From: Jon Harrop @ 2011-04-23 17:05 UTC (permalink / raw)
  To: 'Fabrice Le Fessant', caml-list

Fabrice wrote:
> On 04/21/2011 01:23 AM, Jon Harrop wrote:
> > The OCaml team at INRIA are not motivated to do this because it does
> > not constitute research, would probably make Coq slower and would
> > burden them with maintaining irrelevant features.
> 
> You think the programmers in the world that are only interested in
floating-point
> intensive computations, with fine-grain concurrency, are a huge majority.

That is neither true nor relevant.

> > OCaml users just migrate to other languages that are closer to what
> > they want rather than spending years learning how to build a parallel
> > OCaml, then doing it and then building a community around it. The only
> > notable exception might be Jane St. Capital because they have the
> > resources and a vested interest in performance. There are other large
> > companies invested in OCaml, like Citrix, but they aren't so interested
in
> parallelism.
> 
> That's probably the reason why so many scientists use Python instead of
OCaml,
> because it is faster with better multicore support ?  I always thought
Python was
> slower than OCaml, and had no multicore support...

Scientists in academia use Python because it is a cheap and easy way to
invoke existing solutions generally written in low-level languages like C,
C++ and Fortran. While there is certainly merit in such utility it is hardly
relevant to this discussion.

Cheers,
Jon.



^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-21  8:35                       ` Hugo Ferreira
@ 2011-04-23 17:32                         ` Jon Harrop
  0 siblings, 0 replies; 74+ messages in thread
From: Jon Harrop @ 2011-04-23 17:32 UTC (permalink / raw)
  To: 'Hugo Ferreira'; +Cc: caml-list

Hugo wrote:
> Fabrice wrote:
> > You think the programmers in the world that are only interested in
> > floating-point intensive computations, with fine-grain concurrency,
> > are a huge majority. I think they are not so many. Can we do a better
> > job of quantifying this ?
>
> Some of us are interested in parallel computation for symbolic
manipulation. In
> my case I am manipulating sets of literals in predicate logic. I am sure
this is also
> useful in other areas such ans ILP, graph mining, frequent pattern mining,
etc.

Yes. Symbolic programming is increasingly common in scientific computing as
scientists have come to realise that higher-level languages make it feasible
to attack problems that were considered intractable before. Thomas
Fischbacher did some good work on this using OCaml a few years ago that
attracted a lot of attention.

Cheers,
Jon.



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-23 13:47                                 ` Alexy Khrabrov
@ 2011-04-23 17:39                                   ` Eray Ozkural
  2011-04-23 20:18                                     ` Alexy Khrabrov
  0 siblings, 1 reply; 74+ messages in thread
From: Eray Ozkural @ 2011-04-23 17:39 UTC (permalink / raw)
  To: Alexy Khrabrov; +Cc: Caml List

[-- Attachment #1: Type: text/plain, Size: 2067 bytes --]

On Sat, Apr 23, 2011 at 4:47 PM, Alexy Khrabrov <deliverable@gmail.com>wrote:

>
> On Apr 23, 2011, at 6:17 AM, Eray Ozkural wrote:
>
> > I don't really care what others say, but to prove that this has any
> performance value you should do the following:
> >
> >
> > Compare your most "parallel" algorithm with the performance of a
> corresponding well-written MPI application using openmpi's shared memory
> transport. If there is a difference, then your system has some value.
> >
> > Of course openmpi's shared memory transport is terribly buggy, but it
> should give a baseline acceptable performance.
> >
> > If there is no comparison, we have no idea.
>
> The problem with "implement in MPI and compare" is that you have to
> rearchitect a sequential program for a totally different model.  By
> contrast, using shared memory parallelism, it's often a question of using
> pmap.
>

Incorrect. We always compare to sequential code in parallel computing. It's
called "speedup".

And doubly incorrect because we are not comparing to sequential code but a
claimed shared memory parallelism. It's only logical to compare two
approaches on the same hardware.


>
> I really recommend everybody interested in parallelism to learn and try
> Clojure on a small problem.  You can replace a single map by pmap in a
> suitable setting and observe a not-quite-linear, but proportional speedup.
>

Of course functional programming fits such parallelism very well. It's a
shame that ocaml does not have parallel functional primitives.


>
> I'd be really happy if OCaml gets the mechanisms from Clojure.



It'd be even better if such explicit parallelism had good compiler support,
too :) I don't know much about Clojure, but I wouldn't use anything that
runs on JVM for a parallel program. That might be like first turning your
computer to Commodore 64 and then getting some speedup.


Best,


-- 
Eray Ozkural, PhD candidate.  Comp. Sci. Dept., Bilkent University, Ankara
http://groups.yahoo.com/group/ai-philosophy
http://myspace.com/arizanesil http://myspace.com/malfunct

[-- Attachment #2: Type: text/html, Size: 3034 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-21 10:09                       ` Philippe Strauss
@ 2011-04-23 17:44                         ` Jon Harrop
  0 siblings, 0 replies; 74+ messages in thread
From: Jon Harrop @ 2011-04-23 17:44 UTC (permalink / raw)
  To: 'Philippe Strauss'; +Cc: caml-list

Phillipe wrote:
> Fabrice wrote:
> > You think the programmers in the world that are only interested in
> > floating-point intensive computations, with fine-grain concurrency,
> > are a huge majority. I think they are not so many. Can we do a better
> > job of quantifying this ?
> 
> me!
> 
> for audio DSP.

This is exactly the kind of application I had in mind when I was developing
HLVM. Assuming OCaml doesn't get good support for parallelism, perhaps the
next best solution is to quote code in your OCaml source that will then be
compiled using LLVM down to efficient machine code for execution?

This looks like the most viable approach to me, not just because it would
satisfy your requirements but also because it would separate concerns
between a core OCaml language optimized for Coq and the run-time code gen
library that can be optimized completely independently for scientific
computing. Therefore, the OCaml team at INRIA can continue to develop OCaml
independently and the scientists that want this can fund development in
their own preferred direction. Everyone wins...

Cheers,
Jon.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-23 10:17                               ` Eray Ozkural
  2011-04-23 13:47                                 ` Alexy Khrabrov
@ 2011-04-23 19:02                                 ` Jon Harrop
  1 sibling, 0 replies; 74+ messages in thread
From: Jon Harrop @ 2011-04-23 19:02 UTC (permalink / raw)
  To: 'Eray Ozkural'; +Cc: caml-list

Eray wrote:
> I don't really care what others say, but to prove that this has any
performance value you should do the following:

There is certainly merit in your proposal but it is worth noting up front
that you cannot "prove" anything this way. We can only hope to test
falsifiable hypotheses. For example, if we observe better performance from
Cilk-style parallelism than from MPI then it might be because we suck at
writing MPI code rather than because the Cilk style is inherently more
efficient. A valuable result nonetheless but not technically a proof.

Cheers,
Jon.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-23 17:39                                   ` Eray Ozkural
@ 2011-04-23 20:18                                     ` Alexy Khrabrov
  2011-04-23 21:18                                       ` Jon Harrop
  2011-04-24  0:33                                       ` Eray Ozkural
  0 siblings, 2 replies; 74+ messages in thread
From: Alexy Khrabrov @ 2011-04-23 20:18 UTC (permalink / raw)
  To: Eray Ozkural; +Cc: Caml List

[-- Attachment #1: Type: text/plain, Size: 3834 bytes --]

On Apr 23, 2011, at 1:39 PM, Eray Ozkural wrote:

> On Sat, Apr 23, 2011 at 4:47 PM, Alexy Khrabrov <deliverable@gmail.com> wrote:
> 
> On Apr 23, 2011, at 6:17 AM, Eray Ozkural wrote:
> 
> > I don't really care what others say, but to prove that this has any performance value you should do the following:
> >
> >
> > Compare your most "parallel" algorithm with the performance of a corresponding well-written MPI application using openmpi's shared memory transport. If there is a difference, then your system has some value.
> >
> > Of course openmpi's shared memory transport is terribly buggy, but it should give a baseline acceptable performance.
> >
> > If there is no comparison, we have no idea.
> 
> The problem with "implement in MPI and compare" is that you have to rearchitect a sequential program for a totally different model.  By contrast, using shared memory parallelism, it's often a question of using pmap.
> 
> Incorrect. We always compare to sequential code in parallel computing. It's called "speedup".
> 
> And doubly incorrect because we are not comparing to sequential code but a claimed shared memory parallelism. It's only logical to compare two approaches on the same hardware.

I'm not claiming that something is correct or not -- I'm just saying that replacing map by pmap is easy, while rewriting in MPI style is complex.  Making a shared memory program out of sequential one might be this trivial, while MPI never will be; you have to program in message-passing style from the get-go, and preferably in Erlang or Scala actors with or without the  AKKA kernel and such or 0MQ, etc.

> 
> I really recommend everybody interested in parallelism to learn and try Clojure on a small problem.  You can replace a single map by pmap in a suitable setting and observe a not-quite-linear, but proportional speedup.
> 
> Of course functional programming fits such parallelism very well. It's a shame that ocaml does not have parallel functional primitives.
>  
> 
> I'd be really happy if OCaml gets the mechanisms from Clojure. 
> 
> 
> It'd be even better if such explicit parallelism had good compiler support, too :) I don't know much about Clojure, but I wouldn't use anything that runs on JVM for a parallel program. That might be like first turning your computer to Commodore 64 and then getting some speedup.

This is an obsolete urban legend.  JVM has the most mature GC out there and computational performance often on par with C when loaded and running.  In my social networking benchmark, the largest data-churning test ever for functional programming languages (http://functional.tv/), Clojure was only 2-3 times slower than OCaml and Haskell, and it's mostly due to slow Java serialization and deserialization.  My experience with Scala and Clojure tells me these are the best ways now to do shared memory parallelism for performance gains in a real-world manner (using many libraries).  BTW, Haskell then beat OCaml by a small margin, although using purely functional maps to OCaml's hash tables.  The Haskell folks keep improving their performance, although the GC then originally crashed under such an unexpected volume as a Twitter graph of 5 million users -- and was quickly fixed.  Still we had to strictify Haskell's core data structures, an exercise which made me go back to OCaml.  I finished my Twitter data mining Ph.D. in OCaml as the most practical way to handle the graph, filling up a 64 GB RAM server, yet it was only one core out of eight running, which is a pity.

Clojure's performance improves by leaps and bounds, e.g. using primitives as efficiently as in Java, and I think OCaml would benefit from a similar set of primitives -- then it would be the most practical ML-style FP language, the prize now in fact held by Scala.

-- Alexy

[-- Attachment #2: Type: text/html, Size: 5254 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-23 20:18                                     ` Alexy Khrabrov
@ 2011-04-23 21:18                                       ` Jon Harrop
  2011-04-24  0:33                                       ` Eray Ozkural
  1 sibling, 0 replies; 74+ messages in thread
From: Jon Harrop @ 2011-04-23 21:18 UTC (permalink / raw)
  To: 'Alexy Khrabrov', 'Eray Ozkural'; +Cc: 'Caml List'

Alexy wrote:
> I'm not claiming that something is correct or not -- I'm just
> saying that replacing map by pmap is easy, while rewriting
> in MPI style is complex.

Complex not only because you must pass information around by hand but also
(in the absence of a shared heap, e.g. in OCaml) because you must deep copy
your data structures in order to send them which turns O(1)
pass-by-reference into O(n) pass-by-value and that, in turn, makes the
performance of MPI code much harder to predict.

> This is an obsolete urban legend.  JVM has the most mature
> GC out there and computational performance often on par
> with C when loaded and running.

The JVM does have a very evolved GC and it can often yield competitive
performance but it also has its fair share of deficiencies that can make it
much harder than necessary to attain competitive performance. In this
context, type erasure for generics and the lack of value types are major
issues. These culminate in generic code doing far more allocation than
necessary. Although this is partially offset by the JVM's GC there are still
important situations (e.g. generic hash table) where the JVM is many times
slower than alternatives like .NET (which uses reified generics and has
value types):

  http://fsharpnews.blogspot.com/2010/05/java-vs-f.html

> In my social networking benchmark...

That's a very interesting benchmark!

> Clojure's performance improves by leaps and bounds, e.g.
> using primitives as efficiently as in Java, and I think OCaml
> would benefit from a similar set of primitives --

Absolutely. There's no theoretical reason why this functionality could not
be available from OCaml too but it is a major project and, as I said, nobody
has enough motive to undertake it.

> then it would be the most practical ML-style FP language,
> the prize now in fact held by Scala.

I'd like to see F# too. :-)

Cheers,
Jon.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-23 20:18                                     ` Alexy Khrabrov
  2011-04-23 21:18                                       ` Jon Harrop
@ 2011-04-24  0:33                                       ` Eray Ozkural
  2011-04-28 14:42                                         ` orbitz
  1 sibling, 1 reply; 74+ messages in thread
From: Eray Ozkural @ 2011-04-24  0:33 UTC (permalink / raw)
  To: Alexy Khrabrov; +Cc: Caml List

[-- Attachment #1: Type: text/plain, Size: 322 bytes --]

Don't get me wrong, I think Clojure is a decent language. I have no qualms
with it :) I just don't think I would use it for HPC. Even O'Caml is too
slow for writing distributed memory parallel code. I can't imagine those
other things....

In the end, you guys are going to make me write yet another functional
language :D

[-- Attachment #2: Type: text/html, Size: 375 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-04-24  0:33                                       ` Eray Ozkural
@ 2011-04-28 14:42                                         ` orbitz
  0 siblings, 0 replies; 74+ messages in thread
From: orbitz @ 2011-04-28 14:42 UTC (permalink / raw)
  To: Eray Ozkural; +Cc: Alexy Khrabrov, Caml List

Hey everyone, I wrote up a short review of the paper "Efficient  
Parallel Programming in Poly/ML and Isabelle/ML" which discusses how  
multicore support was added to Poly/ML.  I am not an expert so if I  
got details wrong please correct me but the hope is Poly/ML's story  
might be helpful for Ocaml.


http://functional-orbitz.blogspot.com/2011/04/jc-efficient-parallel-programming-in.html


On Apr 23, 2011, at 8:33 PM, Eray Ozkural wrote:

> Don't get me wrong, I think Clojure is a decent language. I have no  
> qualms with it :) I just don't think I would use it for HPC. Even  
> O'Caml is too slow for writing distributed memory parallel code. I  
> can't imagine those other things....
>
> In the end, you guys are going to make me write yet another  
> functional language :D
>


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-24 14:57 ` Gerd Stolpmann
  2011-03-24 15:03   ` Joel Reymont
@ 2011-03-25 19:49   ` Richard W.M. Jones
  1 sibling, 0 replies; 74+ messages in thread
From: Richard W.M. Jones @ 2011-03-25 19:49 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: Alexy Khrabrov, caml-list

On Thu, Mar 24, 2011 at 03:57:55PM +0100, Gerd Stolpmann wrote:
> Generally, I'd say there is right now no perfect solution to the
> problem. As language developer, you have to foresee future hardware to
> some degree. When more and more cores are added, the characteristics of
> the hardware will gradually shift from SMP to NUMA, and the more it
> does, message passing techniques will become more attractive.

Up at the high end systems we test for RHEL, it's *all* very NUMA, has
been for quite some time.

Rich.

-- 
Richard Jones
Red Hat

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-24 15:28     ` Guillaume Yziquel
@ 2011-03-24 15:48       ` Gerd Stolpmann
  0 siblings, 0 replies; 74+ messages in thread
From: Gerd Stolpmann @ 2011-03-24 15:48 UTC (permalink / raw)
  To: Guillaume Yziquel; +Cc: Joel Reymont, Alexy Khrabrov, caml-list

Am Donnerstag, den 24.03.2011, 16:28 +0100 schrieb Guillaume Yziquel:
> Le Thursday 24 Mar 2011 à 15:03:18 (+0000), Joel Reymont a écrit :
> > I'm using a combination of OCaml and ZeroMQ which allows for many combinations of workers, routers, dispatchers, etc. 
> > 
> > Works for me and I recommend it.
> 
> How does ZeroMQ pertain to shared memory?
> 
> Gerd: Do you plan to include transactional shared memory in the explicit
> shared memory datastructure you implemented?

Well, the first goal is to allow some form of garbage collection. Right
now you can only move ocaml values to shared memory, but there is no way
to know whether it is still referenced.

There will probably some way of putting a transaction manager on top of
this, but again everything is explicit. E.g. you could have a hashtable
mapping transaction IDs to values, and first when a worker has produced
a complete value, it is added to this hashtable. A commit would have to
check whether the input transactions are still the same.

So, yes, some room for experimentation here at least. But don't expect
something working or something automatic.

Gerd
-- 
------------------------------------------------------------
Gerd Stolpmann, Bad Nauheimer Str.3, 64289 Darmstadt,Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-24 15:03   ` Joel Reymont
  2011-03-24 15:28     ` Guillaume Yziquel
@ 2011-03-24 15:38     ` Gerd Stolpmann
  1 sibling, 0 replies; 74+ messages in thread
From: Gerd Stolpmann @ 2011-03-24 15:38 UTC (permalink / raw)
  To: Joel Reymont; +Cc: Alexy Khrabrov, caml-list

Am Donnerstag, den 24.03.2011, 15:03 +0000 schrieb Joel Reymont:
> I'm using a combination of OCaml and ZeroMQ which allows for many combinations of workers, routers, dispatchers, etc. 
> 
> Works for me and I recommend it.

Right, for server architectures you can use middleware like message
queues. This is data-driven parallelism - works nicely when your data
allows it, and you don't need special features of the language runtime.

It becomes more complicated when your data doesn't allow this structure
(e.g. think of a matrix multiplication), or when you need very short
reaction times (in the low microseconds range), or the computations are
comparatively cheap, and the communication overhead would dominate.

A few months ago I started an experiment: I wanted to speed up a
function that sorts a large array of strings, and parallelized it. It
turned out that all sort of communication and synchronization are
already that expensive so that I finally only got a speedup on a quad
core for very large data sets (in the multi-gigabyte range). As sorting
is O(n log n) every data copy (which is O(n)) is already measurable, and
the distribution of the input data to the workers and the gathering of
the result was that costly that I first got an improvement for such
large data sets.

Gerd

> 
> On Mar 24, 2011, at 2:57 PM, Gerd Stolpmann wrote:
> 
> > So, here is my answer to your latter question: I've started a library
> > for using independent processes as workers that can communicate via
> > explicit shared memory (see
> > https://godirepo.camlcity.org/svn/lib-ocamlnet2/trunk/code/src/netmulticore/, and https://godirepo.camlcity.org/svn/lib-ocamlnet2/trunk/code/examples/multicore/chameneos.ml for a coding sample). This seems at least to be a solution in the Unix world for coarse-grained parallelism that is not dominated by the communication overhead between the processes.
> 
> --------------------------------------------------------------------------
> - for hire: mac osx device driver ninja, kernel extensions and usb drivers
> ---------------------+------------+---------------------------------------
> http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont
> ---------------------+------------+---------------------------------------
> 
> 
> 
> 

-- 
------------------------------------------------------------
Gerd Stolpmann, Bad Nauheimer Str.3, 64289 Darmstadt,Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-24 15:03   ` Joel Reymont
@ 2011-03-24 15:28     ` Guillaume Yziquel
  2011-03-24 15:48       ` Gerd Stolpmann
  2011-03-24 15:38     ` Gerd Stolpmann
  1 sibling, 1 reply; 74+ messages in thread
From: Guillaume Yziquel @ 2011-03-24 15:28 UTC (permalink / raw)
  To: Joel Reymont; +Cc: Gerd Stolpmann, Alexy Khrabrov, caml-list

Le Thursday 24 Mar 2011 à 15:03:18 (+0000), Joel Reymont a écrit :
> I'm using a combination of OCaml and ZeroMQ which allows for many combinations of workers, routers, dispatchers, etc. 
> 
> Works for me and I recommend it.

How does ZeroMQ pertain to shared memory?

Gerd: Do you plan to include transactional shared memory in the explicit
shared memory datastructure you implemented?

-- 
     Guillaume Yziquel


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-24 14:57 ` Gerd Stolpmann
@ 2011-03-24 15:03   ` Joel Reymont
  2011-03-24 15:28     ` Guillaume Yziquel
  2011-03-24 15:38     ` Gerd Stolpmann
  2011-03-25 19:49   ` Richard W.M. Jones
  1 sibling, 2 replies; 74+ messages in thread
From: Joel Reymont @ 2011-03-24 15:03 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: Alexy Khrabrov, caml-list

I'm using a combination of OCaml and ZeroMQ which allows for many combinations of workers, routers, dispatchers, etc. 

Works for me and I recommend it.

On Mar 24, 2011, at 2:57 PM, Gerd Stolpmann wrote:

> So, here is my answer to your latter question: I've started a library
> for using independent processes as workers that can communicate via
> explicit shared memory (see
> https://godirepo.camlcity.org/svn/lib-ocamlnet2/trunk/code/src/netmulticore/, and https://godirepo.camlcity.org/svn/lib-ocamlnet2/trunk/code/examples/multicore/chameneos.ml for a coding sample). This seems at least to be a solution in the Unix world for coarse-grained parallelism that is not dominated by the communication overhead between the processes.

--------------------------------------------------------------------------
- for hire: mac osx device driver ninja, kernel extensions and usb drivers
---------------------+------------+---------------------------------------
http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont
---------------------+------------+---------------------------------------





^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Caml-list] Efficient OCaml multicore -- roadmap?
  2011-03-24 13:44 Alexy Khrabrov
@ 2011-03-24 14:57 ` Gerd Stolpmann
  2011-03-24 15:03   ` Joel Reymont
  2011-03-25 19:49   ` Richard W.M. Jones
  0 siblings, 2 replies; 74+ messages in thread
From: Gerd Stolpmann @ 2011-03-24 14:57 UTC (permalink / raw)
  To: Alexy Khrabrov; +Cc: caml-list

Am Donnerstag, den 24.03.2011, 09:44 -0400 schrieb Alexy Khrabrov:
> Where does the OCaml team stand on the multicore issues?  A year or so ago, when there was a prototype parallel GC implementation, IIRC, Xavier said it has to be done right.  So what are the official plans and the status of integrating what volunteers had done?
> 
> WIth Scala having a robust actors model and AKKA kernel, and Clojure built around efficient shared memory concurrency with agents and references and STM, and Haskell also really parallel, OCaml is lacking behind.  Furthermore, F# builds on strongly parallel .NET, overcoming granddaddy.  With multicores common even in laptops and iPads, we need an efficient  multicore OCaml!  Due to the model different from Haskell or Scala and Clojure, now all on github, OCaml is both more stable and also is slower to advance -- what do folks think about this situation?  How do you do shared memory parallelism now?

I'm a bit reluctant to answering to this at all, because there is a big
misunderstanding in the world about what parallelism is good for. Some
people think it is a technique for doing micro optimizations in their
code (i.e. parallelize some inner loops). Questions about "shared memory
concurrency" often go into this direction because existing code can
profit from it. Although there is nothing wrong in principal about this,
one should be very aware that the usefulness is very limited - often
only code speedups of a few percent of the total runtime can be
achieved, and it is often restricted to two cores. If you look at
programs that are designed for parallelism, the whole story is a
completely different one, though.

So, here is my answer to your latter question: I've started a library
for using independent processes as workers that can communicate via
explicit shared memory (see
https://godirepo.camlcity.org/svn/lib-ocamlnet2/trunk/code/src/netmulticore/, and https://godirepo.camlcity.org/svn/lib-ocamlnet2/trunk/code/examples/multicore/chameneos.ml for a coding sample). This seems at least to be a solution in the Unix world for coarse-grained parallelism that is not dominated by the communication overhead between the processes. Available features:

- manage worker processes
- the processes can use files, shared memory, and semaphores
- processes can communicate to each other by sending messages to
  "camlboxes". By tricky implementation, there is only overhead for 
  serialization but not deserialization which makes message passing
  really fast

Of course, this is all very experimental. At least it demonstrates that
exploiting shared memory between independent processes is possible. For
getting a real programming environment, one needs to put more effort
into this, though, especially we would need data structures that can
live in shared memory (e.g. hashtables).

The resulting programming model is of course not the usual
multi-threading model. In particular, values are not automatically put
into shared memory. This has to be explicitly managed. IMHO, this is not
a weakness, but rather a strong point (and not everybody will understand
this - it's a somehow comparable to banning "goto" from higher-level
languages). It also means that you cannot use parallelism for doing
micro optimizations - the whole program has to be re-designed.

Generally, I'd say there is right now no perfect solution to the
problem. As language developer, you have to foresee future hardware to
some degree. When more and more cores are added, the characteristics of
the hardware will gradually shift from SMP to NUMA, and the more it
does, message passing techniques will become more attractive.

I'd like to see Ocaml doing here well. Let the others focus on SMP
architectures (which will, at a certain point of hardware development,
only be found in laptops but not in real compute hardware).

Gerd

> 
> Cheers,
> Alexy
> 

-- 
------------------------------------------------------------
Gerd Stolpmann, Bad Nauheimer Str.3, 64289 Darmstadt,Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [Caml-list] Efficient OCaml multicore -- roadmap?
@ 2011-03-24 13:44 Alexy Khrabrov
  2011-03-24 14:57 ` Gerd Stolpmann
  0 siblings, 1 reply; 74+ messages in thread
From: Alexy Khrabrov @ 2011-03-24 13:44 UTC (permalink / raw)
  To: caml-list

Where does the OCaml team stand on the multicore issues?  A year or so ago, when there was a prototype parallel GC implementation, IIRC, Xavier said it has to be done right.  So what are the official plans and the status of integrating what volunteers had done?

WIth Scala having a robust actors model and AKKA kernel, and Clojure built around efficient shared memory concurrency with agents and references and STM, and Haskell also really parallel, OCaml is lacking behind.  Furthermore, F# builds on strongly parallel .NET, overcoming granddaddy.  With multicores common even in laptops and iPads, we need an efficient  multicore OCaml!  Due to the model different from Haskell or Scala and Clojure, now all on github, OCaml is both more stable and also is slower to advance -- what do folks think about this situation?  How do you do shared memory parallelism now?

Cheers,
Alexy

^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2011-04-28 14:42 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <2054357367.219171.1300974318806.JavaMail.root@zmbs4.inria.fr>
2011-03-24 23:13 ` [Caml-list] Efficient OCaml multicore -- roadmap? Fabrice Le Fessant
2011-03-25  0:23   ` [Caml-list] " Sylvain Le Gall
2011-03-25  9:55   ` [Caml-list] " Alain Frisch
2011-03-25 11:44     ` Gerd Stolpmann
     [not found]   ` <1396338209.232813.1301046980856.JavaMail.root@zmbs4.inria.fr>
2011-03-25 10:23     ` Fabrice Le Fessant
2011-03-25 12:07       ` Gerd Stolpmann
2011-04-16 12:12         ` Jon Harrop
2011-03-25 10:51   ` Hugo Ferreira
2011-03-25 12:25     ` Gerd Stolpmann
2011-03-25 12:58       ` Hugo Ferreira
     [not found]       ` <341494683.237537.1301057887481.JavaMail.root@zmbs4.inria.fr>
2011-03-25 13:10         ` Fabrice Le Fessant
2011-03-25 13:41           ` Dario Teixeira
2011-03-30 18:12             ` Jon Harrop
2011-03-25 15:44           ` Hugo Ferreira
2011-03-25 18:24             ` Martin Jambon
2011-03-25 19:19               ` Hugo Ferreira
2011-03-25 20:26                 ` Gerd Stolpmann
2011-03-26  9:11                   ` Hugo Ferreira
2011-03-26 10:31                   ` Richard W.M. Jones
2011-03-30 16:56                     ` Jon Harrop
2011-03-30 19:24                       ` Richard W.M. Jones
2011-04-20 21:44                   ` Jon Harrop
2011-04-19  9:57                 ` Eray Ozkural
2011-04-19 10:05                   ` Hugo Ferreira
2011-04-19 20:26                   ` Gerd Stolpmann
2011-04-20  7:59                     ` Hugo Ferreira
2011-04-20 12:30                       ` Markus Mottl
2011-04-20 12:53                         ` Hugo Ferreira
2011-04-20 13:22                           ` Markus Mottl
2011-04-20 14:00                       ` Edgar Friendly
2011-04-19 22:49                 ` Jon Harrop
2011-03-30 17:02               ` Jon Harrop
2011-04-20 19:23               ` Jon Harrop
2011-04-20 20:05                 ` Alexy Khrabrov
2011-04-20 23:00                   ` Jon Harrop
     [not found]                   ` <76544177.594058.1303341821437.JavaMail.root@zmbs4.inria.fr>
2011-04-21  7:48                     ` Fabrice Le Fessant
2011-04-21  8:35                       ` Hugo Ferreira
2011-04-23 17:32                         ` Jon Harrop
2011-04-21  9:09                       ` Alain Frisch
     [not found]                         ` <20110421.210304.1267840107736400776.Christophe.Troestler+ocaml@umons.ac.be>
2011-04-21 19:53                           ` Hezekiah M. Carty
2011-04-22  8:34                           ` Alain Frisch
     [not found]                         ` <799994864.610698.1303412613509.JavaMail.root@zmbs4.inria.fr>
2011-04-22  8:06                           ` Fabrice Le Fessant
2011-04-22  9:11                             ` Gerd Stolpmann
2011-04-23 10:17                               ` Eray Ozkural
2011-04-23 13:47                                 ` Alexy Khrabrov
2011-04-23 17:39                                   ` Eray Ozkural
2011-04-23 20:18                                     ` Alexy Khrabrov
2011-04-23 21:18                                       ` Jon Harrop
2011-04-24  0:33                                       ` Eray Ozkural
2011-04-28 14:42                                         ` orbitz
2011-04-23 19:02                                 ` Jon Harrop
2011-04-22  9:44                             ` Vincent Aravantinos
2011-04-21 10:09                       ` Philippe Strauss
2011-04-23 17:44                         ` Jon Harrop
2011-04-23 17:05                       ` Jon Harrop
2011-04-20 20:30                 ` Gerd Stolpmann
2011-04-20 23:33                   ` Jon Harrop
2011-03-25 20:27             ` Philippe Strauss
2011-04-19 22:47           ` Jon Harrop
     [not found]           ` <869445701.579183.1303253283515.JavaMail.root@zmbs4.inria.fr>
2011-04-20  9:25             ` Fabrice Le Fessant
2011-03-25 18:45   ` Andrei Formiga
2011-03-30 17:00     ` Jon Harrop
2011-04-13  3:36   ` Lucas Dixon
2011-04-13 13:01     ` Gerd Stolpmann
2011-04-13 13:09       ` Lucas Dixon
2011-04-13 23:04       ` Goswin von Brederlow
2011-04-16 13:54     ` Jon Harrop
2011-03-24 13:44 Alexy Khrabrov
2011-03-24 14:57 ` Gerd Stolpmann
2011-03-24 15:03   ` Joel Reymont
2011-03-24 15:28     ` Guillaume Yziquel
2011-03-24 15:48       ` Gerd Stolpmann
2011-03-24 15:38     ` Gerd Stolpmann
2011-03-25 19:49   ` Richard W.M. Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).