Re: multicore wish - Goswin von Brederlow

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

From: Goswin von Brederlow <goswin-v-b@web.de>
To: Jon Harrop <jon@ffconsultancy.com>
Cc: caml-list@yquem.inria.fr
Subject: Re: multicore wish
Date: Tue, 22 Dec 2009 14:09:27 +0100	[thread overview]
Message-ID: <87oclrrw8o.fsf@frosties.localdomain> (raw)
In-Reply-To: <200912211953.23880.jon@ffconsultancy.com> (Jon Harrop's message of "Mon, 21 Dec 2009 19:53:23 +0000")

Jon Harrop <jon@ffconsultancy.com> writes:

> Cilk pioneered wait-free work-stealing task deques and Microsoft's Task 
> Parallel Library (which will be part of .NET 4 in March 2010) copied the 
> idea. You have a separate deque of tasks for each core. A core tries to pop a 
> task off its deque. If there are no tasks on its deque then it tries to pop 
> off a randomly-chosen other deque. If there is a task then the core begins 
> executing that task and any child tasks are pushed back onto the local deque 
> (and might get stolen before they are executed).
>
> An important characteristic of this design is that child tasks are likely to 
> be executed on the same core as their parent and, therefore, locality between 
> related tasks is valuable. Consequently, many algorithms that recursively 
> subdivide a large dataset (like quicksort) attain superlinear speedups 
> because they make effective use of the sum of all L2 caches rather than just 
> a single L2 cache. So this works extremely well with the multi-level 
> hierarchical caches with interconnects that are the hallmark of multicore 
> architecture.
>
> You can then implement quicksort (and many other parallel divide and conquer 
> algorithms) something like this:
>
>   let rec serial_qsort cmp a i0 i3  =
>     if i3 - i0 > 1 then
>       let i1, i2 = partition cmp a i0 i3 in
>       serial_qsort i0 i1;
>       serial_qsort i2 i3
>
>   let rec parallel_qsort cmp a i0 i3  =
>     if i3 - i0 > 1 then
>       if i3 - i0 < threshold then
>         serial_qsort cmp a i0 i3
>       else
>         let i1, i2 = partition cmp a i0 i3 in
>         let future = invoke (parallel_qsort cmp i0) i1 in
>         parallel_qsort i2 i3
>         future()
>
> where the "invoke" combinator pushes a task onto the local task deque and 
> returns a closure that, when applied, waits for that task to complete and 
> returns its (unit) result. The "threshold" value protects against the 
> function spending more time spawning tasks that doing actual work.
>
> You can implement this in OCaml and F# but there are some important 
> differences:
>
> 1. The array "a" is just an ordinary array of any type of values on the shared 
> heap in F# but, for generality in OCaml, this must be both the underlying 
> ordinary data and a manually-managed shared big array of indices into the 
> ordinary data where the indices get sorted while the original data remain in 
> place until they are permuted at the end.

Unless you have a primitive type that isn't a pointer.

The advantage with ocaml though is that you never have pointers into a
structure. Makes thinks a lot simpler for the GC and avoids large
overheads in memory.

> 2. The sorted subarrays are contiguous in memory and, at some subdivision, 
> will fit into L2 cache. So F# offers optimal locality. In contrast, there is 
> no locality whatsoever in the OCaml code and most accesses into the unsorted 
> original array will incur cache misses right up to main memory. So the OCaml 
> approach does not scale as well and will never see superlinear speedup 
> because it cannot be cache friendly.

On the other hand swapping two elements in the array has a constant
cost no matter what size they have. At some size there will be a break
even point where copying the data costs more than the cache misses and
with increasing size the cache won't help F# so much either.

> 3. Child tasks are likely to be executed on the same core as their parent and 
> use a subset of their parent's data in F#, so they offer the best possible 
> locality. In contrast, child processes are likely to be executed on another 
> core in OCaml and offer the worst possible locality.

But, if I understood you right, you first fork one process per
core. Those should then also each pin themself to one core. Each
process then has a work queue which is works through. So they will
always use the local data. Only when a queue runs dry they steal from
another process and ruin locality.

So I don't see where your argument fits. You are not creating childs
on the fly. Only at the start and they run till all the work is done.
At least in this example.

> 4. Task deques can handle an arbitrary number of tasks limited only by memory 
> whereas processes are a scarce resource and forking is likely to fail, 
> whereupon the "invoke" combinator will simply execute sequentially. So it is 
> much easier to write reliable and performant code in F# than OCaml.

Why would you fork in invoke? That only adds a closure to the task
queue for the core.

> 5. OCaml's fork-based "invoke" combinator is many orders of magnitude slower 
> than pushing a closure onto a concurrent task deque in F#.
>
> 6. The "threshold" value is a magic number derived from measurements on a 
> given machine in my OCaml code but is dynamically adjusted in a 
> machine-independent way by the "invoke" combinator in my F# code using atomic 
> operations and real time profiling of the proportion of time spent spawning 
> tasks vs doing actual work.

5+6 seem to be an implementation detail of some specific
implementation you are talking about. I don't see anything in the
theory that would require that.

> The same basic principles apply to many algorithms. Although it can sometimes 
> be tricky to figure out how best to use this technology to parallelize a 
> given algorithm (e.g. successive over relaxation), I have found that a great 
> many algorithms can be parallelized effectively using this approach when you 
> have a suitable foundation in place (like the TPL). Moreover, the ability to 
> use ordinary constructs in F# instead of hacks like type-specific shared 
> memory big arrays in OCaml makes it a lot easier to parallelize programs. My 
> parallel Burrows-Wheeler Transform (BWT), for example, took 30 minutes to 
> develop in F# and 2 days in OCaml.

It might be true that ocaml lacks some primitives for multi-core
operations. But I would say that is mostly because so far it hasn't
supported multi-core. If F# has great support for this then that might
be a good place to steal some ideas from. But so far have heart
nothing that would make F# fundamentally more capable than ocaml could
become.

> Hopefully, HLVM will bring these kinds of benefits to OCaml, albeit in the 
> form of a DSL for high performance parallel programming.

MfG
        Goswin

next prev parent reply	other threads:[~2009-12-22 13:09 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-12-19 19:38 OCaml is broken Jeff Shaw
2009-12-20  4:43 ` [Caml-list] " Jon Harrop
2009-12-20 12:21   ` [***SPAM*** Score/Req: 10.1/8.0] " Erik Rigtorp
2009-12-20 13:22     ` Martin Jambon
2009-12-20 13:47     ` Yaron Minsky
2009-12-20 16:01       ` Gerd Stolpmann
2009-12-21 22:50       ` [***SPAM*** Score/Req: 10.1/8.0] Re: [***SPAM*** Score/Req: 10.1/8.0] " Erik Rigtorp
2009-12-22 12:04         ` Erik Rigtorp
2009-12-22 12:27           ` Mihamina Rakotomandimby
2009-12-22 13:27           ` Gerd Stolpmann
2009-12-23 11:25             ` Erik Rigtorp
2009-12-29 12:07         ` [***SPAM*** Score/Req: 10.1/8.0] Re: [***SPAM*** Score/Req: 10.1/8.0] " Richard Jones
2009-12-20 14:27     ` Dario Teixeira
2009-12-20 21:14       ` Jon Harrop
2009-12-21  1:08         ` Gerd Stolpmann
2009-12-21  4:30           ` Jon Harrop
2009-12-21  3:58             ` Yaron Minsky
2009-12-21  5:32             ` Markus Mottl
2009-12-21 13:29               ` Jon Harrop
2009-12-26 17:08           ` orbitz
2009-12-20 19:38     ` [***SPAM*** Score/Req: 10.1/8.0] " Jon Harrop
2009-12-21 12:26       ` Mihamina Rakotomandimby
2009-12-21 14:19         ` general question, was " Keyan
2009-12-21 14:40           ` [Caml-list] " rixed
2009-12-21 14:42           ` Gerd Stolpmann
2009-12-21 15:25             ` Eray Ozkural
2009-12-21 14:50           ` Philip
2009-12-21 15:01             ` Keyan
2009-12-21 15:13               ` Stefano Zacchiroli
2009-12-21 15:27               ` Dario Teixeira
2009-12-21 15:46                 ` Jacques Carette
2009-12-21 18:50             ` Jon Harrop
2009-12-21 18:48           ` Jon Harrop
2010-01-03 10:49           ` Sylvain Le Gall
2010-01-03 20:06             ` [Caml-list] " Jon Harrop
2009-12-21 13:07     ` [***SPAM*** Score/Req: 10.1/8.0] Re: [Caml-list] " Damien Doligez
2009-12-21 13:31   ` multicore wish [Was: Re: [Caml-list] Re: OCaml is broken] Goswin von Brederlow
2009-12-21 14:19     ` multicore wish Mihamina Rakotomandimby
2009-12-21 16:15       ` [Caml-list] " Fischbacher T.
2009-12-21 17:42       ` Dario Teixeira
2009-12-21 18:43       ` Jon Harrop
2009-12-21 19:53     ` multicore wish [Was: Re: [Caml-list] Re: OCaml is broken] Jon Harrop
2009-12-22 13:09       ` Goswin von Brederlow [this message]
2009-12-22 19:12         ` [Caml-list] Re: multicore wish Jon Harrop
2009-12-22 18:02           ` Edgar Friendly
2009-12-22 19:20             ` Jon Harrop
2009-12-24 12:58               ` Goswin von Brederlow
2009-12-24 16:51                 ` Jon Harrop
2009-12-24 13:19           ` Goswin von Brederlow
2009-12-24 17:06             ` Jon Harrop
2009-12-27 12:45               ` Goswin von Brederlow
2009-12-27 16:37                 ` Jon Harrop
2009-12-28 12:28                 ` Gerd Stolpmann
2009-12-28 15:07                   ` Anil Madhavapeddy
2009-12-28 18:05                   ` Xavier Leroy
2009-12-29 16:44                     ` Gerd Stolpmann
2009-12-20 11:56 ` [***SPAM*** Score/Req: 10.1/8.0] [Caml-list] Re: OCaml is broken Erik Rigtorp

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87oclrrw8o.fsf@frosties.localdomain \
    --to=goswin-v-b@web.de \
    --cc=caml-list@yquem.inria.fr \
    --cc=jon@ffconsultancy.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).