From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <goswin-v-b@web.de>
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105])
	by yquem.inria.fr (Postfix) with ESMTP id E6DD4BBAF
	for <caml-list@yquem.inria.fr>; Tue, 22 Dec 2009 14:09:48 +0100 (CET)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AtwCADNUMEvZSMDji2dsb2JhbACBSploAQEBCgsKBxEFuWOEMwQ
X-IronPort-AV: E=Sophos;i="4.47,436,1257116400"; 
   d="scan'208";a="52649061"
Received: from fmmailgate02.web.de ([217.72.192.227])
  by mail4-smtp-sop.national.inria.fr with ESMTP; 22 Dec 2009 14:09:29 +0100
Received: from smtp06.web.de (fmsmtp06.dlan.cinetic.de [172.20.5.172])
	by fmmailgate02.web.de (Postfix) with ESMTP id 093B7149E9203;
	Tue, 22 Dec 2009 14:09:29 +0100 (CET)
Received: from [95.208.117.111] (helo=frosties.localdomain)
	by smtp06.web.de with asmtp (TLSv1:AES256-SHA:256)
	(WEB.DE 4.110 #314)
	id 1NN4UR-0003tx-00; Tue, 22 Dec 2009 14:09:27 +0100
Received: from mrvn by frosties.localdomain with local (Exim 4.69)
	(envelope-from <goswin-v-b@web.de>)
	id 1NN4UR-0008Ci-4W; Tue, 22 Dec 2009 14:09:27 +0100
From: Goswin von Brederlow <goswin-v-b@web.de>
To: Jon Harrop <jon@ffconsultancy.com>
Cc: caml-list@yquem.inria.fr
Subject: Re: multicore wish
References: <4B2D2BC1.6020204@msu.edu>
	<200912200443.57698.jon@ffconsultancy.com>
	<874onko3mp.fsf_-_@frosties.localdomain>
	<200912211953.23880.jon@ffconsultancy.com>
Date: Tue, 22 Dec 2009 14:09:27 +0100
In-Reply-To: <200912211953.23880.jon@ffconsultancy.com> (Jon Harrop's message
	of "Mon, 21 Dec 2009 19:53:23 +0000")
Message-ID: <87oclrrw8o.fsf@frosties.localdomain>
User-Agent: Gnus/5.110006 (No Gnus v0.6) XEmacs/21.4.22 (linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Sender: goswin-v-b@web.de
X-Sender: goswin-v-b@web.de
X-Provags-ID: V01U2FsdGVkX1/95dZJdVPiJ/FatXbO+yFk8tIJMUUhR6wMbpwf
	Fs20NSETVacQFEwhF+bpDeOu1/Q5UB4BcbMdtihO5O3tsrLTeb
	2XyVsyec8=
X-Spam: no; 0.00; deques:01 deque:01 deque:01 locality:01 speedups:01 combinator:01 ocaml:01 ocaml:01 pointer:01 pointers:01 locality:01 speedup:01 subset:01 deques:01 combinator:01 

Jon Harrop <jon@ffconsultancy.com> writes:

> Cilk pioneered wait-free work-stealing task deques and Microsoft's Task 
> Parallel Library (which will be part of .NET 4 in March 2010) copied the 
> idea. You have a separate deque of tasks for each core. A core tries to pop a 
> task off its deque. If there are no tasks on its deque then it tries to pop 
> off a randomly-chosen other deque. If there is a task then the core begins 
> executing that task and any child tasks are pushed back onto the local deque 
> (and might get stolen before they are executed).
>
> An important characteristic of this design is that child tasks are likely to 
> be executed on the same core as their parent and, therefore, locality between 
> related tasks is valuable. Consequently, many algorithms that recursively 
> subdivide a large dataset (like quicksort) attain superlinear speedups 
> because they make effective use of the sum of all L2 caches rather than just 
> a single L2 cache. So this works extremely well with the multi-level 
> hierarchical caches with interconnects that are the hallmark of multicore 
> architecture.
>
> You can then implement quicksort (and many other parallel divide and conquer 
> algorithms) something like this:
>
>   let rec serial_qsort cmp a i0 i3  =
>     if i3 - i0 > 1 then
>       let i1, i2 = partition cmp a i0 i3 in
>       serial_qsort i0 i1;
>       serial_qsort i2 i3
>
>   let rec parallel_qsort cmp a i0 i3  =
>     if i3 - i0 > 1 then
>       if i3 - i0 < threshold then
>         serial_qsort cmp a i0 i3
>       else
>         let i1, i2 = partition cmp a i0 i3 in
>         let future = invoke (parallel_qsort cmp i0) i1 in
>         parallel_qsort i2 i3
>         future()
>
> where the "invoke" combinator pushes a task onto the local task deque and 
> returns a closure that, when applied, waits for that task to complete and 
> returns its (unit) result. The "threshold" value protects against the 
> function spending more time spawning tasks that doing actual work.
>
> You can implement this in OCaml and F# but there are some important 
> differences:
>
> 1. The array "a" is just an ordinary array of any type of values on the shared 
> heap in F# but, for generality in OCaml, this must be both the underlying 
> ordinary data and a manually-managed shared big array of indices into the 
> ordinary data where the indices get sorted while the original data remain in 
> place until they are permuted at the end.

Unless you have a primitive type that isn't a pointer.

The advantage with ocaml though is that you never have pointers into a
structure. Makes thinks a lot simpler for the GC and avoids large
overheads in memory.

> 2. The sorted subarrays are contiguous in memory and, at some subdivision, 
> will fit into L2 cache. So F# offers optimal locality. In contrast, there is 
> no locality whatsoever in the OCaml code and most accesses into the unsorted 
> original array will incur cache misses right up to main memory. So the OCaml 
> approach does not scale as well and will never see superlinear speedup 
> because it cannot be cache friendly.

On the other hand swapping two elements in the array has a constant
cost no matter what size they have. At some size there will be a break
even point where copying the data costs more than the cache misses and
with increasing size the cache won't help F# so much either.

> 3. Child tasks are likely to be executed on the same core as their parent and 
> use a subset of their parent's data in F#, so they offer the best possible 
> locality. In contrast, child processes are likely to be executed on another 
> core in OCaml and offer the worst possible locality.

But, if I understood you right, you first fork one process per
core. Those should then also each pin themself to one core. Each
process then has a work queue which is works through. So they will
always use the local data. Only when a queue runs dry they steal from
another process and ruin locality.

So I don't see where your argument fits. You are not creating childs
on the fly. Only at the start and they run till all the work is done.
At least in this example.

> 4. Task deques can handle an arbitrary number of tasks limited only by memory 
> whereas processes are a scarce resource and forking is likely to fail, 
> whereupon the "invoke" combinator will simply execute sequentially. So it is 
> much easier to write reliable and performant code in F# than OCaml.

Why would you fork in invoke? That only adds a closure to the task
queue for the core.

> 5. OCaml's fork-based "invoke" combinator is many orders of magnitude slower 
> than pushing a closure onto a concurrent task deque in F#.
>
> 6. The "threshold" value is a magic number derived from measurements on a 
> given machine in my OCaml code but is dynamically adjusted in a 
> machine-independent way by the "invoke" combinator in my F# code using atomic 
> operations and real time profiling of the proportion of time spent spawning 
> tasks vs doing actual work.

5+6 seem to be an implementation detail of some specific
implementation you are talking about. I don't see anything in the
theory that would require that.

> The same basic principles apply to many algorithms. Although it can sometimes 
> be tricky to figure out how best to use this technology to parallelize a 
> given algorithm (e.g. successive over relaxation), I have found that a great 
> many algorithms can be parallelized effectively using this approach when you 
> have a suitable foundation in place (like the TPL). Moreover, the ability to 
> use ordinary constructs in F# instead of hacks like type-specific shared 
> memory big arrays in OCaml makes it a lot easier to parallelize programs. My 
> parallel Burrows-Wheeler Transform (BWT), for example, took 30 minutes to 
> develop in F# and 2 days in OCaml.

It might be true that ocaml lacks some primitives for multi-core
operations. But I would say that is mostly because so far it hasn't
supported multi-core. If F# has great support for this then that might
be a good place to steal some ideas from. But so far have heart
nothing that would make F# fundamentally more capable than ocaml could
become.

> Hopefully, HLVM will bring these kinds of benefits to OCaml, albeit in the 
> form of a DSL for high performance parallel programming.

MfG
        Goswin