From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <jon@ffconsultancy.com>
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104])
	by yquem.inria.fr (Postfix) with ESMTP id 3FE77BBAF
	for <caml-list@yquem.inria.fr>; Tue, 22 Dec 2009 18:59:04 +0100 (CET)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: Al0BACqYMEvUnwdji2dsb2JhbACCGZkvAQEBCgsKBxEGumiEMwQ
X-IronPort-AV: E=Sophos;i="4.47,437,1257116400"; 
   d="scan'208";a="40433668"
Received: from relay.pcl-ipout01.plus.net ([212.159.7.99])
  by mail3-smtp-sop.national.inria.fr with ESMTP/TLS/RC4-SHA; 22 Dec 2009 18:58:45 +0100
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: ApgFALKXMEvUnw4S/2dsb2JhbACCGdRXhDME
Received: from pih-relay05.plus.net ([212.159.14.18])
  by relay.pcl-ipout01.plus.net with ESMTP; 22 Dec 2009 17:58:44 +0000
Received: from [87.114.35.173] (helo=leper.local)
	 by pih-relay05.plus.net with esmtp (Exim) id 1NN90N-0007vj-OR
	for caml-list@yquem.inria.fr; Tue, 22 Dec 2009 17:58:44 +0000
From: Jon Harrop <jon@ffconsultancy.com>
Organization: Flying Frog Consultancy Ltd.
To: caml-list@yquem.inria.fr
Subject: Re: [Caml-list] Re: multicore wish
Date: Tue, 22 Dec 2009 19:12:50 +0000
User-Agent: KMail/1.9.9
References: <4B2D2BC1.6020204@msu.edu> <200912211953.23880.jon@ffconsultancy.com> <87oclrrw8o.fsf@frosties.localdomain>
In-Reply-To: <87oclrrw8o.fsf@frosties.localdomain>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
Message-Id: <200912221912.51017.jon@ffconsultancy.com>
X-Plusnet-Relay: 8cecb776d52fdcbb7adfdcf887bf32d3
X-Spam: no; 0.00; ocaml:01 pointer:01 ocaml:01 pointers:01 pointers:01 locality:01 locality:01 speedup:01 unboxed:01 subset:01 recursive:01 invocation:01 deques:01 recursive:01 deques:01 

On Tuesday 22 December 2009 13:09:27 Goswin von Brederlow wrote:
> Jon Harrop <jon@ffconsultancy.com> writes:
> > 1. The array "a" is just an ordinary array of any type of values on the
> > shared heap in F# but, for generality in OCaml, this must be both the
> > underlying ordinary data and a manually-managed shared big array of
> > indices into the ordinary data where the indices get sorted while the
> > original data remain in place until they are permuted at the end.
>
> Unless you have a primitive type that isn't a pointer.

In OCaml, you would need to write a custom quicksort optimized for that=20
particular type. In F#, the generic version just works and works efficientl=
y.

> The advantage with ocaml though is that you never have pointers into a
> structure. Makes thinks a lot simpler for the GC and avoids large
> overheads in memory.

I don't understand what you mean by OCaml "never has pointers into a=20
structure". Half the problem with OCaml is that OCaml almost always uses=20
pointers and the programmer has no choice, e.g. for complex numbers.

> > 2. The sorted subarrays are contiguous in memory and, at some
> > subdivision, will fit into L2 cache. So F# offers optimal locality. In
> > contrast, there is no locality whatsoever in the OCaml code and most
> > accesses into the unsorted original array will incur cache misses right
> > up to main memory. So the OCaml approach does not scale as well and will
> > never see superlinear speedup because it cannot be cache friendly.
>
> On the other hand swapping two elements in the array has a constant
> cost no matter what size they have. At some size there will be a break
> even point where copying the data costs more than the cache misses and
> with increasing size the cache won't help F# so much either.

In theory, yes. In practice, that threshold is far larger than any value ty=
pe=20
that a real program would use so it is of no practical concern. Moreover, F=
#=20
gives the programmer control over whether data are unboxed (value types) or=
=20
boxed (reference types) anyway. In contrast, OCaml is tied to a few value=20
types that are hard-coded into the GC.

> > 3. Child tasks are likely to be executed on the same core as their pare=
nt
> > and use a subset of their parent's data in F#, so they offer the best
> > possible locality. In contrast, child processes are likely to be execut=
ed
> > on another core in OCaml and offer the worst possible locality.
>
> But, if I understood you right, you first fork one process per
> core.

No, in OCaml I fork every child. That is the only transparent way to give t=
he=20
child a coherent view of the heap but it is extremely slow (~1ms):

  "F# can do 60MFLOPS of computation in the time it takes OCaml=20
to fork a single process" -
http://caml.inria.fr/pub/ml-archives/caml-list/2009/06/542b8bed77022b4a4824=
de2da5b7f714.en.html

> Those should then also each pin themself to one core.

You have no idea which core a forked child process will run on.

> Each process then has a work queue which is works through. So they will=20
> always use the local data. Only when a queue runs dry they steal from
> another process and ruin locality.

You are correctly describing the efficient solution based upon work-stealin=
g=20
queues that F# uses but OCaml cannot express it.

> So I don't see where your argument fits. You are not creating childs
> on the fly. Only at the start and they run till all the work is done.
> At least in this example.

No, every recursive invocation of the parallel quicksort spawns another chi=
ld=20
on-the-fly. That's precisely why it parallelizes so efficiently when you ha=
ve=20
wait-free work-stealing task deques and a shared heap. In general, you=20
rewrite algorithms into this recursive divide and conquer form and=20
parallelize when possible. You can parallelize a *lot* of problems=20
efficiently that way.

> > 4. Task deques can handle an arbitrary number of tasks limited only by
> > memory whereas processes are a scarce resource and forking is likely to
> > fail, whereupon the "invoke" combinator will simply execute sequentiall=
y.
> > So it is much easier to write reliable and performant code in F# than
> > OCaml.
>
> Why would you fork in invoke?

=46ork is currently the only transparent way to implement "invoke" but it i=
s=20
extremely slow and unreliable.

> > 5. OCaml's fork-based "invoke" combinator is many orders of magnitude
> > slower than pushing a closure onto a concurrent task deque in F#.
> >
> > 6. The "threshold" value is a magic number derived from measurements on=
 a
> > given machine in my OCaml code but is dynamically adjusted in a
> > machine-independent way by the "invoke" combinator in my F# code using
> > atomic operations and real time profiling of the proportion of time spe=
nt
> > spawning tasks vs doing actual work.
>
> 5+6 seem to be an implementation detail of some specific
> implementation you are talking about.

Yes. I'm talking about today's OCaml and F# implementations.

> I don't see anything in the theory that would require that.

If by "in theory" you mean that a new performant concurrent GC for OCaml wo=
uld=20
solve these problems then yes. But I doubt OCaml is ever going to get one.

> > The same basic principles apply to many algorithms. Although it can
> > sometimes be tricky to figure out how best to use this technology to
> > parallelize a given algorithm (e.g. successive over relaxation), I have
> > found that a great many algorithms can be parallelized effectively using
> > this approach when you have a suitable foundation in place (like the
> > TPL). Moreover, the ability to use ordinary constructs in F# instead of
> > hacks like type-specific shared memory big arrays in OCaml makes it a l=
ot
> > easier to parallelize programs. My parallel Burrows-Wheeler Transform
> > (BWT), for example, took 30 minutes to develop in F# and 2 days in OCam=
l.
>
> It might be true that ocaml lacks some primitives for multi-core
> operations. But I would say that is mostly because so far it hasn't
> supported multi-core.

Well, yes. :-)

> If F# has great support for this then that might be a good place to steal
> some ideas from.=20

The infrastructure for this kind of shared memory parallel programming is=20
really very simple. You just need a GC that can handle a shared heap (which=
=20
HLVM already has) and work-stealing task deques. Then you can easily write=
=20
parallel programs that leverage multicores and run a *lot* faster than=20
anything that can be written in OCaml. You can even make this accessible to=
=20
OCaml programmers as a DSL with automatic interop to make high performance=
=20
parallel programming as easy as possible from OCaml.

> But so far have heart nothing that would make F# fundamentally more capab=
le
> than ocaml could become.

In theory, OCaml could catch up with F#. In practice, the core of OCaml's=20
implementation is so heavily optimized in another direction (and that will=
=20
not change because it is OCaml's raison d'=EAtre) that it is worth starting=
=20
from scratch. Modern libraries like LLVM make this comparatively painless a=
nd=20
the improvements are vast. And besides, it is great fun! :-)

=2D-=20
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e