From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <goswin-v-b@web.de>
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82])
	by yquem.inria.fr (Postfix) with ESMTP id D72E7BBAF
	for <caml-list@yquem.inria.fr>; Thu, 24 Dec 2009 14:21:02 +0100 (CET)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AiADANb4MkvZSMDdi2dsb2JhbACBSpoOAQoLCgcRBbhphDME
X-IronPort-AV: E=Sophos;i="4.47,449,1257116400"; 
   d="scan'208";a="42692273"
Received: from fmmailgate01.web.de ([217.72.192.221])
  by mail1-smtp-roc.national.inria.fr with ESMTP; 24 Dec 2009 14:21:02 +0100
Received: from smtp08.web.de (fmsmtp08.dlan.cinetic.de [172.20.5.216])
	by fmmailgate01.web.de (Postfix) with ESMTP id 0BE141432AA08;
	Thu, 24 Dec 2009 14:19:55 +0100 (CET)
Received: from [95.208.117.111] (helo=frosties.localdomain)
	by smtp08.web.de with asmtp (TLSv1:AES256-SHA:256)
	(WEB.DE 4.110 #314)
	id 1NNnbc-0003Rr-00; Thu, 24 Dec 2009 14:19:52 +0100
Received: from mrvn by frosties.localdomain with local (Exim 4.69)
	(envelope-from <goswin-v-b@web.de>)
	id 1NNnbc-000546-7F; Thu, 24 Dec 2009 14:19:52 +0100
From: Goswin von Brederlow <goswin-v-b@web.de>
To: Jon Harrop <jon@ffconsultancy.com>
Cc: caml-list@yquem.inria.fr
Subject: Re: [Caml-list] Re: multicore wish
References: <4B2D2BC1.6020204@msu.edu>
	<200912211953.23880.jon@ffconsultancy.com>
	<87oclrrw8o.fsf@frosties.localdomain>
	<200912221912.51017.jon@ffconsultancy.com>
Date: Thu, 24 Dec 2009 14:19:52 +0100
In-Reply-To: <200912221912.51017.jon@ffconsultancy.com> (Jon Harrop's message
	of "Tue, 22 Dec 2009 19:12:50 +0000")
Message-ID: <87r5qk5x1j.fsf@frosties.localdomain>
User-Agent: Gnus/5.110006 (No Gnus v0.6) XEmacs/21.4.22 (linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Sender: goswin-v-b@web.de
X-Sender: goswin-v-b@web.de
X-Provags-ID: V01U2FsdGVkX18LCoRMx1WlkyB7OeCvJVC6c07LKnVkeBO83XP8
	9HwvdXAbaaHldVGb65MK7RawQBl7exDPa6HwAMudddhZY4577B
	v5g4AJZPc=
X-Spam: no; 0.00; ocaml:01 pointer:01 ocaml:01 pointers:01 pointers:01 locality:01 locality:01 speedup:01 unboxed:01 subset:01 scheduler:01 recursive:01 invocation:01 deques:01 recursive:01 

Jon Harrop <jon@ffconsultancy.com> writes:

> On Tuesday 22 December 2009 13:09:27 Goswin von Brederlow wrote:
>> Jon Harrop <jon@ffconsultancy.com> writes:
>> > 1. The array "a" is just an ordinary array of any type of values on the
>> > shared heap in F# but, for generality in OCaml, this must be both the
>> > underlying ordinary data and a manually-managed shared big array of
>> > indices into the ordinary data where the indices get sorted while the
>> > original data remain in place until they are permuted at the end.
>>
>> Unless you have a primitive type that isn't a pointer.
>
> In OCaml, you would need to write a custom quicksort optimized for that 
> particular type. In F#, the generic version just works and works efficiently.
>
>> The advantage with ocaml though is that you never have pointers into a
>> structure. Makes thinks a lot simpler for the GC and avoids large
>> overheads in memory.
>
> I don't understand what you mean by OCaml "never has pointers into a 
> structure". Half the problem with OCaml is that OCaml almost always uses 
> pointers and the programmer has no choice, e.g. for complex numbers.
>
>> > 2. The sorted subarrays are contiguous in memory and, at some
>> > subdivision, will fit into L2 cache. So F# offers optimal locality. In
>> > contrast, there is no locality whatsoever in the OCaml code and most
>> > accesses into the unsorted original array will incur cache misses right
>> > up to main memory. So the OCaml approach does not scale as well and will
>> > never see superlinear speedup because it cannot be cache friendly.
>>
>> On the other hand swapping two elements in the array has a constant
>> cost no matter what size they have. At some size there will be a break
>> even point where copying the data costs more than the cache misses and
>> with increasing size the cache won't help F# so much either.
>
> In theory, yes. In practice, that threshold is far larger than any value type 
> that a real program would use so it is of no practical concern. Moreover, F# 
> gives the programmer control over whether data are unboxed (value types) or 
> boxed (reference types) anyway. In contrast, OCaml is tied to a few value 
> types that are hard-coded into the GC.
>
>> > 3. Child tasks are likely to be executed on the same core as their parent
>> > and use a subset of their parent's data in F#, so they offer the best
>> > possible locality. In contrast, child processes are likely to be executed
>> > on another core in OCaml and offer the worst possible locality.
>>
>> But, if I understood you right, you first fork one process per
>> core.
>
> No, in OCaml I fork every child. That is the only transparent way to give the 
> child a coherent view of the heap but it is extremely slow (~1ms):

So if you add a (sleep 60) to the ocaml code then ocaml gets even
more worse than F#? You are comparing an implementation that is
specifically optimized to use things F# is good at and ocaml is
bad. To give a fair comparison you need to optimize the implementation
to the language.

>   "F# can do 60MFLOPS of computation in the time it takes OCaml 
> to fork a single process" -
> http://caml.inria.fr/pub/ml-archives/caml-list/2009/06/542b8bed77022b4a4824de2da5b7f714.en.html
>
>> Those should then also each pin themself to one core.
>
> You have no idea which core a forked child process will run on.

I can pin each child to one core and the scheduler will move it there.
I would expect a multi-core ocaml to actualy do this internaly already
so there is on GC threads per core ocaml runs on. But that depends on
the GC implementation that will be used.

>> Each process then has a work queue which is works through. So they will 
>> always use the local data. Only when a queue runs dry they steal from
>> another process and ruin locality.
>
> You are correctly describing the efficient solution based upon work-stealing 
> queues that F# uses but OCaml cannot express it.

You mean that you didn't implement that way. I can easily express that
with closures and pre-forked worker threads.

>> So I don't see where your argument fits. You are not creating childs
>> on the fly. Only at the start and they run till all the work is done.
>> At least in this example.
>
> No, every recursive invocation of the parallel quicksort spawns another child 
> on-the-fly. That's precisely why it parallelizes so efficiently when you have 
> wait-free work-stealing task deques and a shared heap. In general, you 
> rewrite algorithms into this recursive divide and conquer form and 
> parallelize when possible. You can parallelize a *lot* of problems 
> efficiently that way.

That seems awfully ineficient. Then you have a million children
running on maybe 4 cores and each job queue holds no job.

I think we mean different things by children. By children I mean what
the kernel sees as children. Newly forked threads. I think you mean
jobs that get put into the queues. There is no reason to fork a new
system thread every time the parallel quicksort splits an array in
two.

>> > 4. Task deques can handle an arbitrary number of tasks limited only by
>> > memory whereas processes are a scarce resource and forking is likely to
>> > fail, whereupon the "invoke" combinator will simply execute sequentially.
>> > So it is much easier to write reliable and performant code in F# than
>> > OCaml.
>>
>> Why would you fork in invoke?
>
> Fork is currently the only transparent way to implement "invoke" but it is 
> extremely slow and unreliable.

No, it is the insane way. Use closures.

>> > 5. OCaml's fork-based "invoke" combinator is many orders of magnitude
>> > slower than pushing a closure onto a concurrent task deque in F#.
>> >
>> > 6. The "threshold" value is a magic number derived from measurements on a
>> > given machine in my OCaml code but is dynamically adjusted in a
>> > machine-independent way by the "invoke" combinator in my F# code using
>> > atomic operations and real time profiling of the proportion of time spent
>> > spawning tasks vs doing actual work.
>>
>> 5+6 seem to be an implementation detail of some specific
>> implementation you are talking about.
>
> Yes. I'm talking about today's OCaml and F# implementations.
>
>> I don't see anything in the theory that would require that.
>
> If by "in theory" you mean that a new performant concurrent GC for OCaml would 
> solve these problems then yes. But I doubt OCaml is ever going to get one.
>
>> > The same basic principles apply to many algorithms. Although it can
>> > sometimes be tricky to figure out how best to use this technology to
>> > parallelize a given algorithm (e.g. successive over relaxation), I have
>> > found that a great many algorithms can be parallelized effectively using
>> > this approach when you have a suitable foundation in place (like the
>> > TPL). Moreover, the ability to use ordinary constructs in F# instead of
>> > hacks like type-specific shared memory big arrays in OCaml makes it a lot
>> > easier to parallelize programs. My parallel Burrows-Wheeler Transform
>> > (BWT), for example, took 30 minutes to develop in F# and 2 days in OCaml.
>>
>> It might be true that ocaml lacks some primitives for multi-core
>> operations. But I would say that is mostly because so far it hasn't
>> supported multi-core.
>
> Well, yes. :-)
>
>> If F# has great support for this then that might be a good place to steal
>> some ideas from. 
>
> The infrastructure for this kind of shared memory parallel programming is 
> really very simple. You just need a GC that can handle a shared heap (which 

Which there is one example so far for ocaml. Maybe not a verry good
one. Last simple benchmarking I remember it was like 3 times slower
than mono-core.

> HLVM already has) and work-stealing task deques. Then you can easily write 

Task deques you can implement given a multi-core GC, worker threads
and using closures.

> parallel programs that leverage multicores and run a *lot* faster than 
> anything that can be written in OCaml. You can even make this accessible to 
> OCaml programmers as a DSL with automatic interop to make high performance 
> parallel programming as easy as possible from OCaml.

Or just implement it properly in ocaml. The problem part is the
GC. The rest is easy.

>> But so far have heart nothing that would make F# fundamentally more capable
>> than ocaml could become.
>
> In theory, OCaml could catch up with F#. In practice, the core of OCaml's 
> implementation is so heavily optimized in another direction (and that will 
> not change because it is OCaml's raison d'être) that it is worth starting 
> from scratch. Modern libraries like LLVM make this comparatively painless and 
> the improvements are vast. And besides, it is great fun! :-)

I think there is an ocaml frontend for LLVM. So someone already did.

MfG
        Goswin