Re: [Caml-list] Re: multicore wish - Goswin von Brederlow

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

From: Goswin von Brederlow <goswin-v-b@web.de>
To: Jon Harrop <jon@ffconsultancy.com>
Cc: caml-list@yquem.inria.fr
Subject: Re: [Caml-list] Re: multicore wish
Date: Thu, 24 Dec 2009 14:19:52 +0100	[thread overview]
Message-ID: <87r5qk5x1j.fsf@frosties.localdomain> (raw)
In-Reply-To: <200912221912.51017.jon@ffconsultancy.com> (Jon Harrop's message of "Tue, 22 Dec 2009 19:12:50 +0000")

Jon Harrop <jon@ffconsultancy.com> writes:

> On Tuesday 22 December 2009 13:09:27 Goswin von Brederlow wrote:
>> Jon Harrop <jon@ffconsultancy.com> writes:
>> > 1. The array "a" is just an ordinary array of any type of values on the
>> > shared heap in F# but, for generality in OCaml, this must be both the
>> > underlying ordinary data and a manually-managed shared big array of
>> > indices into the ordinary data where the indices get sorted while the
>> > original data remain in place until they are permuted at the end.
>>
>> Unless you have a primitive type that isn't a pointer.
>
> In OCaml, you would need to write a custom quicksort optimized for that 
> particular type. In F#, the generic version just works and works efficiently.
>
>> The advantage with ocaml though is that you never have pointers into a
>> structure. Makes thinks a lot simpler for the GC and avoids large
>> overheads in memory.
>
> I don't understand what you mean by OCaml "never has pointers into a 
> structure". Half the problem with OCaml is that OCaml almost always uses 
> pointers and the programmer has no choice, e.g. for complex numbers.
>
>> > 2. The sorted subarrays are contiguous in memory and, at some
>> > subdivision, will fit into L2 cache. So F# offers optimal locality. In
>> > contrast, there is no locality whatsoever in the OCaml code and most
>> > accesses into the unsorted original array will incur cache misses right
>> > up to main memory. So the OCaml approach does not scale as well and will
>> > never see superlinear speedup because it cannot be cache friendly.
>>
>> On the other hand swapping two elements in the array has a constant
>> cost no matter what size they have. At some size there will be a break
>> even point where copying the data costs more than the cache misses and
>> with increasing size the cache won't help F# so much either.
>
> In theory, yes. In practice, that threshold is far larger than any value type 
> that a real program would use so it is of no practical concern. Moreover, F# 
> gives the programmer control over whether data are unboxed (value types) or 
> boxed (reference types) anyway. In contrast, OCaml is tied to a few value 
> types that are hard-coded into the GC.
>
>> > 3. Child tasks are likely to be executed on the same core as their parent
>> > and use a subset of their parent's data in F#, so they offer the best
>> > possible locality. In contrast, child processes are likely to be executed
>> > on another core in OCaml and offer the worst possible locality.
>>
>> But, if I understood you right, you first fork one process per
>> core.
>
> No, in OCaml I fork every child. That is the only transparent way to give the 
> child a coherent view of the heap but it is extremely slow (~1ms):

So if you add a (sleep 60) to the ocaml code then ocaml gets even
more worse than F#? You are comparing an implementation that is
specifically optimized to use things F# is good at and ocaml is
bad. To give a fair comparison you need to optimize the implementation
to the language.

>   "F# can do 60MFLOPS of computation in the time it takes OCaml 
> to fork a single process" -
> http://caml.inria.fr/pub/ml-archives/caml-list/2009/06/542b8bed77022b4a4824de2da5b7f714.en.html
>
>> Those should then also each pin themself to one core.
>
> You have no idea which core a forked child process will run on.

I can pin each child to one core and the scheduler will move it there.
I would expect a multi-core ocaml to actualy do this internaly already
so there is on GC threads per core ocaml runs on. But that depends on
the GC implementation that will be used.

>> Each process then has a work queue which is works through. So they will 
>> always use the local data. Only when a queue runs dry they steal from
>> another process and ruin locality.
>
> You are correctly describing the efficient solution based upon work-stealing 
> queues that F# uses but OCaml cannot express it.

You mean that you didn't implement that way. I can easily express that
with closures and pre-forked worker threads.

>> So I don't see where your argument fits. You are not creating childs
>> on the fly. Only at the start and they run till all the work is done.
>> At least in this example.
>
> No, every recursive invocation of the parallel quicksort spawns another child 
> on-the-fly. That's precisely why it parallelizes so efficiently when you have 
> wait-free work-stealing task deques and a shared heap. In general, you 
> rewrite algorithms into this recursive divide and conquer form and 
> parallelize when possible. You can parallelize a *lot* of problems 
> efficiently that way.

That seems awfully ineficient. Then you have a million children
running on maybe 4 cores and each job queue holds no job.

I think we mean different things by children. By children I mean what
the kernel sees as children. Newly forked threads. I think you mean
jobs that get put into the queues. There is no reason to fork a new
system thread every time the parallel quicksort splits an array in
two.

>> > 4. Task deques can handle an arbitrary number of tasks limited only by
>> > memory whereas processes are a scarce resource and forking is likely to
>> > fail, whereupon the "invoke" combinator will simply execute sequentially.
>> > So it is much easier to write reliable and performant code in F# than
>> > OCaml.
>>
>> Why would you fork in invoke?
>
> Fork is currently the only transparent way to implement "invoke" but it is 
> extremely slow and unreliable.

No, it is the insane way. Use closures.

>> > 5. OCaml's fork-based "invoke" combinator is many orders of magnitude
>> > slower than pushing a closure onto a concurrent task deque in F#.
>> >
>> > 6. The "threshold" value is a magic number derived from measurements on a
>> > given machine in my OCaml code but is dynamically adjusted in a
>> > machine-independent way by the "invoke" combinator in my F# code using
>> > atomic operations and real time profiling of the proportion of time spent
>> > spawning tasks vs doing actual work.
>>
>> 5+6 seem to be an implementation detail of some specific
>> implementation you are talking about.
>
> Yes. I'm talking about today's OCaml and F# implementations.
>
>> I don't see anything in the theory that would require that.
>
> If by "in theory" you mean that a new performant concurrent GC for OCaml would 
> solve these problems then yes. But I doubt OCaml is ever going to get one.
>
>> > The same basic principles apply to many algorithms. Although it can
>> > sometimes be tricky to figure out how best to use this technology to
>> > parallelize a given algorithm (e.g. successive over relaxation), I have
>> > found that a great many algorithms can be parallelized effectively using
>> > this approach when you have a suitable foundation in place (like the
>> > TPL). Moreover, the ability to use ordinary constructs in F# instead of
>> > hacks like type-specific shared memory big arrays in OCaml makes it a lot
>> > easier to parallelize programs. My parallel Burrows-Wheeler Transform
>> > (BWT), for example, took 30 minutes to develop in F# and 2 days in OCaml.
>>
>> It might be true that ocaml lacks some primitives for multi-core
>> operations. But I would say that is mostly because so far it hasn't
>> supported multi-core.
>
> Well, yes. :-)
>
>> If F# has great support for this then that might be a good place to steal
>> some ideas from. 
>
> The infrastructure for this kind of shared memory parallel programming is 
> really very simple. You just need a GC that can handle a shared heap (which 

Which there is one example so far for ocaml. Maybe not a verry good
one. Last simple benchmarking I remember it was like 3 times slower
than mono-core.

> HLVM already has) and work-stealing task deques. Then you can easily write 

Task deques you can implement given a multi-core GC, worker threads
and using closures.

> parallel programs that leverage multicores and run a *lot* faster than 
> anything that can be written in OCaml. You can even make this accessible to 
> OCaml programmers as a DSL with automatic interop to make high performance 
> parallel programming as easy as possible from OCaml.

Or just implement it properly in ocaml. The problem part is the
GC. The rest is easy.

>> But so far have heart nothing that would make F# fundamentally more capable
>> than ocaml could become.
>
> In theory, OCaml could catch up with F#. In practice, the core of OCaml's 
> implementation is so heavily optimized in another direction (and that will 
> not change because it is OCaml's raison d'être) that it is worth starting 
> from scratch. Modern libraries like LLVM make this comparatively painless and 
> the improvements are vast. And besides, it is great fun! :-)

I think there is an ocaml frontend for LLVM. So someone already did.

MfG
        Goswin

next prev parent reply	other threads:[~2009-12-24 13:21 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-12-19 19:38 OCaml is broken Jeff Shaw
2009-12-20  4:43 ` [Caml-list] " Jon Harrop
2009-12-20 12:21   ` [***SPAM*** Score/Req: 10.1/8.0] " Erik Rigtorp
2009-12-20 13:22     ` Martin Jambon
2009-12-20 13:47     ` Yaron Minsky
2009-12-20 16:01       ` Gerd Stolpmann
2009-12-21 22:50       ` [***SPAM*** Score/Req: 10.1/8.0] Re: [***SPAM*** Score/Req: 10.1/8.0] " Erik Rigtorp
2009-12-22 12:04         ` Erik Rigtorp
2009-12-22 12:27           ` Mihamina Rakotomandimby
2009-12-22 13:27           ` Gerd Stolpmann
2009-12-23 11:25             ` Erik Rigtorp
2009-12-29 12:07         ` [***SPAM*** Score/Req: 10.1/8.0] Re: [***SPAM*** Score/Req: 10.1/8.0] " Richard Jones
2009-12-20 14:27     ` Dario Teixeira
2009-12-20 21:14       ` Jon Harrop
2009-12-21  1:08         ` Gerd Stolpmann
2009-12-21  4:30           ` Jon Harrop
2009-12-21  3:58             ` Yaron Minsky
2009-12-21  5:32             ` Markus Mottl
2009-12-21 13:29               ` Jon Harrop
2009-12-26 17:08           ` orbitz
2009-12-20 19:38     ` [***SPAM*** Score/Req: 10.1/8.0] " Jon Harrop
2009-12-21 12:26       ` Mihamina Rakotomandimby
2009-12-21 14:19         ` general question, was " Keyan
2009-12-21 14:40           ` [Caml-list] " rixed
2009-12-21 14:42           ` Gerd Stolpmann
2009-12-21 15:25             ` Eray Ozkural
2009-12-21 14:50           ` Philip
2009-12-21 15:01             ` Keyan
2009-12-21 15:13               ` Stefano Zacchiroli
2009-12-21 15:27               ` Dario Teixeira
2009-12-21 15:46                 ` Jacques Carette
2009-12-21 18:50             ` Jon Harrop
2009-12-21 18:48           ` Jon Harrop
2010-01-03 10:49           ` Sylvain Le Gall
2010-01-03 20:06             ` [Caml-list] " Jon Harrop
2009-12-21 13:07     ` [***SPAM*** Score/Req: 10.1/8.0] Re: [Caml-list] " Damien Doligez
2009-12-21 13:31   ` multicore wish [Was: Re: [Caml-list] Re: OCaml is broken] Goswin von Brederlow
2009-12-21 14:19     ` multicore wish Mihamina Rakotomandimby
2009-12-21 16:15       ` [Caml-list] " Fischbacher T.
2009-12-21 17:42       ` Dario Teixeira
2009-12-21 18:43       ` Jon Harrop
2009-12-21 19:53     ` multicore wish [Was: Re: [Caml-list] Re: OCaml is broken] Jon Harrop
2009-12-22 13:09       ` multicore wish Goswin von Brederlow
2009-12-22 19:12         ` [Caml-list] " Jon Harrop
2009-12-22 18:02           ` Edgar Friendly
2009-12-22 19:20             ` Jon Harrop
2009-12-24 12:58               ` Goswin von Brederlow
2009-12-24 16:51                 ` Jon Harrop
2009-12-24 13:19           ` Goswin von Brederlow [this message]
2009-12-24 17:06             ` Jon Harrop
2009-12-27 12:45               ` Goswin von Brederlow
2009-12-27 16:37                 ` Jon Harrop
2009-12-28 12:28                 ` Gerd Stolpmann
2009-12-28 15:07                   ` Anil Madhavapeddy
2009-12-28 18:05                   ` Xavier Leroy
2009-12-29 16:44                     ` Gerd Stolpmann
2009-12-20 11:56 ` [***SPAM*** Score/Req: 10.1/8.0] [Caml-list] Re: OCaml is broken Erik Rigtorp

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87r5qk5x1j.fsf@frosties.localdomain \
    --to=goswin-v-b@web.de \
    --cc=caml-list@yquem.inria.fr \
    --cc=jon@ffconsultancy.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).