From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82]) by yquem.inria.fr (Postfix) with ESMTP id D72E7BBAF for ; Thu, 24 Dec 2009 14:21:02 +0100 (CET) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AiADANb4MkvZSMDdi2dsb2JhbACBSpoOAQoLCgcRBbhphDME X-IronPort-AV: E=Sophos;i="4.47,449,1257116400"; d="scan'208";a="42692273" Received: from fmmailgate01.web.de ([217.72.192.221]) by mail1-smtp-roc.national.inria.fr with ESMTP; 24 Dec 2009 14:21:02 +0100 Received: from smtp08.web.de (fmsmtp08.dlan.cinetic.de [172.20.5.216]) by fmmailgate01.web.de (Postfix) with ESMTP id 0BE141432AA08; Thu, 24 Dec 2009 14:19:55 +0100 (CET) Received: from [95.208.117.111] (helo=frosties.localdomain) by smtp08.web.de with asmtp (TLSv1:AES256-SHA:256) (WEB.DE 4.110 #314) id 1NNnbc-0003Rr-00; Thu, 24 Dec 2009 14:19:52 +0100 Received: from mrvn by frosties.localdomain with local (Exim 4.69) (envelope-from ) id 1NNnbc-000546-7F; Thu, 24 Dec 2009 14:19:52 +0100 From: Goswin von Brederlow To: Jon Harrop Cc: caml-list@yquem.inria.fr Subject: Re: [Caml-list] Re: multicore wish References: <4B2D2BC1.6020204@msu.edu> <200912211953.23880.jon@ffconsultancy.com> <87oclrrw8o.fsf@frosties.localdomain> <200912221912.51017.jon@ffconsultancy.com> Date: Thu, 24 Dec 2009 14:19:52 +0100 In-Reply-To: <200912221912.51017.jon@ffconsultancy.com> (Jon Harrop's message of "Tue, 22 Dec 2009 19:12:50 +0000") Message-ID: <87r5qk5x1j.fsf@frosties.localdomain> User-Agent: Gnus/5.110006 (No Gnus v0.6) XEmacs/21.4.22 (linux) MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit Sender: goswin-v-b@web.de X-Sender: goswin-v-b@web.de X-Provags-ID: V01U2FsdGVkX18LCoRMx1WlkyB7OeCvJVC6c07LKnVkeBO83XP8 9HwvdXAbaaHldVGb65MK7RawQBl7exDPa6HwAMudddhZY4577B v5g4AJZPc= X-Spam: no; 0.00; ocaml:01 pointer:01 ocaml:01 pointers:01 pointers:01 locality:01 locality:01 speedup:01 unboxed:01 subset:01 scheduler:01 recursive:01 invocation:01 deques:01 recursive:01 Jon Harrop writes: > On Tuesday 22 December 2009 13:09:27 Goswin von Brederlow wrote: >> Jon Harrop writes: >> > 1. The array "a" is just an ordinary array of any type of values on the >> > shared heap in F# but, for generality in OCaml, this must be both the >> > underlying ordinary data and a manually-managed shared big array of >> > indices into the ordinary data where the indices get sorted while the >> > original data remain in place until they are permuted at the end. >> >> Unless you have a primitive type that isn't a pointer. > > In OCaml, you would need to write a custom quicksort optimized for that > particular type. In F#, the generic version just works and works efficiently. > >> The advantage with ocaml though is that you never have pointers into a >> structure. Makes thinks a lot simpler for the GC and avoids large >> overheads in memory. > > I don't understand what you mean by OCaml "never has pointers into a > structure". Half the problem with OCaml is that OCaml almost always uses > pointers and the programmer has no choice, e.g. for complex numbers. > >> > 2. The sorted subarrays are contiguous in memory and, at some >> > subdivision, will fit into L2 cache. So F# offers optimal locality. In >> > contrast, there is no locality whatsoever in the OCaml code and most >> > accesses into the unsorted original array will incur cache misses right >> > up to main memory. So the OCaml approach does not scale as well and will >> > never see superlinear speedup because it cannot be cache friendly. >> >> On the other hand swapping two elements in the array has a constant >> cost no matter what size they have. At some size there will be a break >> even point where copying the data costs more than the cache misses and >> with increasing size the cache won't help F# so much either. > > In theory, yes. In practice, that threshold is far larger than any value type > that a real program would use so it is of no practical concern. Moreover, F# > gives the programmer control over whether data are unboxed (value types) or > boxed (reference types) anyway. In contrast, OCaml is tied to a few value > types that are hard-coded into the GC. > >> > 3. Child tasks are likely to be executed on the same core as their parent >> > and use a subset of their parent's data in F#, so they offer the best >> > possible locality. In contrast, child processes are likely to be executed >> > on another core in OCaml and offer the worst possible locality. >> >> But, if I understood you right, you first fork one process per >> core. > > No, in OCaml I fork every child. That is the only transparent way to give the > child a coherent view of the heap but it is extremely slow (~1ms): So if you add a (sleep 60) to the ocaml code then ocaml gets even more worse than F#? You are comparing an implementation that is specifically optimized to use things F# is good at and ocaml is bad. To give a fair comparison you need to optimize the implementation to the language. > "F# can do 60MFLOPS of computation in the time it takes OCaml > to fork a single process" - > http://caml.inria.fr/pub/ml-archives/caml-list/2009/06/542b8bed77022b4a4824de2da5b7f714.en.html > >> Those should then also each pin themself to one core. > > You have no idea which core a forked child process will run on. I can pin each child to one core and the scheduler will move it there. I would expect a multi-core ocaml to actualy do this internaly already so there is on GC threads per core ocaml runs on. But that depends on the GC implementation that will be used. >> Each process then has a work queue which is works through. So they will >> always use the local data. Only when a queue runs dry they steal from >> another process and ruin locality. > > You are correctly describing the efficient solution based upon work-stealing > queues that F# uses but OCaml cannot express it. You mean that you didn't implement that way. I can easily express that with closures and pre-forked worker threads. >> So I don't see where your argument fits. You are not creating childs >> on the fly. Only at the start and they run till all the work is done. >> At least in this example. > > No, every recursive invocation of the parallel quicksort spawns another child > on-the-fly. That's precisely why it parallelizes so efficiently when you have > wait-free work-stealing task deques and a shared heap. In general, you > rewrite algorithms into this recursive divide and conquer form and > parallelize when possible. You can parallelize a *lot* of problems > efficiently that way. That seems awfully ineficient. Then you have a million children running on maybe 4 cores and each job queue holds no job. I think we mean different things by children. By children I mean what the kernel sees as children. Newly forked threads. I think you mean jobs that get put into the queues. There is no reason to fork a new system thread every time the parallel quicksort splits an array in two. >> > 4. Task deques can handle an arbitrary number of tasks limited only by >> > memory whereas processes are a scarce resource and forking is likely to >> > fail, whereupon the "invoke" combinator will simply execute sequentially. >> > So it is much easier to write reliable and performant code in F# than >> > OCaml. >> >> Why would you fork in invoke? > > Fork is currently the only transparent way to implement "invoke" but it is > extremely slow and unreliable. No, it is the insane way. Use closures. >> > 5. OCaml's fork-based "invoke" combinator is many orders of magnitude >> > slower than pushing a closure onto a concurrent task deque in F#. >> > >> > 6. The "threshold" value is a magic number derived from measurements on a >> > given machine in my OCaml code but is dynamically adjusted in a >> > machine-independent way by the "invoke" combinator in my F# code using >> > atomic operations and real time profiling of the proportion of time spent >> > spawning tasks vs doing actual work. >> >> 5+6 seem to be an implementation detail of some specific >> implementation you are talking about. > > Yes. I'm talking about today's OCaml and F# implementations. > >> I don't see anything in the theory that would require that. > > If by "in theory" you mean that a new performant concurrent GC for OCaml would > solve these problems then yes. But I doubt OCaml is ever going to get one. > >> > The same basic principles apply to many algorithms. Although it can >> > sometimes be tricky to figure out how best to use this technology to >> > parallelize a given algorithm (e.g. successive over relaxation), I have >> > found that a great many algorithms can be parallelized effectively using >> > this approach when you have a suitable foundation in place (like the >> > TPL). Moreover, the ability to use ordinary constructs in F# instead of >> > hacks like type-specific shared memory big arrays in OCaml makes it a lot >> > easier to parallelize programs. My parallel Burrows-Wheeler Transform >> > (BWT), for example, took 30 minutes to develop in F# and 2 days in OCaml. >> >> It might be true that ocaml lacks some primitives for multi-core >> operations. But I would say that is mostly because so far it hasn't >> supported multi-core. > > Well, yes. :-) > >> If F# has great support for this then that might be a good place to steal >> some ideas from. > > The infrastructure for this kind of shared memory parallel programming is > really very simple. You just need a GC that can handle a shared heap (which Which there is one example so far for ocaml. Maybe not a verry good one. Last simple benchmarking I remember it was like 3 times slower than mono-core. > HLVM already has) and work-stealing task deques. Then you can easily write Task deques you can implement given a multi-core GC, worker threads and using closures. > parallel programs that leverage multicores and run a *lot* faster than > anything that can be written in OCaml. You can even make this accessible to > OCaml programmers as a DSL with automatic interop to make high performance > parallel programming as easy as possible from OCaml. Or just implement it properly in ocaml. The problem part is the GC. The rest is easy. >> But so far have heart nothing that would make F# fundamentally more capable >> than ocaml could become. > > In theory, OCaml could catch up with F#. In practice, the core of OCaml's > implementation is so heavily optimized in another direction (and that will > not change because it is OCaml's raison d'être) that it is worth starting > from scratch. Modern libraries like LLVM make this comparatively painless and > the improvements are vast. And besides, it is great fun! :-) I think there is an ocaml frontend for LLVM. So someone already did. MfG Goswin