From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by yquem.inria.fr (Postfix) with ESMTP id 3FE77BBAF for ; Tue, 22 Dec 2009 18:59:04 +0100 (CET) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Al0BACqYMEvUnwdji2dsb2JhbACCGZkvAQEBCgsKBxEGumiEMwQ X-IronPort-AV: E=Sophos;i="4.47,437,1257116400"; d="scan'208";a="40433668" Received: from relay.pcl-ipout01.plus.net ([212.159.7.99]) by mail3-smtp-sop.national.inria.fr with ESMTP/TLS/RC4-SHA; 22 Dec 2009 18:58:45 +0100 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ApgFALKXMEvUnw4S/2dsb2JhbACCGdRXhDME Received: from pih-relay05.plus.net ([212.159.14.18]) by relay.pcl-ipout01.plus.net with ESMTP; 22 Dec 2009 17:58:44 +0000 Received: from [87.114.35.173] (helo=leper.local) by pih-relay05.plus.net with esmtp (Exim) id 1NN90N-0007vj-OR for caml-list@yquem.inria.fr; Tue, 22 Dec 2009 17:58:44 +0000 From: Jon Harrop Organization: Flying Frog Consultancy Ltd. To: caml-list@yquem.inria.fr Subject: Re: [Caml-list] Re: multicore wish Date: Tue, 22 Dec 2009 19:12:50 +0000 User-Agent: KMail/1.9.9 References: <4B2D2BC1.6020204@msu.edu> <200912211953.23880.jon@ffconsultancy.com> <87oclrrw8o.fsf@frosties.localdomain> In-Reply-To: <87oclrrw8o.fsf@frosties.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Message-Id: <200912221912.51017.jon@ffconsultancy.com> X-Plusnet-Relay: 8cecb776d52fdcbb7adfdcf887bf32d3 X-Spam: no; 0.00; ocaml:01 pointer:01 ocaml:01 pointers:01 pointers:01 locality:01 locality:01 speedup:01 unboxed:01 subset:01 recursive:01 invocation:01 deques:01 recursive:01 deques:01 On Tuesday 22 December 2009 13:09:27 Goswin von Brederlow wrote: > Jon Harrop writes: > > 1. The array "a" is just an ordinary array of any type of values on the > > shared heap in F# but, for generality in OCaml, this must be both the > > underlying ordinary data and a manually-managed shared big array of > > indices into the ordinary data where the indices get sorted while the > > original data remain in place until they are permuted at the end. > > Unless you have a primitive type that isn't a pointer. In OCaml, you would need to write a custom quicksort optimized for that=20 particular type. In F#, the generic version just works and works efficientl= y. > The advantage with ocaml though is that you never have pointers into a > structure. Makes thinks a lot simpler for the GC and avoids large > overheads in memory. I don't understand what you mean by OCaml "never has pointers into a=20 structure". Half the problem with OCaml is that OCaml almost always uses=20 pointers and the programmer has no choice, e.g. for complex numbers. > > 2. The sorted subarrays are contiguous in memory and, at some > > subdivision, will fit into L2 cache. So F# offers optimal locality. In > > contrast, there is no locality whatsoever in the OCaml code and most > > accesses into the unsorted original array will incur cache misses right > > up to main memory. So the OCaml approach does not scale as well and will > > never see superlinear speedup because it cannot be cache friendly. > > On the other hand swapping two elements in the array has a constant > cost no matter what size they have. At some size there will be a break > even point where copying the data costs more than the cache misses and > with increasing size the cache won't help F# so much either. In theory, yes. In practice, that threshold is far larger than any value ty= pe=20 that a real program would use so it is of no practical concern. Moreover, F= #=20 gives the programmer control over whether data are unboxed (value types) or= =20 boxed (reference types) anyway. In contrast, OCaml is tied to a few value=20 types that are hard-coded into the GC. > > 3. Child tasks are likely to be executed on the same core as their pare= nt > > and use a subset of their parent's data in F#, so they offer the best > > possible locality. In contrast, child processes are likely to be execut= ed > > on another core in OCaml and offer the worst possible locality. > > But, if I understood you right, you first fork one process per > core. No, in OCaml I fork every child. That is the only transparent way to give t= he=20 child a coherent view of the heap but it is extremely slow (~1ms): "F# can do 60MFLOPS of computation in the time it takes OCaml=20 to fork a single process" - http://caml.inria.fr/pub/ml-archives/caml-list/2009/06/542b8bed77022b4a4824= de2da5b7f714.en.html > Those should then also each pin themself to one core. You have no idea which core a forked child process will run on. > Each process then has a work queue which is works through. So they will=20 > always use the local data. Only when a queue runs dry they steal from > another process and ruin locality. You are correctly describing the efficient solution based upon work-stealin= g=20 queues that F# uses but OCaml cannot express it. > So I don't see where your argument fits. You are not creating childs > on the fly. Only at the start and they run till all the work is done. > At least in this example. No, every recursive invocation of the parallel quicksort spawns another chi= ld=20 on-the-fly. That's precisely why it parallelizes so efficiently when you ha= ve=20 wait-free work-stealing task deques and a shared heap. In general, you=20 rewrite algorithms into this recursive divide and conquer form and=20 parallelize when possible. You can parallelize a *lot* of problems=20 efficiently that way. > > 4. Task deques can handle an arbitrary number of tasks limited only by > > memory whereas processes are a scarce resource and forking is likely to > > fail, whereupon the "invoke" combinator will simply execute sequentiall= y. > > So it is much easier to write reliable and performant code in F# than > > OCaml. > > Why would you fork in invoke? =46ork is currently the only transparent way to implement "invoke" but it i= s=20 extremely slow and unreliable. > > 5. OCaml's fork-based "invoke" combinator is many orders of magnitude > > slower than pushing a closure onto a concurrent task deque in F#. > > > > 6. The "threshold" value is a magic number derived from measurements on= a > > given machine in my OCaml code but is dynamically adjusted in a > > machine-independent way by the "invoke" combinator in my F# code using > > atomic operations and real time profiling of the proportion of time spe= nt > > spawning tasks vs doing actual work. > > 5+6 seem to be an implementation detail of some specific > implementation you are talking about. Yes. I'm talking about today's OCaml and F# implementations. > I don't see anything in the theory that would require that. If by "in theory" you mean that a new performant concurrent GC for OCaml wo= uld=20 solve these problems then yes. But I doubt OCaml is ever going to get one. > > The same basic principles apply to many algorithms. Although it can > > sometimes be tricky to figure out how best to use this technology to > > parallelize a given algorithm (e.g. successive over relaxation), I have > > found that a great many algorithms can be parallelized effectively using > > this approach when you have a suitable foundation in place (like the > > TPL). Moreover, the ability to use ordinary constructs in F# instead of > > hacks like type-specific shared memory big arrays in OCaml makes it a l= ot > > easier to parallelize programs. My parallel Burrows-Wheeler Transform > > (BWT), for example, took 30 minutes to develop in F# and 2 days in OCam= l. > > It might be true that ocaml lacks some primitives for multi-core > operations. But I would say that is mostly because so far it hasn't > supported multi-core. Well, yes. :-) > If F# has great support for this then that might be a good place to steal > some ideas from.=20 The infrastructure for this kind of shared memory parallel programming is=20 really very simple. You just need a GC that can handle a shared heap (which= =20 HLVM already has) and work-stealing task deques. Then you can easily write= =20 parallel programs that leverage multicores and run a *lot* faster than=20 anything that can be written in OCaml. You can even make this accessible to= =20 OCaml programmers as a DSL with automatic interop to make high performance= =20 parallel programming as easy as possible from OCaml. > But so far have heart nothing that would make F# fundamentally more capab= le > than ocaml could become. In theory, OCaml could catch up with F#. In practice, the core of OCaml's=20 implementation is so heavily optimized in another direction (and that will= =20 not change because it is OCaml's raison d'=EAtre) that it is worth starting= =20 from scratch. Modern libraries like LLVM make this comparatively painless a= nd=20 the improvements are vast. And besides, it is great fun! :-) =2D-=20 Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e