From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.2 required=5.0 tests=AWL autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by yquem.inria.fr (Postfix) with ESMTP id DA248BC37 for ; Wed, 30 Sep 2009 17:22:25 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AlgCAPkUw0rUnwcjlGdsb2JhbACCH5hgAQEBAQkLCAkTBL4ThCcE X-IronPort-AV: E=Sophos;i="4.44,480,1249250400"; d="scan'208";a="33878876" Received: from relay.ptn-ipout01.plus.net ([212.159.7.35]) by mail2-smtp-roc.national.inria.fr with ESMTP/TLS/RC4-SHA; 30 Sep 2009 17:22:25 +0200 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Av0GAPkUw0rUnw4U/2dsb2JhbACCH9czhCcE Received: from pih-relay08.plus.net ([212.159.14.20]) by relay.ptn-ipout01.plus.net with ESMTP; 30 Sep 2009 16:22:25 +0100 Received: from [87.114.87.187] (helo=leper.local) by pih-relay08.plus.net with esmtp (Exim) id 1Mt10a-0003jH-Qy for caml-list@yquem.inria.fr; Wed, 30 Sep 2009 16:22:25 +0100 From: Jon Harrop Organization: Flying Frog Consultancy Ltd. To: caml-list@yquem.inria.fr Subject: Re: [Caml-list] JIT & HLVM, LLVM Date: Wed, 30 Sep 2009 16:22:57 +0100 User-Agent: KMail/1.9.9 References: <4E6F3027-5745-462D-AF10-30C868285D28@refined-audiometrics.com> <200909272210.19130.jon@ffconsultancy.com> In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Message-Id: <200909301622.57394.jon@ffconsultancy.com> X-Plusnet-Relay: 8d7c041ce969fc54216dc1df86b4589d X-Spam: no; 0.00; mikkel:01 rgensen:01 tpl:01 tpl:01 deque:01 spawned:01 algebra:01 recursive:01 trivial:01 higher-order:01 recursive:01 run-time:01 2009:98 2009:98 carleton:98 On Wednesday 30 September 2009 02:08:06 Mikkel Fahn=C3=B8e J=C3=B8rgensen w= rote: > 2009/9/27 Jon Harrop : > > On Sunday 27 September 2009 20:23:13 kcheung@math.carleton.ca wrote: > >> If Grand Central Dispatch makes its way > >> into *nix... > > > > Incidentally, what exactly (technically) is Grand Central and how does = it > > relate to existing alternatives and the state-of-the-art? I Googled it > > but failed to find anything useful... > > Grand Central Dispatch... Thanks for the explanation! This seems to be attacking the same problem as= =20 Cilk and the TPL but my immediate impression (based entirely upon your=20 description) is that Cilk and the TPL are probably better foundations. Using one work stealing deque per core is much more efficient than the work= =20 sharing queue you described for two reasons: 1. Less global syncronization. 2. Subtasks are likely to be executed on the core that spawned them, which= =20 improves cache efficiency. The parallel "for" loop you described is similar to the TPL's and leaves a = lot=20 to be desired. Specifically, you want to execute clusters of consecutive=20 tasks on the same core for two reasons: 1. Using the same core improves cache efficiency because the end of one tas= k=20 is often spatially related to the start of the next, e.g. parallel quicksor= t,=20 linear algebra or image processing. 2. Executing clusters of tasks amortizes the cost of queueing tasks. The former is easily solved with the Cilk/TPL design by using recursive=20 subdivision instead of the index sharing scheme you described because=20 subtasks are likely to be executed on the same core. However, that will not= =20 work with a work sharing queue because subtasks are executed on random core= s.=20 Moreover, a trivial extension to this higher-order function allows you to=20 pass in another function that describes the complexity of each task. This=20 allows the recursive function to more intelligently subdivide the given ran= ge=20 into clusters of variable-complexity tasks with similar total running times= =2E=20 This is the technique I use in our commercial F# libraries but I have not=20 seen it described elsewhere. Cilk and the TPL also just rely upon the programmer to make tasks sufficien= tly=20 complex that the time spent queueing and dequeueing them is insignificant. = I=20 solved this problem by injecting run-time profiling code (with global=20 synchronizations per cluster of tasks) to measure the proportion of the tim= e=20 being spent spawning rather than executing tasks and then use exponential=20 backoff to increase the cluster size until a sufficiently small proportion = of=20 the time is spent spawning. Again, I have not seen this technique described= =20 elsewhere. =2D-=20 Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e