From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <jon@ffconsultancy.com>
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr
X-Spam-Level: 
X-Spam-Status: No, score=0.2 required=5.0 tests=AWL autolearn=disabled 
	version=3.1.3
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83])
	by yquem.inria.fr (Postfix) with ESMTP id DA248BC37
	for <caml-list@yquem.inria.fr>; Wed, 30 Sep 2009 17:22:25 +0200 (CEST)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AlgCAPkUw0rUnwcjlGdsb2JhbACCH5hgAQEBAQkLCAkTBL4ThCcE
X-IronPort-AV: E=Sophos;i="4.44,480,1249250400"; 
   d="scan'208";a="33878876"
Received: from relay.ptn-ipout01.plus.net ([212.159.7.35])
  by mail2-smtp-roc.national.inria.fr with ESMTP/TLS/RC4-SHA; 30 Sep 2009 17:22:25 +0200
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: Av0GAPkUw0rUnw4U/2dsb2JhbACCH9czhCcE
Received: from pih-relay08.plus.net ([212.159.14.20])
  by relay.ptn-ipout01.plus.net with ESMTP; 30 Sep 2009 16:22:25 +0100
Received: from [87.114.87.187] (helo=leper.local)
	 by pih-relay08.plus.net with esmtp (Exim) id 1Mt10a-0003jH-Qy
	for caml-list@yquem.inria.fr; Wed, 30 Sep 2009 16:22:25 +0100
From: Jon Harrop <jon@ffconsultancy.com>
Organization: Flying Frog Consultancy Ltd.
To: caml-list@yquem.inria.fr
Subject: Re: [Caml-list] JIT & HLVM, LLVM
Date: Wed, 30 Sep 2009 16:22:57 +0100
User-Agent: KMail/1.9.9
References: <4E6F3027-5745-462D-AF10-30C868285D28@refined-audiometrics.com> <200909272210.19130.jon@ffconsultancy.com> <caee5ad80909291808p5cb034aq62c3e71a7b189170@mail.gmail.com>
In-Reply-To: <caee5ad80909291808p5cb034aq62c3e71a7b189170@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
Message-Id: <200909301622.57394.jon@ffconsultancy.com>
X-Plusnet-Relay: 8d7c041ce969fc54216dc1df86b4589d
X-Spam: no; 0.00; mikkel:01 rgensen:01 tpl:01 tpl:01 deque:01 spawned:01 algebra:01 recursive:01 trivial:01 higher-order:01 recursive:01 run-time:01 2009:98 2009:98 carleton:98 

On Wednesday 30 September 2009 02:08:06 Mikkel Fahn=C3=B8e J=C3=B8rgensen w=
rote:
> 2009/9/27 Jon Harrop <jon@ffconsultancy.com>:
> > On Sunday 27 September 2009 20:23:13 kcheung@math.carleton.ca wrote:
> >> If Grand Central Dispatch makes its way
> >> into *nix...
> >
> > Incidentally, what exactly (technically) is Grand Central and how does =
it
> > relate to existing alternatives and the state-of-the-art? I Googled it
> > but failed to find anything useful...
>
> Grand Central Dispatch...

Thanks for the explanation! This seems to be attacking the same problem as=
=20
Cilk and the TPL but my immediate impression (based entirely upon your=20
description) is that Cilk and the TPL are probably better foundations.

Using one work stealing deque per core is much more efficient than the work=
=20
sharing queue you described for two reasons:

1. Less global syncronization.

2. Subtasks are likely to be executed on the core that spawned them, which=
=20
improves cache efficiency.

The parallel "for" loop you described is similar to the TPL's and leaves a =
lot=20
to be desired. Specifically, you want to execute clusters of consecutive=20
tasks on the same core for two reasons:

1. Using the same core improves cache efficiency because the end of one tas=
k=20
is often spatially related to the start of the next, e.g. parallel quicksor=
t,=20
linear algebra or image processing.

2. Executing clusters of tasks amortizes the cost of queueing tasks.

The former is easily solved with the Cilk/TPL design by using recursive=20
subdivision instead of the index sharing scheme you described because=20
subtasks are likely to be executed on the same core. However, that will not=
=20
work with a work sharing queue because subtasks are executed on random core=
s.=20
Moreover, a trivial extension to this higher-order function allows you to=20
pass in another function that describes the complexity of each task. This=20
allows the recursive function to more intelligently subdivide the given ran=
ge=20
into clusters of variable-complexity tasks with similar total running times=
=2E=20
This is the technique I use in our commercial F# libraries but I have not=20
seen it described elsewhere.

Cilk and the TPL also just rely upon the programmer to make tasks sufficien=
tly=20
complex that the time spent queueing and dequeueing them is insignificant. =
I=20
solved this problem by injecting run-time profiling code (with global=20
synchronizations per cluster of tasks) to measure the proportion of the tim=
e=20
being spent spawning rather than executing tasks and then use exponential=20
backoff to increase the cluster size until a sufficiently small proportion =
of=20
the time is spent spawning. Again, I have not seen this technique described=
=20
elsewhere.

=2D-=20
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e