From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <mikkelfj@gmail.com>
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr
X-Spam-Level: **
X-Spam-Status: No, score=2.3 required=5.0 tests=AWL,DNS_FROM_RFC_POST,
	SPF_NEUTRAL autolearn=disabled version=3.1.3
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104])
	by yquem.inria.fr (Postfix) with ESMTP id AD4D4BC37
	for <caml-list@yquem.inria.fr>; Wed, 30 Sep 2009 21:56:01 +0200 (CEST)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: ArEDAABUw0rRVdvckGdsb2JhbACRYYhfPwEBAQEJCQwHEwOtO5ACAQMDBYQiBIgd
X-IronPort-AV: E=Sophos;i="4.44,481,1249250400"; 
   d="scan'208";a="35348103"
Received: from mail-ew0-f220.google.com ([209.85.219.220])
  by mail3-smtp-sop.national.inria.fr with ESMTP; 30 Sep 2009 21:35:34 +0200
Received: by ewy20 with SMTP id 20so353369ewy.44
        for <caml-list@yquem.inria.fr>; Wed, 30 Sep 2009 12:35:34 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=gamma;
        h=domainkey-signature:mime-version:sender:received:in-reply-to
         :references:date:x-google-sender-auth:message-id:subject:from:to:cc
         :content-type;
        bh=EAfuEGL2AbhPbRuXSce7EzX+JUq9FW9LSqkYqcRqECY=;
        b=gPCELBd04YtDOsUJ24uQqWsm472vvS4QXLbD2OCUX/MWK2JF3EOYFuFBLfkqXhL7Yz
         lv98NyPKGEoyVsZgnhO8AHHbAeX+jui/W8qW5y7ESoLvwrURbUD5hnvLyg2IgrfXXxB4
         s4UpotlAGa5T5X15WdP7Hw+Yx2wupWeMVAqbM=
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:in-reply-to:references:date
         :x-google-sender-auth:message-id:subject:from:to:cc:content-type;
        b=JJ4UcnN0wOit0fRGdacshS1ndW0+4brWwXaCZSxhNle35KHq5IQmsp54BDXhgqZYbF
         Yu3L/kf9+3fIrfuisfdxd2h9byow9KDUKSAwkS8hjBeF2lkNdCKnZWnmGun3VZfqAz4Z
         yzihREtj7lJ9XlJZMMw6s4u3j3r0MvbTej4Fs=
MIME-Version: 1.0
Sender: mikkelfj@gmail.com
Received: by 10.216.22.211 with SMTP id t61mr41758wet.217.1254339334189; Wed, 
	30 Sep 2009 12:35:34 -0700 (PDT)
In-Reply-To: <200909301622.57394.jon@ffconsultancy.com>
References: <4E6F3027-5745-462D-AF10-30C868285D28@refined-audiometrics.com>
	 <200909272210.19130.jon@ffconsultancy.com>
	 <caee5ad80909291808p5cb034aq62c3e71a7b189170@mail.gmail.com>
	 <200909301622.57394.jon@ffconsultancy.com>
Date: Wed, 30 Sep 2009 21:35:34 +0200
X-Google-Sender-Auth: 4bb2124f37c54474
Message-ID: <caee5ad80909301235x7317beb4x424df798d48c2a19@mail.gmail.com>
Subject: Re: [Caml-list] JIT & HLVM, LLVM
From: =?UTF-8?Q?Mikkel_Fahn=C3=B8e_J=C3=B8rgensen?= <mikkel@dvide.com>
To: Jon Harrop <jon@ffconsultancy.com>
Cc: caml-list@yquem.inria.fr
Content-Type: text/plain; charset=UTF-8
X-Spam: no; 0.00; mikkel:01 rgensen:01 mikkel:01 deque:01 spawned:01 chunks:01 granularity:01 nodes:01 tpl:01 recursive:01 scheduler:01 scheduler:01 computations:01 nodes:01 abstraction:01 

> Using one work stealing deque per core is much more efficient than the work
> sharing queue you described for two reasons:
>
> 1. Less global syncronization.
>
> 2. Subtasks are likely to be executed on the core that spawned them, which
> improves cache efficiency.

You could also create larger task chunks or schedule multiple tasks on
a serial queue to obtain some benefits of cache coherency. I'm not
saying that this is a better way than the one suggest.

Another issue is latency vs. throughput.
If you have a lot of short transactions like in a trading system, you
want things done now, not more efficiently in a while, then you'd
rather add more cores to the system and wasting some.

> The parallel "for" loop you described is similar to the TPL's and leaves a lot
> to be desired. Specifically, you want to execute clusters of consecutive
> tasks on the same core for two reasons:

umm yes - it just a convenience for getting things done I suppose, but
again, you could split the loop into nested loops to increase
granularity. But in relation to my other recent post, there are also
map-reduce for addressing these issues which work well across network
nodes.


> The former is easily solved with the Cilk/TPL design by using recursive
> subdivision instead of the index sharing scheme you described because
> subtasks are likely to be executed on the same core.

I also wondered why GCD insisted on starting indices in order and even
waste time syncing on the index counter since there is no guarantee of
execution order, other than than start time. I guess it would be easy
to modify GCD with a different scheduler - it is about 10 lines of
code in the library, and you could set a property on a queue to
suggest a preferred scheduler. However, I think the benefit of the
current approach is exactly to ensure that as many early indices as
possible run in parallel - which you might want for low latency.

Another issue that concerns me more is how serial queues when more
work is added to the queue. If the tasks keeps eating on the serial
queue, it might starve other more important work, unless the serial
queue takes a break - of course the thread is preempted, but it will
not give priority to older concurrent tasks still waiting to be pulled
from the global concurrent queue. Again I think this is easily fixed,
and I might have not understood this fully.

For high performance computing of many similar computations like pixel
processing, I think one should also have a look at OpenCL, and
possibly some map reduce top layer that can distribute work to
multiple nodes - such a layer could be built with GCD without relying
heavily on GCD for anything other than coordination.

Again, here latency is important - the sooner you can ship off work to
OpenCL (which feeds work to graphic coprocessors) the more work gets
done rather than having work lining up on the same core to ensure GCD
maximizes its own cpu cache.

This also applies to network communication: if you can send data
sooner, the other end can start computing earlier, and especially if
risk delays.

So there is an argument both ways - I think GCD is more intended for
systems programming than number crunching (or raytracing :-)

I think one should also remember that GCD is a programming
abstraction, and that there are many ways to implement it, and as time
progress some changes could be made.