From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: ** X-Spam-Status: No, score=2.3 required=5.0 tests=AWL,DNS_FROM_RFC_POST, SPF_NEUTRAL autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by yquem.inria.fr (Postfix) with ESMTP id AD4D4BC37 for ; Wed, 30 Sep 2009 21:56:01 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ArEDAABUw0rRVdvckGdsb2JhbACRYYhfPwEBAQEJCQwHEwOtO5ACAQMDBYQiBIgd X-IronPort-AV: E=Sophos;i="4.44,481,1249250400"; d="scan'208";a="35348103" Received: from mail-ew0-f220.google.com ([209.85.219.220]) by mail3-smtp-sop.national.inria.fr with ESMTP; 30 Sep 2009 21:35:34 +0200 Received: by ewy20 with SMTP id 20so353369ewy.44 for ; Wed, 30 Sep 2009 12:35:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:date:x-google-sender-auth:message-id:subject:from:to:cc :content-type; bh=EAfuEGL2AbhPbRuXSce7EzX+JUq9FW9LSqkYqcRqECY=; b=gPCELBd04YtDOsUJ24uQqWsm472vvS4QXLbD2OCUX/MWK2JF3EOYFuFBLfkqXhL7Yz lv98NyPKGEoyVsZgnhO8AHHbAeX+jui/W8qW5y7ESoLvwrURbUD5hnvLyg2IgrfXXxB4 s4UpotlAGa5T5X15WdP7Hw+Yx2wupWeMVAqbM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; b=JJ4UcnN0wOit0fRGdacshS1ndW0+4brWwXaCZSxhNle35KHq5IQmsp54BDXhgqZYbF Yu3L/kf9+3fIrfuisfdxd2h9byow9KDUKSAwkS8hjBeF2lkNdCKnZWnmGun3VZfqAz4Z yzihREtj7lJ9XlJZMMw6s4u3j3r0MvbTej4Fs= MIME-Version: 1.0 Sender: mikkelfj@gmail.com Received: by 10.216.22.211 with SMTP id t61mr41758wet.217.1254339334189; Wed, 30 Sep 2009 12:35:34 -0700 (PDT) In-Reply-To: <200909301622.57394.jon@ffconsultancy.com> References: <4E6F3027-5745-462D-AF10-30C868285D28@refined-audiometrics.com> <200909272210.19130.jon@ffconsultancy.com> <200909301622.57394.jon@ffconsultancy.com> Date: Wed, 30 Sep 2009 21:35:34 +0200 X-Google-Sender-Auth: 4bb2124f37c54474 Message-ID: Subject: Re: [Caml-list] JIT & HLVM, LLVM From: =?UTF-8?Q?Mikkel_Fahn=C3=B8e_J=C3=B8rgensen?= To: Jon Harrop Cc: caml-list@yquem.inria.fr Content-Type: text/plain; charset=UTF-8 X-Spam: no; 0.00; mikkel:01 rgensen:01 mikkel:01 deque:01 spawned:01 chunks:01 granularity:01 nodes:01 tpl:01 recursive:01 scheduler:01 scheduler:01 computations:01 nodes:01 abstraction:01 > Using one work stealing deque per core is much more efficient than the work > sharing queue you described for two reasons: > > 1. Less global syncronization. > > 2. Subtasks are likely to be executed on the core that spawned them, which > improves cache efficiency. You could also create larger task chunks or schedule multiple tasks on a serial queue to obtain some benefits of cache coherency. I'm not saying that this is a better way than the one suggest. Another issue is latency vs. throughput. If you have a lot of short transactions like in a trading system, you want things done now, not more efficiently in a while, then you'd rather add more cores to the system and wasting some. > The parallel "for" loop you described is similar to the TPL's and leaves a lot > to be desired. Specifically, you want to execute clusters of consecutive > tasks on the same core for two reasons: umm yes - it just a convenience for getting things done I suppose, but again, you could split the loop into nested loops to increase granularity. But in relation to my other recent post, there are also map-reduce for addressing these issues which work well across network nodes. > The former is easily solved with the Cilk/TPL design by using recursive > subdivision instead of the index sharing scheme you described because > subtasks are likely to be executed on the same core. I also wondered why GCD insisted on starting indices in order and even waste time syncing on the index counter since there is no guarantee of execution order, other than than start time. I guess it would be easy to modify GCD with a different scheduler - it is about 10 lines of code in the library, and you could set a property on a queue to suggest a preferred scheduler. However, I think the benefit of the current approach is exactly to ensure that as many early indices as possible run in parallel - which you might want for low latency. Another issue that concerns me more is how serial queues when more work is added to the queue. If the tasks keeps eating on the serial queue, it might starve other more important work, unless the serial queue takes a break - of course the thread is preempted, but it will not give priority to older concurrent tasks still waiting to be pulled from the global concurrent queue. Again I think this is easily fixed, and I might have not understood this fully. For high performance computing of many similar computations like pixel processing, I think one should also have a look at OpenCL, and possibly some map reduce top layer that can distribute work to multiple nodes - such a layer could be built with GCD without relying heavily on GCD for anything other than coordination. Again, here latency is important - the sooner you can ship off work to OpenCL (which feeds work to graphic coprocessors) the more work gets done rather than having work lining up on the same core to ensure GCD maximizes its own cpu cache. This also applies to network communication: if you can send data sooner, the other end can start computing earlier, and especially if risk delays. So there is an argument both ways - I think GCD is more intended for systems programming than number crunching (or raytracing :-) I think one should also remember that GCD is a programming abstraction, and that there are many ways to implement it, and as time progress some changes could be made.