From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: ** X-Spam-Status: No, score=2.3 required=5.0 tests=AWL,DNS_FROM_RFC_POST, SPF_NEUTRAL autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82]) by yquem.inria.fr (Postfix) with ESMTP id 35F81BC37 for ; Wed, 30 Sep 2009 03:08:08 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AlMCAE9MwkrRVdvckGdsb2JhbACRXohgPwEBAQEJCQwHEwOsHpAVAQMCBYIrA4FrBYkM X-IronPort-AV: E=Sophos;i="4.44,476,1249250400"; d="scan'208";a="37130531" Received: from mail-ew0-f220.google.com ([209.85.219.220]) by mail1-smtp-roc.national.inria.fr with ESMTP; 30 Sep 2009 03:08:07 +0200 Received: by ewy20 with SMTP id 20so5755969ewy.44 for ; Tue, 29 Sep 2009 18:08:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:date:x-google-sender-auth:message-id:subject:from:to:cc :content-type; bh=dbDGUWRsqP/cjMcRh0h5JPmx2qv4Z5Suh2lPV4WDyYU=; b=r9wLqXiKq1TmXg4vux0yb7Zrmzi+m9hxPNyv8h4tsDeX5JZPZB3L7hH5sVsNRw/mqb SGED9dtehGTH9Rp4hCuCx/ZiIs1X5Gv/gs7T+Cck1cuHKK1+4zGAkosCWJgreKpooI61 AN0xyU+gdz/L2FeF7I8vJJI7inYUokwKxnQcI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; b=QMoW1R8KpEdvhst32GOABNEsMJK6Ad1p4hamPJ3Z1iT4L/FFqIXzeWL6BFbWttZeA4 J/xA7pl1JDmLBHmmAEoia05bii+gl4oo/fisn5iOwBnzITMkmcovUVH6Rd4Alz/JtV4X 6ckpah+JmsxEw69tMarStXteyJqRkbbVJ9UvQ= MIME-Version: 1.0 Sender: mikkelfj@gmail.com Received: by 10.216.37.140 with SMTP id y12mr1241040wea.52.1254272886796; Tue, 29 Sep 2009 18:08:06 -0700 (PDT) In-Reply-To: <200909272210.19130.jon@ffconsultancy.com> References: <4E6F3027-5745-462D-AF10-30C868285D28@refined-audiometrics.com> <4ABFB7DE.2050108@gmail.com> <60282.70.26.46.231.1254079393.squirrel@pegasus.carleton.ca> <200909272210.19130.jon@ffconsultancy.com> Date: Wed, 30 Sep 2009 03:08:06 +0200 X-Google-Sender-Auth: 2a5f7c0c3b68788a Message-ID: Subject: Re: [Caml-list] JIT & HLVM, LLVM From: =?UTF-8?Q?Mikkel_Fahn=C3=B8e_J=C3=B8rgensen?= To: Jon Harrop Cc: caml-list@yquem.inria.fr Content-Type: text/plain; charset=UTF-8 X-Spam: no; 0.00; mikkel:01 rgensen:01 mikkel:01 non-trivial:01 model:01 pointer:01 syntax:01 syntax:01 scheduler:01 popping:01 popping:01 ocaml:01 parallelism:01 async:01 iteration:01 2009/9/27 Jon Harrop : > On Sunday 27 September 2009 20:23:13 kcheung@math.carleton.ca wrote: >> If Grand Central Dispatch makes its way >> into *nix... > > Incidentally, what exactly (technically) is Grand Central and how does it > relate to existing alternatives and the state-of-the-art? I Googled it but > failed to find anything useful... > Grand Central Dispatch (GCD, where libdispatch is client side library) is a very simple piece of code (with some non-trivial implementation details). Some would say it is just a thread queue, but it is quite clever the same way map-reduce is clever and simple. It may help to read the Ars Technica article first, since I dig a little below the user level api. Two important concepts: GCD does not require threads, and it does not require kernel support. When present, GCD becomes efficient at handling multiple cores - or this is the claim, and at least OS X 10.6 Snow Leopard seems to be much smoother than OS X 10.5 Leopard, so I guess there is something to it. But without threads and kernel support, it is still a good programming model, it would seem (haven't burned my fingers yet...) To better understand GCD, think of a single threaded event driven application with support for nested event queues. We start with a single thread, later extend the concept to multiple threads and cores: There is set of global queues with different priorities, one which is the default. Since we assume single threaded for now, we limit this to only one global queue. Tasks are pushed onto the global queue. A task is just function with data, i.e. a closure. In C this is a function pointer and a void * to context, or it is block if you have the blocks extension, but the blocks extension really is optional, although very useful since it can take an standard C block and with a little syntax turn it into a closure with a copy of the parent scope state which can then quickly be turned into a parallel loop. But this is just syntax, really not too important for the underpinnings of libdispatch. A function can push other functions to the global queue. This in effect creates a continuation, and the concept is similar to continuation passing style or LWT threads, but requires no continuation arguments to drag around. A task should return soon, and avoid blocking operations - especially in single threaded setup, but generally blocking is bad in GCD. The interesting part is how this gets scheduled: There is no central scheduler. When our current task function returns (which it should do soon), there is another piece of calling code that earlier picked our task from a queue. Upon task return this code now picks the next task on the queue and calls it - ad infinitum until we reach an empty queue, in which case our control code can exit - i.e. exit the thread. So far this your standard http web server event loop. Now we add a thread pool to the queue. This means that N threads are all either executing a task from the queue, or actively executing a task on the queue. One a task completes, the next task is popped. Since there are N threads, we have contention on the queue, so it must be briefly locked with compare-exchange logic. Now we have a concurrent queue with the semantic that tasks are started in order, but execute concurrently. Any and all synchronization between tasks must be managed by client code if they access shared resources. Which brings us to the next point: A task (or the main program thread) can create a new serial queue. A serial queue is really just a task pushed onto the default global queue, but this is mostly invisible to the user, but internally this makes the control very simple and elegant (thread local storage is used to identify the current parent queue in the queue creation function). Our active task can now push new tasks onto the serial queue. These tasks are guaranteed to be isolated. The serial queue tasks will eventually be popped from the global concurrent queue either after our function exits, or sooner if there or other vacant threads in the pool. Once the serial queue task is popped, it starts popping tasks of the queue, while still locking pop operations to avoid conflict with new pushes. Note that every single thread cooperate on scheduling by popping the next task from the queue it is currently assigned to. When a task is a queue task, it will temporarily change the thread task to the new queue until that queue is exhausted, then the thread returns focus to the previous queue. A task can choose to wait on the completion of another task it pushes, in which case it is easy to deadlock if not being careful. It is possibly to create an insanely huge amounts of queues in this system. The only real overhead is the costs of memory barriers used when pushing and popping. Apple refers to this as islands of serialization in a sea of concurrency. There is also the event signalling subsystem that grabs descriptor events and holds a task until signalled, then pushes the task to a designated queue. This is based on kqueue on FreeBSD and OS X, and some promising work is being done to emulate kqueue for Linux for the purpose of libdispatch integration - this is somewhat related to libevent. Now, it would be really simple to implement the dispatch api in ocaml, either single threaded, or multithreaded to handle the odd block tasking. On top of this, Apple has built NSOperations which is a user level system that allows for various constraints between tasks: task A cannot run before task B and only if there is no yellow car parked in front of the hotel, or something, but GCD runs underneath, and it is almost all queues in user space, and very simple. (Actually I think NSOperations came before, but has now been rewritten to use GCD). OK, so how does this turn into a finely tuned OS controlled multi-core system? Well, whenever thread is locking the queue, it does this through a compare-exchange op that gives first prevents other tasks from taking control, then it verifies that it itself has not been victim of the prevention step. If it did, the queue is in heavy competition, and this means there are too many threads, so our discouraged thread will prepare for suicide. During a task push operation to a concurrent queue, a new thread is attempted allocated for that thread. This is done by calling a supporting thread create function. The tasks is simply queued, so it does not matter if a new thread is not created, it just means less parallelism. The supporting thread creation function is where it all happens. This is deeply buried in the client libdispatch logic. It the function has two implementations: if the apple work threads library is present, it will call a kernel function to so inform that there is now more work. This kernel function may or may not provide additional threads based on a global view of current system load and available cores. Now, since this global view does not know how many of these threads we intend to leave blocking on whatever sync api, it is generally seen as a bad thing to have anything blocking. The event system makes it relatively easy to avoid sync calls to common I/O ops. The portable alternative thread creation function is ugly: whenever a new task is pushed to a global async queue, a new thread is created unless the thread count has reached 255. While this works, a more intelligent thread pool would be useful; one that knows about number of cores, and possibly some app hint about how many threads we intend to have blocking. So basically, all there is libdispatch, aka Grand Central Dispatch is the kernel work threads api, the kqueue event api, and a thin client library that provides asyn access to pushing and popping from queues. A minor detail left out: when pushing a task, a small list element is allocated from heap, but returned to thread local storage cache list which is cleaned when a thread commits suicide. There are some additional sugar ops such as creating a parallel for loop by pushing tasks that take and index as argument. A clever detail of this design is to not push the 10,000 array elements at once, but limit to a fair number based on system core count, then have these tasks dynamically restart themselves with a new iteration index which they allocate from from a critical section protected iteration counter. Such a list iterator is also a task, and the task creating the iteration will block until all iterations have completed, which is somewhat similar to a thread join, but trivial to create. It is not clear to me how much of this kind of blocking is a good thing because it depends on the underlying thread pool. At least this is how I understand :-) Now, I'm quite tempted to do this kind of programming in C rather than OCaml, since you get a lot of stuff for free, and you don't have to keep track of a lot of different processes - oh - btw. GCD also makes it easy to monitor process events via the kqueue system, so perhaps a C core server dispatching work to ocaml processes would be a way to go? (Process monitoring might be emulated via a separate thread on Linux, at least this is what the author of the kqueue emulation library currently considers). As to how GCD could be used in OCaml with multi-core support: There is still the issue of concurrent garbage collection. If the thread support is entirely removed and replaced by a GCD like construct, then the runtime has a much better view over when scheduling happens, and it knows that there is an observable and limited amount of real threads - thus making it feasible with several large per thread heaps. But still, it doesn't really solve the concurrent GC problem fully. I might have gotten some details wrong, but this is what I came up with after digging around the code and asking a few questions.