From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <mikkelfj@gmail.com>
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr
X-Spam-Level: **
X-Spam-Status: No, score=2.3 required=5.0 tests=AWL,DNS_FROM_RFC_POST,
	SPF_NEUTRAL autolearn=disabled version=3.1.3
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82])
	by yquem.inria.fr (Postfix) with ESMTP id 35F81BC37
	for <caml-list@yquem.inria.fr>; Wed, 30 Sep 2009 03:08:08 +0200 (CEST)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AlMCAE9MwkrRVdvckGdsb2JhbACRXohgPwEBAQEJCQwHEwOsHpAVAQMCBYIrA4FrBYkM
X-IronPort-AV: E=Sophos;i="4.44,476,1249250400"; 
   d="scan'208";a="37130531"
Received: from mail-ew0-f220.google.com ([209.85.219.220])
  by mail1-smtp-roc.national.inria.fr with ESMTP; 30 Sep 2009 03:08:07 +0200
Received: by ewy20 with SMTP id 20so5755969ewy.44
        for <caml-list@yquem.inria.fr>; Tue, 29 Sep 2009 18:08:07 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=gamma;
        h=domainkey-signature:mime-version:sender:received:in-reply-to
         :references:date:x-google-sender-auth:message-id:subject:from:to:cc
         :content-type;
        bh=dbDGUWRsqP/cjMcRh0h5JPmx2qv4Z5Suh2lPV4WDyYU=;
        b=r9wLqXiKq1TmXg4vux0yb7Zrmzi+m9hxPNyv8h4tsDeX5JZPZB3L7hH5sVsNRw/mqb
         SGED9dtehGTH9Rp4hCuCx/ZiIs1X5Gv/gs7T+Cck1cuHKK1+4zGAkosCWJgreKpooI61
         AN0xyU+gdz/L2FeF7I8vJJI7inYUokwKxnQcI=
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:in-reply-to:references:date
         :x-google-sender-auth:message-id:subject:from:to:cc:content-type;
        b=QMoW1R8KpEdvhst32GOABNEsMJK6Ad1p4hamPJ3Z1iT4L/FFqIXzeWL6BFbWttZeA4
         J/xA7pl1JDmLBHmmAEoia05bii+gl4oo/fisn5iOwBnzITMkmcovUVH6Rd4Alz/JtV4X
         6ckpah+JmsxEw69tMarStXteyJqRkbbVJ9UvQ=
MIME-Version: 1.0
Sender: mikkelfj@gmail.com
Received: by 10.216.37.140 with SMTP id y12mr1241040wea.52.1254272886796; Tue, 
	29 Sep 2009 18:08:06 -0700 (PDT)
In-Reply-To: <200909272210.19130.jon@ffconsultancy.com>
References: <4E6F3027-5745-462D-AF10-30C868285D28@refined-audiometrics.com>
	 <4ABFB7DE.2050108@gmail.com>
	 <60282.70.26.46.231.1254079393.squirrel@pegasus.carleton.ca>
	 <200909272210.19130.jon@ffconsultancy.com>
Date: Wed, 30 Sep 2009 03:08:06 +0200
X-Google-Sender-Auth: 2a5f7c0c3b68788a
Message-ID: <caee5ad80909291808p5cb034aq62c3e71a7b189170@mail.gmail.com>
Subject: Re: [Caml-list] JIT & HLVM, LLVM
From: =?UTF-8?Q?Mikkel_Fahn=C3=B8e_J=C3=B8rgensen?= <mikkel@dvide.com>
To: Jon Harrop <jon@ffconsultancy.com>
Cc: caml-list@yquem.inria.fr
Content-Type: text/plain; charset=UTF-8
X-Spam: no; 0.00; mikkel:01 rgensen:01 mikkel:01 non-trivial:01 model:01 pointer:01 syntax:01 syntax:01 scheduler:01 popping:01 popping:01 ocaml:01 parallelism:01 async:01 iteration:01 

2009/9/27 Jon Harrop <jon@ffconsultancy.com>:
> On Sunday 27 September 2009 20:23:13 kcheung@math.carleton.ca wrote:
>> If Grand Central Dispatch makes its way
>> into *nix...
>
> Incidentally, what exactly (technically) is Grand Central and how does it
> relate to existing alternatives and the state-of-the-art? I Googled it but
> failed to find anything useful...
>

Grand Central Dispatch (GCD, where libdispatch is client side library)
is a very simple piece of code (with some non-trivial implementation
details). Some would say it is just a thread queue, but it is quite
clever the same way map-reduce is clever and simple.

It may help to read the Ars Technica article first, since I dig a
little below the user level api.

Two important concepts: GCD does not require threads, and it does not
require kernel support. When present, GCD becomes efficient at
handling multiple cores - or this is the claim, and at least OS X 10.6
Snow Leopard seems to be much smoother than OS X 10.5 Leopard, so I
guess there is something to it. But without threads and kernel
support, it is still a good programming model, it would seem (haven't
burned my fingers yet...)


To better understand GCD, think of a single threaded event driven
application with support for nested event queues.

We start with a single thread, later extend the concept to multiple
threads and cores:

There is set of global queues with different priorities, one which is
the default.
Since we assume single threaded for now, we limit this to only one global queue.

Tasks are pushed onto the global queue.

A task is just function with data, i.e. a closure.

In C this is a function pointer and a void * to context, or it is
block if you have the blocks extension, but the blocks extension
really is optional, although very useful since it can take an standard
C block and with a little syntax turn it into a closure with a copy of
the parent scope state which can then quickly be turned into a
parallel loop. But this is just syntax, really not too important for
the underpinnings of libdispatch.

A function can push other functions to the global queue. This in
effect creates a continuation, and the concept is similar to
continuation passing style or LWT threads, but requires no
continuation arguments to drag around.

A task should return soon, and avoid blocking operations - especially
in single threaded setup, but generally blocking is bad in GCD.

The interesting part is how this gets scheduled: There is no central scheduler.

When our current task function returns (which it should do soon),
there is another piece of calling code that earlier picked our task
from a queue. Upon task return this code now picks the next task on
the queue and calls it - ad infinitum until we reach an empty queue,
in which case our control code can exit - i.e. exit the thread.

So far this your standard http web server event loop.

Now we add a thread pool to the queue.

This means that N threads are all either executing a task from the
queue, or actively executing a task on the queue. One a task
completes, the next task is popped. Since there are N threads, we have
contention on the queue, so it must be briefly locked with
compare-exchange logic.

Now we have a concurrent queue with the semantic that tasks are
started in order, but execute concurrently. Any and all
synchronization between tasks must be managed by client code if they
access shared resources.

Which brings us to the next point:

A task (or the main program thread) can create a new serial queue. A
serial queue is really just a task pushed onto the default global
queue, but this is mostly invisible to the user, but internally this
makes the control very simple and elegant (thread local storage is
used to identify the current parent queue in the queue creation
function).

Our active task can now push new tasks onto the serial queue. These
tasks are guaranteed to be isolated. The serial queue tasks will
eventually be popped from the global concurrent queue either after our
function exits, or sooner if there or other vacant threads in the
pool. Once the serial queue task is popped, it starts popping tasks of
the queue, while still locking pop operations to avoid conflict with
new pushes.

Note that every single thread cooperate on scheduling by popping the
next task from the queue it is currently assigned to. When a task is a
queue task, it will temporarily change the thread task to the new
queue until that queue is exhausted, then the thread returns focus to
the previous queue.

A task can choose to wait on the completion of another task it pushes,
in which case it is easy to deadlock if not being careful.

It is possibly to create an insanely huge amounts of queues in this
system. The only real overhead is the costs of memory barriers used
when pushing and popping.

Apple refers to this as islands of serialization in a sea of concurrency.

There is also the event signalling subsystem that grabs descriptor
events and holds a task until signalled, then pushes the task to a
designated queue. This is based on kqueue on FreeBSD and OS X, and
some promising work is being done to emulate kqueue for Linux for the
purpose of libdispatch integration - this is somewhat related to
libevent.

Now, it would be really simple to implement the dispatch api in ocaml,
either single threaded, or multithreaded to handle the odd block
tasking.

On top of this, Apple has built NSOperations which is a user level
system that allows for various constraints between tasks: task A
cannot run before task B and only if there is no yellow car parked in
front of the hotel, or something, but GCD runs underneath, and it is
almost all queues in user space, and very simple. (Actually I think
NSOperations came before, but has now been rewritten to use GCD).

OK, so how does this turn into a finely tuned OS controlled multi-core system?

Well, whenever thread is locking the queue, it does this through a
compare-exchange op that gives first prevents other tasks from taking
control, then it verifies that it itself has not been victim of the
prevention step. If it did, the queue is in heavy competition, and
this means there are too many threads, so our discouraged thread will
prepare for suicide. During a task push operation to a concurrent
queue, a new thread is attempted allocated for that thread. This is
done by calling a supporting thread create function. The tasks is
simply queued, so it does not matter if a new thread is not created,
it just means less parallelism.

The supporting thread creation function is where it all happens. This
is deeply buried in the client libdispatch logic. It the function has
two implementations: if the apple work threads library is present, it
will call a kernel function to so inform that there is now more work.
This kernel function may or may not provide additional threads based
on a global view of current system load and available cores. Now,
since this global view does not know how many of these threads we
intend to leave blocking on whatever sync api, it is generally seen as
a bad thing to have anything blocking. The event system makes it
relatively easy to avoid sync calls to common I/O ops.

The portable alternative thread creation function is ugly: whenever a
new task is pushed to a global async queue, a new thread is created
unless the thread count has reached 255. While this works, a more
intelligent thread pool would be useful; one that knows about number
of cores, and possibly some app hint about how many threads we intend
to have blocking.

So basically, all there is libdispatch, aka Grand Central Dispatch is
the kernel work threads api, the kqueue event api, and a thin client
library that provides asyn access to pushing and popping from queues.

A minor detail left out: when pushing a task, a small list element is
allocated from heap, but returned to thread local storage cache list
which is cleaned when a thread commits suicide.

There are some additional sugar ops such as creating a parallel for
loop by pushing tasks that take and index as argument. A clever detail
of this design is to not push the 10,000 array elements at once, but
limit to a fair number based on system core count, then have these
tasks dynamically restart themselves with a new iteration index which
they allocate from from a critical section protected iteration
counter.

Such a list iterator is also a task, and the task creating the
iteration will block until all iterations have completed, which is
somewhat similar to a thread join, but trivial to create. It is not
clear to me how much of this kind of blocking is a good thing because
it depends on the underlying thread pool.

At least this is how I understand :-)

Now, I'm quite tempted to do this kind of programming in C rather than
OCaml, since you get a lot of stuff for free, and you don't have to
keep track of a lot of different processes - oh - btw. GCD also makes
it easy to monitor process events via the kqueue system, so perhaps a
C core server dispatching work to ocaml processes would be a way to
go? (Process monitoring might be emulated via a separate thread on
Linux, at least this is what the author of the kqueue emulation
library currently considers).

As to how GCD could be used in OCaml with multi-core support:

There is still the issue of concurrent garbage collection.
If the thread support is entirely removed and replaced by a GCD like
construct, then the runtime has a much better view over when
scheduling happens, and it knows that there is an observable and
limited amount of real threads - thus making it feasible with several
large per thread heaps. But still, it doesn't really solve the
concurrent GC problem fully.

I might have gotten some details wrong, but this is what I came up
with after digging around the code and asking a few questions.