caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* JIT & HLVM, LLVM
@ 2009-09-27 17:18 David McClain
  2009-09-27 17:22 ` [Caml-list] " Vincent Aravantinos
  0 siblings, 1 reply; 18+ messages in thread
From: David McClain @ 2009-09-27 17:18 UTC (permalink / raw)
  To: caml-list

Ahh, I see from doing a bit more research that JIT does *not* run  
particularly faster than statically compiled code. But rather, it runs  
faster than interpreted byte-code.

I remember many years ago speaking with David Robson, over lunch,  
about the upcoming changes in Smalltalk, using a form of JIT to  
improve performance of their method dispatch, and attempting to gain  
multiple inheritance in that manner for Smalltalk. But there, again,  
it is a case of attempting to improve on an interpreted byte-code, and  
not a case of improving over statically compiled native code.

But with so many talented bodies working on LLVM, perhaps, in time, a  
way will be found to gain improvement over static native code.

Dr. David McClain
dbm@refined-audiometrics.com




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] JIT & HLVM, LLVM
  2009-09-27 17:18 JIT & HLVM, LLVM David McClain
@ 2009-09-27 17:22 ` Vincent Aravantinos
  2009-09-27 17:35   ` Edgar Friendly
  0 siblings, 1 reply; 18+ messages in thread
From: Vincent Aravantinos @ 2009-09-27 17:22 UTC (permalink / raw)
  To: David McClain; +Cc: caml-list

Hi,

I think what Jon means is that, with JIT, polymorphic functions can be  
specialized at run-time
and allow optimizations that are not currently achieved by the Ocaml  
native code compiler.

V.

Le 27 sept. 09 à 19:18, David McClain a écrit :

> Ahh, I see from doing a bit more research that JIT does *not* run  
> particularly faster than statically compiled code. But rather, it  
> runs faster than interpreted byte-code.
>
> I remember many years ago speaking with David Robson, over lunch,  
> about the upcoming changes in Smalltalk, using a form of JIT to  
> improve performance of their method dispatch, and attempting to gain  
> multiple inheritance in that manner for Smalltalk. But there, again,  
> it is a case of attempting to improve on an interpreted byte-code,  
> and not a case of improving over statically compiled native code.
>
> But with so many talented bodies working on LLVM, perhaps, in time,  
> a way will be found to gain improvement over static native code.
>
> Dr. David McClain
> dbm@refined-audiometrics.com


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] JIT & HLVM, LLVM
  2009-09-27 17:22 ` [Caml-list] " Vincent Aravantinos
@ 2009-09-27 17:35   ` Edgar Friendly
  2009-09-27 18:51     ` Jon Harrop
  2009-09-28 12:25     ` Gerd Stolpmann
  0 siblings, 2 replies; 18+ messages in thread
From: Edgar Friendly @ 2009-09-27 17:35 UTC (permalink / raw)
  To: Vincent Aravantinos, caml-list

Vincent Aravantinos wrote:
> Hi,
> 
> I think what Jon means is that, with JIT, polymorphic functions can be
> specialized at run-time
> and allow optimizations that are not currently achieved by the Ocaml
> native code compiler.
> 
> V.

The alternative to specializing at runtime using JIT is to do it at
compile time (/ link time) using a form of whole-program analysis.  How
expensive would this be, and how hard would it be to still support
separate compilation?

And how much would the OCaml world cry if we didn't have fully-separate
compilation?  At the moment, and for the foreseeable future, anything
binary is bound tightly to the compiler version, and binary distribution
of modules seems pretty much impossible due to this and unknown other
factors.  How much easier would the above be given the source code /
some intermediate representation to generate specialized code with?

E


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] JIT & HLVM, LLVM
  2009-09-27 17:35   ` Edgar Friendly
@ 2009-09-27 18:51     ` Jon Harrop
  2009-09-27 19:07       ` Edgar Friendly
  2009-09-28 12:25     ` Gerd Stolpmann
  1 sibling, 1 reply; 18+ messages in thread
From: Jon Harrop @ 2009-09-27 18:51 UTC (permalink / raw)
  To: caml-list

On Sunday 27 September 2009 18:35:32 Edgar Friendly wrote:
> Vincent Aravantinos wrote:
> > I think what Jon means is that, with JIT, polymorphic functions can be
> > specialized at run-time and allow optimizations that are not currently
> > achieved by the Ocaml native code compiler.
>
> The alternative to specializing at runtime using JIT is to do it at
> compile time (/ link time) using a form of whole-program analysis.  How
> expensive would this be, and how hard would it be to still support
> separate compilation?

C++ has those features and they are blamed for slow compilation. However, 
separate compilation is not the issue so much as dynamic loading.

> And how much would the OCaml world cry if we didn't have fully-separate
> compilation?

I think it is a bad idea to consider how much the OCaml world would cry. 
Indeed, we cannot even agree on what constitutes the OCaml world. Xavier said 
that most OCaml users care about the performance of Coq which is, IMHO, 
insanely inaccurate. The important thing is what we can offer the whole 
world, not what little remains of the OCaml world (who are a self-selected 
group of people who, by definition, endure OCaml's shortcomings). I think 
that an OCaml-like language that addresses OCaml's performance (including 
parallelism) and FFI issues would be much more widely useful and is an 
entirely achievable goal.

> At the moment, and for the foreseeable future, anything 
> binary is bound tightly to the compiler version, and binary distribution
> of modules seems pretty much impossible due to this and unknown other
> factors.

Right. Separate compilation and dynamic linking and trivial with a JIT 
compiler.

> How much easier would the above be given the source code / 
> some intermediate representation to generate specialized code with?

Infinitely easier.

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] JIT & HLVM, LLVM
  2009-09-27 18:51     ` Jon Harrop
@ 2009-09-27 19:07       ` Edgar Friendly
  2009-09-27 19:23         ` kcheung
  0 siblings, 1 reply; 18+ messages in thread
From: Edgar Friendly @ 2009-09-27 19:07 UTC (permalink / raw)
  To: Jon Harrop; +Cc: caml-list

Jon Harrop wrote:
> C++ has those features and they are blamed for slow compilation. However, 
> separate compilation is not the issue so much as dynamic loading.
> 
Yes, dynamic loading doesn't mix with whole-program analysis.  Although
I wonder if it's possible to have our cake and eat it too, with a fast
bytecode compiler and a slower native code compiler that does great
specialization.

> I think 
> that an OCaml-like language that addresses OCaml's performance (including 
> parallelism) and FFI issues would be much more widely useful and is an 
> entirely achievable goal.
> 
I'm not as committed to abandoning OCaml as you seem, and have hope for
incremental improvement of OCaml's weaknesses, although I realize we'll
have to break a number of things to get to where we both (and likely
many others) want to be.


E


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] JIT & HLVM, LLVM
  2009-09-27 19:07       ` Edgar Friendly
@ 2009-09-27 19:23         ` kcheung
  2009-09-27 19:33           ` Jon Harrop
  2009-09-27 21:10           ` Jon Harrop
  0 siblings, 2 replies; 18+ messages in thread
From: kcheung @ 2009-09-27 19:23 UTC (permalink / raw)
  To: caml-list

>> I think
>> that an OCaml-like language that addresses OCaml's performance
>> (including
>> parallelism) and FFI issues would be much more widely useful and is an
>> entirely achievable goal.
>>
> I'm not as committed to abandoning OCaml as you seem, and have hope for
> incremental improvement of OCaml's weaknesses, although I realize we'll
> have to break a number of things to get to where we both (and likely
> many others) want to be.

Perhaps the future adoption of OCaml will
depend on what OCaml 4.0 is going to be like.
If Grand Central Dispatch makes its way
into *nix, then I think it is extremely
worthwhile for OCaml to have support for
something similar.

Kevin Cheung.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] JIT & HLVM, LLVM
  2009-09-27 19:23         ` kcheung
@ 2009-09-27 19:33           ` Jon Harrop
  2009-09-27 21:10           ` Jon Harrop
  1 sibling, 0 replies; 18+ messages in thread
From: Jon Harrop @ 2009-09-27 19:33 UTC (permalink / raw)
  To: caml-list

On Sunday 27 September 2009 20:23:13 kcheung@math.carleton.ca wrote:
> > I'm not as committed to abandoning OCaml as you seem, and have hope for
> > incremental improvement of OCaml's weaknesses, although I realize we'll
> > have to break a number of things to get to where we both (and likely
> > many others) want to be.
>
> Perhaps the future adoption of OCaml will
> depend on what OCaml 4.0 is going to be like.
> If Grand Central Dispatch makes its way
> into *nix, then I think it is extremely
> worthwhile for OCaml to have support for
> something similar.

Someone would have to inspire Xavier and/or Damien to start from scratch and 
build what we need (and not what Coq needs). Realistically, that is never 
going to happen. We should thank them for enlightening us but this will only 
ever get built if we build it ourselves. But, hey, at least we can build it 
in OCaml. ;-)

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] JIT & HLVM, LLVM
  2009-09-27 19:23         ` kcheung
  2009-09-27 19:33           ` Jon Harrop
@ 2009-09-27 21:10           ` Jon Harrop
  2009-09-27 21:45             ` Vincent Aravantinos
  2009-09-30  1:08             ` Mikkel Fahnøe Jørgensen
  1 sibling, 2 replies; 18+ messages in thread
From: Jon Harrop @ 2009-09-27 21:10 UTC (permalink / raw)
  To: caml-list

On Sunday 27 September 2009 20:23:13 kcheung@math.carleton.ca wrote:
> If Grand Central Dispatch makes its way
> into *nix...

Incidentally, what exactly (technically) is Grand Central and how does it 
relate to existing alternatives and the state-of-the-art? I Googled it but 
failed to find anything useful...

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] JIT & HLVM, LLVM
  2009-09-27 21:10           ` Jon Harrop
@ 2009-09-27 21:45             ` Vincent Aravantinos
  2009-09-30  1:08             ` Mikkel Fahnøe Jørgensen
  1 sibling, 0 replies; 18+ messages in thread
From: Vincent Aravantinos @ 2009-09-27 21:45 UTC (permalink / raw)
  To: Jon Harrop; +Cc: caml-list

Le 27 sept. 09 à 23:10, Jon Harrop a écrit :

> On Sunday 27 September 2009 20:23:13 kcheung@math.carleton.ca wrote:
>> If Grand Central Dispatch makes its way
>> into *nix...
>
> Incidentally, what exactly (technically) is Grand Central and how  
> does it
> relate to existing alternatives and the state-of-the-art? I Googled  
> it but
> failed to find anything useful...

http://arstechnica.com/apple/reviews/2009/08/mac-os-x-10-6.ars

Not everything relevant in those 23 pages, but you should get answers.

V.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] JIT & HLVM, LLVM
  2009-09-27 17:35   ` Edgar Friendly
  2009-09-27 18:51     ` Jon Harrop
@ 2009-09-28 12:25     ` Gerd Stolpmann
  1 sibling, 0 replies; 18+ messages in thread
From: Gerd Stolpmann @ 2009-09-28 12:25 UTC (permalink / raw)
  To: Edgar Friendly; +Cc: Vincent Aravantinos, caml-list


Am Sonntag, den 27.09.2009, 13:35 -0400 schrieb Edgar Friendly:
> Vincent Aravantinos wrote:
> > Hi,
> > 
> > I think what Jon means is that, with JIT, polymorphic functions can be
> > specialized at run-time
> > and allow optimizations that are not currently achieved by the Ocaml
> > native code compiler.
> > 
> > V.
> 
> The alternative to specializing at runtime using JIT is to do it at
> compile time (/ link time) using a form of whole-program analysis.  How
> expensive would this be, and how hard would it be to still support
> separate compilation?
> 
> And how much would the OCaml world cry if we didn't have fully-separate
> compilation? 

We don't have fully separate compilation anyway (in native mode). The
compiler can already dump an intermediate representation of functions
into .cmx files, and can use that for cross-module inlining. This
feature is optional for the user. Of course, it increases the number of
modules that need to be recompiled when something is changed.

If we keep that spirit of giving the user a choice: Of course the "Ocaml
world" would appreciate when the actual code generation can be delayed
in order to gain performance improvements by whole-program analysis.
However, for projects with several 100KLOC the compile time could be
drastically increased, and I don't think users with such large programs
would actually use that feature.

Gerd

>  At the moment, and for the foreseeable future, anything
> binary is bound tightly to the compiler version, and binary distribution
> of modules seems pretty much impossible due to this and unknown other
> factors.  How much easier would the above be given the source code /
> some intermediate representation to generate specialized code with?
> 
> E
> 
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
> 
-- 
------------------------------------------------------------
Gerd Stolpmann, Bad Nauheimer Str.3, 64289 Darmstadt,Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] JIT & HLVM, LLVM
  2009-09-27 21:10           ` Jon Harrop
  2009-09-27 21:45             ` Vincent Aravantinos
@ 2009-09-30  1:08             ` Mikkel Fahnøe Jørgensen
  2009-09-30 11:57               ` Stéphane Glondu
  2009-09-30 15:22               ` Jon Harrop
  1 sibling, 2 replies; 18+ messages in thread
From: Mikkel Fahnøe Jørgensen @ 2009-09-30  1:08 UTC (permalink / raw)
  To: Jon Harrop; +Cc: caml-list

2009/9/27 Jon Harrop <jon@ffconsultancy.com>:
> On Sunday 27 September 2009 20:23:13 kcheung@math.carleton.ca wrote:
>> If Grand Central Dispatch makes its way
>> into *nix...
>
> Incidentally, what exactly (technically) is Grand Central and how does it
> relate to existing alternatives and the state-of-the-art? I Googled it but
> failed to find anything useful...
>

Grand Central Dispatch (GCD, where libdispatch is client side library)
is a very simple piece of code (with some non-trivial implementation
details). Some would say it is just a thread queue, but it is quite
clever the same way map-reduce is clever and simple.

It may help to read the Ars Technica article first, since I dig a
little below the user level api.

Two important concepts: GCD does not require threads, and it does not
require kernel support. When present, GCD becomes efficient at
handling multiple cores - or this is the claim, and at least OS X 10.6
Snow Leopard seems to be much smoother than OS X 10.5 Leopard, so I
guess there is something to it. But without threads and kernel
support, it is still a good programming model, it would seem (haven't
burned my fingers yet...)


To better understand GCD, think of a single threaded event driven
application with support for nested event queues.

We start with a single thread, later extend the concept to multiple
threads and cores:

There is set of global queues with different priorities, one which is
the default.
Since we assume single threaded for now, we limit this to only one global queue.

Tasks are pushed onto the global queue.

A task is just function with data, i.e. a closure.

In C this is a function pointer and a void * to context, or it is
block if you have the blocks extension, but the blocks extension
really is optional, although very useful since it can take an standard
C block and with a little syntax turn it into a closure with a copy of
the parent scope state which can then quickly be turned into a
parallel loop. But this is just syntax, really not too important for
the underpinnings of libdispatch.

A function can push other functions to the global queue. This in
effect creates a continuation, and the concept is similar to
continuation passing style or LWT threads, but requires no
continuation arguments to drag around.

A task should return soon, and avoid blocking operations - especially
in single threaded setup, but generally blocking is bad in GCD.

The interesting part is how this gets scheduled: There is no central scheduler.

When our current task function returns (which it should do soon),
there is another piece of calling code that earlier picked our task
from a queue. Upon task return this code now picks the next task on
the queue and calls it - ad infinitum until we reach an empty queue,
in which case our control code can exit - i.e. exit the thread.

So far this your standard http web server event loop.

Now we add a thread pool to the queue.

This means that N threads are all either executing a task from the
queue, or actively executing a task on the queue. One a task
completes, the next task is popped. Since there are N threads, we have
contention on the queue, so it must be briefly locked with
compare-exchange logic.

Now we have a concurrent queue with the semantic that tasks are
started in order, but execute concurrently. Any and all
synchronization between tasks must be managed by client code if they
access shared resources.

Which brings us to the next point:

A task (or the main program thread) can create a new serial queue. A
serial queue is really just a task pushed onto the default global
queue, but this is mostly invisible to the user, but internally this
makes the control very simple and elegant (thread local storage is
used to identify the current parent queue in the queue creation
function).

Our active task can now push new tasks onto the serial queue. These
tasks are guaranteed to be isolated. The serial queue tasks will
eventually be popped from the global concurrent queue either after our
function exits, or sooner if there or other vacant threads in the
pool. Once the serial queue task is popped, it starts popping tasks of
the queue, while still locking pop operations to avoid conflict with
new pushes.

Note that every single thread cooperate on scheduling by popping the
next task from the queue it is currently assigned to. When a task is a
queue task, it will temporarily change the thread task to the new
queue until that queue is exhausted, then the thread returns focus to
the previous queue.

A task can choose to wait on the completion of another task it pushes,
in which case it is easy to deadlock if not being careful.

It is possibly to create an insanely huge amounts of queues in this
system. The only real overhead is the costs of memory barriers used
when pushing and popping.

Apple refers to this as islands of serialization in a sea of concurrency.

There is also the event signalling subsystem that grabs descriptor
events and holds a task until signalled, then pushes the task to a
designated queue. This is based on kqueue on FreeBSD and OS X, and
some promising work is being done to emulate kqueue for Linux for the
purpose of libdispatch integration - this is somewhat related to
libevent.

Now, it would be really simple to implement the dispatch api in ocaml,
either single threaded, or multithreaded to handle the odd block
tasking.

On top of this, Apple has built NSOperations which is a user level
system that allows for various constraints between tasks: task A
cannot run before task B and only if there is no yellow car parked in
front of the hotel, or something, but GCD runs underneath, and it is
almost all queues in user space, and very simple. (Actually I think
NSOperations came before, but has now been rewritten to use GCD).

OK, so how does this turn into a finely tuned OS controlled multi-core system?

Well, whenever thread is locking the queue, it does this through a
compare-exchange op that gives first prevents other tasks from taking
control, then it verifies that it itself has not been victim of the
prevention step. If it did, the queue is in heavy competition, and
this means there are too many threads, so our discouraged thread will
prepare for suicide. During a task push operation to a concurrent
queue, a new thread is attempted allocated for that thread. This is
done by calling a supporting thread create function. The tasks is
simply queued, so it does not matter if a new thread is not created,
it just means less parallelism.

The supporting thread creation function is where it all happens. This
is deeply buried in the client libdispatch logic. It the function has
two implementations: if the apple work threads library is present, it
will call a kernel function to so inform that there is now more work.
This kernel function may or may not provide additional threads based
on a global view of current system load and available cores. Now,
since this global view does not know how many of these threads we
intend to leave blocking on whatever sync api, it is generally seen as
a bad thing to have anything blocking. The event system makes it
relatively easy to avoid sync calls to common I/O ops.

The portable alternative thread creation function is ugly: whenever a
new task is pushed to a global async queue, a new thread is created
unless the thread count has reached 255. While this works, a more
intelligent thread pool would be useful; one that knows about number
of cores, and possibly some app hint about how many threads we intend
to have blocking.

So basically, all there is libdispatch, aka Grand Central Dispatch is
the kernel work threads api, the kqueue event api, and a thin client
library that provides asyn access to pushing and popping from queues.

A minor detail left out: when pushing a task, a small list element is
allocated from heap, but returned to thread local storage cache list
which is cleaned when a thread commits suicide.

There are some additional sugar ops such as creating a parallel for
loop by pushing tasks that take and index as argument. A clever detail
of this design is to not push the 10,000 array elements at once, but
limit to a fair number based on system core count, then have these
tasks dynamically restart themselves with a new iteration index which
they allocate from from a critical section protected iteration
counter.

Such a list iterator is also a task, and the task creating the
iteration will block until all iterations have completed, which is
somewhat similar to a thread join, but trivial to create. It is not
clear to me how much of this kind of blocking is a good thing because
it depends on the underlying thread pool.

At least this is how I understand :-)

Now, I'm quite tempted to do this kind of programming in C rather than
OCaml, since you get a lot of stuff for free, and you don't have to
keep track of a lot of different processes - oh - btw. GCD also makes
it easy to monitor process events via the kqueue system, so perhaps a
C core server dispatching work to ocaml processes would be a way to
go? (Process monitoring might be emulated via a separate thread on
Linux, at least this is what the author of the kqueue emulation
library currently considers).

As to how GCD could be used in OCaml with multi-core support:

There is still the issue of concurrent garbage collection.
If the thread support is entirely removed and replaced by a GCD like
construct, then the runtime has a much better view over when
scheduling happens, and it knows that there is an observable and
limited amount of real threads - thus making it feasible with several
large per thread heaps. But still, it doesn't really solve the
concurrent GC problem fully.

I might have gotten some details wrong, but this is what I came up
with after digging around the code and asking a few questions.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] JIT & HLVM, LLVM
  2009-09-30  1:08             ` Mikkel Fahnøe Jørgensen
@ 2009-09-30 11:57               ` Stéphane Glondu
  2009-09-30 12:54                 ` Mikkel Fahnøe Jørgensen
  2009-09-30 15:22               ` Jon Harrop
  1 sibling, 1 reply; 18+ messages in thread
From: Stéphane Glondu @ 2009-09-30 11:57 UTC (permalink / raw)
  To: Mikkel Fahnøe Jørgensen; +Cc: caml-list

Mikkel Fahnøe Jørgensen a écrit :
> A function can push other functions to the global queue. This in
> effect creates a continuation, and the concept is similar to
> continuation passing style or LWT threads, *but requires no
> continuation arguments to drag around*.

Could you elaborate?


Best regards,

-- 
Stéphane


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] JIT & HLVM, LLVM
  2009-09-30 11:57               ` Stéphane Glondu
@ 2009-09-30 12:54                 ` Mikkel Fahnøe Jørgensen
  2009-09-30 13:42                   ` Stéphane Glondu
  0 siblings, 1 reply; 18+ messages in thread
From: Mikkel Fahnøe Jørgensen @ 2009-09-30 12:54 UTC (permalink / raw)
  To: Stéphane Glondu; +Cc: caml-list

First, sorry for the rough post (it was late), I see some typos and
slightly more confusing mistakes, but hopefully not too bad, or please
ask.

>> A function can push other functions to the global queue. This in
>> effect creates a continuation, and the concept is similar to
>> continuation passing style or LWT threads, *but requires no
>> continuation arguments to drag around*.

> Could you elaborate?

Yes,
I'm probably not the best to explain continuation passing style, but
I'll give it a shot.

In OCaml you do not have continuations, but you do have partial evaluation.

This means the function

  let sum a b = a + b

can be called partially like

  let sum_creator a = sum a

This won't execute the sum code, but will have it pending until we get
the last argument b. In this sense the sum_creator function is a
sleeping thread waiting to execute.

We can combine several functions in this way to get a complex
calculation which won't evaluate until the continuation argument b is
supplied.

For example we can create a complex boolean expression combining
several primitive and / or functions that take a variable x as
argument. The entire expression is suspended until x is given, and in
principle the OR subexpressions could execute in parallel. We can also
build parse expressions where it is not a variable x, but the rest of
the input string that is missing.

Instead of evaluating a sum or boolean expression, the function could
do something else, like filling a buffer from a socket. The k argument
does not need to be used at all, we just use it to avoid computation
before we need it.

We can use this to create threads by having an extra dummy argument
that we will call k. We don't really need k we just need the
expression to not evaluate before we say so. This is the continuation
being dragged (passed) around.

(In monadic stream parsers, the k continuation is replaced by io state
which is dragged around, and contrary to k it provides useful
information. These parsers allow parsing to sleep when the input
stream dries up temporarily, and the parsers can explore several parse
paths concurrently (useful for error reporting), so there is an analog
to concurrent threads consuming i/o).

A function can also return a recursive call to itself supplying new
state such as bytes left to read, or a partial call to another
function, in both cases supplying new state information, but failing
to supply to the k argument such that "the next task to execute" is
returned instead of actually executing this task.

A now it starts to get complex, but you get a hold of an entire tree
of partially evaluated functions that are parked waiting for the last
argument.

A clean way to put one function after another so they execute in
sequence as dependent tasks is to use a monadic bind operation:

 http://ocsigen.org/docu/last/Lwt.html#VALbind

But this requires the function to be designed in a clean way and
conform to certain monadic rules, and getting it wrong creates a mess
in the type errors.

What we effectively achieve is a lot of functions queued up on a the
call stack in the form of closures allocated to hold the partially
evaluated parameters.

Now, we might as well just push a closure onto a queue instead of on
the call stack. This avoid a lot of complexity in the function type
design, and we get a lot more flexibility in how we dispatch the tasks
(arguably we could do the same or possibly more, in continuation
passing style, but it will give you a headache).

We might loose some type safety by using queues, but I'm sure one
could dream up some queue type system to enforce certain conditions.

More importantly, it is much simpler to understand a closure on a
queue, than continuation passing style.

And finally, with queues you have an elegant way of extending this to
multiple OS threads (or ocaml green threads), although again this is
also doable in continuation passing style.

Someone could probably argue that a queue is also just a continuation
state or something, but we a dealing with making threaded abstractions
that are easy to use and implement, not mathematical equivalence
relations.

2009/9/30 Stéphane Glondu <steph@glondu.net>:
> Mikkel Fahnøe Jørgensen a écrit :
>> A function can push other functions to the global queue. This in
>> effect creates a continuation, and the concept is similar to
>> continuation passing style or LWT threads, *but requires no
>> continuation arguments to drag around*.
>
> Could you elaborate?
>
>
> Best regards,
>
> --
> Stéphane
>
>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] JIT & HLVM, LLVM
  2009-09-30 12:54                 ` Mikkel Fahnøe Jørgensen
@ 2009-09-30 13:42                   ` Stéphane Glondu
  2009-09-30 19:11                     ` Mikkel Fahnøe Jørgensen
  0 siblings, 1 reply; 18+ messages in thread
From: Stéphane Glondu @ 2009-09-30 13:42 UTC (permalink / raw)
  To: Mikkel Fahnøe Jørgensen; +Cc: caml-list

Mikkel Fahnøe Jørgensen a écrit :
> But this requires the function to be designed in a clean way and
> conform to certain monadic rules, and getting it wrong creates a mess
> in the type errors.

Actually, I find the typing discipline enforced by the monadic
abstraction very helpful (and elegant).

> Now, we might as well just push a closure onto a queue instead of on
> the call stack. This avoid a lot of complexity in the function type
> design, and we get a lot more flexibility in how we dispatch the tasks
> (arguably we could do the same or possibly more, in continuation
> passing style, but it will give you a headache).

This sounds like a narrow (and C-ish) way to tackle things. The bind
operator is about composition of threads, not scheduling.

> More importantly, it is much simpler to understand a closure on a
> queue, than continuation passing style.

What you call the "call stack" is orthogonal to your "queues of
closures": one is about combining results of threaded computations
whereas the other is just spawning threads and relying on some external
mechanism to schedule them.

In other words, I think both can be used together, and achieve different
purposes.


Best regards,

-- 
Stéphane


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] JIT & HLVM, LLVM
  2009-09-30  1:08             ` Mikkel Fahnøe Jørgensen
  2009-09-30 11:57               ` Stéphane Glondu
@ 2009-09-30 15:22               ` Jon Harrop
  2009-09-30 19:35                 ` Mikkel Fahnøe Jørgensen
  1 sibling, 1 reply; 18+ messages in thread
From: Jon Harrop @ 2009-09-30 15:22 UTC (permalink / raw)
  To: caml-list

On Wednesday 30 September 2009 02:08:06 Mikkel Fahnøe Jørgensen wrote:
> 2009/9/27 Jon Harrop <jon@ffconsultancy.com>:
> > On Sunday 27 September 2009 20:23:13 kcheung@math.carleton.ca wrote:
> >> If Grand Central Dispatch makes its way
> >> into *nix...
> >
> > Incidentally, what exactly (technically) is Grand Central and how does it
> > relate to existing alternatives and the state-of-the-art? I Googled it
> > but failed to find anything useful...
>
> Grand Central Dispatch...

Thanks for the explanation! This seems to be attacking the same problem as 
Cilk and the TPL but my immediate impression (based entirely upon your 
description) is that Cilk and the TPL are probably better foundations.

Using one work stealing deque per core is much more efficient than the work 
sharing queue you described for two reasons:

1. Less global syncronization.

2. Subtasks are likely to be executed on the core that spawned them, which 
improves cache efficiency.

The parallel "for" loop you described is similar to the TPL's and leaves a lot 
to be desired. Specifically, you want to execute clusters of consecutive 
tasks on the same core for two reasons:

1. Using the same core improves cache efficiency because the end of one task 
is often spatially related to the start of the next, e.g. parallel quicksort, 
linear algebra or image processing.

2. Executing clusters of tasks amortizes the cost of queueing tasks.

The former is easily solved with the Cilk/TPL design by using recursive 
subdivision instead of the index sharing scheme you described because 
subtasks are likely to be executed on the same core. However, that will not 
work with a work sharing queue because subtasks are executed on random cores. 
Moreover, a trivial extension to this higher-order function allows you to 
pass in another function that describes the complexity of each task. This 
allows the recursive function to more intelligently subdivide the given range 
into clusters of variable-complexity tasks with similar total running times. 
This is the technique I use in our commercial F# libraries but I have not 
seen it described elsewhere.

Cilk and the TPL also just rely upon the programmer to make tasks sufficiently 
complex that the time spent queueing and dequeueing them is insignificant. I 
solved this problem by injecting run-time profiling code (with global 
synchronizations per cluster of tasks) to measure the proportion of the time 
being spent spawning rather than executing tasks and then use exponential 
backoff to increase the cluster size until a sufficiently small proportion of 
the time is spent spawning. Again, I have not seen this technique described 
elsewhere.

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] JIT & HLVM, LLVM
  2009-09-30 13:42                   ` Stéphane Glondu
@ 2009-09-30 19:11                     ` Mikkel Fahnøe Jørgensen
  0 siblings, 0 replies; 18+ messages in thread
From: Mikkel Fahnøe Jørgensen @ 2009-09-30 19:11 UTC (permalink / raw)
  To: Stéphane Glondu; +Cc: caml-list

2009/9/30 Stéphane Glondu <steph@glondu.net>:
> Mikkel Fahnøe Jørgensen a écrit :
> Actually, I find the typing discipline enforced by the monadic
> abstraction very helpful (and elegant).

For some purposes - for example filtering and transforming large data
sets, but perhaps less so for ad hoc tasks like background reporting -
although of course it can be done.

>> Now, we might as well just push a closure onto a queue instead of on
>> the call stack. This avoid a lot of complexity in the function type
>> design, and we get a lot more flexibility in how we dispatch the tasks
>> (arguably we could do the same or possibly more, in continuation
>> passing style, but it will give you a headache).
>
> This sounds like a narrow (and C-ish) way to tackle things. The bind
> operator is about composition of threads, not scheduling.

Perhaps, but sometimes this makes it easier to get things done,
especially for people not trained in the use of monads. Yes, bind
makes sure that one task is not scheduled at the same time, or before,
another task.

> What you call the "call stack" is orthogonal to your "queues of
> closures": one is about combining results of threaded computations
> whereas the other is just spawning threads and relying on some external
> mechanism to schedule them.

Well - yes and no. In general, monads are about combining results, but
they are also used for thread control where you do not rely on the
results. But it highlights an interesting point: queues does nothing
to help you communicate computation results, in parallel or not.
Neither does a monad based thread library. But a map-reduce monadic
setup very much does combine results and can also be made to schedule
for parallel execution which is very elegant, but not a good match for
more ad-hoc concurrent tasks.

> In other words, I think both can be used together, and achieve different
> purposes.

I agree. It was not my intention to say that monads are bad in any
way, and indeed many of our daily collection types are very useful
monads with abstractions that make them easy to use.

But since monads tend to spread all over the code, I think they should
not be used as a solution to everything, and this is basically what I
meant by "dragging continuations around".

Regards,
Mikkel


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] JIT & HLVM, LLVM
  2009-09-30 15:22               ` Jon Harrop
@ 2009-09-30 19:35                 ` Mikkel Fahnøe Jørgensen
  0 siblings, 0 replies; 18+ messages in thread
From: Mikkel Fahnøe Jørgensen @ 2009-09-30 19:35 UTC (permalink / raw)
  To: Jon Harrop; +Cc: caml-list

> Using one work stealing deque per core is much more efficient than the work
> sharing queue you described for two reasons:
>
> 1. Less global syncronization.
>
> 2. Subtasks are likely to be executed on the core that spawned them, which
> improves cache efficiency.

You could also create larger task chunks or schedule multiple tasks on
a serial queue to obtain some benefits of cache coherency. I'm not
saying that this is a better way than the one suggest.

Another issue is latency vs. throughput.
If you have a lot of short transactions like in a trading system, you
want things done now, not more efficiently in a while, then you'd
rather add more cores to the system and wasting some.

> The parallel "for" loop you described is similar to the TPL's and leaves a lot
> to be desired. Specifically, you want to execute clusters of consecutive
> tasks on the same core for two reasons:

umm yes - it just a convenience for getting things done I suppose, but
again, you could split the loop into nested loops to increase
granularity. But in relation to my other recent post, there are also
map-reduce for addressing these issues which work well across network
nodes.


> The former is easily solved with the Cilk/TPL design by using recursive
> subdivision instead of the index sharing scheme you described because
> subtasks are likely to be executed on the same core.

I also wondered why GCD insisted on starting indices in order and even
waste time syncing on the index counter since there is no guarantee of
execution order, other than than start time. I guess it would be easy
to modify GCD with a different scheduler - it is about 10 lines of
code in the library, and you could set a property on a queue to
suggest a preferred scheduler. However, I think the benefit of the
current approach is exactly to ensure that as many early indices as
possible run in parallel - which you might want for low latency.

Another issue that concerns me more is how serial queues when more
work is added to the queue. If the tasks keeps eating on the serial
queue, it might starve other more important work, unless the serial
queue takes a break - of course the thread is preempted, but it will
not give priority to older concurrent tasks still waiting to be pulled
from the global concurrent queue. Again I think this is easily fixed,
and I might have not understood this fully.

For high performance computing of many similar computations like pixel
processing, I think one should also have a look at OpenCL, and
possibly some map reduce top layer that can distribute work to
multiple nodes - such a layer could be built with GCD without relying
heavily on GCD for anything other than coordination.

Again, here latency is important - the sooner you can ship off work to
OpenCL (which feeds work to graphic coprocessors) the more work gets
done rather than having work lining up on the same core to ensure GCD
maximizes its own cpu cache.

This also applies to network communication: if you can send data
sooner, the other end can start computing earlier, and especially if
risk delays.

So there is an argument both ways - I think GCD is more intended for
systems programming than number crunching (or raytracing :-)

I think one should also remember that GCD is a programming
abstraction, and that there are many ways to implement it, and as time
progress some changes could be made.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] JIT & HLVM, LLVM
       [not found] <20090927214558.D40C5BC5C@yquem.inria.fr>
@ 2009-09-27 21:59 ` CUOQ Pascal
  0 siblings, 0 replies; 18+ messages in thread
From: CUOQ Pascal @ 2009-09-27 21:59 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 236 bytes --]

> We should thank them for enlightening us but this will only 
> ever get built if we build it ourselves. But, hey, at least we can build it 
> in OCaml. ;-)

So that's why I was reminded of John Skaller all this time...

Pascal

[-- Attachment #2: winmail.dat --]
[-- Type: application/ms-tnef, Size: 2529 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2009-09-30 19:56 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-27 17:18 JIT & HLVM, LLVM David McClain
2009-09-27 17:22 ` [Caml-list] " Vincent Aravantinos
2009-09-27 17:35   ` Edgar Friendly
2009-09-27 18:51     ` Jon Harrop
2009-09-27 19:07       ` Edgar Friendly
2009-09-27 19:23         ` kcheung
2009-09-27 19:33           ` Jon Harrop
2009-09-27 21:10           ` Jon Harrop
2009-09-27 21:45             ` Vincent Aravantinos
2009-09-30  1:08             ` Mikkel Fahnøe Jørgensen
2009-09-30 11:57               ` Stéphane Glondu
2009-09-30 12:54                 ` Mikkel Fahnøe Jørgensen
2009-09-30 13:42                   ` Stéphane Glondu
2009-09-30 19:11                     ` Mikkel Fahnøe Jørgensen
2009-09-30 15:22               ` Jon Harrop
2009-09-30 19:35                 ` Mikkel Fahnøe Jørgensen
2009-09-28 12:25     ` Gerd Stolpmann
     [not found] <20090927214558.D40C5BC5C@yquem.inria.fr>
2009-09-27 21:59 ` CUOQ Pascal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).