From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <examachine@gmail.com>
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82])
	by yquem.inria.fr (Postfix) with ESMTP id E5DFBBC37
	for <caml-list@yquem.inria.fr>; Thu, 17 Dec 2009 07:45:29 +0100 (CET)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: Ar4BAMNhKUvRVdrVkGdsb2JhbACQKoF1ghWGYD8BAQEBCQkMBxMDqW6BMoVEiGMBAgMFhCgEhWM
X-IronPort-AV: E=Sophos;i="4.47,410,1257116400"; 
   d="scan'208";a="42191933"
Received: from mail-bw0-f213.google.com ([209.85.218.213])
  by mail1-smtp-roc.national.inria.fr with ESMTP; 17 Dec 2009 07:45:29 +0100
Received: by bwz5 with SMTP id 5so1286785bwz.3
        for <caml-list@yquem.inria.fr>; Wed, 16 Dec 2009 22:45:29 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=gamma;
        h=domainkey-signature:mime-version:received:in-reply-to:references
         :date:message-id:subject:from:to:content-type;
        bh=ZNBn2d8mFUhrJ2dcqeP+B/buQrm5eo5V74wNPcX8Rys=;
        b=CgyOfwDMx46j9O/ECn8CVOhWb5pHOVQlgssBM6vIOB8fcDgSxeRhhNImFnINSTAO1H
         mCxDRPTw3aBSJwKWoRLm08SJAAE4fiHpUU1p6Xk3be5yaGXBhhzmktc/BvlpLcZBH2WJ
         TAgsunL5NkEw4kGkImwTXcZ4hSFAc+dNg+rMU=
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=TGXE+s3Y5Jc9qDRqp9ypq7cuJ4VIeVbHe49VFbnmvP+PtB/BB0f4uWTGaXd8CXuDyq
         WPb1IRPvoEn6APVFKVf1PzoE2lCxP86Kb68c3U+rnjBTu+lHuA6IzDdJXsAaPd8e1IXb
         0gb9nlmrw4lU8sSKG9LTS408Y8Pwx3VWg6GBE=
MIME-Version: 1.0
Received: by 10.204.3.78 with SMTP id 14mr357604bkm.129.1261032328836; Wed, 16 
	Dec 2009 22:45:28 -0800 (PST)
In-Reply-To: <4d1b2df20912161634w682bb804meb6c4090fa28284b@mail.gmail.com>
References: <320e992a0912150737t483a5946x9cad351fc344691f@mail.gmail.com>
	 <4B27B981.80700@starynkevitch.net>
	 <200912160039.51086.jon@ffconsultancy.com>
	 <20091216134138.411322D08BF@kicki.hq.vtech>
	 <320e992a0912160547q7969ed30i129e5798456fbf84@mail.gmail.com>
	 <4d1b2df20912161634w682bb804meb6c4090fa28284b@mail.gmail.com>
Date: Thu, 17 Dec 2009 08:45:28 +0200
Message-ID: <320e992a0912162245u5ce780f2ke838bff2ff378321@mail.gmail.com>
Subject: Re: [Caml-list] How to write a CUDA kernel in ocaml?
From: Eray Ozkural <examachine@gmail.com>
To: caml-list@yquem.inria.fr
Content-Type: text/plain; charset=ISO-8859-1
X-Spam: no; 0.00; ocaml:01 eray:01 ozkural:01 eray:01 ozkural:01 trivial:01 ocaml:01 bytecode:01 bytecode:01 parallelism:01 parallelism:01 computations:01 high-level:01 compiler:01 hypothetical:01 

On Thu, Dec 17, 2009 at 2:34 AM, Philippe Wang
<philippe.wang.lists@gmail.com> wrote:
> On Wed, Dec 16, 2009 at 2:47 PM, Eray Ozkural <examachine@gmail.com> wrote:
>
>> One trivial and low-performance solution that comes to mind is: make
>> an ocaml bytecode interpreter into a CUDA kernel and then pass the
>> bytecode to it, and then voila, at least we have some 512-way
>> parallelism on the GT300. How does that sound? We'd be losing some
>> performance but massive parallelism will cover up for some of that.
>
>
> With parallel processors, you move very quickly the performance
> bottleneck from processor(s) to memory bandwidth, such that
> - it's hell to program because you have to manage concurrency and it
> has a real cost
> - it's useful for very specific programs that have very few memory
> access compared to processor computations (such as some compression
> algorithms, a more specific and very easy to write example is matrix
> multiplications).
>
> Imagine you have 3000MHz for memory bandwidth, which is extremely good
> today (I think). And imagine you have 100 processors that share this
> memory bandwidth. If they all want to access memory at the same time,
> even if you forget the concurrency management cost, you have
> 3000/100MHz/processor=30MHz/processor, which is very very very low. So
> think about 10 processors instead of 100 to be more realistic, it's
> still 300MHz/processor, which looks like what we had about a decade
> ago...
>
> (IMHO) A not-too-too-bad-but-still-realistic way to take benefit of
> GPUs today, with OCaml (or any high-level language), is to write
> computation functions in C (possibly with some assembly), and to write
> composition functions in OCaml. Or (less realistic in a short amount
> of time) maybe to write a compiler that may do the job for you, but
> it's not quite easy...
>
> Good luck,

First, the GT300 will have great memory bandwidth, probably 256 GB/s.
Half a gig/sec per core, not bad I think. With a smart ocaml bytecode
interpreter, we could derive some performance from this (hypothetical
yet!) baby. GT300 is great, it makes the reconfigurable computing
project I worked on mostly obsolete =)

Of course, you are right that the "memory wall" is a serious
obstacle to *any* parallel architecture, not just this architecture or
that architecture. I didn't read very thoroughly but in the Fermi
architecture the caches and local memory in NVIDIA naturally have
severe limitations. You have 512 cores. You can't give each huge
caches.

In the context of the following comments assume that a PRAM algorithm is given.

Obviously, we may expect the performance of a memory bound algorithm
to suffer in *both* multicore architectures and GPU's (that's where
reconfigurable computing might take over...).

However, if the algorithm is compute-bound, and can do with a moderate
memory bandwidth per processor, then I think this becomes an ideal
architecture. Not necessarily "embarrassingly parallel" algorithms,
but as seen in the CUDA pages of NVIDIA, those will work great!

My application is a perfect match for NVIDIA. It needs just 1-2 mb storage per
processor. And it spends more time computing than accessing memory, so
I think it will do well.

The ocaml bytecode interpreter is written in C. For a baseline
implementation we could try to port this to OpenCL.
http://caml.inria.fr/cgi-bin/viewcvs.cgi/ocaml/trunk/byterun/

Would be a cool experiment at least =)

What I want to do is to run the ocaml bytecode interpreter on each core, and
then feed the relevant bytecode to those. It can be done, I suppose? Or am I
missing something crucial? :) The runtime library would have to be ported to
OpenCL/CUDA, as well, isn't that possible?

Best,

PS: Sorry for having mailed this to you personally, I intended to post
it to the
mailing list.

-- 
Eray Ozkural, PhD candidate.  Comp. Sci. Dept., Bilkent University, Ankara
http://groups.yahoo.com/group/ai-philosophy
http://myspace.com/arizanesil http://myspace.com/malfunct