From mboxrd@z Thu Jan  1 00:00:00 1970
Mime-Version: 1.0 (Apple Message framework v752.3)
In-Reply-To: <9959.1178315907@lunacy.ugrad.cs.cmu.edu>
References: <9959.1178315907@lunacy.ugrad.cs.cmu.edu>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
Message-Id: <F3A6ED51-A07D-4A9F-9D31-22C68D11DE80@telus.net>
Content-Transfer-Encoding: 7bit
From: Paul Lalonde <plalonde@telus.net>
Subject: Re: [9fans] speaking of kenc
Date: Fri,  4 May 2007 15:27:00 -0700
To: Fans of the OS Plan 9 from Bell Labs <9fans@cse.psu.edu>
Topicbox-Message-UUID: 59e2f71e-ead2-11e9-9d60-3106f5b1d025

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


On May 4, 2007, at 2:58 PM, Dave Eckhardt wrote:
> One possible goal might be a language in which you could
> describe high-level algorithms of a certain class which
> could then be compiled to run well on a Cell (and, to be
> a cool result, on some other thing).  This would probably
> handle not just computation but also the necessary DMA
> to get the data ready.

FWIW, the C++ template goop that I use in my SPU code is all about  
masking the data movements - you don't want virtual-function call  
overhead in cache-lookup functions, nor do you want a different  
version of the code for each data type you want to transfer.  There  
is a relatively limited number of buffer usage patterns.  In  
approximate best to worst performance order these are double-buffered  
input and output, block-random access, struct-sized random access,  
and general pointer-chasing.  I can easily wrap a small language  
around these operations (and have in the past - it's just more  
convenient right now to let GCC maintain it for me).

> Failing that, it seems like what people will be doing for
> a while is writing code carefully tuned to run well on
> exactly one or two particular models of Cell, which seems
> to me likely to look like carefully optimized "inner loop"
> stuff wrapped by glue code which matters less.

Only partly true; the SPU architecture defines the latencies and  
stalls of the various instructions fairly well.  Given my experience  
optimizing SPU code, the 40:1 to 100:1 improvements from data  
restructuring and selective SIMD conversions are worth doing, while  
the per-cycle stall management isn't - there might be another factor  
of 2, or there might not - it's a difficult space for a small reward.

> I have to
> wonder whether it would be less painful to learn the hardware
> and write the optimized code in assembly language or to learn
> the hardware *and* learn how to cajole a complicated compiler
> into emitting the assembly language you know it should be
> emitting.

Doing the streaming/caching/DMA code in assembly is a non-starter.   
It's just that increment of too complicated.  And fortunately, IBM  
went and defined the C language extensions as part of the SPU  
architecture, which means it's not too hard to learn to use.  The  
restrict keyword does gall me though.

> With respect to kencc, I wonder how far you could get if
> each Cell vector instruction were a C-callable .s function
> of a few instructions and the SPU linker routinely inlined
> all small-instruction-count functions and had an optimizer
> explicitly designed for the SPU.

I think this could work quite well; I'm not sure how this interferes  
with register allocation though.  I'll give it some thought.  The  
harder part will be the data movement operations from my first  
paragraph.

Paul

>
> Dave Eckhardt

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (Darwin)

iD8DBQFGO7M1pJeHo/Fbu1wRAmFvAKDUlDdofVlXv30Lcf3xYPHN6ubX4QCfclYB
te5F+PL5KW2BiF+CvXzyuDQ=
=HTyI
-----END PGP SIGNATURE-----