From mboxrd@z Thu Jan 1 00:00:00 1970 Mime-Version: 1.0 (Apple Message framework v752.3) In-Reply-To: <9959.1178315907@lunacy.ugrad.cs.cmu.edu> References: <9959.1178315907@lunacy.ugrad.cs.cmu.edu> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: Content-Transfer-Encoding: 7bit From: Paul Lalonde Subject: Re: [9fans] speaking of kenc Date: Fri, 4 May 2007 15:27:00 -0700 To: Fans of the OS Plan 9 from Bell Labs <9fans@cse.psu.edu> Topicbox-Message-UUID: 59e2f71e-ead2-11e9-9d60-3106f5b1d025 -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On May 4, 2007, at 2:58 PM, Dave Eckhardt wrote: > One possible goal might be a language in which you could > describe high-level algorithms of a certain class which > could then be compiled to run well on a Cell (and, to be > a cool result, on some other thing). This would probably > handle not just computation but also the necessary DMA > to get the data ready. FWIW, the C++ template goop that I use in my SPU code is all about masking the data movements - you don't want virtual-function call overhead in cache-lookup functions, nor do you want a different version of the code for each data type you want to transfer. There is a relatively limited number of buffer usage patterns. In approximate best to worst performance order these are double-buffered input and output, block-random access, struct-sized random access, and general pointer-chasing. I can easily wrap a small language around these operations (and have in the past - it's just more convenient right now to let GCC maintain it for me). > Failing that, it seems like what people will be doing for > a while is writing code carefully tuned to run well on > exactly one or two particular models of Cell, which seems > to me likely to look like carefully optimized "inner loop" > stuff wrapped by glue code which matters less. Only partly true; the SPU architecture defines the latencies and stalls of the various instructions fairly well. Given my experience optimizing SPU code, the 40:1 to 100:1 improvements from data restructuring and selective SIMD conversions are worth doing, while the per-cycle stall management isn't - there might be another factor of 2, or there might not - it's a difficult space for a small reward. > I have to > wonder whether it would be less painful to learn the hardware > and write the optimized code in assembly language or to learn > the hardware *and* learn how to cajole a complicated compiler > into emitting the assembly language you know it should be > emitting. Doing the streaming/caching/DMA code in assembly is a non-starter. It's just that increment of too complicated. And fortunately, IBM went and defined the C language extensions as part of the SPU architecture, which means it's not too hard to learn to use. The restrict keyword does gall me though. > With respect to kencc, I wonder how far you could get if > each Cell vector instruction were a C-callable .s function > of a few instructions and the SPU linker routinely inlined > all small-instruction-count functions and had an optimizer > explicitly designed for the SPU. I think this could work quite well; I'm not sure how this interferes with register allocation though. I'll give it some thought. The harder part will be the data movement operations from my first paragraph. Paul > > Dave Eckhardt -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.3 (Darwin) iD8DBQFGO7M1pJeHo/Fbu1wRAmFvAKDUlDdofVlXv30Lcf3xYPHN6ubX4QCfclYB te5F+PL5KW2BiF+CvXzyuDQ= =HTyI -----END PGP SIGNATURE-----