From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Sun, 2 Mar 2008 19:31:16 -0800 From: "Roman V. Shaposhnik" Subject: Re: [9fans] GCC/G++: some stress testing In-reply-to: To: Fans of the OS Plan 9 from Bell Labs <9fans@cse.psu.edu> Message-id: <1204515076.27006.173.camel@goose.sun.com> MIME-version: 1.0 Content-type: text/plain Content-transfer-encoding: 7BIT References: Topicbox-Message-UUID: 6d23796a-ead3-11e9-9d60-3106f5b1d025 On Sun, 2008-03-02 at 12:34 -0800, Paul Lalonde wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > CSP doesn't scale very well to hundreds of simultaneously executing > threads (my claim, not, as far as I've found yet, anyone else's). I take it that you really do mean "simultaneous". As in: you actually have hundreds of cores available on the same system. I'm actually quite curious to find out what's the highest number of cores available of a single shared memory systems that you, or anybody else has experienced in practice so far (mine would be 32 == 8 CPUs x 4 cores AMD Barcelona)? Now even more important question is -- what are the expectations for, this number in 5 years from now? > Programming the memory hierarchy is really a specific instance of > programming for masking latency. This goes well beyond inserting > prefetches in an optimization stage, In fact the more I think about it, the more it seems like having a direct way of manipulating L1/L2 caches would be more of a benefit than a curse at this point. Prefetches are nothing but a patchwork over the fundamental need for programming memory hierarchy in an efficient way. But, I guess, there's more to it on the hardware side than just crafting additional opcodes. > presenting itself as problem > decompositions that keep the current working set in cache (at L3, L2, > or L1 granularity, depending), while simultaneously avoiding having > multiple processors chewing on the same data (which leads to vast > amounts of cache synchronization bus traffic). Successful algorithms > in this space work on small bundles of data that either get flushed > back to memory uncached (to keep more cache for streaming in), or in > small bundles that can be passed from compute kernel to compute > kernel cheaply. Having language structures to help with these > decompositions and caching decisions is a great help - that's one of > the reasons why functional programming keeps rearing its head in this > space. Without aliasing and global (serializing) state it's much > easier to analyze the program and chose how to break up the > computation into kernels that can be streamed, pipelined, or > otherwise separated to allow better cache utilization and parallelism. Very well said, indeed! I can't agree more, especially on the issue of functional programming being very well suited for parallelism. In fact, here at Sun we've had a curious little experiment run by Tim Bray: http://www.tbray.org/ongoing/When/200x/2007/09/20/Wide-Finder with a very unexpected result as to what would be the ideal language for what seems to be an embarrassingly parallel problem. > Currently, the best performing programs I know for exploiting the > memory hierarchy are C and C++ programs written in a "kernel > composition" kind of model that the language supports poorly. You > can do it, but it feels more like coding in assembly than like > expressing your algorithms. Much of the template metaprogramming is > about taking measures of cache spaces and data sets and turning out > code (statically) tuned to those sizes. There's a huge opportunity > for a JIT language to allow better adaptability to changing data sets > and numbers of active threads. These techniques help. No doubt about it. Still, I don't think that we have much of choice when the # of compute units hits hundreds of thousands -- at that level shared state and side effects propagated through it will be just too much to stomach. We clearly need a very different way of thinking about programming and also the way of expressing our ideas. None of the mainstream languages seem to fit the bill. And as far as I can tell this makes it one of the hottest research areas in the industry. Everybody wants to hit that jackpot by inventing C for the parallel world. Thanks, Roman. P.S. We had Don Knuth lecturing at Sun Labs a couple of days ago and he gave us all a real scare when he officially proclaimed that he could NOT see any way of how his reasoning about computation and algorithms could be applied to parallelism. In fact, in his latest volume he has an explicit disclaimer to that effect.