From mboxrd@z Thu Jan  1 00:00:00 1970
Date: Sun,  2 Mar 2008 19:31:16 -0800
From: "Roman V. Shaposhnik" <rvs@sun.com>
Subject: Re: [9fans] GCC/G++: some stress testing
In-reply-to: <E329B8ED-63E6-42C7-BB0B-81F4FAF79962@telus.net>
To: Fans of the OS Plan 9 from Bell Labs <9fans@cse.psu.edu>
Message-id: <1204515076.27006.173.camel@goose.sun.com>
MIME-version: 1.0
Content-type: text/plain
Content-transfer-encoding: 7BIT
References: <d5033e6962e97bb803b4f3feee886f55@quanstro.net>
	<E329B8ED-63E6-42C7-BB0B-81F4FAF79962@telus.net>
Topicbox-Message-UUID: 6d23796a-ead3-11e9-9d60-3106f5b1d025

On Sun, 2008-03-02 at 12:34 -0800, Paul Lalonde wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> CSP doesn't scale very well to hundreds of simultaneously executing
> threads (my claim, not, as far as I've found yet, anyone else's).

I take it that you really do mean "simultaneous". As in: you actually
have hundreds of cores available on the same system. I'm actually
quite curious to find out what's the highest number of cores available
of a single shared memory systems that you, or anybody else has
experienced in practice so far (mine would be 32 == 8 CPUs x 4 cores AMD
Barcelona)? Now even more important question is -- what are the
expectations for, this number in 5 years from now?

> Programming the memory hierarchy is really a specific instance of
> programming for masking latency.  This goes well beyond inserting
> prefetches in an optimization stage,

In fact the more I think about it, the more it seems like having
a direct way of manipulating L1/L2 caches would be more of a benefit
than a curse at this point. Prefetches are nothing but a patchwork
over the fundamental need for programming memory hierarchy in an
efficient way. But, I guess, there's more to it on the hardware side
than just crafting additional opcodes.

> presenting itself as problem
> decompositions that keep the current working set in cache (at L3, L2,
> or L1 granularity, depending), while simultaneously avoiding having
> multiple processors chewing on the same data (which leads to vast
> amounts of cache synchronization bus traffic).  Successful algorithms
> in this space work on small bundles of data that either get flushed
> back to memory uncached (to keep more cache for streaming in), or in
> small bundles that can be passed from compute kernel to compute
> kernel cheaply.  Having language structures to help with these
> decompositions and caching decisions is a great help - that's one of
> the reasons why functional programming keeps rearing its head in this
> space.  Without aliasing and global (serializing) state it's much
> easier to analyze the program and chose how to break up the
> computation into kernels that can be streamed, pipelined, or
> otherwise separated to allow better cache utilization and parallelism.

Very well said, indeed! I can't agree more, especially on the issue
of functional programming being very well suited for parallelism.
In fact, here at Sun we've had a curious little experiment run by
Tim Bray:
   http://www.tbray.org/ongoing/When/200x/2007/09/20/Wide-Finder
with a very unexpected result as to what would be the ideal
language for what seems to be an embarrassingly parallel problem.

> Currently, the best performing programs I know for exploiting the
> memory hierarchy are C and C++ programs written in a  "kernel
> composition" kind of model that the language supports poorly.  You
> can do it, but it feels more like coding in assembly than like
> expressing your algorithms.  Much of the template metaprogramming is
> about taking measures of cache spaces and data sets and turning out
> code (statically) tuned to those sizes.  There's a huge opportunity
> for a JIT language to allow better adaptability to changing data sets
> and numbers of active threads.

These techniques help. No doubt about it. Still, I don't think that we
have much of choice when the # of compute units hits hundreds of
thousands -- at that level shared state  and side effects propagated
through it will be just too much to stomach. We clearly need a very
different way of thinking about programming and also the way of
expressing our ideas. None of the mainstream languages seem to fit
the bill. And as far as I can tell this makes it one of the hottest
research areas in the industry. Everybody wants to hit that jackpot
by inventing C for the parallel world.

Thanks,
Roman.

P.S. We had Don Knuth lecturing at Sun Labs a couple of days ago and he
gave us all a real scare when he officially proclaimed that he could NOT
see any way of how his reasoning about computation and algorithms
could be applied to parallelism. In fact, in his latest volume he has
an explicit disclaimer to that effect.