From mboxrd@z Thu Jan  1 00:00:00 1970
Message-ID: <4487B351.8000808@lanl.gov>
Date: Wed,  7 Jun 2006 23:19:13 -0600
From: Ronald G Minnich <rminnich@lanl.gov>
User-Agent: Mozilla Thunderbird 1.0.7-1.1.fc4 (X11/20050929)
MIME-Version: 1.0
To: Fans of the OS Plan 9 from Bell Labs <9fans@cse.psu.edu>
Subject: Re: [9fans] gcc on plan9
References: <a20f57e86cd3182978b056ca52ad00fe@plan9.bell-labs.com>	<44879B05.6050706@lanl.gov>
	<44879EC5.3050105@lanl.gov> <4487A269.40906@sun.com>
In-Reply-To: <4487A269.40906@sun.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Topicbox-Message-UUID: 60359f3c-ead1-11e9-9d60-3106f5b1d025

Roman Shaposhnik wrote:
> One question that I still have, though, 
> is what
> makes you think that once you're done with porting gcc (big task) and 
> porting HPC apps to
> gcc/Plan9 (even bigger one!) they will *execute* faster than they do on 
> Linux ?

Excellent question.

It's all about parallel performance; making sure your 1000 nodes run 
1000 times as fast as 1 node, or, if they don't, that it's Somebody 
Else's Problem. The reason that the OS can impact parallel performance 
boils down to the kinds of tasks that go on in OSes that can be run at 
awkward times,and in turn interfere with parallel applications, and 
result in degraded performance. (for another approach, see Cray's 
synchronised scheduler work; make all nodes schedule the app at the same 
time).

Imagine you have one of these lovely apps, on a 1000-node cluster with a 
5-microsecond latency network. Let us further imagine (this stuff 
exists; see Quadrics) that you can do a broadcast/global sum op in 5 
microseconds. After 1 millisecond, they all need to talk to each other, 
and can not proceed until they're all agreed on (say) the value of a 
computed number -- e.g. some sort of global sum of a variable held by 
each of 1000 procs. The generic term for this type of thing is 'global 
reduction' -- you reduce a vector to a scalar of some sort.

The math is pretty easy to do, but it boils down to this: OS activities 
can interfere with, say, just one task, and kill the parallel 
performance of the app, making your 1000-node app run like a 750 node 
app -- or worse.

Pretend you're delayed one microsecond; do the math; it's depressing. 
One millisecond compute interval is a really extreme case, chosen for 
ease of illustration, but ...

In the clustering world, what a lot of people do is run real heavy nodes 
in clusters -- they have stuff like cron running, if you can believe it! 
They pretty much do a full desktop install, then turn off a few daemons, 
and away they go. Some really famous companies actually run clusters 
this way -- you'd be surprised at who. So do some famous gov't labs.

If they're lucky, interference never hits them. If they're not, they get 
less-than-ideal app performance. Then, they draw a conjecture from the 
OS interference that comes with such bad configuration: you can't run a 
cluster node with anything but a custom OS which has no clock 
interrupts, and, for that matter, no ability to run more than one 
process at a time. See the computational node kernel on the BG/L for one 
example, or the catamount kernel on Red Storm. Those kernels are really 
constrained; just running one proc at a time is only part of the story.

Here at LANL, we run pretty light cluster nodes.

Here is a cluster node running xcpu (under busybox, as you can see):
     1 ?        S      0:00 /bin/ash /linuxrc
     2 ?        S      0:00 [migration/0]
     3 ?        SN     0:00 [ksoftirqd/0]
     4 ?        S      0:00 [watchdog/0]
     5 ?        S      0:00 [migration/1]
     6 ?        SN     0:00 [ksoftirqd/1]
     7 ?        S      0:00 [watchdog/1]
     8 ?        S      0:00 [migration/2]
     9 ?        SN     0:00 [ksoftirqd/2]
    10 ?        S      0:00 [watchdog/2]
    11 ?        S      0:00 [migration/3]
    12 ?        SN     0:00 [ksoftirqd/3]
    13 ?        S      0:00 [watchdog/3]
    14 ?        S<     0:00 [events/0]
    15 ?        S<     0:00 [events/1]
    16 ?        S<     0:00 [events/2]
    17 ?        S<     0:00 [events/3]
    18 ?        S<     0:00 [khelper]
    19 ?        S<     0:00 [kthread]
    26 ?        S<     0:00 [kblockd/0]
    27 ?        S<     0:00 [kblockd/1]
    28 ?        S<     0:00 [kblockd/2]
    29 ?        S<     0:00 [kblockd/3]
   105 ?        S      0:00 [pdflush]
   106 ?        S      0:00 [pdflush]
   107 ?        S      0:00 [kswapd1]
   109 ?        S<     0:00 [aio/0]
   108 ?        S      0:00 [kswapd0]
   110 ?        S<     0:00 [aio/1]
   111 ?        S<     0:00 [aio/2]
   112 ?        S<     0:00 [aio/3]
   697 ?        S<     0:00 [kseriod]
   855 ?        S      0:00 xsrv -D 0 tcp!*!20001
   857 ?        S      0:00 9pserve -u tcp!*!20001
   864 ?        S      0:00 u9fs -a none -u root -m 65560 -p 564
   865 ?        S      0:00 /bin/ash

see how little we have running? Oh, but wait, what's all that stuff in 
[]? It's the stuff we can't turn off. Note there is per-cpu stuff, and 
other junk. Note that this node has been up for five hours, and this 
stuff is pretty quiet(0 run time); our nodes are the quietest (in the OS 
interference sense) Linux nodes I have yet seen. But, that said, all 
this can hit you.

And, in Linux, there's a lot of stuff people are finding you can't turn 
off. Lots of timers down there, lots of magic that goes on, and you just 
can't turn it off, or adjust it, try as you might.

Plan 9, our conjecture goes, is a small, tight, kernel, with lots of 
stuff moved to user mode (file systems); and, we believe that the Plan 9 
architecture is a good match to future HPC (High Performance Computing) 
systems, as typified by Red Storm and BG/L: small, fixed-configuration 
nodes with memory, network, CPU, and nothing else. The ability to not 
even have a file system on the node is a big plus. The ability to 
transparently have the file system remote/local puts the application 
into the driver's seat as to how the node is configured, and what 
tradeoffs are made; the system as a whole is incredibly flexible.

Our measurements, so far, do show that Plan 9 is "quieter" than Linux. A 
full Plan 9 desktop has less OS noise than a Linux box at the login 
prompt. This matters.

But it only matters if people can run their apps. Hence our concern 
about getting gcc-based cra-- er, applications code, running.

I'm not really trying to make Plan 9 look like Linux. I just want to run 
MPQC for a friend of mine :-)

thanks

ron