From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <4487B351.8000808@lanl.gov> Date: Wed, 7 Jun 2006 23:19:13 -0600 From: Ronald G Minnich User-Agent: Mozilla Thunderbird 1.0.7-1.1.fc4 (X11/20050929) MIME-Version: 1.0 To: Fans of the OS Plan 9 from Bell Labs <9fans@cse.psu.edu> Subject: Re: [9fans] gcc on plan9 References: <44879B05.6050706@lanl.gov> <44879EC5.3050105@lanl.gov> <4487A269.40906@sun.com> In-Reply-To: <4487A269.40906@sun.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Topicbox-Message-UUID: 60359f3c-ead1-11e9-9d60-3106f5b1d025 Roman Shaposhnik wrote: > One question that I still have, though, > is what > makes you think that once you're done with porting gcc (big task) and > porting HPC apps to > gcc/Plan9 (even bigger one!) they will *execute* faster than they do on > Linux ? Excellent question. It's all about parallel performance; making sure your 1000 nodes run 1000 times as fast as 1 node, or, if they don't, that it's Somebody Else's Problem. The reason that the OS can impact parallel performance boils down to the kinds of tasks that go on in OSes that can be run at awkward times,and in turn interfere with parallel applications, and result in degraded performance. (for another approach, see Cray's synchronised scheduler work; make all nodes schedule the app at the same time). Imagine you have one of these lovely apps, on a 1000-node cluster with a 5-microsecond latency network. Let us further imagine (this stuff exists; see Quadrics) that you can do a broadcast/global sum op in 5 microseconds. After 1 millisecond, they all need to talk to each other, and can not proceed until they're all agreed on (say) the value of a computed number -- e.g. some sort of global sum of a variable held by each of 1000 procs. The generic term for this type of thing is 'global reduction' -- you reduce a vector to a scalar of some sort. The math is pretty easy to do, but it boils down to this: OS activities can interfere with, say, just one task, and kill the parallel performance of the app, making your 1000-node app run like a 750 node app -- or worse. Pretend you're delayed one microsecond; do the math; it's depressing. One millisecond compute interval is a really extreme case, chosen for ease of illustration, but ... In the clustering world, what a lot of people do is run real heavy nodes in clusters -- they have stuff like cron running, if you can believe it! They pretty much do a full desktop install, then turn off a few daemons, and away they go. Some really famous companies actually run clusters this way -- you'd be surprised at who. So do some famous gov't labs. If they're lucky, interference never hits them. If they're not, they get less-than-ideal app performance. Then, they draw a conjecture from the OS interference that comes with such bad configuration: you can't run a cluster node with anything but a custom OS which has no clock interrupts, and, for that matter, no ability to run more than one process at a time. See the computational node kernel on the BG/L for one example, or the catamount kernel on Red Storm. Those kernels are really constrained; just running one proc at a time is only part of the story. Here at LANL, we run pretty light cluster nodes. Here is a cluster node running xcpu (under busybox, as you can see): 1 ? S 0:00 /bin/ash /linuxrc 2 ? S 0:00 [migration/0] 3 ? SN 0:00 [ksoftirqd/0] 4 ? S 0:00 [watchdog/0] 5 ? S 0:00 [migration/1] 6 ? SN 0:00 [ksoftirqd/1] 7 ? S 0:00 [watchdog/1] 8 ? S 0:00 [migration/2] 9 ? SN 0:00 [ksoftirqd/2] 10 ? S 0:00 [watchdog/2] 11 ? S 0:00 [migration/3] 12 ? SN 0:00 [ksoftirqd/3] 13 ? S 0:00 [watchdog/3] 14 ? S< 0:00 [events/0] 15 ? S< 0:00 [events/1] 16 ? S< 0:00 [events/2] 17 ? S< 0:00 [events/3] 18 ? S< 0:00 [khelper] 19 ? S< 0:00 [kthread] 26 ? S< 0:00 [kblockd/0] 27 ? S< 0:00 [kblockd/1] 28 ? S< 0:00 [kblockd/2] 29 ? S< 0:00 [kblockd/3] 105 ? S 0:00 [pdflush] 106 ? S 0:00 [pdflush] 107 ? S 0:00 [kswapd1] 109 ? S< 0:00 [aio/0] 108 ? S 0:00 [kswapd0] 110 ? S< 0:00 [aio/1] 111 ? S< 0:00 [aio/2] 112 ? S< 0:00 [aio/3] 697 ? S< 0:00 [kseriod] 855 ? S 0:00 xsrv -D 0 tcp!*!20001 857 ? S 0:00 9pserve -u tcp!*!20001 864 ? S 0:00 u9fs -a none -u root -m 65560 -p 564 865 ? S 0:00 /bin/ash see how little we have running? Oh, but wait, what's all that stuff in []? It's the stuff we can't turn off. Note there is per-cpu stuff, and other junk. Note that this node has been up for five hours, and this stuff is pretty quiet(0 run time); our nodes are the quietest (in the OS interference sense) Linux nodes I have yet seen. But, that said, all this can hit you. And, in Linux, there's a lot of stuff people are finding you can't turn off. Lots of timers down there, lots of magic that goes on, and you just can't turn it off, or adjust it, try as you might. Plan 9, our conjecture goes, is a small, tight, kernel, with lots of stuff moved to user mode (file systems); and, we believe that the Plan 9 architecture is a good match to future HPC (High Performance Computing) systems, as typified by Red Storm and BG/L: small, fixed-configuration nodes with memory, network, CPU, and nothing else. The ability to not even have a file system on the node is a big plus. The ability to transparently have the file system remote/local puts the application into the driver's seat as to how the node is configured, and what tradeoffs are made; the system as a whole is incredibly flexible. Our measurements, so far, do show that Plan 9 is "quieter" than Linux. A full Plan 9 desktop has less OS noise than a Linux box at the login prompt. This matters. But it only matters if people can run their apps. Hence our concern about getting gcc-based cra-- er, applications code, running. I'm not really trying to make Plan 9 look like Linux. I just want to run MPQC for a friend of mine :-) thanks ron