From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <703e7fa5147ef911e7206d8a877e8b8f@psychobunny.homework.net> To: rminnich@lanl.gov, 9fans@cse.psu.edu Subject: Re: [9fans] xcpu note Date: Tue, 18 Oct 2005 03:25:36 -0700 From: leimy2k@speakeasy.net In-Reply-To: <4354602C.7060102@lanl.gov> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Cc: Topicbox-Message-UUID: 9c5b4fee-ead0-11e9-9d60-3106f5b1d025 > David Leimbach wrote: > >> Clustermatic is pretty cool, I think it's what was installed on one of >> the other clusters I used at LANL as a contractor at the time. I >> recall a companion tool for bproc to request nodes, sort of an ad-hoc >> scheduler. I had to integrate support for this in our MPI's start up >> that I was testing on that machine. > > the simple scheduler, bjs, was written by erik hendriks (now at Google, > sigh) and was rock-solid. It ran on one cluster, unattended, scheduling > 128 2-cpu nodes with a very diverse job mix, for one year. It was a > great piece of software. It was far faster, and far more reliable, than > any scheduler we have ever seen, then or now. In one test, we ran about > 20,000 jobs through it on about an hour, on a 1024-node cluster, just to > test. Note that it could probably have scheduled a lot more jobs, but > the run-time of the job was non-zero. No other scheduler we have used > comes close to this kind of performance. Scheduler overhead was > basically insignificant. > Yeah, when I came to the lab last it was a "surprise" to find out that I not only had to support bproc but bjs though. Luckilly it took about 10 minutes to figure it out and add support to our "mpirun" startup script. It was pretty neat. >> >> I'm curious to see how this all fits together with xcpu, if there is >> such a resource allocation setup needed etc. > > we're going to take bjs and have it schedule nodes to give to users. > > Note one thing we are going to do with xcpu: attach nodes to a user's > desktop machine, rather than make users log in to the cluster. So users > will get interactive clusters that look like they own them. This will, > we hope, kill batch mode. Plan 9 ideas make this possible. It's going to > be a big change, one we hope users will like. Hmm, planning to create a multi-hosted xcpu resource all bound to the user's namespace? Or one host per set of files? Is there a way to launch multiple jobs in one shot ala MPI startup this way that's easy? > > If you look at how most clusters are used today, they closely resemble > the batch world of the 1960s. It is actually kind of shocking. I > downloaded a JCL manual a year or two ago, and compared what JCL did to > what people wanted batch schedulers for clusters to do, and the > correspondance was a little depressing. The Data General ad said it > best: "Batch is a bitch". Yeah, I've been comparing them to punch card systems for a while now. Some are even almost the same size as those old machines now that we've stacked them up. MPI jobs have turned modern machines into huge monoliths that basically throw out the advantages of a multi-user system. In fact having worked with CPlant for a while with Ron Brightwell over at SNL, they had a design optimized for one process per machine. One CPU [no SMP hardware contention], Myrinet with Portals for RDMA and OS bypass reasons [low overheads], no threads [though I was somewhat taunted with them at one point], and this Yod and Yod2 scheduler for job startup. It was very unique, and very interesting to work on and not a lot of fun to debug running code on. :) The closest thing I've seen to this kind of design in production has to be Blue Gene [which is a much different architecture of course but similar in that it is very custom designed for a few purposes]. > > Oh yeah, if anyone has a copy of that ad (Google does not), i'd like it > in .pdf :-) It appeared in the late 70s IIRC. > > ron > p.s. go ahead, google JCL, and you can find very recent manuals on how > to use it. I will be happy to post the JCL for "sort + copy" if anyone > wants to see it. Please god no!!! :) Dave