[9fans] xcpu note

9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed

* [9fans] xcpu note
@ 2005-10-17 16:43 Ronald G Minnich
  2005-10-17 21:41 ` David Leimbach
  2005-10-18  3:04 ` Kenji Okamoto
  0 siblings, 2 replies; 16+ messages in thread
From: Ronald G Minnich @ 2005-10-17 16:43 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

oh, yeah, you're going to see a lot of debugging crap from xcpusrv.

this is called: "A guy who's done select()-based threading for xx years 
tries to learn Plan 9 threads, and fails a lot, but is slowly getting 
it, sometimes"

sorry for any convenience (sic).

also, on Plan 9 ports,  you are going to need a linux kernel, e.g. 
2.6.14-rc2, to make it go, or use Newsham's python client code.

ron

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [9fans] xcpu note
  2005-10-17 16:43 [9fans] xcpu note Ronald G Minnich
@ 2005-10-17 21:41 ` David Leimbach
  2005-10-18  2:38   ` Ronald G Minnich
  2005-10-18  3:04 ` Kenji Okamoto
  1 sibling, 1 reply; 16+ messages in thread
From: David Leimbach @ 2005-10-17 21:41 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Congrats on another fine Linux Journal article Ron.  I just got this
in the mail yesterday and read it today.

Clustermatic is pretty cool, I think it's what was installed on one of
the other clusters I used at LANL as a contractor at the time.  I
recall a companion tool for bproc to request nodes, sort of an ad-hoc
scheduler.  I had to integrate support for this in our MPI's start up
that I was testing on that machine.

I'm curious to see how this all fits together with xcpu, if there is
such a resource allocation setup needed etc.

Dave

On 10/17/05, Ronald G Minnich <rminnich@lanl.gov> wrote:
> oh, yeah, you're going to see a lot of debugging crap from xcpusrv.
>
> this is called: "A guy who's done select()-based threading for xx years
> tries to learn Plan 9 threads, and fails a lot, but is slowly getting
> it, sometimes"
>
> sorry for any convenience (sic).
>
> also, on Plan 9 ports,  you are going to need a linux kernel, e.g.
> 2.6.14-rc2, to make it go, or use Newsham's python client code.
>
> ron
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [9fans] xcpu note
  2005-10-17 21:41 ` David Leimbach
@ 2005-10-18  2:38   ` Ronald G Minnich
  2005-10-18  4:44     ` Scott Schwartz
                       ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Ronald G Minnich @ 2005-10-18  2:38 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

David Leimbach wrote:

> Clustermatic is pretty cool, I think it's what was installed on one of
> the other clusters I used at LANL as a contractor at the time.  I
> recall a companion tool for bproc to request nodes, sort of an ad-hoc
> scheduler.  I had to integrate support for this in our MPI's start up
> that I was testing on that machine.

the simple scheduler, bjs, was written by erik hendriks (now at Google, 
sigh) and was rock-solid. It ran on one cluster, unattended, scheduling 
128 2-cpu nodes with a very diverse job mix, for one year. It was a 
great piece of software. It was far faster, and far more reliable, than 
any scheduler we have ever seen, then or now. In one test, we ran about 
20,000 jobs through it on about an hour, on a 1024-node cluster, just to 
test. Note that it could probably have scheduled a lot more jobs, but 
the run-time of the job was non-zero. No other scheduler we have used 
comes close to this kind of performance. Scheduler overhead was 
basically insignificant.

> 
> I'm curious to see how this all fits together with xcpu, if there is
> such a resource allocation setup needed etc.

we're going to take bjs and have it schedule nodes to give to users.

Note one thing we are going to do with xcpu: attach nodes to a user's 
desktop machine, rather than make users log in to the cluster. So users 
will get interactive clusters that look like they own them. This will, 
we hope, kill batch mode. Plan 9 ideas make this possible. It's going to 
be a big change, one we hope users will like.

If you look at how most clusters are used today, they closely resemble 
the batch world of the 1960s. It is actually kind of shocking. I 
downloaded a JCL manual a year or two ago, and compared what JCL did to 
what people wanted batch schedulers for clusters to do, and the 
correspondance was a little depressing. The Data General ad said it 
best: "Batch is a bitch".

Oh yeah, if anyone has a copy of that ad (Google does not), i'd like it 
in .pdf :-) It appeared in the late 70s IIRC.

ron
p.s. go ahead, google JCL, and you can find very recent manuals on how 
to use it. I will be happy to post the JCL for "sort + copy" if anyone 
wants to see it.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [9fans] xcpu note
  2005-10-17 16:43 [9fans] xcpu note Ronald G Minnich
  2005-10-17 21:41 ` David Leimbach
@ 2005-10-18  3:04 ` Kenji Okamoto
  2005-10-18  3:06   ` Ronald G Minnich
  2005-10-18  3:23   ` Eric Van Hensbergen
  1 sibling, 2 replies; 16+ messages in thread
From: Kenji Okamoto @ 2005-10-18  3:04 UTC (permalink / raw)
  To: 9fans

> also, on Plan 9 ports,  you are going to need a linux kernel, e.g. 
> 2.6.14-rc2, to make it go, or use Newsham's python client code.

The latest candidate of this, I checked just before, is still 2.6.14-rc4.☺
Do you have any info when it'll become stable release?

Kenji



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [9fans] xcpu note
  2005-10-18  3:04 ` Kenji Okamoto
@ 2005-10-18  3:06   ` Ronald G Minnich
  2005-10-18  3:23   ` Eric Van Hensbergen
  1 sibling, 0 replies; 16+ messages in thread
From: Ronald G Minnich @ 2005-10-18  3:06 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Kenji Okamoto wrote:
>>also, on Plan 9 ports,  you are going to need a linux kernel, e.g. 
>>2.6.14-rc2, to make it go, or use Newsham's python client code.
> 
> 
> The latest candidate of this, I checked just before, is still 2.6.14-rc4.☺
> Do you have any info when it'll become stable release?
> 
> Kenji
> 

usual rule with this most recent series is "pretty damn soon". 2.6.13 
stabilized quite fast. I am guessing we're close, not that I know any 
more than you do :-)

ron


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [9fans] xcpu note
  2005-10-18  3:04 ` Kenji Okamoto
  2005-10-18  3:06   ` Ronald G Minnich
@ 2005-10-18  3:23   ` Eric Van Hensbergen
  1 sibling, 0 replies; 16+ messages in thread
From: Eric Van Hensbergen @ 2005-10-18  3:23 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs; +Cc: V9FS Developers

Should be any day now, there weren't that many patches in rc4.
Of course, lucho has a massive rehaul of the mux code waiting in the
wings and I have my fid tracking rework, so stuff won't be stable for
long ;)  Of course, we'll keep that code in our development trees
until it has "cooked" a little, but lucho's code looks to fix a lot of
long standing problems and hopefully my new fid stuff will make Plan 9
things (p9p) and Ron's new synthetics work better.

    -eric

On 10/17/05, Kenji Okamoto <okamoto@granite.cias.osakafu-u.ac.jp> wrote:
> > also, on Plan 9 ports,  you are going to need a linux kernel, e.g.
> > 2.6.14-rc2, to make it go, or use Newsham's python client code.
>
> The latest candidate of this, I checked just before, is still 2.6.14-rc4.☺
> Do you have any info when it'll become stable release?
>
> Kenji
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [9fans] xcpu note
  2005-10-18  2:38   ` Ronald G Minnich
@ 2005-10-18  4:44     ` Scott Schwartz
  2005-10-18  4:45       ` Ronald G Minnich
  2005-10-18  4:57       ` andrey mirtchovski
  2005-10-18 10:25     ` leimy2k
                       ` (3 subsequent siblings)
  4 siblings, 2 replies; 16+ messages in thread
From: Scott Schwartz @ 2005-10-18  4:44 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

| No other scheduler we have used 
| comes close to this kind of performance. Scheduler overhead was 
| basically insignificant.

Probably apples and oranges, but Jim Kent wrote a job scheduler for his
kilocluster that nicely handled about 1M jobs in six hours.  It's the
standard thing for whole genome sequence alignments at ucsc.

| If you look at how most clusters are used today, they closely resemble 
| the batch world of the 1960s. It is actually kind of shocking. 

On the other hand, sometimes that's just what you really want.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [9fans] xcpu note
  2005-10-18  4:44     ` Scott Schwartz
@ 2005-10-18  4:45       ` Ronald G Minnich
  2005-10-18  7:35         ` Scott Schwartz
  2005-10-18  4:57       ` andrey mirtchovski
  1 sibling, 1 reply; 16+ messages in thread
From: Ronald G Minnich @ 2005-10-18  4:45 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Scott Schwartz wrote:

> Probably apples and oranges, but Jim Kent wrote a job scheduler for his
> kilocluster that nicely handled about 1M jobs in six hours.  It's the
> standard thing for whole genome sequence alignments at ucsc.

I think that's neat, I would like to learn more. Was this scheduler for 
an arbitrary job mix, or specialized to that app?

> 
> | If you look at how most clusters are used today, they closely resemble 
> | the batch world of the 1960s. It is actually kind of shocking. 
> 
> On the other hand, sometimes that's just what you really want.
> 

true. Sometimes it is. I've found, more often, that it's what people 
will accept, but not what they want.

ron


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [9fans] xcpu note
  2005-10-18  4:57       ` andrey mirtchovski
@ 2005-10-18  4:57         ` Ronald G Minnich
  2005-10-19 18:21           ` rog
  0 siblings, 1 reply; 16+ messages in thread
From: Ronald G Minnich @ 2005-10-18  4:57 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

andrey mirtchovski wrote:
>>| No other scheduler we have used 
>>| comes close to this kind of performance. Scheduler overhead was 
>>| basically insignificant.
>> 
>>Probably apples and oranges, but Jim Kent wrote a job scheduler for his
>>kilocluster that nicely handled about 1M jobs in six hours.  It's the
>>standard thing for whole genome sequence alignments at ucsc.
> 
> 
> the vitanuova guys probably have better numbers, but when we ran their
> grid code at ucalgary it executed over a million jobs in a 24-hour
> period.  the jobs were non-null (md5sum using inferno's dis code).  it
> ran on a 12 (or so) -node cluster :)
> 

man, all these schedulers that work MUCH better than the stuff we pay 
money for ... ah well. It shows how limited my experience is ... I'm 
used to schedulers that take 5-25 seconds to schedule jobs on 1000 or so 
nodes.

Oh, wait, 12 nodes. Hmm. That's cheating!

ron


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [9fans] xcpu note
  2005-10-18  4:44     ` Scott Schwartz
  2005-10-18  4:45       ` Ronald G Minnich
@ 2005-10-18  4:57       ` andrey mirtchovski
  2005-10-18  4:57         ` Ronald G Minnich
  1 sibling, 1 reply; 16+ messages in thread
From: andrey mirtchovski @ 2005-10-18  4:57 UTC (permalink / raw)
  To: 9fans

> | No other scheduler we have used 
> | comes close to this kind of performance. Scheduler overhead was 
> | basically insignificant.
>  
> Probably apples and oranges, but Jim Kent wrote a job scheduler for his
> kilocluster that nicely handled about 1M jobs in six hours.  It's the
> standard thing for whole genome sequence alignments at ucsc.

the vitanuova guys probably have better numbers, but when we ran their
grid code at ucalgary it executed over a million jobs in a 24-hour
period.  the jobs were non-null (md5sum using inferno's dis code).  it
ran on a 12 (or so) -node cluster :)



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [9fans] xcpu note
  2005-10-18  4:45       ` Ronald G Minnich
@ 2005-10-18  7:35         ` Scott Schwartz
  0 siblings, 0 replies; 16+ messages in thread
From: Scott Schwartz @ 2005-10-18  7:35 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

| > Probably apples and oranges, but Jim Kent wrote a job scheduler for his
| > kilocluster that nicely handled about 1M jobs in six hours.  It's the
| > standard thing for whole genome sequence alignments at ucsc.
| 
| I think that's neat, I would like to learn more. Was this scheduler for 
| an arbitrary job mix, or specialized to that app?

Well, it was designed to do what we needed and no more, but it's still
pretty general.  The input is a file of commands, and it runs them all
until they are all done (with a way to retry the ones that failed.)

http://www.cse.ucsc.edu/~kent/
http://www.soe.ucsc.edu/~donnak/eng/parasol.htm

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [9fans] xcpu note
  2005-10-18  2:38   ` Ronald G Minnich
  2005-10-18  4:44     ` Scott Schwartz
@ 2005-10-18 10:25     ` leimy2k
  2005-10-18 10:25     ` leimy2k
                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: leimy2k @ 2005-10-18 10:25 UTC (permalink / raw)
  To: rminnich, 9fans

> David Leimbach wrote:
> 
>> Clustermatic is pretty cool, I think it's what was installed on one of
>> the other clusters I used at LANL as a contractor at the time.  I
>> recall a companion tool for bproc to request nodes, sort of an ad-hoc
>> scheduler.  I had to integrate support for this in our MPI's start up
>> that I was testing on that machine.
> 
> the simple scheduler, bjs, was written by erik hendriks (now at Google, 
> sigh) and was rock-solid. It ran on one cluster, unattended, scheduling 
> 128 2-cpu nodes with a very diverse job mix, for one year. It was a 
> great piece of software. It was far faster, and far more reliable, than 
> any scheduler we have ever seen, then or now. In one test, we ran about 
> 20,000 jobs through it on about an hour, on a 1024-node cluster, just to 
> test. Note that it could probably have scheduled a lot more jobs, but 
> the run-time of the job was non-zero. No other scheduler we have used 
> comes close to this kind of performance. Scheduler overhead was 
> basically insignificant.
> 

Yeah, when I came to the lab last it was a "surprise" to find out that I 
not only had to support bproc but bjs though.  Luckilly it took about 10
minutes to figure it out and add support to our "mpirun" startup script.

It was pretty neat.

>> 
>> I'm curious to see how this all fits together with xcpu, if there is
>> such a resource allocation setup needed etc.
> 
> we're going to take bjs and have it schedule nodes to give to users.
> 
> Note one thing we are going to do with xcpu: attach nodes to a user's 
> desktop machine, rather than make users log in to the cluster. So users 
> will get interactive clusters that look like they own them. This will, 
> we hope, kill batch mode. Plan 9 ideas make this possible. It's going to 
> be a big change, one we hope users will like.

Hmm, planning to create a multi-hosted xcpu resource all bound to the 
user's namespace?  Or one host per set of files?  Is there a way to launch
multiple jobs in one shot ala MPI startup this way that's easy?

> 
> If you look at how most clusters are used today, they closely resemble 
> the batch world of the 1960s. It is actually kind of shocking. I 
> downloaded a JCL manual a year or two ago, and compared what JCL did to 
> what people wanted batch schedulers for clusters to do, and the 
> correspondance was a little depressing. The Data General ad said it 
> best: "Batch is a bitch".

Yeah, I've been comparing them to punch card systems for a while now. 
Some are even almost the same size as those old machines now that we've 
stacked them up.

MPI jobs have turned modern machines into huge monoliths that basically 
throw out the advantages of a multi-user system.  In fact having worked
with CPlant for a while with Ron Brightwell over at SNL, they had a design
optimized for one process per machine.  One CPU [no SMP hardware contention],
Myrinet with Portals for RDMA and OS bypass reasons [low overheads], 
no threads [though I was somewhat taunted with them at one point], and this
Yod and Yod2 scheduler for job startup.

It was very unique, and very interesting to work on and not a lot of fun to
debug running code on. :)

The closest thing I've seen to this kind of design in production has to be 
Blue Gene [which is a much different architecture of course but similar in 
that it is very custom designed for a few purposes].

> 
> Oh yeah, if anyone has a copy of that ad (Google does not), i'd like it 
> in .pdf :-) It appeared in the late 70s IIRC.
> 
> ron
> p.s. go ahead, google JCL, and you can find very recent manuals on how 
> to use it. I will be happy to post the JCL for "sort + copy" if anyone 
> wants to see it.

Please god no!!! :)

Dave

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [9fans] xcpu note
  2005-10-18  2:38   ` Ronald G Minnich
  2005-10-18  4:44     ` Scott Schwartz
  2005-10-18 10:25     ` leimy2k
@ 2005-10-18 10:25     ` leimy2k
  2005-10-18 10:25     ` leimy2k
  2005-10-18 12:10     ` Brantley Coile
  4 siblings, 0 replies; 16+ messages in thread
From: leimy2k @ 2005-10-18 10:25 UTC (permalink / raw)
  To: rminnich, 9fans

> David Leimbach wrote:
> 
>> Clustermatic is pretty cool, I think it's what was installed on one of
>> the other clusters I used at LANL as a contractor at the time.  I
>> recall a companion tool for bproc to request nodes, sort of an ad-hoc
>> scheduler.  I had to integrate support for this in our MPI's start up
>> that I was testing on that machine.
> 
> the simple scheduler, bjs, was written by erik hendriks (now at Google, 
> sigh) and was rock-solid. It ran on one cluster, unattended, scheduling 
> 128 2-cpu nodes with a very diverse job mix, for one year. It was a 
> great piece of software. It was far faster, and far more reliable, than 
> any scheduler we have ever seen, then or now. In one test, we ran about 
> 20,000 jobs through it on about an hour, on a 1024-node cluster, just to 
> test. Note that it could probably have scheduled a lot more jobs, but 
> the run-time of the job was non-zero. No other scheduler we have used 
> comes close to this kind of performance. Scheduler overhead was 
> basically insignificant.
> 

Yeah, when I came to the lab last it was a "surprise" to find out that I 
not only had to support bproc but bjs though.  Luckilly it took about 10
minutes to figure it out and add support to our "mpirun" startup script.

It was pretty neat.

>> 
>> I'm curious to see how this all fits together with xcpu, if there is
>> such a resource allocation setup needed etc.
> 
> we're going to take bjs and have it schedule nodes to give to users.
> 
> Note one thing we are going to do with xcpu: attach nodes to a user's 
> desktop machine, rather than make users log in to the cluster. So users 
> will get interactive clusters that look like they own them. This will, 
> we hope, kill batch mode. Plan 9 ideas make this possible. It's going to 
> be a big change, one we hope users will like.

Hmm, planning to create a multi-hosted xcpu resource all bound to the 
user's namespace?  Or one host per set of files?  Is there a way to launch
multiple jobs in one shot ala MPI startup this way that's easy?

> 
> If you look at how most clusters are used today, they closely resemble 
> the batch world of the 1960s. It is actually kind of shocking. I 
> downloaded a JCL manual a year or two ago, and compared what JCL did to 
> what people wanted batch schedulers for clusters to do, and the 
> correspondance was a little depressing. The Data General ad said it 
> best: "Batch is a bitch".

Yeah, I've been comparing them to punch card systems for a while now. 
Some are even almost the same size as those old machines now that we've 
stacked them up.

MPI jobs have turned modern machines into huge monoliths that basically 
throw out the advantages of a multi-user system.  In fact having worked
with CPlant for a while with Ron Brightwell over at SNL, they had a design
optimized for one process per machine.  One CPU [no SMP hardware contention],
Myrinet with Portals for RDMA and OS bypass reasons [low overheads], 
no threads [though I was somewhat taunted with them at one point], and this
Yod and Yod2 scheduler for job startup.

It was very unique, and very interesting to work on and not a lot of fun to
debug running code on. :)

The closest thing I've seen to this kind of design in production has to be 
Blue Gene [which is a much different architecture of course but similar in 
that it is very custom designed for a few purposes].

> 
> Oh yeah, if anyone has a copy of that ad (Google does not), i'd like it 
> in .pdf :-) It appeared in the late 70s IIRC.
> 
> ron
> p.s. go ahead, google JCL, and you can find very recent manuals on how 
> to use it. I will be happy to post the JCL for "sort + copy" if anyone 
> wants to see it.

Please god no!!! :)

Dave

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [9fans] xcpu note
  2005-10-18  2:38   ` Ronald G Minnich
                       ` (2 preceding siblings ...)
  2005-10-18 10:25     ` leimy2k
@ 2005-10-18 10:25     ` leimy2k
  2005-10-18 12:10     ` Brantley Coile
  4 siblings, 0 replies; 16+ messages in thread
From: leimy2k @ 2005-10-18 10:25 UTC (permalink / raw)
  To: rminnich, 9fans

> David Leimbach wrote:
> 
>> Clustermatic is pretty cool, I think it's what was installed on one of
>> the other clusters I used at LANL as a contractor at the time.  I
>> recall a companion tool for bproc to request nodes, sort of an ad-hoc
>> scheduler.  I had to integrate support for this in our MPI's start up
>> that I was testing on that machine.
> 
> the simple scheduler, bjs, was written by erik hendriks (now at Google, 
> sigh) and was rock-solid. It ran on one cluster, unattended, scheduling 
> 128 2-cpu nodes with a very diverse job mix, for one year. It was a 
> great piece of software. It was far faster, and far more reliable, than 
> any scheduler we have ever seen, then or now. In one test, we ran about 
> 20,000 jobs through it on about an hour, on a 1024-node cluster, just to 
> test. Note that it could probably have scheduled a lot more jobs, but 
> the run-time of the job was non-zero. No other scheduler we have used 
> comes close to this kind of performance. Scheduler overhead was 
> basically insignificant.
> 

Yeah, when I came to the lab last it was a "surprise" to find out that I 
not only had to support bproc but bjs though.  Luckilly it took about 10
minutes to figure it out and add support to our "mpirun" startup script.

It was pretty neat.

>> 
>> I'm curious to see how this all fits together with xcpu, if there is
>> such a resource allocation setup needed etc.
> 
> we're going to take bjs and have it schedule nodes to give to users.
> 
> Note one thing we are going to do with xcpu: attach nodes to a user's 
> desktop machine, rather than make users log in to the cluster. So users 
> will get interactive clusters that look like they own them. This will, 
> we hope, kill batch mode. Plan 9 ideas make this possible. It's going to 
> be a big change, one we hope users will like.

Hmm, planning to create a multi-hosted xcpu resource all bound to the 
user's namespace?  Or one host per set of files?  Is there a way to launch
multiple jobs in one shot ala MPI startup this way that's easy?

> 
> If you look at how most clusters are used today, they closely resemble 
> the batch world of the 1960s. It is actually kind of shocking. I 
> downloaded a JCL manual a year or two ago, and compared what JCL did to 
> what people wanted batch schedulers for clusters to do, and the 
> correspondance was a little depressing. The Data General ad said it 
> best: "Batch is a bitch".

Yeah, I've been comparing them to punch card systems for a while now. 
Some are even almost the same size as those old machines now that we've 
stacked them up.

MPI jobs have turned modern machines into huge monoliths that basically 
throw out the advantages of a multi-user system.  In fact having worked
with CPlant for a while with Ron Brightwell over at SNL, they had a design
optimized for one process per machine.  One CPU [no SMP hardware contention],
Myrinet with Portals for RDMA and OS bypass reasons [low overheads], 
no threads [though I was somewhat taunted with them at one point], and this
Yod and Yod2 scheduler for job startup.

It was very unique, and very interesting to work on and not a lot of fun to
debug running code on. :)

The closest thing I've seen to this kind of design in production has to be 
Blue Gene [which is a much different architecture of course but similar in 
that it is very custom designed for a few purposes].

> 
> Oh yeah, if anyone has a copy of that ad (Google does not), i'd like it 
> in .pdf :-) It appeared in the late 70s IIRC.
> 
> ron
> p.s. go ahead, google JCL, and you can find very recent manuals on how 
> to use it. I will be happy to post the JCL for "sort + copy" if anyone 
> wants to see it.

Please god no!!! :)

Dave

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [9fans] xcpu note
  2005-10-18  2:38   ` Ronald G Minnich
                       ` (3 preceding siblings ...)
  2005-10-18 10:25     ` leimy2k
@ 2005-10-18 12:10     ` Brantley Coile
  4 siblings, 0 replies; 16+ messages in thread
From: Brantley Coile @ 2005-10-18 12:10 UTC (permalink / raw)
  To: 9fans

> p.s. go ahead, google JCL, and you can find very recent manuals on how 
> to use it. I will be happy to post the JCL for "sort + copy" if anyone 
> wants to see it.

no need.  i remember it.  i still have my Brown book.  (it's blue, actually.)

 brantley



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [9fans] xcpu note
  2005-10-18  4:57         ` Ronald G Minnich
@ 2005-10-19 18:21           ` rog
  0 siblings, 0 replies; 16+ messages in thread
From: rog @ 2005-10-19 18:21 UTC (permalink / raw)
  To: 9fans

> Oh, wait, 12 nodes. Hmm. That's cheating!

unfortunately, we haven't been able to run the inferno grid stuff on
any more than about 300 nodes.  it works fairly quickly on that
number, but task takeup slows down considerably when it's pumping out
a lot of data (this is better now that nodes cache data).

things are slowed down quite a bit by logging constraints (it stores
much on-going data on disk, both to reduce memory consumption and so
that if the server crashes or is turned off, things can resume with
virtually nothing lost).  running on top of a ram disk can speed
things up by at least an order of magnitude. this probably makes
sense for short-lived jobs.

i'd love to try it out on a larger cluster (one could use an existing
scheduler to leverage the initial installation).

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2005-10-19 18:21 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-10-17 16:43 [9fans] xcpu note Ronald G Minnich
2005-10-17 21:41 ` David Leimbach
2005-10-18  2:38   ` Ronald G Minnich
2005-10-18  4:44     ` Scott Schwartz
2005-10-18  4:45       ` Ronald G Minnich
2005-10-18  7:35         ` Scott Schwartz
2005-10-18  4:57       ` andrey mirtchovski
2005-10-18  4:57         ` Ronald G Minnich
2005-10-19 18:21           ` rog
2005-10-18 10:25     ` leimy2k
2005-10-18 10:25     ` leimy2k
2005-10-18 10:25     ` leimy2k
2005-10-18 12:10     ` Brantley Coile
2005-10-18  3:04 ` Kenji Okamoto
2005-10-18  3:06   ` Ronald G Minnich
2005-10-18  3:23   ` Eric Van Hensbergen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).