From mboxrd@z Thu Jan  1 00:00:00 1970
Date: Tue, 30 Dec 1997 11:05:22 -0800
From: Eric Dorman eld@jewel.ucsd.edu
Subject: [9fans] Re: etherelnk3.c
Topicbox-Message-UUID: 6f40c73a-eac8-11e9-9e20-41e7f4b1d025
Message-ID: <19971230190522.f_duBuPJ9BGAWGwlwzmUo4yKSCTdSXCSMjZWeZa0a6k@z>

> From gdb@dbSystems.com Mon Dec 29 17:36:38 1997
> Date: Mon, 29 Dec 1997 19:20:46 -0600
> From: eld@jewel.ucsd.edu (Eric Dorman)
 
> >Definately worth a look; I've got a 9pcfs that allows the use
> >of IDE disks for filesystems
> Why?  I agree IDE disks are very attractive from a $$/MB point
> of view, but you can't get enough of them on a machine.  The
> overall $$/MB is less with SCSI since you can aggregate the cost
> of the CPU/RAM over more disks.  Also, IDE is PIO and that
> is a bad place to waste CPU resources.

Well, one doesn't *have* to use PIO mode on IDEs, one can use
the supposed-DMA modes :)   Consider as well that it's unlikely 
that CPU is a premium in the fs; more likely it's bounded by 
network or disk bandwidth (offhand).  Besides, CPU is cheap on PCs,
and fs's aren't doing anything else.

With 8-9G EIDEs readily available, 32-36G is a goodly-sized 
magnetic level, and considerably cheaper than SCSI for the size 
range.  Around here SCSIs are ~2x-2.5x per G for ultrawide, and
1.5-2x for fast.  For the scope that I'm working in, 32G is 
fine for primary storage even.  Certainly the long-run cost 
amortization for SCSI is better, but that's only relevant
if it's going to be realized...

Of course the current support in the fs is for pretty slow
scsi controllers, from what I've seen.  One could stick a 
PCI controller that looks like an AHA1542 in there but the 
driver wouldn't utilize 32-bit addressing, forcing the use of
bounce buffers...  hmm, didn't some of the VLB cards have 
a larger address space?
  
> >                             and have found that the network
> >is far and away the limiting factor (10Mbit/sec 10BaseT on NE2000s)
> If you use two transmit buffers, sending one and filling the other,
> you will find much more "network" in your NE2000.  [I don't use
> the NE2000 anymore now that I have the 3c515 working! Not for the
> 100bT, but for the 64k of RAM and busmaster transfers!]

Well, whatever NE2000 mode being used in the cdrom for fs
is what I'm using.  I haven't dredged into it to see how it's
running the wire.

I don't recall offhand seeing 3C515 support in the code; is there
a boddle for it?

> >and 100BaseT is practically free these days; I'd love to use 100BaseT.
> Not when you look at the price/performance of 10bT full duplex switches
> and 100bT hubs.  100bT full duplex switches are nice, and expensive.

Sure, if you're building your own WAN network from the ground-up.  
We have much switched 10bT infrastructure already in place and some
switched 100bT sections (fiber backbone) so it doesn't cost me 
anything :) (at least at work) .. 100bT dumb hubs are plenty 
cheap ($50-70), fine for local runs and (hidden behind routers)
tightly-coupled fileservers and workstations.  

[xx] 
> >            Might even be able to interleave across primary and 
> >secondary IDEs (if the braindead chipsets will support it..). 
> No need, filsys main [h0h1] will interleave for you.

If h0 and h1 are on the same IDE controller you won't get
overlapping operations (unlike SCSI) since a single IDE 
controller can't do more than one thing at a time.  W/o 
overlapping the operations [h0h1] isin't a big win over 
(h0h1) (though better use of the on-disk caches is a help). 

On the other hand, lets say you have controllers h0 and h1, each 
with disks 0 and 1.  Then you can say [h0.0h1.0] and get two 
operations running at the same time (one on controller 0 and
another on controller 1).  The question, however, whether 
the chipsets will do it.

I suspect that the win from [h0.0h1.0] over [h0h1] will
be less than [h0h1] is over (h0h1) because of the on-disk
caches...

> >So far I've had to scramble around in the fs code changing 'long's
> >to 'ulong's in bytewise size computations and changing the type
> >of disk block addresses to ulongs; the matched-pair of 3.5G disks
> >breaks 'long's.  I'm worrying though that this may have caused
> >my block tag bug. 
> Very possible.  I looked at doing a global long to ulong change
> but found some places that weren't easy to fix (I don't remeber
> where now), so I left it alone.  I was thinking of reducing the
> block size and needed more blocks.  If you leave the block size
> at 4K, and handle the multiplier like devwren does, then you
> shouldn't need to make that change.

Well, I found out what the problem was; stupidity on my part =:)
I forgot that the fs was basically multithreaded and ended up
reusing a buffer at the driver-level before the fs-level got
to it... ick.  Eliminated an intermediate copy too.  Now 
everything seems to work (for h0); big files, root fs, etc.
with ulong block addresses and file sizes.  Several builds 
and boots so far and no trouble.  It's off to h0.1 and h1.? 
land :)  I have to kick it in the butt with big files before
I'm happy with it though.

> >                                I chose the former as an
[h typecode in port/sub.c]
> >easier solution but it's, well, icky; changing stuff in
> >fs/port is evil since I'd have to stub out 'ideread/idewrite'
> >in all other architectures.  
> Why?  Use different names.  
[xx]

What do you mean by 'use different names'?  Seems to me that if
port/sub.c references ideread/idewrite for type Devide
you'll have to have a ideread/idewrite stub in every set of 
architecture-specific drivers (not just pc) since you have
to satisfy that reference somewhere.  Unless, of course, you 
compile port/sub.c (with #ifdefs or something, more yuk) 
differently for, say, sparcs, than for pcs.  Am I missing something?

> >Seems to me a better solution would be to have the hardware-specific 
> >initialization stuff build a table describing the disks connected
> >to the box (complete with codeletters, size, traps into the
> >hardware driver, etc) and have fs/port/sub.c go indirect
> >through the table to the hardware.
> 
> Maybe.  I like the way it is now because my mirror code likes
> to know that a disk is missing (for whatever reason) and then
> know that it is available again later (a reboot cleared the error,
> or the drive was replaced.)  The config block is written to a
> mirror set with "config {w0.0w1.0}" and it tells you what the system
> is suppose to look like even if it doesn't look like that now.
> I have caused drive and controller failures and the system
> just takes the drive (or all the drives on a failed controller)
> off line and keeps running.  When the system is fixed and rebooted,
> it finds the mirrors needing recovery and does it.  Even if it
> is booted with the drives sick, it does the right thing.

Yeah;  If OTOH the table indexed controllers (which is the
real problem part anyway) rather than disks themselves, the 
only time the fs would have trouble is if it booted with a 
broken controller.  
 
> David Butler
> gdb@dbSystems.com

Regards,

Eric Dorman
edorman@ucsd.edu