9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* Re: [9fans] bug in disk/format
@ 2002-05-26  4:13 rsc
  2002-05-26  5:11 ` Mike Haertel
  0 siblings, 1 reply; 12+ messages in thread
From: rsc @ 2002-05-26  4:13 UTC (permalink / raw)
  To: 9fans

I can't reproduce your problem.

Was this the first 9fat ever present on the disk?
Was there an earlier 9fat that you had mounted
before running the format and the remount?

The only thing I can think of is that you had run
9fat: earlier, so dossrv had the FATs in its
buffer cache.  When you reformatted, dossrv kept
using the old FATs, hence the breakage of
files on sector boundaries.

I found a different problem, though, which is
fixed on sources.  The amount of space required by
the FATs depends on the number of bits per FAT
entry, which depends on the number of FAT entries,
which depends on the amount of disk space left
over after subtracting out the space used by the FATs.

Oh, and if you somehow divine the number of bits
per FAT, the amount of space required by the FATs
still depends on the number of FAT entries still
depends on the number of clusters, which depends
on the amount of disk space left over after
subtracting out the space used by the FATs.

Before we had an approximation that didn't work
when you were close to the dividing line between
12-bit and 16-bit FAT entries (having an 8MB 9fat
would put you close enough to cause problems).
Now we just guess until we find a fixed point.

This isn't your problem, though -- this bug
makes reads fail much earlier than 55000 bytes
into the file.

Russ



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] bug in disk/format
  2002-05-26  4:13 [9fans] bug in disk/format rsc
@ 2002-05-26  5:11 ` Mike Haertel
  0 siblings, 0 replies; 12+ messages in thread
From: Mike Haertel @ 2002-05-26  5:11 UTC (permalink / raw)
  To: 9fans

>I can't reproduce your problem.
>
>Was this the first 9fat ever present on the disk?
>Was there an earlier 9fat that you had mounted
>before running the format and the remount?
>
>The only thing I can think of is that you had run
>9fat: earlier, so dossrv had the FATs in its
>buffer cache.  When you reformatted, dossrv kept
>using the old FATs, hence the breakage of
>files on sector boundaries.

Nope, I had already thought of that.  Watch this:

# take care to make sure there is no stale data in dossrv
term% kill dossrv | rc

# zero out the 9fat partition
term% dd -if /dev/zero -of '#S/sd01/9fat' -count 1
1+0 records in
1+0 records out
term% dd -if /dev/zero -of '#S/sd01/9fat' -seek 2
write: i/o error
20481+0 records in
20481+0 records out

# prove the 9fat partition contains nothing but the Plan 9 partition table
term% cat '#S/sd01/9fat'
part 9fat 0 20482
part fs 20482 16971204
part swap 16971204 17912412

# format the 9fat partition and install 9load et al.
# n.b. this is the same version of disk/format as my previous email.
term% disk/format -b /386/pbs -d -r 2 '#S/sd01/9fat' /386/9load /386/9pcdisk /tmp/plan9.ini
Initialising FAT file system
type hard, 10 tracks, 64 heads, 32 sectors/track, 512 bytes/sec
Adding file /386/9load, length 177320
Adding file /386/9pcdisk, length 1744105
Adding file /tmp/plan9.ini, length 185
used 1927168 bytes

# start up a fresh dossrv and look at the 9fat partition.
term% dossrv
dossrv: serving #s/dos
term% mount /srv/dos /n/9fat '#S/sd01/9fat'
term% cmp /n/9fat/9load /386/9load
term% cmp /n/9fat/9pcdisk /386/9pcdisk
/n/9fat/9pcdisk /386/9pcdisk differ: char 169985

But: Hmm, that's very interesting.  This time there is no cmp error
in 9load, but 9pcdisk still got copied incorrectly.  The error is at
a different offset than last time.

Just for good measure I tried the exact same sequence of commands
again, and got the exact same result: a cmp difference in 9pcdisk at
offset 169985.

Here is the md5 checksum of my /bin/disk/format binary: hopefully it
will be identical to what was on sources before your fix.  (Is there
a dump filesystem on sources that the rest of us can look at?)

term% md5sum /bin/disk/format
73e3a9480e3c973097e95dfcfd015a85	/bin/disk/format

Finally, just to rule out the possibility of a corrupt executable
text cache on my machine, I copied /bin/disk/format to /tmp/format
and ran the same example again using /tmp/format, with the same
result.

By the way, in your attempt to reproduce this problem did you use
a system with a SCSI disk?  Note that the behavior of opendisk()
is different for IDE vs. SCSI disks, particularly how it guesses
the disk geometry.  In the past I submitted a bugfix to disk.c that
would greatly increase the likelihood of SCSI geometry being correctly
guessed, and I see my patch was ignored...

In fact, disk/format guessed the wrong geometry for my SCSI disk:
the BIOS geometry is 255 heads and 63 sectors per track.  But even
if disk/format assumes the wrong geometry when it creates a DOS
filesystem, dossrv should work just fine since dossrv uses linear
addressing and so doesn't know or care about c/h/s geometry.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] bug in disk/format
@ 2002-05-27 17:49 rsc
  0 siblings, 0 replies; 12+ messages in thread
From: rsc @ 2002-05-27 17:49 UTC (permalink / raw)
  To: 9fans

> umm, isn't this a bug in the driver?  the man page for write(2) says,
>
> The number of characters actually written is returned.
> It should be regarded as an error if this is not the same as requested.
>
> so yes, disk/format should check for short writes, but it should fail
> with a write error if it gets one, no?

and indeed it does, now.  it also only tries to write 8k
at a time to avoid aggravating the sd53c8xx problem.

> question, if it is an error, should errstr() be set?

it won't be, because the system call didn't return -1.
not sure whether it should be.

russ



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] bug in disk/format
@ 2002-05-27  8:32 nigel
  0 siblings, 0 replies; 12+ messages in thread
From: nigel @ 2002-05-27  8:32 UTC (permalink / raw)
  To: 9fans

> rsc writes:
>> For better or worse, the sd53c8xx driver
>> truncates very large writes (like an entire kernel!).  It returns
>> the correct short write count, but format wasn't checking
>> for short writes.
>
> umm, isn't this a bug in the driver?  the man page for write(2) says,
>

Yep, and I think "for better or worse" acknowledges that.

I fixed this over a year ago, but somehow it never made it to 3rd, and hence
not 4th, edition.

I've posted the fix to 9trouble, but anyone wanting to read/write more than
128K at a time to Symbios controlled disk can pick up the files as

http://www.9fs.org/dist/ncr/sd53c8xx.tgz



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] bug in disk/format
  2002-05-26 22:13 rsc
@ 2002-05-27  6:31 ` Michael Baldwin
  0 siblings, 0 replies; 12+ messages in thread
From: Michael Baldwin @ 2002-05-27  6:31 UTC (permalink / raw)
  To: 9fans

rsc writes:
> For better or worse, the sd53c8xx driver
> truncates very large writes (like an entire kernel!).  It returns
> the correct short write count, but format wasn't checking
> for short writes.

umm, isn't this a bug in the driver?  the man page for write(2) says,

The number of characters actually written is returned.
It should be regarded as an error if this is not the same as requested.

so yes, disk/format should check for short writes, but it should fail
with a write error if it gets one, no?

question, if it is an error, should errstr() be set?



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] bug in disk/format
@ 2002-05-26 22:13 rsc
  2002-05-27  6:31 ` Michael Baldwin
  0 siblings, 1 reply; 12+ messages in thread
From: rsc @ 2002-05-26 22:13 UTC (permalink / raw)
  To: 9fans

This is fixed.  For better or worse, the sd53c8xx driver
truncates very large writes (like an entire kernel!).  It returns
the correct short write count, but format wasn't checking
for short writes.

Now disk/format breaks the write into 8k chunks to be
nicer to the disk subsystem, and watches for short writes.

Update is on sources, along with new binaries.

Russ



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] bug in disk/format
  2002-05-26  6:19 rsc
@ 2002-05-26 20:40 ` Mike Haertel
  0 siblings, 0 replies; 12+ messages in thread
From: Mike Haertel @ 2002-05-26 20:40 UTC (permalink / raw)
  To: 9fans

>If you update, you'll get a new format that
>has both your patch and my fix from earlier
>tonight.  Do you still see the problem when
>you run that format?  If so, run format with
>the -v flag and let me know what you see.

Yup, the problem is still there.  And, curiously enough, the file
offset at which the miscomparison occurs has gone back to match
that in my *original* bug report.  This suggests that:

1) in my original bug report, I had neglected to first zero the
previous contents of the 9fat partition, so the old version of
disk/format was getting its idea of the geometry from sector 0 of
the 9fat partition, which at the time held the correct geometry
(255 heads, 63 sectors) for the disk.

2) the behavior of the bug does indeed depend on what format's idea
of the geometry is.

So, on the theory that the problem has something to do with format's
notion of the geometry, I tried to set up a fake "sd01" directory
containing ctl, data, and 9fat files constructed to fool format
into thinking it is looking at a real disk, and run format on that.

I discovered that in this case the problem does not occur.

Just to be sure that I had enough bits and pieces in my fake
sd01 directory to fool format into thinking it was looking at a
real disk, I ran format on the real disk drive under "iostats -d"
to see all its I/O.

I discovered in this case the problem also does not occur.

So I can only see the bug when format is *directly* accessing
the real SCSI driver.  Watch this:

# first, we run format -v on the real scsi disk and record its output.
# just to make things r
term% dd -if /dev/zero -of /dev/sd01/9fat -count 1
1+0 records in
1+0 records out
term% dd -if /dev/zero -of /dev/sd01/9fat -seek 2 -count 20480
20480+0 records in
20480+0 records out
term% disk/format -v -b /386/pbs -d -r 2 /dev/sd01/9fat /386/9load /386/9pcdisk /tmp/plan9.ini > format.out1 >[2] format.err1

(at this point, starting up a fresh dossrv and running cmp verifies
that /n/9fat/9load has bogus data starting at byte 55809 and /n/9fat/9pcdisk
has bogus data starting at byte 1)

# now, we run the exact same command under "iostats -d"
term% dd -if /dev/zero -of /dev/sd01/9fat -count 1
1+0 records in
1+0 records out
term% dd -if /dev/zero -of /dev/sd01/9fat -seek 2 -count 20480
20480+0 records in
20480+0 records out
term% iostats -d disk/format -v -b /386/pbs -d -r 2 /dev/sd01/9fat /386/9load /386/9pcdisk /tmp/plan9.ini > format.out2 >[2] format.err2

(at this point, starting up a fresh dossrv and running cmp verifies
that /n/9fat/9load and /n/9fat/9pcdisk were both copied correctly)

# and here we see that disk/format produces different output when it
# is directly accessing the real scsi disk vs. when it is accessing
# the scsi disk proxied through iostats -d.
# (we also see that iostats eats standard error entirely; grr)
term% diff format.out1 format.out2
33c33
< plan9.ini @1BEE00
---
> plan9.ini @1DEC00
term% diff format.err1 format.err2
1,6d0
< add 9load at clust 2
< add 9pcdisk at clust 59
< add plan9.ini at clust 3ad
< add 9load at clust 2
< add 9pcdisk at clust 59
< add plan9.ini at clust 3ad

So at this point my hypothesis is that there is no problem with disk/format
at all, but rather that there is a bug of some kind in the SCSI driver.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] bug in disk/format
@ 2002-05-26  6:19 rsc
  2002-05-26 20:40 ` Mike Haertel
  0 siblings, 1 reply; 12+ messages in thread
From: rsc @ 2002-05-26  6:19 UTC (permalink / raw)
  To: 9fans

I stand by "just missed".  Those changes are
too familiar to have been ignored.  Probably I applied
them on my laptop and they never got back to the
real file server.  I wrote the replica(8) tools for
a reason.

In any event, the patch is applied now.

If you update, you'll get a new format that
has both your patch and my fix from earlier
tonight.  Do you still see the problem when
you run that format?  If so, run format with
the -v flag and let me know what you see.

Russ



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] bug in disk/format
  2002-05-26  5:27 rsc
@ 2002-05-26  6:01 ` Mike Haertel
  0 siblings, 0 replies; 12+ messages in thread
From: Mike Haertel @ 2002-05-26  6:01 UTC (permalink / raw)
  To: 9fans

>> By the way, in your attempt to reproduce this problem did you use
>> a system with a SCSI disk?  Note that the behavior of opendisk()
>
>No, just a zeroed file appropriately sized.
>I don't think format actually cares about
>the geometry.  I don't have any SCSI disks handy.

Format *does* care about the geometry.  If it puts the wrong values
into the boot sector, the boot sector /386/pbs won't work, because
the function BIOSread needs to know the geometry in order to convert
linear offsets to the C/H/S addressing used in Int 10 BIOS calls.
(I use /386/pbslba on my SCSI disk to avoid this problem.)

>> is different for IDE vs. SCSI disks, particularly how it guesses
>> the disk geometry.  In the past I submitted a bugfix to disk.c that
>> would greatly increase the likelihood of SCSI geometry being correctly
>> guessed, and I see my patch was ignored...
>
>More likely just missed.  Where is this patch?

https://lists.cse.psu.edu/archives/9fans/2000-October/008039.html

Two messages later (in .../008042.html) you replied with "Thanks for the fix".
Therefore I stand by "was ignored" or perhaps even "was brutally discarded" :-)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] bug in disk/format
@ 2002-05-26  5:27 rsc
  2002-05-26  6:01 ` Mike Haertel
  0 siblings, 1 reply; 12+ messages in thread
From: rsc @ 2002-05-26  5:27 UTC (permalink / raw)
  To: 9fans

> Here is the md5 checksum of my /bin/disk/format binary: hopefully it
> will be identical to what was on sources before your fix.  (Is there
> a dump filesystem on sources that the rest of us can look at?)

Sources is a kfs, but I do have dump CDs.  I'll see if I can
make them available on a separate tcp port.

> term% md5sum /bin/disk/format
> 73e3a9480e3c973097e95dfcfd015a85	/bin/disk/format

That's correct.

> Finally, just to rule out the possibility of a corrupt executable
> text cache on my machine, I copied /bin/disk/format to /tmp/format
> and ran the same example again using /tmp/format, with the same
> result.
>
> By the way, in your attempt to reproduce this problem did you use
> a system with a SCSI disk?  Note that the behavior of opendisk()

No, just a zeroed file appropriately sized.
I don't think format actually cares about
the geometry.  I don't have any SCSI disks handy.

> is different for IDE vs. SCSI disks, particularly how it guesses
> the disk geometry.  In the past I submitted a bugfix to disk.c that
> would greatly increase the likelihood of SCSI geometry being correctly
> guessed, and I see my patch was ignored...

More likely just missed.  Where is this patch?

> In fact, disk/format guessed the wrong geometry for my SCSI disk:
> the BIOS geometry is 255 heads and 63 sectors per track.  But even
> if disk/format assumes the wrong geometry when it creates a DOS
> filesystem, dossrv should work just fine since dossrv uses linear
> addressing and so doesn't know or care about c/h/s geometry.

Right.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] bug in disk/format
@ 2002-05-26  5:22 Geoff Collyer
  0 siblings, 0 replies; 12+ messages in thread
From: Geoff Collyer @ 2002-05-26  5:22 UTC (permalink / raw)
  To: 9fans

I've had problems with 4e dossrv on scsi disks too.  It seems to be
worse if the files being copied into have the "al" bits set.  I've
taken to running scandisk in dos when I reboot after modifying a dos
fs.  Sometimes it finds damage, usually in the files copied into.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [9fans] bug in disk/format
@ 2002-05-26  3:01 Mike Haertel
  0 siblings, 0 replies; 12+ messages in thread
From: Mike Haertel @ 2002-05-26  3:01 UTC (permalink / raw)
  To: 9fans

Here is a report of a bug in disk/format that results
in broken 9fat partitions that won't boot.

The test system is using a 9GB IBM SCSI disk drive
connected to an 53c875 based controller.  Plan 9
is the only operating system on this computer.

term% cat /dev/sd01/ctl
inquiry IBM     DGHS09Y         03E0682B91A1GAGSPMT03E
geometry 17916240 512
part data 0 17916240
part plan9 63 17912475
part 9fat 63 20545
part fs 20545 16971267
part swap 16971267 17912475

In this demonstration, /dev/sd01/9fat initially contains all
bytes 0 except for the Plan 9 partition table in block 1.

Now we run disk/format following the example in the
man page:

term% disk/format -b /386/pbs -d -r 2 /dev/sd01/9fat /386/9load /386/9pcdisk /tmp/plan9.ini
Initialising FAT file system
type hard, 10 tracks, 64 heads, 32 sectors/track, 512 bytes/sec
Adding file /386/9load, length 177320
Adding file /386/9pcdisk, length 1744105
Adding file /tmp/plan9.ini, length 185
used 1927168 bytes

Now, using dossrv, we unfortunately find that 9load and
9pcdisk were not copied correctly!

term% 9fat:
term% cmp /386/9load /n/9fat/9load
/386/9load /n/9fat/9load differ: char 55809
term% cmp /386/9pcdisk /n/9fat/9pcdisk
/386/9pcdisk /n/9fat/9pcdisk differ: char 1

The system crashes in 9load when I try to boot from
the resulting 9fat partition.

Using dossrv and cp I was able to manually recopy 9load
and 9pcdisk to the 9fat partition to make a bootable system,
but this would be a bad experience for a new user.


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2002-05-27 17:49 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-05-26  4:13 [9fans] bug in disk/format rsc
2002-05-26  5:11 ` Mike Haertel
  -- strict thread matches above, loose matches on Subject: below --
2002-05-27 17:49 rsc
2002-05-27  8:32 nigel
2002-05-26 22:13 rsc
2002-05-27  6:31 ` Michael Baldwin
2002-05-26  6:19 rsc
2002-05-26 20:40 ` Mike Haertel
2002-05-26  5:27 rsc
2002-05-26  6:01 ` Mike Haertel
2002-05-26  5:22 Geoff Collyer
2002-05-26  3:01 Mike Haertel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).