9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* Re: [9fans] Uh oh, disk problems.
@ 2001-04-12 13:04 jmk
  2001-04-12 18:06 ` Dan Cross
  0 siblings, 1 reply; 10+ messages in thread
From: jmk @ 2001-04-12 13:04 UTC (permalink / raw)
  To: 9fans

On Thu Apr 12 02:07:24 EDT 2001, cross@math.psu.edu wrote:
> In article <001c01c0c235$6bf8da20$16096887@cs.research.belllabs.com> you write:
> >
> >If you are really seeing a NOP then that means the drive command didn't
> >complete in 30s and the driver tried to abort it. That's pretty unusual and
> >that
> >code hasn't been exercised much and it's perfectly plausible for the NOP
> >interrupt to come after the driver thinks the command is finished. You
> >could try putting in some debugging for that condition and see what the
> >state really is before the nop happens.
>
> Thanks for the information, Jim.  I found out through further instrumenting
> the code that it is indeed a bad block on the disk.  Sigh.  I think I'll
> replace the disk before it gets any worse.
>
> >Are you using DMA?
>
> Yes, I am....
>
> 	- Dan C.

Although it's not good that your drive is going bad, it's good that
it's not the driver.

Next time I delve into the driver I'm going to make it more interrupt
driven so you won't see the Inil interrupt.

--jim


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9fans] Uh oh, disk problems.
  2001-04-12 13:04 [9fans] Uh oh, disk problems jmk
@ 2001-04-12 18:06 ` Dan Cross
  0 siblings, 0 replies; 10+ messages in thread
From: Dan Cross @ 2001-04-12 18:06 UTC (permalink / raw)
  To: 9fans

In article <20010412130455.A012D19A01@mail.cse.psu.edu> you write:
>Although it's not good that your drive is going bad, it's good that
>it's not the driver.

Ha!  Maybe for you.  ;-)

Nah, I too, am glad that the driver isn't at fault---software problems
always seem far more insideous than a simple case of bad hardware.  My
only real complaint about the driver is that it didn't emit a more
informative diagnostic, but that's largely an asthetic issue.

>Next time I delve into the driver I'm going to make it more interrupt
>driven so you won't see the Inil interrupt.

Cool.  btw- the Inil seems to be printed in response to a call to
atanop() which is called after ataready() times out in
atastartgenio().  Why the Inil message is printed out is sort of
unclear to me.  My suspicion is atatstartgenio() is attempting to abort
an operation, which has ceased to exist.  I suppose a simple way to get
rid of the message would be to add a state flag to the data structure
which describes the controller state that says, ``I'm aborting an
operation....'' and the test for that in atainterrupt(), but I'm very
unfamiliar with the issue, so I doubt that would be correct.  ;-)

	- Dan C.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9fans] Uh oh, disk problems.
  2001-04-12 18:15 jmk
@ 2001-04-12 18:56 ` Dan Cross
  0 siblings, 0 replies; 10+ messages in thread
From: Dan Cross @ 2001-04-12 18:56 UTC (permalink / raw)
  To: 9fans

In article <20010412181514.75CC319A19@mail.cse.psu.edu> you write:
>I'd have to familiarise myself with the driver again but I think the NOP
>is issued and it doesn't wait for the interrupt for that command. So, it's
>possible that the interrupt for the NOP appears after the Ctlr state is
>all tidied up after aborting.

Ahh, that makes sense.

>That's why more of this stuff needs to be interrupt driven.

Yes, I see what you were driving at now.

	- Dan C.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9fans] Uh oh, disk problems.
@ 2001-04-12 18:15 jmk
  2001-04-12 18:56 ` Dan Cross
  0 siblings, 1 reply; 10+ messages in thread
From: jmk @ 2001-04-12 18:15 UTC (permalink / raw)
  To: 9fans

On Thu Apr 12 14:07:22 EDT 2001, cross@math.psu.edu wrote:
> Cool.  btw- the Inil seems to be printed in response to a call to
> atanop() which is called after ataready() times out in
> atastartgenio().  Why the Inil message is printed out is sort of
> unclear to me.  My suspicion is atatstartgenio() is attempting to abort
> an operation, which has ceased to exist.  I suppose a simple way to get
> rid of the message would be to add a state flag to the data structure
> which describes the controller state that says, ``I'm aborting an
> operation....'' and the test for that in atainterrupt(), but I'm very
> unfamiliar with the issue, so I doubt that would be correct.  ;-)

I'd have to familiarise myself with the driver again but I think the NOP
is issued and it doesn't wait for the interrupt for that command. So, it's
possible that the interrupt for the NOP appears after the Ctlr state is
all tidied up after aborting. That's why more of this stuff needs to be
interrupt driven.

--jim


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9fans] Uh oh, disk problems.
  2001-04-11  3:13   ` Jim McKie
@ 2001-04-12  6:06     ` Dan Cross
  0 siblings, 0 replies; 10+ messages in thread
From: Dan Cross @ 2001-04-12  6:06 UTC (permalink / raw)
  To: 9fans

In article <001c01c0c235$6bf8da20$16096887@cs.research.belllabs.com> you write:
>
>If you are really seeing a NOP then that means the drive command didn't
>complete in 30s and the driver tried to abort it. That's pretty unusual and
>that
>code hasn't been exercised much and it's perfectly plausible for the NOP
>interrupt to come after the driver thinks the command is finished. You
>could try putting in some debugging for that condition and see what the
>state really is before the nop happens.

Thanks for the information, Jim.  I found out through further instrumenting
the code that it is indeed a bad block on the disk.  Sigh.  I think I'll
replace the disk before it gets any worse.

>Are you using DMA?

Yes, I am....

	- Dan C.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9fans] Uh oh, disk problems.
  2001-04-11  2:57 ` Dan Cross
@ 2001-04-11  3:13   ` Jim McKie
  2001-04-12  6:06     ` Dan Cross
  0 siblings, 1 reply; 10+ messages in thread
From: Jim McKie @ 2001-04-11  3:13 UTC (permalink / raw)
  To: 9fans


If you are really seeing a NOP then that means the drive command didn't
complete in 30s and the driver tried to abort it. That's pretty unusual and
that
code hasn't been exercised much and it's perfectly plausible for the NOP
interrupt to come after the driver thinks the command is finished. You
could try putting in some debugging for that condition and see what the
state really is before the nop happens.

Are you using DMA?



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9fans] Uh oh, disk problems.
  2001-04-11  0:40 jmk
  2001-04-11  2:00 ` Dan Cross
@ 2001-04-11  2:57 ` Dan Cross
  2001-04-11  3:13   ` Jim McKie
  1 sibling, 1 reply; 10+ messages in thread
From: Dan Cross @ 2001-04-11  2:57 UTC (permalink / raw)
  To: 9fans

In article <20010411004058.78B7D19B35@mail.cse.psu.edu> you write:
>do you have some type of power management on that causes the
>disc to spin down?

Aha!  Apparently, I did have something of the kind enabled.  I've now
disabled it.  The i/o error's and problems with disk/kfscmd check still
show up, of course.  Looking through the source, one sees that the
``wrenread failed'' error messages are generated by KFS's devwren.c
module, either a seek or a read is failing.  I modified devwren.c to
tell me which, and it appears as though it's the read.

The Inil error is kind of strange....  It seems to be prognosticating a
hardware error (at least, the Err bit is set in the status byte that is
brought in in from the ATA controller in atainterupt() in
/sys/src/9/pc/sdata.c).  The ``curdrive'' pointer in the ctlr is null,
however, hence the Inil message being printed out.  Decomposing the
bits from the status byte, one sees that the device is ready, and
either service or ``device seek complete'' is set.  I'm guessing the
former.  The command that's being responded to appears to be a nop,
however.  Weird.  I guess I'm expecting to see something slightly
different come back from the ATA controller (ie, something which says,
``Device #0 had an error.'') as opposed to Inil....

I'm way out of my depth when it comes to these weird buses.  :-)

	- Dan C.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9fans] Uh oh, disk problems.
  2001-04-11  0:40 jmk
@ 2001-04-11  2:00 ` Dan Cross
  2001-04-11  2:57 ` Dan Cross
  1 sibling, 0 replies; 10+ messages in thread
From: Dan Cross @ 2001-04-11  2:00 UTC (permalink / raw)
  To: 9fans

In article <20010411004058.78B7D19B35@mail.cse.psu.edu> you write:
>do you have some type of power management on that causes the
>disc to spin down?

Possibly.  I'll comb through the BIOS and check (I thought that I had
disabled all such things at one point).  I should note that the errors
appeared during or immediately after a period of pretty high disk
activity.  In particular, I was building the ARM libraries so as to
build a Bitsy kernel.  mk had just exited when the first error messages
popped up, and the second set was during disk/kfscmd check.

Hmm, listening very carefully now, it sounds as if the disk *has* spun
down (I booted off of the network and copied all my local data to the
file server as soon as I started got the first).  However, after
booting once more and running disk/kfscmd check again, I see the
``wrenread ...'' errors again.  :-(

	- Dan C.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9fans] Uh oh, disk problems.
@ 2001-04-11  0:40 jmk
  2001-04-11  2:00 ` Dan Cross
  2001-04-11  2:57 ` Dan Cross
  0 siblings, 2 replies; 10+ messages in thread
From: jmk @ 2001-04-11  0:40 UTC (permalink / raw)
  To: 9fans

do you have some type of power management on that causes the
disc to spin down?


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [9fans] Uh oh, disk problems.
@ 2001-04-11  0:33 Dan Cross
  0 siblings, 0 replies; 10+ messages in thread
From: Dan Cross @ 2001-04-11  0:33 UTC (permalink / raw)
  To: 9fans

Recently, I saw these strange errors, written to the `console' on
my machine:

wrenread failed: i/o error
wrenwrite failed: i/o error
Inil00/51+

This set off a mental red flag.  `disk/kfscmd check' produces the
following:

wrenread failed: i/o error
wrenread failed: i/o error
Inil00/51+

(the above written to the console, ie, as ^T^Tp would be.)

term% disk/kfscmd check
checking file system: main
check: "/29000/include/": xtag: p null
check: "/960/": xtag: p null
check free list
lo = 220699; hi = 667811
   28549 files
  667811 blocks in the file system
  220693 used blocks
  447113 free blocks
       5 missing blocks
   38413 maximum qid path
missing: 24702
missing: 24701
missing: 24700
missing: 14199
missing: 14198
term%

What I'm really concerned about is wether or not my disk is going
bad.  Can anyone confirm that this is or isn't the case?  If not,
I'll be very happy, but what corrective action should I take?
Sorry if these questions seem naive, but disk errors make me
skittish.  Thanks a lot!

	- Dan C.

(ps- It'd be really nice if there were a searchable interface to
the 9fans archives....)


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2001-04-12 18:56 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-04-12 13:04 [9fans] Uh oh, disk problems jmk
2001-04-12 18:06 ` Dan Cross
  -- strict thread matches above, loose matches on Subject: below --
2001-04-12 18:15 jmk
2001-04-12 18:56 ` Dan Cross
2001-04-11  0:40 jmk
2001-04-11  2:00 ` Dan Cross
2001-04-11  2:57 ` Dan Cross
2001-04-11  3:13   ` Jim McKie
2001-04-12  6:06     ` Dan Cross
2001-04-11  0:33 Dan Cross

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).