[9fans] sick cpu server; 9load hates my SATA controller

9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed

* [9fans] sick cpu server; 9load hates my SATA controller
@ 2009-01-23 23:08 Anthony Sorace
  2009-01-23 23:26 ` erik quanstrom
  0 siblings, 1 reply; 2+ messages in thread
From: Anthony Sorace @ 2009-01-23 23:08 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

About a month ago, the motherboard in my CPU server went bad (visibly
bulging capacitors!). I finally got the replacement part on RMA from
the manufacturer and tried getting things going again yesterday. No
joy, and the problems are strange. The symptoms differ depending on
whether I have drives on sdE[0,1,3] (as was the case before) or
sdE[0,1,2].

When I have drives on sdE[0,1,3], 9load starts, and proceeds normally
until half-way through probing my SATA drives. The lines are:

sb600: sata-II with 4 ports
sdiahci: drive 0 in state ready after 0 resets
sdiahci: drive 1 in state ready after 0 resets
sdiahci: drive 2 in state missing after 0 resets
sdiahci: drive 3 in state ready after 0 resets
sdE3: i/o error 50 @0
sdE3: i/o error 50 @1

but (as best I can tell) after "state missing" line all I/O becomes
dog slow. Characters print at what looks like maybe 300 baud, newlines
take a few seconds to redraw the screen. Despite the extreme slowness
of printing, it prints the 9load menu I had set up and responds to the
menu entry and loads the kernel. It prints the "cpu0:" and "apm" lines
as expected (but, again, very slowly), and then "sdE3" i/o error 50
@0" three times. It then finds the kernel, I get the expected
".886899....." and so on, with the .'s printing very slowly (less than
1/sec, suggesting that there's a more general I/O problem, not just
printing). Once the kernel has finished loading, it prints "entry:
0xf0100020" and becomes totally unresponsive (no ^t^tp, random typing
produces no characters).

I've disabled what peripherals I can in BIOS, different BIOS settings
for the SATA mode (although I'm sure it was running AHCI before), and
tried with different kernels in my boot menu; no substantial change
(loading a gzip'd kernel seems to print the "..." faster per dot, but
hangs after the "=>").

I've tried booting of an ISO downloaded about two weeks ago, and get
similar results: things seem okay until it probes the SATA controller,
when it reports "sb600: sata-II with 4 ports" and then hangs (although
this does respond to ^p).

Note that the part is indeed a 4-port sb600 and there are indeed three
disks attached (although the BIOS and 9load disagree on whether the
second or third are missing).

If I have drives on sdE[0,1,2], the case for the CD is the same, but
the on-disk kernel gets through asking where root is from, and then
yields "panic: fault: 0x11c" as it probes the drives. All the on-disk
kernels perform the same way.

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [9fans] sick cpu server; 9load hates my SATA controller
  2009-01-23 23:08 [9fans] sick cpu server; 9load hates my SATA controller Anthony Sorace
@ 2009-01-23 23:26 ` erik quanstrom
  0 siblings, 0 replies; 2+ messages in thread
From: erik quanstrom @ 2009-01-23 23:26 UTC (permalink / raw)
  To: 9fans

i have been using a sb600-based machine for a couple of years as a
terminal, so i think that your bios configuration might have a lot to
do with your problems.  it is probablly tickling bugs in the driver.

would you be able and willing to try booting the 9load at
/n/sources/contrib/quanstro/src/9loadaoe
and try the updated kernel driver at
/n/sources/contrib/quanstro/src/9/pc/*ahci*
?

> About a month ago, the motherboard in my CPU server went bad (visibly
> bulging capacitors!). I finally got the replacement part on RMA from
> the manufacturer and tried getting things going again yesterday. No
> joy, and the problems are strange. The symptoms differ depending on
> whether I have drives on sdE[0,1,3] (as was the case before) or
> sdE[0,1,2].

that's definately a clue.  your bios configuration should
be for no raid, all ahci all the time.

> sb600: sata-II with 4 ports
> sdiahci: drive 0 in state ready after 0 resets
> sdiahci: drive 1 in state ready after 0 resets
> sdiahci: drive 2 in state missing after 0 resets
> sdiahci: drive 3 in state ready after 0 resets
> sdE3: i/o error 50 @0
> sdE3: i/o error 50 @1
>
> but (as best I can tell) after "state missing" line all I/O becomes
> dog slow. Characters print at what looks like maybe 300 baud, newlines
> take a few seconds to redraw the screen.

likely howling interrupts due to the i/o error.  with ahci,
this has often been a power management issue.

> If I have drives on sdE[0,1,2], the case for the CD is the same, but
> the on-disk kernel gets through asking where root is from, and then
> yields "panic: fault: 0x11c" as it probes the drives. All the on-disk
> kernels perform the same way.

sometimes this is caused by sdata getting spurious interrupts.

if you continue to have trouble, i would be more than
happy to debug this problem offline.  just let me know
how it goes.

- erik

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2009-01-23 23:26 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-01-23 23:08 [9fans] sick cpu server; 9load hates my SATA controller Anthony Sorace
2009-01-23 23:26 ` erik quanstrom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).