9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] fossil/venti: diskReadRaw failed
@ 2003-03-19 20:34 Axel Belinfante
  2003-03-19 20:39 ` Russ Cox
  0 siblings, 1 reply; 7+ messages in thread
From: Axel Belinfante @ 2003-03-19 20:34 UTC (permalink / raw)
  To: 9fans

I just got the following error, see at the end.
It seems that some things continue to work, at least
fossil is still happily copying. To avoid fossil disk
buffer overflow I just stopped mkext.
I did not stop venti or fossil.

While I was typing this it seemed that fossil tried another
venti access, because I heard the disk and saw another similar
diskReadRaw/archWalk error block (part of the errors at the end).

I don't know if the last snap -a just finished, or the
result was there already a bit longer, but there is a
fresh vac:.. fingerprint.

The mkfs -a|mkext pipe of which I stopped mkext has done
some 12 Gb and still has some 15 Gb to go.

While I was typing the rest of the message above, stopping
mkext seems to have effectively stopped al activity, as
visualized by stats, and audable by lack of disk activity
(though I did not touch venti or fossil processes).

What's the best cause of action now?

Axel.


Here follows all the error output I saw.

term% cat error.diskReadRaw
[...]
disk: io=10002 at 8.397ms
disk: io=10000 at 4.040ms
disk: io=10000 at 3.604ms
disk: io=10000 at 3.489ms
disk: io=10000 at 4.882ms
disk: io=10012 at 8.411ms
diskReadRaw failed: part=3 addr=ee65: i/o error
archive(0, 0xee65): cannot find block: i/o error
archWalk 0xee65 failed; ptr is in 0xece8 offset 190
archWalk 0xece8 failed; ptr is in 0xd5c0 offset 7
archWalk 0xd5c0 failed; ptr is in 0xe877 offset 41
archWalk 0xe877 failed; ptr is in 0xe876 offset 4
archWalk 0xe876 failed; ptr is in 0xe875 offset 2
archWalk 0xe875 failed; ptr is in 0xe874 offset 4
archWalk 0xe874 failed; ptr is in 0xe873 offset 5
archWalk 0xe873 failed; ptr is in 0xe872 offset 2
archWalk 0xe872 failed; ptr is in 0xe871 offset 8
archWalk 0xe871 failed; ptr is in 0xe870 offset 42
archWalk 0xe870 failed; ptr is in 0xf61f offset 0
archWalk 0xf61f failed; ptr is in 0xf61e offset 0
archWalk 0xf61e failed; ptr is in 0xf61d offset 0
archiveBlock 0xf61d: i/o error
diskReadRaw failed: part=3 addr=ee65: i/o error
archive(0, 0xee65): cannot find block: i/o error
archWalk 0xee65 failed; ptr is in 0xece8 offset 190
archWalk 0xece8 failed; ptr is in 0xd5c0 offset 7
archWalk 0xd5c0 failed; ptr is in 0xe877 offset 41
archWalk 0xe877 failed; ptr is in 0xe876 offset 4
archWalk 0xe876 failed; ptr is in 0xe875 offset 2
archWalk 0xe875 failed; ptr is in 0xe874 offset 4
archWalk 0xe874 failed; ptr is in 0xe873 offset 5
archWalk 0xe873 failed; ptr is in 0xe872 offset 2
archWalk 0xe872 failed; ptr is in 0xe871 offset 8
archWalk 0xe871 failed; ptr is in 0xe870 offset 42
archWalk 0xe870 failed; ptr is in 0xf61f offset 0
archWalk 0xf61f failed; ptr is in 0xf61e offset 0
archWalk 0xf61e failed; ptr is in 0xf61d offset 0
archiveBlock 0xf61d: i/o error
disk: io=10027 at 8.080ms
diskReadRaw failed: part=3 addr=ee65: i/o error
archive(0, 0xee65): cannot find block: i/o error
archWalk 0xee65 failed; ptr is in 0xece8 offset 190
archWalk 0xece8 failed; ptr is in 0xd5c0 offset 7
archWalk 0xd5c0 failed; ptr is in 0xe877 offset 41
archWalk 0xe877 failed; ptr is in 0xe876 offset 4
archWalk 0xe876 failed; ptr is in 0xe875 offset 2
archWalk 0xe875 failed; ptr is in 0xe874 offset 4
archWalk 0xe874 failed; ptr is in 0xe873 offset 5
archWalk 0xe873 failed; ptr is in 0xe872 offset 2
archWalk 0xe872 failed; ptr is in 0xe871 offset 8
archWalk 0xe871 failed; ptr is in 0xe870 offset 42
archWalk 0xe870 failed; ptr is in 0xf61f offset 0
archWalk 0xf61f failed; ptr is in 0xf61e offset 0
archWalk 0xf61e failed; ptr is in 0xf61d offset 0
archiveBlock 0xf61d: i/o error
diskReadRaw failed: part=3 addr=ee65: i/o error
archive(0, 0xee65): cannot find block: i/o error
archWalk 0xee65 failed; ptr is in 0xece8 offset 190
archWalk 0xece8 failed; ptr is in 0xd5c0 offset 7
archWalk 0xd5c0 failed; ptr is in 0xe877 offset 41
archWalk 0xe877 failed; ptr is in 0xe876 offset 4
archWalk 0xe876 failed; ptr is in 0xe875 offset 2
archWalk 0xe875 failed; ptr is in 0xe874 offset 4
archWalk 0xe874 failed; ptr is in 0xe873 offset 5
archWalk 0xe873 failed; ptr is in 0xe872 offset 2
archWalk 0xe872 failed; ptr is in 0xe871 offset 8
archWalk 0xe871 failed; ptr is in 0xe870 offset 42
archWalk 0xe870 failed; ptr is in 0xf61f offset 0
archWalk 0xf61f failed; ptr is in 0xf61e offset 0
archWalk 0xf61e failed; ptr is in 0xf61d offset 0
archiveBlock 0xf61d: i/o error
diskReadRaw failed: part=3 addr=efe1: i/o error
archive(0, 0xefe1): cannot find block: i/o error
archWalk 0xefe1 failed; ptr is in 0xef26 offset 158
archWalk 0xef26 failed; ptr is in 0xd5c0 offset 8
archWalk 0xd5c0 failed; ptr is in 0xe877 offset 41
archWalk 0xe877 failed; ptr is in 0xe876 offset 4
archWalk 0xe876 failed; ptr is in 0xe875 offset 2
archWalk 0xe875 failed; ptr is in 0xe874 offset 4
archWalk 0xe874 failed; ptr is in 0xe873 offset 5
archWalk 0xe873 failed; ptr is in 0xe872 offset 2
archWalk 0xe872 failed; ptr is in 0xe871 offset 8
archWalk 0xe871 failed; ptr is in 0xe870 offset 42
archWalk 0xe870 failed; ptr is in 0xf61f offset 0
archWalk 0xf61f failed; ptr is in 0xf61e offset 0
archWalk 0xf61e failed; ptr is in 0xf61d offset 0
archiveBlock 0xf61d: i/o error
diskReadRaw failed: part=3 addr=efe1: i/o error
archive(0, 0xefe1): cannot find block: i/o error
archWalk 0xefe1 failed; ptr is in 0xef26 offset 158
archWalk 0xef26 failed; ptr is in 0xd5c0 offset 8
archWalk 0xd5c0 failed; ptr is in 0xe877 offset 41
archWalk 0xe877 failed; ptr is in 0xe876 offset 4
archWalk 0xe876 failed; ptr is in 0xe875 offset 2
archWalk 0xe875 failed; ptr is in 0xe874 offset 4
archWalk 0xe874 failed; ptr is in 0xe873 offset 5
archWalk 0xe873 failed; ptr is in 0xe872 offset 2
archWalk 0xe872 failed; ptr is in 0xe871 offset 8
archWalk 0xe871 failed; ptr is in 0xe870 offset 42
archWalk 0xe870 failed; ptr is in 0xf61f offset 0
archWalk 0xf61f failed; ptr is in 0xf61e offset 0
archWalk 0xf61e failed; ptr is in 0xf61d offset 0
archiveBlock 0xf61d: i/o error

term% 



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [9fans] fossil/venti: diskReadRaw failed
  2003-03-19 20:34 [9fans] fossil/venti: diskReadRaw failed Axel Belinfante
@ 2003-03-19 20:39 ` Russ Cox
  2003-03-19 20:52   ` Axel Belinfante
  0 siblings, 1 reply; 7+ messages in thread
From: Russ Cox @ 2003-03-19 20:39 UTC (permalink / raw)
  To: 9fans

this is giving you the path from the block with the
i/o error all the way back up to the root block.
given that 0xf61d is readable yet 0xee65 is not,
i'd say you have a real disk error rather than
an overflow.

try running cp /dev/sdC0/fossil /dev/null and
see if you get i/o errors then.

diskReadRaw failed: part=3 addr=ee65: i/o error
archive(0, 0xee65): cannot find block: i/o error
archWalk 0xee65 failed; ptr is in 0xece8 offset 190
archWalk 0xece8 failed; ptr is in 0xd5c0 offset 7
archWalk 0xd5c0 failed; ptr is in 0xe877 offset 41
archWalk 0xe877 failed; ptr is in 0xe876 offset 4
archWalk 0xe876 failed; ptr is in 0xe875 offset 2
archWalk 0xe875 failed; ptr is in 0xe874 offset 4
archWalk 0xe874 failed; ptr is in 0xe873 offset 5
archWalk 0xe873 failed; ptr is in 0xe872 offset 2
archWalk 0xe872 failed; ptr is in 0xe871 offset 8
archWalk 0xe871 failed; ptr is in 0xe870 offset 42
archWalk 0xe870 failed; ptr is in 0xf61f offset 0
archWalk 0xf61f failed; ptr is in 0xf61e offset 0
archWalk 0xf61e failed; ptr is in 0xf61d offset 0
archiveBlock 0xf61d: i/o error



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [9fans] fossil/venti: diskReadRaw failed
  2003-03-19 20:39 ` Russ Cox
@ 2003-03-19 20:52   ` Axel Belinfante
  2003-03-19 21:01     ` Axel Belinfante
  2003-03-19 21:01     ` [9fans] fossil/venti: diskReadRaw failed Russ Cox
  0 siblings, 2 replies; 7+ messages in thread
From: Axel Belinfante @ 2003-03-19 20:52 UTC (permalink / raw)
  To: 9fans

> this is giving you the path from the block with the
> i/o error all the way back up to the root block.
> given that 0xf61d is readable yet 0xee65 is not,
> i'd say you have a real disk error rather than
> an overflow.
> 
> try running cp /dev/sdC0/fossil /dev/null and
> see if you get i/o errors then.

term% cp /dev/sdD1/fossil /dev/null
cp: error reading /dev/sdD1/fossil: i/o error

Guess that's what you mean.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [9fans] fossil/venti: diskReadRaw failed
  2003-03-19 20:52   ` Axel Belinfante
@ 2003-03-19 21:01     ` Axel Belinfante
  2003-03-19 21:12       ` Russ Cox
  2003-03-19 21:01     ` [9fans] fossil/venti: diskReadRaw failed Russ Cox
  1 sibling, 1 reply; 7+ messages in thread
From: Axel Belinfante @ 2003-03-19 21:01 UTC (permalink / raw)
  To: 9fans

Just remembered that, in an attempt to speed things up,
I had echo-ed 'dma on' and 'rwm on' to /dev/sdD1/ctl .
To be sure, I switched both off, and repeated the cp;
it took longer but the result was the same (i/o error).

I wrote in response to Russ:
> > this is giving you the path from the block with the
> > i/o error all the way back up to the root block.
> > given that 0xf61d is readable yet 0xee65 is not,
> > i'd say you have a real disk error rather than
> > an overflow.
> > 
> > try running cp /dev/sdC0/fossil /dev/null and
> > see if you get i/o errors then.
> 
> term% cp /dev/sdD1/fossil /dev/null
> cp: error reading /dev/sdD1/fossil: i/o error
> 
> Guess that's what you mean.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [9fans] fossil/venti: diskReadRaw failed
  2003-03-19 20:52   ` Axel Belinfante
  2003-03-19 21:01     ` Axel Belinfante
@ 2003-03-19 21:01     ` Russ Cox
  1 sibling, 0 replies; 7+ messages in thread
From: Russ Cox @ 2003-03-19 21:01 UTC (permalink / raw)
  To: 9fans

There you go.  Not fossil's problem (for once).

You might try toggling DMA and see if
the errors go away.  I kind of doubt it.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [9fans] fossil/venti: diskReadRaw failed
  2003-03-19 21:01     ` Axel Belinfante
@ 2003-03-19 21:12       ` Russ Cox
  2003-03-19 23:07         ` [9fans] [OT] heat-stroken disk? (was fossil/venti: diskReadRaw failed) Axel Belinfante
  0 siblings, 1 reply; 7+ messages in thread
From: Russ Cox @ 2003-03-19 21:12 UTC (permalink / raw)
  To: 9fans

go buy a new disk.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [9fans] [OT] heat-stroken disk? (was fossil/venti: diskReadRaw failed)
  2003-03-19 21:12       ` Russ Cox
@ 2003-03-19 23:07         ` Axel Belinfante
  0 siblings, 0 replies; 7+ messages in thread
From: Axel Belinfante @ 2003-03-19 23:07 UTC (permalink / raw)
  To: 9fans

> go buy a new disk.

Hmmm... getting off-topic... and long-winded...
the short summary is: watch out, don't overheat your disks...


The disk is a new (half-years old) maxtor 7200 rpm 80Gb disk,
never used much until now.

Similar disks were offered for a reduced price a while
ago at the local student computer shop at the campus.
I remember that a colleague who bought a disk then
mentioned people who came back to the shop with problems,
where the problems seemed to be due to overheated disks.

So far, the computer my disk is in has seldom been
on for more than a couple of hours at a time.
Now it has been on for more than 24 hours at a time,
with more or less constant disk usage/activity -
it contains the fossil and venti partitions.
Can overheating be the problem?

I just opened the computer case, and noticed that
the disk is, actually, not really in any airflow whatsoever
(in addition to being on and used longer than ever before).

I switched dma and rwm on again. A first time
(while repeating the 'cp .../fossil /dev/null' experiment)
this gave me some (kernel?) message about dma printed to
the console, and the machine (or at least rio, when
changing active windows) seemed to be slowed down
quite a bit. the dmactl and rwmctl fields in the output
of cat /dev/sdD1/ctl were reset to 0.

After a while I retried switching dma/rwm on, and it
stayed on.  I repeated the 'cp /dev/sdD1/fossil /dev/null'
experiment a couple of times.
After a few failures as before, it succeeded!
First time it succeeded, about half-way stats
showed a brief but almost complete drop in context,
syscalls and interrupts. After the first success,
repeated experiments (just a handful, and then 10+10 more)
were all successfull.
I also just gave another 'snap -a' and the first blocks
seem to have been written o.k. (for what it's worth),
if that's what the disk: io=10000 at ... lines tell me.

So, what does this give me?
A disk I fear to really trust?
a reason to reconsider the location of
components and the airflow in the case?
a reason to buy an additional fan, just to be sure?

The funny (ahem) thing is that when I just had opened
the case, indeed,  the disk was warmer than it is now,
but still, that temperature is in no way comparable to
the much higher temperature of the scsi disks I use for
the fs at the office: those scsi disks are in their own
disk cabinet, and even there they heat up enough to bake
an egg on -- or so it seems, but they seem to be able to
stand that (have not given problems, and they have been
on constantly for the last 2 years).


Oh well, and then I still have to redo the experiment
with increasing the amount of RAM in a laptop,
to report the complete error message I get from aux/vga
after the increase ('not enough free address space').
With the original memory (32Mb) I did not get the error,
with 80Mb (and no changes made, apart from changing the ram)
I do get the error. I searched the archive and found a few
hits, but I was not able to figure out how to solve
the problem from them.
As I started to say, I'll redo the experiment and
write down the exact error messages -- maybe tomorrow.

Axel.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2003-03-19 23:07 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-03-19 20:34 [9fans] fossil/venti: diskReadRaw failed Axel Belinfante
2003-03-19 20:39 ` Russ Cox
2003-03-19 20:52   ` Axel Belinfante
2003-03-19 21:01     ` Axel Belinfante
2003-03-19 21:12       ` Russ Cox
2003-03-19 23:07         ` [9fans] [OT] heat-stroken disk? (was fossil/venti: diskReadRaw failed) Axel Belinfante
2003-03-19 21:01     ` [9fans] fossil/venti: diskReadRaw failed Russ Cox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).