From mboxrd@z Thu Jan  1 00:00:00 1970
Message-Id: <200303192307.h2JN7u305651@zamenhof.cs.utwente.nl>
To: 9fans@cse.psu.edu
Subject: [9fans] [OT] heat-stroken disk? (was fossil/venti: diskReadRaw failed)
In-reply-to: Your message of "Wed, 19 Mar 2003 16:12:46 -0500."
             <6e2594b6d7f0bd77f330740aa143cf09@plan9.bell-labs.com> 
References: <6e2594b6d7f0bd77f330740aa143cf09@plan9.bell-labs.com> 
From: Axel Belinfante <Axel.Belinfante@cs.utwente.nl>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-ID: <5645.1048115276.1@zamenhof.cs.utwente.nl>
Date: Thu, 20 Mar 2003 00:07:56 +0100
Topicbox-Message-UUID: 81f0bed2-eacb-11e9-9e20-41e7f4b1d025

> go buy a new disk.

Hmmm... getting off-topic... and long-winded...
the short summary is: watch out, don't overheat your disks...


The disk is a new (half-years old) maxtor 7200 rpm 80Gb disk,
never used much until now.

Similar disks were offered for a reduced price a while
ago at the local student computer shop at the campus.
I remember that a colleague who bought a disk then
mentioned people who came back to the shop with problems,
where the problems seemed to be due to overheated disks.

So far, the computer my disk is in has seldom been
on for more than a couple of hours at a time.
Now it has been on for more than 24 hours at a time,
with more or less constant disk usage/activity -
it contains the fossil and venti partitions.
Can overheating be the problem?

I just opened the computer case, and noticed that
the disk is, actually, not really in any airflow whatsoever
(in addition to being on and used longer than ever before).

I switched dma and rwm on again. A first time
(while repeating the 'cp .../fossil /dev/null' experiment)
this gave me some (kernel?) message about dma printed to
the console, and the machine (or at least rio, when
changing active windows) seemed to be slowed down
quite a bit. the dmactl and rwmctl fields in the output
of cat /dev/sdD1/ctl were reset to 0.

After a while I retried switching dma/rwm on, and it
stayed on.  I repeated the 'cp /dev/sdD1/fossil /dev/null'
experiment a couple of times.
After a few failures as before, it succeeded!
First time it succeeded, about half-way stats
showed a brief but almost complete drop in context,
syscalls and interrupts. After the first success,
repeated experiments (just a handful, and then 10+10 more)
were all successfull.
I also just gave another 'snap -a' and the first blocks
seem to have been written o.k. (for what it's worth),
if that's what the disk: io=10000 at ... lines tell me.

So, what does this give me?
A disk I fear to really trust?
a reason to reconsider the location of
components and the airflow in the case?
a reason to buy an additional fan, just to be sure?

The funny (ahem) thing is that when I just had opened
the case, indeed,  the disk was warmer than it is now,
but still, that temperature is in no way comparable to
the much higher temperature of the scsi disks I use for
the fs at the office: those scsi disks are in their own
disk cabinet, and even there they heat up enough to bake
an egg on -- or so it seems, but they seem to be able to
stand that (have not given problems, and they have been
on constantly for the last 2 years).


Oh well, and then I still have to redo the experiment
with increasing the amount of RAM in a laptop,
to report the complete error message I get from aux/vga
after the increase ('not enough free address space').
With the original memory (32Mb) I did not get the error,
with 80Mb (and no changes made, apart from changing the ram)
I do get the error. I searched the archive and found a few
hits, but I was not able to figure out how to solve
the problem from them.
As I started to say, I'll redo the experiment and
write down the exact error messages -- maybe tomorrow.

Axel.