From mboxrd@z Thu Jan 1 00:00:00 1970 Message-Id: <200303192307.h2JN7u305651@zamenhof.cs.utwente.nl> To: 9fans@cse.psu.edu Subject: [9fans] [OT] heat-stroken disk? (was fossil/venti: diskReadRaw failed) In-reply-to: Your message of "Wed, 19 Mar 2003 16:12:46 -0500." <6e2594b6d7f0bd77f330740aa143cf09@plan9.bell-labs.com> References: <6e2594b6d7f0bd77f330740aa143cf09@plan9.bell-labs.com> From: Axel Belinfante MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <5645.1048115276.1@zamenhof.cs.utwente.nl> Date: Thu, 20 Mar 2003 00:07:56 +0100 Topicbox-Message-UUID: 81f0bed2-eacb-11e9-9e20-41e7f4b1d025 > go buy a new disk. Hmmm... getting off-topic... and long-winded... the short summary is: watch out, don't overheat your disks... The disk is a new (half-years old) maxtor 7200 rpm 80Gb disk, never used much until now. Similar disks were offered for a reduced price a while ago at the local student computer shop at the campus. I remember that a colleague who bought a disk then mentioned people who came back to the shop with problems, where the problems seemed to be due to overheated disks. So far, the computer my disk is in has seldom been on for more than a couple of hours at a time. Now it has been on for more than 24 hours at a time, with more or less constant disk usage/activity - it contains the fossil and venti partitions. Can overheating be the problem? I just opened the computer case, and noticed that the disk is, actually, not really in any airflow whatsoever (in addition to being on and used longer than ever before). I switched dma and rwm on again. A first time (while repeating the 'cp .../fossil /dev/null' experiment) this gave me some (kernel?) message about dma printed to the console, and the machine (or at least rio, when changing active windows) seemed to be slowed down quite a bit. the dmactl and rwmctl fields in the output of cat /dev/sdD1/ctl were reset to 0. After a while I retried switching dma/rwm on, and it stayed on. I repeated the 'cp /dev/sdD1/fossil /dev/null' experiment a couple of times. After a few failures as before, it succeeded! First time it succeeded, about half-way stats showed a brief but almost complete drop in context, syscalls and interrupts. After the first success, repeated experiments (just a handful, and then 10+10 more) were all successfull. I also just gave another 'snap -a' and the first blocks seem to have been written o.k. (for what it's worth), if that's what the disk: io=10000 at ... lines tell me. So, what does this give me? A disk I fear to really trust? a reason to reconsider the location of components and the airflow in the case? a reason to buy an additional fan, just to be sure? The funny (ahem) thing is that when I just had opened the case, indeed, the disk was warmer than it is now, but still, that temperature is in no way comparable to the much higher temperature of the scsi disks I use for the fs at the office: those scsi disks are in their own disk cabinet, and even there they heat up enough to bake an egg on -- or so it seems, but they seem to be able to stand that (have not given problems, and they have been on constantly for the last 2 years). Oh well, and then I still have to redo the experiment with increasing the amount of RAM in a laptop, to report the complete error message I get from aux/vga after the increase ('not enough free address space'). With the original memory (32Mb) I did not get the error, with 80Mb (and no changes made, apart from changing the ram) I do get the error. I searched the archive and found a few hits, but I was not able to figure out how to solve the problem from them. As I started to say, I'll redo the experiment and write down the exact error messages -- maybe tomorrow. Axel.