From mboxrd@z Thu Jan  1 00:00:00 1970
From: Christopher Nielsen <cnielsen@pobox.com>
To: 9fans@cse.psu.edu
Subject: Re: [9fans] venti+fossil woes
Message-ID: <20031118124023.GF65844@cassie.foobarbaz.net>
References: <20031114231842.GC834@cassie.foobarbaz.net> <20031116013757.GO834@cassie.foobarbaz.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20031116013757.GO834@cassie.foobarbaz.net>
User-Agent: Mutt/1.5.3i
Date: Tue, 18 Nov 2003 04:40:23 -0800
Topicbox-Message-UUID: 8e6fe01a-eacc-11e9-9e20-41e7f4b1d025

Here's an update for anyone interested, since I can't
manage to get to sleep for some reason.

I bought some better quality ata cables yesterday. That
helped to the point that I thought my troubles were over.
No such luck.

Now, what I am seeing is whenever a venti arena becomes
full and is in the process of being sealed, the screen
becomes filled with IBsy+ repeated ad infinitum, which
I know is from the ata driver. Eventually, fossil gives
an error from diskReadRaw() saying something like:

archive(0, <block addr>): cannot find block: i/o error

followed by a dump that I presume could be useful for
diagnostics.

What I am guessing is happening is that there is so much
contention in the controller that it's causing reads and
sometimes writes to timeout. This eventually causes fossil
to just fall over dead. At which point, I reboot from a
CD, run venti/checkarenas -vf on the arena partition and
then reboot so that fossil can continue where it left off
with the snapshot.

Wash, rinse, repeat.

Anyway, the saga continues. We'll see if I end up losing
data. I'm still guessing not. My only comment is that it
would be nice if fossil would handle such error conditions
more gracefully.

Regardless, I am going to dig around for another ata
controller to spread the disks across.

On Sat, Nov 15, 2003 at 05:37:57PM -0800, Christopher Nielsen wrote:
> this is looking more and more like it was a hardware
> problem. reseating all the connections eliminated most
> of the errors i was seeing. now i am getting errors
> from diskRawWrite, which leads me to believe that one
> of the disks is going bad. i can't really tell which
> one, though. the error message from diskRawWrite gives
> some diagnostic info, but i don't know how to interpret
> it. admittedly, i haven't dived into the source as much
> as i could, but maybe someone can provide some insight
> before i go ahead and do that.
>
> thanks to everyone that has provided input so far.
>
> i have to say, it doesn't look like i'm going to lose
> any data. it's not certain yet, but it's looking good.
> the paranoia in fossil and venti are good.
>
> On Fri, Nov 14, 2003 at 03:18:42PM -0800, Christopher Nielsen wrote:
> > fossil crashed in the middle of an archival snapshot.
> > now, i'm getting
> >
> > err 4: no space left in arenas
> > failed to write lump for <vac score>: no space left in arenas
> >
> > there's plenty of space left in the arenas. a whole other
> > 167G disc, in fact.
> >
> > i've run venti/checkarenas and venti/checkindex to fix any
> > inconsistencies. they were both successful according to the
> > output.
> >
> > any ideas about what is going on and how to fix it?
> >
> > also, is there any way to tell fossil to stop trying to do
> > the snapshot?
>
> --
> Christopher Nielsen
> "They who can give up essential liberty for temporary
> safety, deserve neither liberty nor safety." --Benjamin Franklin

--
Christopher Nielsen
"They who can give up essential liberty for temporary
safety, deserve neither liberty nor safety." --Benjamin Franklin