[9fans] more fossil woes

9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed

* [9fans] more fossil woes
@ 2003-11-01  0:24 andrey mirtchovski
  2003-11-01  4:18 ` jmk
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: andrey mirtchovski @ 2003-11-01  0:24 UTC (permalink / raw)
  To: 9fans

I never thought I'd get to that point, but here it is:

	Fossil is unable to initialize a partition with flfmt.

Here's the whole story:

This morning after succesfully checking my email from home I arrived at
school just to find that fossil has died with the familiar:

	 assert failed: b->nlock == 1
	fossil 44: suicide: sys: trap: fault read addr=0x0 pc=0x0002b6b7

It was the first crash in a long time, but unfortunately I had no way of
finding out who/what had caused it, because Plan 9 does not allow me to
examine process' activity based on utilization of a particular resource.
(Interestingly enough, when I suggested such "features" are added to the
system there was an outrage, especially from people who never use Plan 9,
telling me I'm just polluting the beautiful system :)...

I didn't give much thought to the problem and ran fossil/flchk, which
surprisingly discovered much more errors than I had thought I had. Here's
how many blocks it couldn't access anymore (I run a 3-day wide epoch
window) and had suggested that I bfree:

	mirtchov@fbsd$ cat flchk | sed '/^[^b]/d' | wc -l
	 365357
	mirtchov@fbsd$

that's 3 gigs of broken data... For comparison my entire venti archive
weights in at 1.3GB.

I examined the blocks for any obvious errors and cat them to the fossil
console, which immediately came back with the somewhat new:

	cacheLocalData: addr=7840 type got 16 exp 8: tag got 0 exp 65afd613
	fossil 94: suicide: sys: trap: fault read addr=0x0 pc=0x0002b6b7

A reboot or two later, and I had a running system that was good for checking
email. Only much later, when I needed to do some real work with Plan 9 did I
find out that /acme/bin/* was corrupted! It was showing binaries as
existing, but no file operations could be done on them. At this point I
decided that it's best to reinit fossil with the latest venti score and just
forget about it, but fossil thought differently:

	plan9# fossil/flfmt -v ff96c3967c7815e15a8a4c09196221b01a8bba3d /dev/sdD0/fossil
	cacheLocalData: addr=7841 type got 16 exp 0: tag got 0 exp 6669fe74
	fossil 90: suicide: sys: trap: fault read addr=0x0 pc=0x0002b6b7

exactly the same happens if I try to format the drive:

	plan9# fossil/flfmt /dev/sdD0/fossil
	cacheLocalData: addr=7841 type got 16 exp 0: tag got 0 exp 6669fe74
	fossil 89: suicide: sys: trap: fault read addr=0x0 pc=0x0002b807

for all it's worth, reading and writing from sdD0 work fine...

Anyway, I have another fossil disk that I can boot and with venti's help
will reinitialize the system. Others may not be that lucky :)

andrey

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] more fossil woes
  2003-11-01  0:24 [9fans] more fossil woes andrey mirtchovski
@ 2003-11-01  4:18 ` jmk
  2003-11-01  7:56   ` andrey mirtchovski
  2003-11-01  5:35 ` Russ Cox
  2003-11-11  4:59 ` Russ Cox
  2 siblings, 1 reply; 12+ messages in thread
From: jmk @ 2003-11-01  4:18 UTC (permalink / raw)
  To: 9fans

I'd say you had something more fundamental wrong, or else you're not telling
the whole story. If you do the 2nd flfmt as described below you should
get a message like
	fs header block already exists; are you sure? [y/n]:
unless you have the '-y' option.

On Fri Oct 31 19:25:36 EST 2003, mirtchov@cpsc.ucalgary.ca wrote:
> I never thought I'd get to that point, but here it is:
>
> 	Fossil is unable to initialize a partition with flfmt.
>
> Here's the whole story:
>
>
> This morning after succesfully checking my email from home I arrived at
> school just to find that fossil has died with the familiar:
>
> 	 assert failed: b->nlock == 1
> 	fossil 44: suicide: sys: trap: fault read addr=0x0 pc=0x0002b6b7
>
> It was the first crash in a long time, but unfortunately I had no way of
> finding out who/what had caused it, because Plan 9 does not allow me to
> examine process' activity based on utilization of a particular resource.
> (Interestingly enough, when I suggested such "features" are added to the
> system there was an outrage, especially from people who never use Plan 9,
> telling me I'm just polluting the beautiful system :)...
>
> I didn't give much thought to the problem and ran fossil/flchk, which
> surprisingly discovered much more errors than I had thought I had. Here's
> how many blocks it couldn't access anymore (I run a 3-day wide epoch
> window) and had suggested that I bfree:
>
> 	mirtchov@fbsd$ cat flchk | sed '/^[^b]/d' | wc -l
> 	 365357
> 	mirtchov@fbsd$
>
>
> that's 3 gigs of broken data... For comparison my entire venti archive
> weights in at 1.3GB.
>
> I examined the blocks for any obvious errors and cat them to the fossil
> console, which immediately came back with the somewhat new:
>
> 	cacheLocalData: addr=7840 type got 16 exp 8: tag got 0 exp 65afd613
> 	fossil 94: suicide: sys: trap: fault read addr=0x0 pc=0x0002b6b7
>
> A reboot or two later, and I had a running system that was good for checking
> email. Only much later, when I needed to do some real work with Plan 9 did I
> find out that /acme/bin/* was corrupted! It was showing binaries as
> existing, but no file operations could be done on them. At this point I
> decided that it's best to reinit fossil with the latest venti score and just
> forget about it, but fossil thought differently:
>
> 	plan9# fossil/flfmt -v ff96c3967c7815e15a8a4c09196221b01a8bba3d /dev/sdD0/fossil
> 	cacheLocalData: addr=7841 type got 16 exp 0: tag got 0 exp 6669fe74
> 	fossil 90: suicide: sys: trap: fault read addr=0x0 pc=0x0002b6b7
>
> exactly the same happens if I try to format the drive:
>
> 	plan9# fossil/flfmt /dev/sdD0/fossil
> 	cacheLocalData: addr=7841 type got 16 exp 0: tag got 0 exp 6669fe74
> 	fossil 89: suicide: sys: trap: fault read addr=0x0 pc=0x0002b807
>
> for all it's worth, reading and writing from sdD0 work fine...
>
> Anyway, I have another fossil disk that I can boot and with venti's help
> will reinitialize the system. Others may not be that lucky :)
>
> andrey


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] more fossil woes
  2003-11-01  0:24 [9fans] more fossil woes andrey mirtchovski
  2003-11-01  4:18 ` jmk
@ 2003-11-01  5:35 ` Russ Cox
  2003-11-01  7:50   ` andrey mirtchovski
  2003-11-11  4:59 ` Russ Cox
  2 siblings, 1 reply; 12+ messages in thread
From: Russ Cox @ 2003-11-01  5:35 UTC (permalink / raw)
  To: 9fans

There are definitely some problems with fossil that I'd like to fix,
but I have very little time these days.  The robustness of flchk is
high on that list.

In any event, if you zero the beginning of the fossil partition
you should be able to start afresh without problems.  Just
cp /dev/zero /dev/sdC0/fossil and then hit rubout after a few
seconds.

Sorry.
Russ

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] more fossil woes
  2003-11-01  5:35 ` Russ Cox
@ 2003-11-01  7:50   ` andrey mirtchovski
  0 siblings, 0 replies; 12+ messages in thread
From: andrey mirtchovski @ 2003-11-01  7:50 UTC (permalink / raw)
  To: 9fans

On Sat, 1 Nov 2003, Russ Cox wrote:

> In any event, if you zero the beginning of the fossil partition
> you should be able to start afresh without problems.  Just
> cp /dev/zero /dev/sdC0/fossil and then hit rubout after a few
> seconds.

i did exactly what you suggested and got:

	cacheLocalData: addr=7841 type got 0 exp 0: tag got 0 exp 6669fe74
	fossil 55: suicide: sys: trap: fault read addr=0x0 pc=0x0002b6b7

is there anything else i can do to debug? the last few retries were done
without the devfs, even though I'm normally using it to mirror two disks.

andrey



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] more fossil woes
  2003-11-01  4:18 ` jmk
@ 2003-11-01  7:56   ` andrey mirtchovski
  0 siblings, 0 replies; 12+ messages in thread
From: andrey mirtchovski @ 2003-11-01  7:56 UTC (permalink / raw)
  To: 9fans

On Fri, 31 Oct 2003 jmk@plan9.bell-labs.com wrote:

> I'd say you had something more fundamental wrong, or else you're not telling
> the whole story. If you do the 2nd flfmt as described below you should
> get a message like
> 	fs header block already exists; are you sure? [y/n]:
> unless you have the '-y' option.
>

there is no message, fossil dies absolutely immediately after I type the
command (and before I had done the 'cat /dev/zero > /dev/sdC0/fossil' there
was no corruption on the disk whatsoever, so after each crash I rebooted
safely from either devfs or the first disk on the box).

the only thing suspect for me is the earlier crash with nblock > 1, which I
have no way of debugging anymore -- there was no log trace of that crash,
except what I had logged on a different machine from the serial console.

if you think i'm omitting something important -- tell me what it could be, so i
can try and help...

andrey

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] more fossil woes
  2003-11-11  4:59 ` Russ Cox
@ 2003-11-11  4:35   ` mirtchov
  2003-11-11  6:08     ` Russ Cox
                       ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: mirtchov @ 2003-11-11  4:35 UTC (permalink / raw)
  To: 9fans

>>	assert failed: b->nlock == 1
>>	fossil 44: suicide: sys: trap: fault read addr=0x0 pc=0x0002b6b7
>
> I believe I have just fixed this bug.  Jmk caught one and held it down for me.
> Sources are there now, binary tomorrow.

Several months ago I offered an 8 megabyte log of fossil crashing with
this problem while running in debug mode.  Could've been helpful, but I'm
glad you fixed it :)

>
>> It was the first crash in a long time, but unfortunately I had no way of
>> finding out who/what had caused it, because Plan 9 does not allow me to
>> examine process' activity based on utilization of a particular resource.
>
> I don't understand what you mean here.  What query would you have
> asked the system to help isolate the problem?

Something similar to lunix' "top" command -- what process is taking
the most out of a particular resource.  It can be done currently for
things like memory usage and cputime, but it's a bit difficult when I
want to know which process used the most out of the cpu in the last
second.  Things like interrupts and most of what stats(8) displays are
good to link with process ids too.

(Just realized that I could possibly deduce the most active process
from its scheduling priority. I need to look into that.)

>> (Interestingly enough, when I suggested such "features" are added to the
>> system there was an outrage, especially from people who never use Plan 9,
>> telling me I'm just polluting the beautiful system :)...
>
> And I definitely don't understand what you mean here.
> I have all sorts of trippy acid to look at who is using
> what.  If you identified an interesting set of questions
> that could be answered by exporting some kernel information
> in a new format, I would be all ears.

There are a certain type of people who like Plan 9 for what (they
think) it stands for (being the anti-lunix) and not for what it is (a
decent OS which is comfortable to work with, if a bit spartan).

This comment was directed to them "don't want any features in Plan 9;
frame the source and put it on the wall" people :)

andrey

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] more fossil woes
  2003-11-01  0:24 [9fans] more fossil woes andrey mirtchovski
  2003-11-01  4:18 ` jmk
  2003-11-01  5:35 ` Russ Cox
@ 2003-11-11  4:59 ` Russ Cox
  2003-11-11  4:35   ` mirtchov
  2 siblings, 1 reply; 12+ messages in thread
From: Russ Cox @ 2003-11-11  4:59 UTC (permalink / raw)
  To: 9fans

>	assert failed: b->nlock == 1
>	fossil 44: suicide: sys: trap: fault read addr=0x0 pc=0x0002b6b7

I believe I have just fixed this bug.  Jmk caught one and held it down for me.
Sources are there now, binary tomorrow.

> It was the first crash in a long time, but unfortunately I had no way of
> finding out who/what had caused it, because Plan 9 does not allow me to
> examine process' activity based on utilization of a particular resource.

I don't understand what you mean here.  What query would you have
asked the system to help isolate the problem?

I needed to use acid to look around at the various fossil processes
involved in causing the crash.  Given that your fossil was gone,
I have a hard time believing acid would have run.  And I can't
imagine any general questions the system might answer that would
have helped.

> (Interestingly enough, when I suggested such "features" are added to the
> system there was an outrage, especially from people who never use Plan 9,
> telling me I'm just polluting the beautiful system :)...

And I definitely don't understand what you mean here.
I have all sorts of trippy acid to look at who is using
what.  If you identified an interesting set of questions
that could be answered by exporting some kernel information
in a new format, I would be all ears.

Russ

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] more fossil woes
  2003-11-11  4:35   ` mirtchov
@ 2003-11-11  6:08     ` Russ Cox
  2003-11-11  6:08     ` andrey mirtchovski
  2003-11-11  9:16     ` Richard Miller
  2 siblings, 0 replies; 12+ messages in thread
From: Russ Cox @ 2003-11-11  6:08 UTC (permalink / raw)
  To: 9fans

> >>	assert failed: b->nlock == 1
> >>	fossil 44: suicide: sys: trap: fault read addr=0x0 pc=0x0002b6b7
> >
> > I believe I have just fixed this bug.  Jmk caught one and held it down for me.
> > Sources are there now, binary tomorrow.
>
> Several months ago I offered an 8 megabyte log of fossil crashing with
> this problem while running in debug mode.  Could've been helpful, but I'm
> glad you fixed it :)

Sadly, debug logs are worthless.  That just tells me all the messages,
not exactly what happened.  I'm not sure I've had any bugs
that I solved without looking at the broken fossil directly
(and I've had a LOT of hard bugs).



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] more fossil woes
  2003-11-11  4:35   ` mirtchov
  2003-11-11  6:08     ` Russ Cox
@ 2003-11-11  6:08     ` andrey mirtchovski
  2003-11-11  9:16     ` Richard Miller
  2 siblings, 0 replies; 12+ messages in thread
From: andrey mirtchovski @ 2003-11-11  6:08 UTC (permalink / raw)
  To: 9fans

On Mon, 10 Nov 2003 mirtchov@cpsc.ucalgary.ca wrote:

> This comment was directed to them "don't want any features in Plan 9;
> frame the source and put it on the wall" people :)

To clear any possible misunderstandings: nobody posting on this list
(including even Choate) is part of this "group" :)



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] more fossil woes
  2003-11-11  4:35   ` mirtchov
  2003-11-11  6:08     ` Russ Cox
  2003-11-11  6:08     ` andrey mirtchovski
@ 2003-11-11  9:16     ` Richard Miller
  2003-11-11  9:59       ` Lucio De Re
  2 siblings, 1 reply; 12+ messages in thread
From: Richard Miller @ 2003-11-11  9:16 UTC (permalink / raw)
  To: 9fans

> (a
> decent OS which is comfortable to work with, if a bit spartan)

s/if/because/

spartan => easy to understand => comfortable to work with

IMHO.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] more fossil woes
  2003-11-11  9:16     ` Richard Miller
@ 2003-11-11  9:59       ` Lucio De Re
  2003-11-11 10:36         ` David Lukes
  0 siblings, 1 reply; 12+ messages in thread
From: Lucio De Re @ 2003-11-11  9:59 UTC (permalink / raw)
  To: 9fans

On Tue, Nov 11, 2003 at 09:16:46AM +0000, Richard Miller wrote:
>
> > (a
> > decent OS which is comfortable to work with, if a bit spartan)
>
> s/if/because/
>
> spartan => easy to understand => comfortable to work with
>
> IMHO.

You're entitled to your opinion, of course, but spartan is more or
less the opposite of opulent, implying a shortage of comforts.

The bit about "easy to understand" does not follow from "spartan",
nor, to be frank, does it make much of a premiss for "comfortable
to work with".

So which is it: Spartan or comfortable?  I think the latter, because
the real needs are addressed and no attempt at satisfying the
artificial needs (curse that browser!) has come along to bloat the
base architecture.

I guess that means that Plan 9 may seem spartan to those who are
not entirely sold on its fundamentals.  No offence intended, of
course.

++L

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] more fossil woes
  2003-11-11  9:59       ` Lucio De Re
@ 2003-11-11 10:36         ` David Lukes
  0 siblings, 0 replies; 12+ messages in thread
From: David Lukes @ 2003-11-11 10:36 UTC (permalink / raw)
  To: 9fans

Lucio De Re wrote:

>On Tue, Nov 11, 2003 at 09:16:46AM +0000, Richard Miller wrote:
>
>
>>spartan => easy to understand => comfortable to work with
>>
>>IMHO.
>>
>>
IMHO spartan has implications of pain and suffering.
Why not call it "minimalist": then all the fashion/design junkies could
get on board?:-)




^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2003-11-11 10:36 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-11-01  0:24 [9fans] more fossil woes andrey mirtchovski
2003-11-01  4:18 ` jmk
2003-11-01  7:56   ` andrey mirtchovski
2003-11-01  5:35 ` Russ Cox
2003-11-01  7:50   ` andrey mirtchovski
2003-11-11  4:59 ` Russ Cox
2003-11-11  4:35   ` mirtchov
2003-11-11  6:08     ` Russ Cox
2003-11-11  6:08     ` andrey mirtchovski
2003-11-11  9:16     ` Richard Miller
2003-11-11  9:59       ` Lucio De Re
2003-11-11 10:36         ` David Lukes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).