Re: [9fans] file server trouble

9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed

* Re: [9fans] file server trouble
@ 2002-04-02 20:57 forsyth
  0 siblings, 0 replies; 9+ messages in thread
From: forsyth @ 2002-04-02 20:57 UTC (permalink / raw)
  To: 9fans

i'd try recover main in config mode.  for whatever reason, some blocks
were not written out.  alternatively, they were, but the fworm bitmap blocks
weren't, and if they aren't right, you still can't read the data blocks.
in some cases, you might need to build a kernel to allow you to go further
back than `recover main' (with a pseudo-worm), but i've not actually had to do that.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [9fans] file server trouble
@ 2002-04-03 20:28 forsyth
  0 siblings, 0 replies; 9+ messages in thread
From: forsyth @ 2002-04-03 20:28 UTC (permalink / raw)
  To: 9fans

normally, if you recover to a dump, it should be fine, BUT
the system now needs to account for the fact that you've
used more blocks with a partially-attempted dump.
it needs to set the basis for future allocations to
the base of the unwritten material.  that's what the commented line does.
(see cwgrow.)

actually, what it ought to do instead of guessing that 100 will be enough
is to trundle down the blocks from where it thinks it can start
until it finds an unwritten block.  it can tell that.
i've got something like it somewhere; i needed
to copy an fworm to a worm.
	from = f1->dev;
	if(from->type == Devcw)
 		from = from->cw.w;
	devinit(from);
 	/* find last valid block in case fworm */
 	lim = devsize(from);
 	for(;;){
 		if(lim == 0)
			panic("no blocks to copy on %D", from);
 		p = getbuf(from, lim-1, Bread);
		if(p){
			putbuf(p);
 			break;
		}
		lim--;
	}
	print("limit %ld\n", lim);

h->fsize should be set to lim+1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [9fans] file server trouble
@ 2002-04-03 20:08 anothy
  0 siblings, 0 replies; 9+ messages in thread
From: anothy @ 2002-04-03 20:08 UTC (permalink / raw)
  To: 9fans

// i wouldn't necessarily suspect the drive.

good to know, and thanks for explaining what's likely going on.

i'm a little reluctant to fiddle with the file server kernel without
understanding what's going on (particularly given the nature of
the comment on that line!), and i'm afraid the file server code is
just a bit dense (as in compact, tightly-written) for me to get a
good handle so far. what's this line _do_? it looks like it's setting
the cache size based on the superblock size, but i don't get
what that _means_ given the cache size is defined at config.

the file server's currently returning a few "Ephase" errors (and
let me just tell you how happy i am to get errors where the
diagnostic is "can't happen"!), which seems consistant with what
you're saying. but if i recover to a dump with these phase errors
in them, are they likely to ever go away? should i re-create the
files in question and allow the fs to do a dump before trying to
do the recover?

much continued gratitude...
ア

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [9fans] file server trouble
@ 2002-04-03 16:56 forsyth
  0 siblings, 0 replies; 9+ messages in thread
From: forsyth @ 2002-04-03 16:56 UTC (permalink / raw)
  To: 9fans

[-- Attachment #1: Type: text/plain, Size: 832 bytes --]

i wouldn't necessarily suspect the drive.

it's saying that the fworm bitmap claims a
block has already been written, which isn't allowed
on a write-once structure.  thats fine, except
it might have been interrupted at some point when
the fworm bitmap had been written but pointers
in the super block and other structures updating
the allocation addresses had not been written.  the data might
or might not be in those blocks.
actually, i'd suspect it had crashed or rebooted
during a dump.

do the recover again, but try a kernel that uses a larger value in cw.c for
	h->fsize = s->fsize + 100;		/* this must be conservative */

you could also just comment out the fworm diagnostic.
i've done that in the past but i knew it was safe in those
cases because i had a clearer picture of its state when it crashed.

[-- Attachment #2: Type: message/rfc822, Size: 2462 bytes --]

To: 9fans@cse.psu.edu
Subject: Re: [9fans] file server trouble
Date: Wed, 3 Apr 2002 11:39:44 -0500
Message-ID: <20020403164150.3E2F319A55@mail.cse.psu.edu>

thanks to geoff's pointers, i was able to build a kernel that
correctly recovered my file system. the file server is back
up, and stays running in the face of normal use. but things
still arn't "right". i'm now getting lots of messages of the
following form on my file server console:

fworm: write 152116
cwio: write induced dump error - w worm
fworm: write 152117
cwio: write induced dump error - w worm
fworm: write 152151
cwio: write induced dump error - w worm
fworm: write 152152
cwio: write induced dump error - w worm

this started after a dump was attempted. the above block
would repeat every few seconds (~10). after the next
dump, a new set was generated, with about twice the
number of errors, printing about ever 20 seconds. this
looks to me like a bad disk (which is really annoying, since
it's a few months old). do others agree?

between now and when i can get the disk replaced, is
there any mechanism in the fs kernel for marking blocks
"bad"? and if not, am i correct in my belief that all my
dumps into the future will be unsafe, since dumps include
all the previous blocks (until modification)?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [9fans] file server trouble
@ 2002-04-03 16:39 anothy
  0 siblings, 0 replies; 9+ messages in thread
From: anothy @ 2002-04-03 16:39 UTC (permalink / raw)
  To: 9fans

thanks to geoff's pointers, i was able to build a kernel that
correctly recovered my file system. the file server is back
up, and stays running in the face of normal use. but things
still arn't "right". i'm now getting lots of messages of the
following form on my file server console:

fworm: write 152116
cwio: write induced dump error - w worm
fworm: write 152117
cwio: write induced dump error - w worm
fworm: write 152151
cwio: write induced dump error - w worm
fworm: write 152152
cwio: write induced dump error - w worm

this started after a dump was attempted. the above block
would repeat every few seconds (~10). after the next
dump, a new set was generated, with about twice the
number of errors, printing about ever 20 seconds. this
looks to me like a bad disk (which is really annoying, since
it's a few months old). do others agree?

between now and when i can get the disk replaced, is
there any mechanism in the fs kernel for marking blocks
"bad"? and if not, am i correct in my belief that all my
dumps into the future will be unsafe, since dumps include
all the previous blocks (until modification)?
ア

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [9fans] file server trouble
@ 2002-04-02 23:18 Geoff Collyer
  0 siblings, 0 replies; 9+ messages in thread
From: Geoff Collyer @ 2002-04-02 23:18 UTC (permalink / raw)
  To: 9fans

And obviously you'll need a running Plan 9 system somewhere to build a
new kernel.  I can build one and put it somewhere for you to fetch;
send me mail.  But the distribution really should be fixed too so that
recover just works.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [9fans] file server trouble
@ 2002-04-02 23:06 Geoff Collyer
  0 siblings, 0 replies; 9+ messages in thread
From: Geoff Collyer @ 2002-04-02 23:06 UTC (permalink / raw)
  To: 9fans

plan9pc/9pcfs.c contains

Startsb	startsb[] =
{
	"main",		2892792,
	0
};

The number is the block address of the first super block to consult
when performing a recovery.  For worms, this can be safely set to 2,
the address of the very first super block (SUPER_ADDR).  Larger values
are presumably optimisations, but this shouldn't matter much unless
you've actually got a jukebox.  emelie/9pcfs.c contains this instead:

Startsb	startsb[] =
{
	"main",		2,
	0
};

and that's probably what plan9pc/9pcfs.c should contain too.  2892792
looks like a left-over optimisation from some past file server.
clone/clone.c has yet a different (probably incorrect) value:

Startsb	startsb[] =
{
	"main",	3066839,
	0
};

So try using a value of 2.  Build & boot that kernel and try recover
again.  You can also override this value for the purpose of recovery
by setting conf.firstsb in localconfinit().  emelie/pc.c does this:

	conf.firstsb = 13219302;

and my copy of plan9pc/pc.c does this:

	conf.firstsb = 2;

but it looks like I added that.  I have done successful recovers with
firstsb == 2.  Out of paranoia, I also set

	conf.dumpreread = 1;	/* read and compare in dump copy */

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [9fans] file server trouble
@ 2002-04-02 22:10 anothy
  0 siblings, 0 replies; 9+ messages in thread
From: anothy @ 2002-04-02 22:10 UTC (permalink / raw)
  To: 9fans

tried recover main; no joy, as seen below. they system then
reboots into the exact state i described earlier.
ア


sysinit: main
recover: cp0.10D5
        devinit p0.10
        devinit w0
        drive w0:
                71132958 blocks at 512 bytes each
                4445809 logical blocks at 8192 bytes each
                16 multiplier
        devinit D5
fworm init
        devinit p10.90
        devinit w0
        drive w0:
                71132958 blocks at 512 bytes each
                4445809 logical blocks at 8192 bytes each
                16 multiplier
fworm: read 2892792
stack trace of 2
0x8012325b 0x80136072 0x80110ab9 0x801272d4 0x80106252 0x80106160 0x80113f23 0x8011f49c
0x80126de0 0x80106160 0x80105e57 0x8011f49c 0x80105e57 0x80106160 0x80105e57 0x80123992
0x801314f0 0x80126d08 0x801314f0 0x80129986 0x80117ad6 0x80117ad6 0x801099b9 0x80117aa1
0x8011772f 0x80100396 0x80110008 0x8010024f 0x80101194 0x8012452e 0x8010968f 0x80123dee
0x801241a5 0x80100665 0x8010000a 0x80117aa1 0x8011772f 0x801073ff 0x80123b54 0x801175a8
0x8010024f 0x80101194 0x80123dee 0x801241a5 0x80123cf0 0x80117785 0x8010d01d 0x80105f21
0x80105e57 0x80123cf0 0x80123cf0 1036 stack used out of 4000

panic: recover: no superblock

cpu 0 exiting



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [9fans] file server trouble
@ 2002-04-02 20:36 anothy
  0 siblings, 0 replies; 9+ messages in thread
From: anothy @ 2002-04-02 20:36 UTC (permalink / raw)
  To: 9fans

my file server's not the least bit happy. details follow. please consider
this a distress call and plea for help!

the first indication of trouble is the message:
	WORM SUPER BLOCK READ FAILED
printed upon bootup. this sounds very bad to me. the file server does,
however, boot, accept connections, and server files as expected. for a
little while, anyway. things go bad when i try to actually _do_ things.
every time i boot my terminal, i get a message of the form:

ark: il: allocating il!10.0.1.105!22788
fworm: read 151856
stack trace of 18
0x8012325b 0x801230c5 0x80105e57 0x80105e57 0x8011ad3a 0x8012586a 0x8011f19d 0x80116eee
0x80106160 0x80105e57 0x80105e57 0x80105e57 0x801132cb 0x8011f49c 0x80126de0 0x80106160
0x80105e57 0x80117ad6 0x80106160 0x80106160 0x80106160 0x801099b9 0x80105e57 0x80105e57
0x80129986 0x801314f0 0x80130c82 0x801321fc 0x80117aa1 0x80117aa1 0x8011772f 0x801073ff
0x80110008 0x8010129d 0x801011f0 0x80101194 0x80123dee 0x801241a5 0x80123cf0 0x8012a7c7
0x80126944 0x80117785 0x80105f0e 0x8012a728 0x80105e57 0x80105e57 0x80129e65 0x80105e57
0x80105eeb 0x80105f0e 0x80105e57 1416 stack used out of 4000

panic: newqid: super block
cpu 0 exiting

things from the "stack trace of" on vary, so i've included another below

ark: il: allocating il!10.0.1.105!25780
fworm: read 151856
stack trace of 13
0x8012325b 0x801230c5 0x80105e57 0x80105e57 0x8011ad3a 0x8012586a 0x8011f19d 0x80116eee
0x80106160 0x80105e57 0x80105e57 0x80105e57 0x801132cb 0x8011f49c 0x80126de0 0x80106160
0x80105e57 0x80117ad6 0x80106160 0x80106160 0x80106160 0x801099b9 0x80105e57 0x80105e57
0x80129986 0x801314f0 0x80130c82 0x801321fc 0x80117aa1 0x80100396 0x80117aa1 0x8011772f
0x80100396 0x801073ff 0x80123b54 0x801175a8 0x8010129d 0x801011f0 0x80101194 0x8010d2f8
0x8010968f 0x80123dee 0x801241a5 0x80123cf0 0x80132414 0x80117785 0x8010d01d 0x8012425f
0x80105e57 0x80129e65 0x80105eeb 0x80105f0e 0x80105e57 1416 stack used out of 4000

panic: newqid: super block
cpu 0 exiting

the terminal gets about half way through booting before the fs dies.
my cpu server boots fine, and runs (at least for a while). shortly after
booting, however, i get a bunch of messages on the fs console:

il: allocating il!10.0.1.100!16428
fworm: read 151856
bufalloc: super block
fworm: read 151856
bufalloc: super block
[...last two messages repeated 14 more times...]

the fs and cpu server continue operating as before. wanting to figure
out what's going on, i did this:

ark: check
fworm: read 151856
FLAGS=10246 TRAP=e ECODE=0 CS=10 PC=801376e9
  AX 00000000  BX 00000073  CX ffffffff  DX 80080688
  SI 809b6c64  DI 00000000  BP 8013fdd8
  DS 0008  ES 0008  FS 0008  GS 0008
  CR0 80010011 CR2 00000000
  ur 800805f4
FLAGS=246 TRAP=1c ECODE=0 CS=10 PC=8012426e
  AX 00000046  BX 8014fc50  CX 00000000  DX 8014fc50
  SI 8014fc50  DI 80144ae2  BP 8014fc78
  DS 0008  ES 0008  FS 0008  GS 0008
  CR0 80010011 CR2 00000000
  lastur 80150ba0
stack trace of 23
0x8012325b 0x8010d519 0x8011795e 0x80117a19 0x80105e57 0x80105e57 0x8010d737 0x8010e210
0x8010b393 0x801270b5 0x8010c8b4 0x80112dbd 0x8011f19d 0x80105e57 0x80106160 0x80105e57
0x8010129d 0x80105f21 0x8010d01d 0x801132cb 0x8011f49c 0x80126de0 0x80106160 0x80105e57
0x80117a85 0x80123b54 0x80123bde 0x801011f0 0x80123f7a 0x80123c8e 0x801376e9 0x801002a4
0x8010973d 0x801098cb 0x8010987f 0x8011772f 0x801073ff 0x80123b54 0x801175a8 0x8010129d
0x801011f0 0x80101194 0x80123dee 0x80123cf0 0x80117ad6 0x80117ad6 0x801099b9 0x8011772f
0x801073ff 0x80123b54 0x801175a8 0x8010129d 0x801011f0 0x80101194 0x80123dee 0x80123cf0
0x80123cf0 0x80117aa1 0x8011772f 0x801073ff 0x80123b54 0x801175a8 0x8010129d 0x801011f0
0x80101194 0x80123dee 0x80123cf0 1616 stack used out of 4000

panic: page fault
cpu 0 exiting

whoops. forcing a dump results in a similar print. the disk is a nice
relatively new (few months) Seagate SCSI thingy, so i'd be somewhat
suprised if it was a hardware issue. the systems didn't loose power
uncleanly. could i recover back to the last dump? other things to try?
i'm in the dark. any help much appreciated.
ア



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2002-04-03 20:28 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-04-02 20:57 [9fans] file server trouble forsyth
  -- strict thread matches above, loose matches on Subject: below --
2002-04-03 20:28 forsyth
2002-04-03 20:08 anothy
2002-04-03 16:56 forsyth
2002-04-03 16:39 anothy
2002-04-02 23:18 Geoff Collyer
2002-04-02 23:06 Geoff Collyer
2002-04-02 22:10 anothy
2002-04-02 20:36 anothy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).