9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] cwfs(4) failing: phase error after recover or suicide after normal startup
@ 2007-09-18 22:28 Anthony Sorace
  2007-09-18 23:13 ` erik quanstrom
  0 siblings, 1 reply; 8+ messages in thread
From: Anthony Sorace @ 2007-09-18 22:28 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 1987 bytes --]

Having played around with cwfs for a week or so now, I'm trying to use
it to migrate my old kenfs. The relevant config line is 'filsys main
cw4f[w<0-3>]'; w4 no longer works. I've gotten w0-3 hooked up to my
cpu server and have created a devmap mapping w4 to a 30GB disk file
(the original w4 disk was slightly larger than that).

Starting up cwfs with -f (and other appropriate invocations) seems to
work fine; I give it the config, it reports the correct mapping, and I
end the conversation with 'recover main' and 'end'. The recover seems
to work fine - it reports the block numbers of a few hundred dumps -
but the process ends with this (I have CHAT(cp) always return true):
	next dump at Wed Sep 19 05:00:00 2007
	c_session 0
	c_attach 0
		fid = 1
		uid = adm
		arg = main
	fworm: read 1400715
		error: phase error -- directory entry not allocated
	panic: FID1 attach to root
	halted at Tue Sep 18 14:16:07 2007.
That phase error is Ealloc in 9p1.c^f_attach.

If I omit the -f from cwfs's invocation after doing the recover, I get
this, instead:
	next dump at Wed Sep 19 05:00:00 2007
	c_session 0
	c_attach 0
		fid = 1
		uid = adm
		arg = main
	cwfs:10685: suicide: sys: trap: divide error pc=0x000131cd
PC there points to /sys/src/cmd/cwfs/cw.c:562 - cwio(); I've got acid
traces in, but I don't see anything obviously wrong (no
divide-by-zero, h->msize is positive, &c).

In this case, only the first cwfs proc has suicided; the rest are
running along, getting 9p requests (although not managing to actually
do anything with them).

I'm going to do more tracing after dinner or tomorrow, but I'm
reasonably stumped at this point. Any pointers or other help is
greatly appreciated. Particularly intriguing is why the behavior
differs, when the first case successfully completes the recover and
moves on. Attached is a summary of an acid debugging run on the
suicided process from the second form, should anyone want to take a
look.

[-- Attachment #2: cwfs.acid.txt --]
[-- Type: text/plain, Size: 2366 bytes --]

: sophia; acid 10685
/proc/10685/text:386 plan 9 executable

/sys/lib/acid/port
/sys/lib/acid/386
acid: lstk()
cwio(dev=0x16f408,addr=0x0,buf=0x8250cc0,opcode=0x8)+0x98 /sys/src/cmd/cwfs/cw.c:562
	cw=0x85fb188
	cb=0xbfd750
	h=0x80a8cc0
	a1=0x25dd9
	bn=0xbfd750
	a2=0x34000001
	max=0x0
	newmax=0x1
	p=0xbfd750
	b=0x20267
	c=0x375bc
	state=0x1
	p1=0x25dd9
	p2=0xbfd750
cwread(dev=0x16f408,b=0x0,c=0x8250cc0)+0x28 /sys/src/cmd/cwfs/cw.c:496
devread(c=0x8250cc0,b=0x0,d=0x16f408)+0x1b6 /sys/src/cmd/cwfs/sub.c:988
	e=0x29743
getbuf(d=0x16f408,addr=0x0,flag=0x1)+0x1de /sys/src/cmd/cwfs/iobuf.c:109
	hp=0xb5a258
	p=0xbffbc0
f_attach(cp=0x85fad28,in=0xdfffebcc,ou=0xdfffeb30)+0x29b /sys/src/cmd/cwfs/9p1.c:241
	p=0x0
	f=0x1762e0
	fs=0x4d3b0
	raddr=0x0
	d=0x202af
fcall9p1(in=0xdfffebcc,ou=0xdfffeb30,cp=0x85fad28)+0x95 /sys/src/cmd/cwfs/console.c:21
	t=0x56
con_attach(fid=0x1,uid=0x4058f,arg=0x1734e8)+0x84 /sys/src/cmd/cwfs/console.c:48
	in=0x1ec56
	ou=0x1ea57
cmd_cfs(argc=0x1,argv=0xdfffeca8)+0x70 /sys/src/cmd/cwfs/con.c:618
	name=0x40571
	fs=0x4d3b0
cmd_exec(arg=0x401a0)+0xdc /sys/src/cmd/cwfs/con.c:118
	line=0x736663
	argv=0xdfffecd4
	argc=0x1
	i=0x1
consserve()+0x3d /sys/src/cmd/cwfs/con.c:20
	i=0x29d0
main(argv=0xdfffef88,argc=0x1)+0x2b0 /sys/src/cmd/cwfs/main.c:331
	nets=0x1
	_argc=0x6d
	_args=0x3fd64
	ann=0x0
	i=0xf
_main+0x31 /sys/src/libc/386/main9.s:16
acid: print(pcfile(0x000131cd))
/sys/src/cmd/cwfs/cw.c
acid: print(pcline(0x000131cd))
562
acid: include("/sys/src/cmd/cwfs/arkive/cw.acid")
acid: cwio:dev
	type	0x08
	init	0xf4
	link	0x00000000
	dlink	0x08250cc0
	private	0x00000008
	size	582328845860864
_10_ {
_2_ wren {
	ctrl	1504264
	targ	0
	lun	136645824
	mapped	11903580
	file	0x00029845
	fd	12581824
	sddir	0x00029743
	sddata	0x0001841f
}
_3_ cat {
	first	0x0016f408
	last	0x00000000
	ndev	136645824
}
_4_ cw {
	c	0x0016f408
	w	0x00000000
	ro	0x08250cc0
}
_5_ j {
	j	0x0016f408
	m	0x00000000
}
_6_ ro {
	parent	0x0016f408
}
_7_ fw {
	fw	0x0016f408
}
_8_ part {
	d	0x0016f408
	base	0
	size	136645824
}
_9_ swab {
	d	0x0016f408
}
}

acid: cwio:h
	maddr	134909120
	msize	12572496
	caddr	12572496
	csize	155097
	fsize	12572496
	wsize	77718
	wmax	1504264
	sbaddr	0
	cwraddr	136645824
	roraddr	8
	toytime	0
	time	135584

acid: cwio:addr
0xdfffea60
acid: cwio:bn
0xdfffea38

acid: rc("cat /dev/text > /mnt/term/Users/anthony/Desktop/cwfs.acid.txt")

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] cwfs(4) failing: phase error after recover or suicide after normal startup
  2007-09-18 22:28 [9fans] cwfs(4) failing: phase error after recover or suicide after normal startup Anthony Sorace
@ 2007-09-18 23:13 ` erik quanstrom
  2007-09-20 23:13   ` Anthony Sorace
  0 siblings, 1 reply; 8+ messages in thread
From: erik quanstrom @ 2007-09-18 23:13 UTC (permalink / raw)
  To: 9fans

	bn = addr % h->msize;

msize must be zero.

- erik


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] cwfs(4) failing: phase error after recover or suicide after normal startup
  2007-09-18 23:13 ` erik quanstrom
@ 2007-09-20 23:13   ` Anthony Sorace
  2007-09-21  0:03     ` [9fans] cwfs(4) failing: phase error after recover or suicide erik quanstrom
  0 siblings, 1 reply; 8+ messages in thread
From: Anthony Sorace @ 2007-09-20 23:13 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

that was the first thing i checked. in acid, print(cwio:h) showed
seemingly useful non-0 numbers. but apparently i wasn't paying very
close attention: h in cwio is a Cache*, not a Cache, so i needed
print(*cwio:h). yup, msize is zero.

geoff's been providing suggestions and thinks that the recover didn't
actually succeed, so i'm focusing on that case for now. more info as
it presents.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] cwfs(4) failing: phase error after recover or suicide
  2007-09-20 23:13   ` Anthony Sorace
@ 2007-09-21  0:03     ` erik quanstrom
  2007-09-21  1:53       ` Anthony Sorace
  0 siblings, 1 reply; 8+ messages in thread
From: erik quanstrom @ 2007-09-21  0:03 UTC (permalink / raw)
  To: 9fans

> that was the first thing i checked. in acid, print(cwio:h) showed
> seemingly useful non-0 numbers. but apparently i wasn't paying very
> close attention: h in cwio is a Cache*, not a Cache, so i needed
> print(*cwio:h). yup, msize is zero.
>
> geoff's been providing suggestions and thinks that the recover didn't
> actually succeed, so i'm focusing on that case for now. more info as
> it presents.

that's what i guessed your problem was.  (since msize just had to be zero.)

i would guess that your new fworm is not exactly the same (calculated)
size as your old worm.  i think you can fix this by simply dropping the "f"
from your device string.  this will inhibit the maintence of the bitmap
at the end of the fake worm.

anyway, what is the compelling reason to move to cwfs?  i have been
spending a lot of time with it in the last 6 weeks.  i've made some substantial
performance improvements.  and i've added aoe.

- erik




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] cwfs(4) failing: phase error after recover or suicide
  2007-09-21  0:03     ` [9fans] cwfs(4) failing: phase error after recover or suicide erik quanstrom
@ 2007-09-21  1:53       ` Anthony Sorace
  2007-09-21  2:00         ` erik quanstrom
  0 siblings, 1 reply; 8+ messages in thread
From: Anthony Sorace @ 2007-09-21  1:53 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On 9/20/07, erik quanstrom <quanstro@quanstro.net> wrote:
// i would guess that your new fworm is not exactly the same (calculated)
// size as your old worm.

the fworm, not the cache? hrm, interesting. it's exactly the same
disks, but i suppose that could be it. i'll take a look at that and
how the bitmap is maintained. i'd expect problems there to show up in
the explicit recover phase (which cwfs's prints say has completed),
but it's worth a check. dropping the "f" is non-destructive in the
face of recover?

i've been looking at auth issues for some of the evening, since it's
complaining about things related to attach. maybe that's a red
herring. i'll take a look at the bitmap tomorrow.

// anyway, what is the compelling reason to move to cwfs?

it's prompted by something in my fs hardware going funny. i suspect
it's just the terminator i have to use on the somewhat odd setup in
that box, but it led to the whole "gee, i'd really like fewer PCs to
maintain" line of thought. the kenfs is also quite old now, and the
size reflects that; i'm considering just moving everything on it over
to venti and putting the box in storage. not to mention a desire to
reduce my power consumption and noise production.

i still think the stand-alone fs has its place, but i don't think my
garage is it.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] cwfs(4) failing: phase error after recover or suicide
  2007-09-21  1:53       ` Anthony Sorace
@ 2007-09-21  2:00         ` erik quanstrom
  2007-09-21  3:10           ` Anthony Sorace
  0 siblings, 1 reply; 8+ messages in thread
From: erik quanstrom @ 2007-09-21  2:00 UTC (permalink / raw)
  To: 9fans

> the fworm, not the cache? hrm, interesting. it's exactly the same
> disks, but i suppose that could be it. i'll take a look at that and
> how the bitmap is maintained. i'd expect problems there to show up in
> the explicit recover phase (which cwfs's prints say has completed),
> but it's worth a check. dropping the "f" is non-destructive in the
> face of recover?

yes.  recover doesn't touch the w part of the device.  it just checks the
block after the last block in each dump to see if it's a sb.  if it is it
loops.  if it is not, then you're at the end and the cache is cleared.

> maintain" line of thought. the kenfs is also quite old now, and the
> size reflects that; i'm considering just moving everything on it over
> to venti and putting the box in storage. not to mention a desire to
> reduce my power consumption and noise production.

kenfs does run on new hardware.  i'm currently running it on an
intel 5000-series processor and a brand new mb at coraid.  it also
does great with my valinux pIII at home.

> i still think the stand-alone fs has its place, but i don't think my
> garage is it.

electricity:  $5/month.
noise: too much.
not doing maintence to the fs: priceless. ☺

- erik



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] cwfs(4) failing: phase error after recover or suicide
  2007-09-21  2:00         ` erik quanstrom
@ 2007-09-21  3:10           ` Anthony Sorace
  2007-09-21  3:39             ` erik quanstrom
  0 siblings, 1 reply; 8+ messages in thread
From: Anthony Sorace @ 2007-09-21  3:10 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

perfect. removing the f from the config did the trick exactly. i've
got my fs back. i'd still like to understand more why the bitmap is
wrong on the same disks, but that's for another day now. very much
thanks.

i agree having a file server "just run" is worth quite a bit; the
problem is the hardware in mine no longer fits that description, and
i'm temporarily budget constrained.

again, much thanks.
a


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] cwfs(4) failing: phase error after recover or suicide
  2007-09-21  3:10           ` Anthony Sorace
@ 2007-09-21  3:39             ` erik quanstrom
  0 siblings, 0 replies; 8+ messages in thread
From: erik quanstrom @ 2007-09-21  3:39 UTC (permalink / raw)
  To: 9fans

> perfect. removing the f from the config did the trick exactly. i've
> got my fs back. i'd still like to understand more why the bitmap is
> wrong on the same disks, but that's for another day now. very much
> thanks.

cool..

the problem is that the calculation of the device size is subject
to rounding error and if it's one sector off, you're bitmap won't
be were its supposed to be.

> i agree having a file server "just run" is worth quite a bit; the
> problem is the hardware in mine no longer fits that description, and
> i'm temporarily budget constrained.

you can get a valinux box on ebay for $100.

- erik



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2007-09-21  3:39 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-09-18 22:28 [9fans] cwfs(4) failing: phase error after recover or suicide after normal startup Anthony Sorace
2007-09-18 23:13 ` erik quanstrom
2007-09-20 23:13   ` Anthony Sorace
2007-09-21  0:03     ` [9fans] cwfs(4) failing: phase error after recover or suicide erik quanstrom
2007-09-21  1:53       ` Anthony Sorace
2007-09-21  2:00         ` erik quanstrom
2007-09-21  3:10           ` Anthony Sorace
2007-09-21  3:39             ` erik quanstrom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).