[9fans] fossil caching venti errors

9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed

* [9fans] fossil caching venti errors
@ 2010-01-08  6:53 Josef Artur
  0 siblings, 0 replies; 24+ messages in thread
From: Josef Artur @ 2010-01-08  6:53 UTC (permalink / raw)
  To: 9fans


[-- Attachment #1.1: Type: text/plain, Size: 241 bytes --]

I'm having a problem on a fresh installation. During archival snapshots
i get failures in archWalk (see attached file fossil.err). You can find the
output of Erik's sossd script attached.

Non-archival snapshots are enabled.

- Josef

[-- Attachment #1.2: Type: text/html, Size: 302 bytes --]

[-- Attachment #2: fossil.err --]
[-- Type: application/octet-stream, Size: 2466 bytes --]

fossil: diskReadRaw failed: /dev/sdC0/fossil: score 0x00303cd8: part=label block 3161304: illegal block address
archive(0, 0x6e3b0a0a): cannot find block: error reading block 0x00303cd8
archWalk 0x6e3b0a0a failed; ptr is in 0x264d offset 1
archWalk 0x264d failed; ptr is in 0x2633 offset 48
archWalk 0x2633 failed; ptr is in 0x5b8c offset 0
archWalk 0x5b8c failed; ptr is in 0x5b8b offset 0
archWalk 0x5b8b failed; ptr is in 0x5b8a offset 0
archiveBlock 0x5b8a: error reading block 0x00303cd8
fossil: diskReadRaw failed: /dev/sdC0/fossil: score 0x00303cd8: part=label block 3161304: illegal block address
archive(0, 0x6e3b0a0a): cannot find block: error reading block 0x00303cd8
archWalk 0x6e3b0a0a failed; ptr is in 0x264d offset 1
archWalk 0x264d failed; ptr is in 0x2633 offset 48
archWalk 0x2633 failed; ptr is in 0x5b8c offset 0
archWalk 0x5b8c failed; ptr is in 0x5b8b offset 0
archWalk 0x5b8b failed; ptr is in 0x5b8a offset 0
archiveBlock 0x5b8a: error reading block 0x00303cd8
fossil: diskReadRaw failed: /dev/sdC0/fossil: score 0x00303cd8: part=label block 3161304: illegal block address
archive(0, 0x6e3b0a0a): cannot find block: error reading block 0x00303cd8
archWalk 0x6e3b0a0a failed; ptr is in 0x264d offset 1
archWalk 0x264d failed; ptr is in 0x2633 offset 48
archWalk 0x2633 failed; ptr is in 0x5b8c offset 0
archWalk 0x5b8c failed; ptr is in 0x5b8b offset 0
archWalk 0x5b8b failed; ptr is in 0x5b8a offset 0
archiveBlock 0x5b8a: error reading block 0x00303cd8
fossil: diskReadRaw failed: /dev/sdC0/fossil: score 0x00303cd8: part=label block 3161304: illegal block address
archive(0, 0x6e3b0a0a): cannot find block: error reading block 0x00303cd8
archWalk 0x6e3b0a0a failed; ptr is in 0x264d offset 1
archWalk 0x264d failed; ptr is in 0x2633 offset 48
archWalk 0x2633 failed; ptr is in 0x5b8c offset 0
archWalk 0x5b8c failed; ptr is in 0x5b8b offset 0
archWalk 0x5b8b failed; ptr is in 0x5b8a offset 0
archiveBlock 0x5b8a: error reading block 0x00303cd8
fossil: diskReadRaw failed: /dev/sdC0/fossil: score 0x00303cd8: part=label block 3161304: illegal block address
archive(0, 0x6e3b0a0a): cannot find block: error reading block 0x00303cd8
archWalk 0x6e3b0a0a failed; ptr is in 0x264d offset 1
archWalk 0x264d failed; ptr is in 0x2633 offset 48
archWalk 0x2633 failed; ptr is in 0x5b8c offset 0
archWalk 0x5b8c failed; ptr is in 0x5b8b offset 0
archWalk 0x5b8b failed; ptr is in 0x5b8a offset 0
archiveBlock 0x5b8a: error reading block 0x00303cd8


[-- Attachment #3: sossd.out --]
[-- Type: application/octet-stream, Size: 855 bytes --]

pci
0.31.1:	disk 01.01.8a 8086/27df   7 0:0000f131 16 1:0000f121 16 2:0000f111 16 3:0000f101 16 4:0000f0f1 16
0.31.2:	disk 01.01.8f 8086/27c4  11 0:0000f0e1 16 1:0000f0d1 16 2:0000f0c1 16 3:0000f0b1 16 4:0000f0a1 16 5:ffe40000 1024

sdctl
sdC ata port F0E0 ctl F0D0 irq 11

sdC0

inquiry Hitachi HTS543232L9A300
config 045A capabilities 0F00 dma 00550020 dmactl 00550020 rwm 16 rwmctl 0 lba48always on
model	Hitachi HTS543232L9A300
serial	091028FB3400CEJD266A
firm	FB4OC40C
feat	lba llba smart power nop ata8 sct 
geometry 625142448 512
missirq	0
sloop	0
irq	297603 171443
bsy	14057 14057
nildrive	126160
part data 0 625142448
part plan9 63 625137345
part 9fat 63 204863
part nvram 204863 204864
part fossil 204864 100026288
part arenas 100026288 599133412
part isect 599133412 624088769
part swap 624088769 625137345


identify device | atazz /dev/sdC0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] fossil caching venti errors
  2009-04-08 17:41         ` cinap_lenrek
@ 2009-04-08 18:18           ` Richard Miller
  0 siblings, 0 replies; 24+ messages in thread
From: Richard Miller @ 2009-04-08 18:18 UTC (permalink / raw)
  To: 9fans

> uniprocessor machine

That eliminates some lines of investigation, thank you.

> could it be that the archiver for some reason is looking
> at rolled back blocks?

I'll put in a console message when blocks are rolled back and
see if there's a correlation.




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] fossil caching venti errors
  2009-04-08 17:28       ` Richard Miller
@ 2009-04-08 17:41         ` cinap_lenrek
  2009-04-08 18:18           ` Richard Miller
  0 siblings, 1 reply; 24+ messages in thread
From: cinap_lenrek @ 2009-04-08 17:41 UTC (permalink / raw)
  To: 9fans

uniprocessor machine

--
cinap




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] fossil caching venti errors
  2009-04-08 14:44   ` Richard Miller
                       ` (2 preceding siblings ...)
  2009-04-08 17:01     ` cinap_lenrek
@ 2009-04-08 17:36     ` Steve Simon
  3 siblings, 0 replies; 24+ messages in thread
From: Steve Simon @ 2009-04-08 17:36 UTC (permalink / raw)
  To: 9fans

>  cd /n/dump/2009
>  for (i in *) { test -d $i$home/tmp || ls -d $i$home/tmp }
>  for (i in *) { test -f $i/mail/box/$user/mbox || ls $i/mail/box/$user/mbox }

no problems here, and my server is a dual cpu PIII.

I last built a kernel on the 11th of feb so if this is a very recent but I may
have been too lazy to install it...

-Stev e



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] fossil caching venti errors
  2009-04-08 17:01     ` cinap_lenrek
@ 2009-04-08 17:28       ` Richard Miller
  2009-04-08 17:41         ` cinap_lenrek
  0 siblings, 1 reply; 24+ messages in thread
From: Richard Miller @ 2009-04-08 17:28 UTC (permalink / raw)
  To: 9fans

> ls: 0216/mail/box/cinap_lenrek/mbox: '0216/mail' does not exist
> ls: 0323/mail/box/cinap_lenrek/mbox: '0323/mail/box' does not exist
>
> oops... :(

Schade.

Is this a uni- or multi-processor?




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] fossil caching venti errors
  2009-04-08 14:44   ` Richard Miller
  2009-04-08 14:56     ` ron minnich
  2009-04-08 16:46     ` cinap_lenrek
@ 2009-04-08 17:01     ` cinap_lenrek
  2009-04-08 17:28       ` Richard Miller
  2009-04-08 17:36     ` Steve Simon
  3 siblings, 1 reply; 24+ messages in thread
From: cinap_lenrek @ 2009-04-08 17:01 UTC (permalink / raw)
  To: 9fans

term% for (i in *) { test -f $i/mail/box/$user/mbox || ls $i/mail/box/$user/mbox }
ls: 0216/mail/box/cinap_lenrek/mbox: '0216/mail' does not exist
ls: 0323/mail/box/cinap_lenrek/mbox: '0323/mail/box' does not exist

oops... :(

--
cinap




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] fossil caching venti errors
  2009-04-08 14:44   ` Richard Miller
  2009-04-08 14:56     ` ron minnich
@ 2009-04-08 16:46     ` cinap_lenrek
  2009-04-08 17:01     ` cinap_lenrek
  2009-04-08 17:36     ` Steve Simon
  3 siblings, 0 replies; 24+ messages in thread
From: cinap_lenrek @ 2009-04-08 16:46 UTC (permalink / raw)
  To: 9fans

[-- Attachment #1: Type: text/plain, Size: 529 bytes --]

hm... could it be that the archiver for some reason is looking
at rolled back blocks?

to ron:

are you using temporal snapshots?  i'v seen the same symptom but hadnt
the nerves to look deeper into it...  just reformated the fossil from
last score and disabled snapshots and snaptime.  :(

it happend while i dumped the blocks from another fossils venti to the
same venti that the failing fossil was archiving to...  with snapshots
enabled.

have no idea if this is relevant...

keep on debugging!

--
cinap

[-- Attachment #2: Type: message/rfc822, Size: 5669 bytes --]

From: Richard Miller <9fans@hamnavoe.com>
To: 9fans@9fans.net
Subject: Re: [9fans] fossil caching venti errors
Date: Wed, 8 Apr 2009 15:44:00 +0100
Message-ID: <18a0a7d320f2c6799a99d3d3b64c2f4a@hamnavoe.com>

I've been seeing corruption of fossil /archive data, at a rate of about
once a week.  The symptom is that venti dir entries for /archive copies
of some frequently referenced directories in my main fossil fs (generally
it's /usr, /usr/miller or /mail/box/miller) contain incorrect venti scores.
Sometimes the score is for a nonexistent block, and sometimes it's a valid
score but for a different block (e.g. a data or dir block when a meta block
is expected).

Almost always it's the dir entry for a metadata block which gets a bad
score, but occasionally it's the dir entry for a sub-dir block.  This
morning both the sub-dir and meta entries for one directory were written
with the scores of the corresponding entries for a different directory:

term% ls -lqd /n/dump/2009/0408/mail /n/dump/2009/0408/usr/miller/disk
(0000000000001646  9 80) d-rwxrwxr-x M 36 upas   upas   0 Jan  1  2007 /n/dump/2009/0408/mail
(000000000000694e 23 80) d-rwxr-xr-x M 36 miller miller 0 Dec 31  2003 /n/dump/2009/0408/usr/miller/disk
term% ls /n/dump/2009/0408/mail /n/dump/2009/0408/usr/miller/disk
/n/dump/2009/0408/mail/fdisk.c
/n/dump/2009/0408/mail/fdisk.c_ok
/n/dump/2009/0408/mail/fdisk.c_try
/n/dump/2009/0408/mail/mkfs.c
/n/dump/2009/0408/usr/miller/disk/fdisk.c
/n/dump/2009/0408/usr/miller/disk/fdisk.c_ok
/n/dump/2009/0408/usr/miller/disk/fdisk.c_try
/n/dump/2009/0408/usr/miller/disk/mkfs.c
term% diff -r /n/dump/2009/0408/mail /n/dump/2009/0408/usr/miller/disk
term%

Here are the pairs of VtEntry structures for the two files as retrieved
from venti:

/n/dump/2009/0408/mail:
0000000  00000000 1ff41fe0 03000000 00000000
0000010  00000280 0830e9ae 3023dce8 3ddf6138
0000020  4fe87d59 7257e3b7 00000000 1ff42000
0000030  01000000 00000000 0000039f 21ee2227
0000040  695a8ea2 fbbd136e dcb435e0 64fbcdfb
0000050
 /n/dump/2009/0408/usr/miller/disk:
0000000  00000000 1ff41fe0 03000000 00000000
0000010  000000a0 0830e9ae 3023dce8 3ddf6138
0000020  4fe87d59 7257e3b7 00000000 1ff42000
0000030  01000000 00000000 00000296 21ee2227
0000040  695a8ea2 fbbd136e dcb435e0 64fbcdfb
0000050

Note that it's just the score fields for /mail which are wrong; the
size fields (280 and 39f) are correct, matching yesterday's dump:

/n/dump/2009/0407/mail:
0000000  00000000 1ff41fe0 03000000 00000000
0000010  00000280 ...

So, what's going on?  My intuition says a fossil bug - I can't think
of any disk hardware error which would lead to this kind of corruption.
I have of course studied /sys/src/cmd/fossil/archive.c looking for
races (my fossil+venti machine has two processors), but everything
appears to be protected by locks.

I would be interested to know if anyone else's archive is being quietly
messed up in this way - you wouldn't necessarily know until you tried
something like a history(1) command and found pieces missing.  This
is a quick test which may show if you have a similar problem:

  cd /n/dump/2009
  for (i in *) { test -d $i$home/tmp || ls -d $i$home/tmp }
  for (i in *) { test -f $i/mail/box/$user/mbox || ls $i/mail/box/$user/mbox }

In a post a while ago, Russ said
> The amazing thing to me about fossil is how indestructable
> it is when used with venti.
> ... Once you see the data in the archive
> tree, you can be very sure it's not going away.

I agree with this, but I'd like a way to be reassured that my daily
data has actually gone into the archive correctly.

-- Richard

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] fossil caching venti errors
  2009-04-08 14:56     ` ron minnich
  2009-04-08 15:36       ` C H Forsyth
@ 2009-04-08 15:55       ` Richard Miller
  1 sibling, 0 replies; 24+ messages in thread
From: Richard Miller @ 2009-04-08 15:55 UTC (permalink / raw)
  To: 9fans

> archive(0, 0x4d385643): cannot find block: error reading block 0x0021cac2
> archWalk 0x4d385643 failed; ptr is in 0x2075 offset 3

0x4d385643 is supposedly a block number, which seems unlikely unless
you have a >10 terabyte disk.  Looks like you have a VtEntry or pointer
block with a corrupted score (to be precise, entry 3 in block 0x2075),
but in this case it should be the score for a local (fossil) block,
not a venti block.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] fossil caching venti errors
  2009-04-08 14:56     ` ron minnich
@ 2009-04-08 15:36       ` C H Forsyth
  2009-04-08 15:55       ` Richard Miller
  1 sibling, 0 replies; 24+ messages in thread
From: C H Forsyth @ 2009-04-08 15:36 UTC (permalink / raw)
  To: 9fans

>I'm having a continuous problem, symptom being failures in archWalk,
>but had assumed it was a hard disk getting ready to die.

i see those on one of my old drives, but i don't think
it's the drive.  i thought it might have been
caused by a power failure catching out fossil or venti,
but if you've got the same problem, perhaps not.
worse, reading the source it looked to me as though archives
might then be incomplete, because the error stops the walk,
but i haven't yet tried initialising a new fossil (perhaps
a separate, temporary one) from the venti to find out.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] fossil caching venti errors
  2009-04-08 14:44   ` Richard Miller
@ 2009-04-08 14:56     ` ron minnich
  2009-04-08 15:36       ` C H Forsyth
  2009-04-08 15:55       ` Richard Miller
  2009-04-08 16:46     ` cinap_lenrek
                       ` (2 subsequent siblings)
  3 siblings, 2 replies; 24+ messages in thread
From: ron minnich @ 2009-04-08 14:56 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

I'm having a continuous problem, symptom being failures in archWalk,
but had assumed it was a hard disk getting ready to die.

fossil: diskReadRaw failed: /dev/sdC0/fossil: score 0x0021cac2:
part=label block 2214594: illegal block address
archive(0, 0x4d385643): cannot find block: error reading block 0x0021cac2
archWalk 0x4d385643 failed; ptr is in 0x2075 offset 3
archWalk 0x2075 failed; ptr is in 0x2072 offset 62
archWalk 0x2072 failed; ptr is in 0x2071 offset 0
archWalk 0x2071 failed; ptr is in 0x20d2 offset 40
archWalk 0x20d2 failed; ptr is in 0x20d6 offset 0
archWalk 0x20d6 failed; ptr is in 0x20d5 offset 0
archWalk 0x20d5 failed; ptr is in 0x20d4 offset 0
archiveBlock 0x20d4: error reading block 0x0021cac2
fossil: diskReadRaw failed: /dev/sdC0/fossil: score 0x0021cac2:
part=label block 2214594: illegal block address

I should know what to do about this but do not.

ron.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] fossil caching venti errors
  2009-03-28 11:11 ` Charles Forsyth
                     ` (2 preceding siblings ...)
  2009-03-28 16:27   ` Nathaniel W Filardo
@ 2009-04-08 14:44   ` Richard Miller
  2009-04-08 14:56     ` ron minnich
                       ` (3 more replies)
  3 siblings, 4 replies; 24+ messages in thread
From: Richard Miller @ 2009-04-08 14:44 UTC (permalink / raw)
  To: 9fans

I've been seeing corruption of fossil /archive data, at a rate of about
once a week.  The symptom is that venti dir entries for /archive copies
of some frequently referenced directories in my main fossil fs (generally
it's /usr, /usr/miller or /mail/box/miller) contain incorrect venti scores.
Sometimes the score is for a nonexistent block, and sometimes it's a valid
score but for a different block (e.g. a data or dir block when a meta block
is expected).

Almost always it's the dir entry for a metadata block which gets a bad
score, but occasionally it's the dir entry for a sub-dir block.  This
morning both the sub-dir and meta entries for one directory were written
with the scores of the corresponding entries for a different directory:

term% ls -lqd /n/dump/2009/0408/mail /n/dump/2009/0408/usr/miller/disk
(0000000000001646  9 80) d-rwxrwxr-x M 36 upas   upas   0 Jan  1  2007 /n/dump/2009/0408/mail
(000000000000694e 23 80) d-rwxr-xr-x M 36 miller miller 0 Dec 31  2003 /n/dump/2009/0408/usr/miller/disk
term% ls /n/dump/2009/0408/mail /n/dump/2009/0408/usr/miller/disk
/n/dump/2009/0408/mail/fdisk.c
/n/dump/2009/0408/mail/fdisk.c_ok
/n/dump/2009/0408/mail/fdisk.c_try
/n/dump/2009/0408/mail/mkfs.c
/n/dump/2009/0408/usr/miller/disk/fdisk.c
/n/dump/2009/0408/usr/miller/disk/fdisk.c_ok
/n/dump/2009/0408/usr/miller/disk/fdisk.c_try
/n/dump/2009/0408/usr/miller/disk/mkfs.c
term% diff -r /n/dump/2009/0408/mail /n/dump/2009/0408/usr/miller/disk
term%

Here are the pairs of VtEntry structures for the two files as retrieved
from venti:

/n/dump/2009/0408/mail:
0000000  00000000 1ff41fe0 03000000 00000000
0000010  00000280 0830e9ae 3023dce8 3ddf6138
0000020  4fe87d59 7257e3b7 00000000 1ff42000
0000030  01000000 00000000 0000039f 21ee2227
0000040  695a8ea2 fbbd136e dcb435e0 64fbcdfb
0000050
 /n/dump/2009/0408/usr/miller/disk:
0000000  00000000 1ff41fe0 03000000 00000000
0000010  000000a0 0830e9ae 3023dce8 3ddf6138
0000020  4fe87d59 7257e3b7 00000000 1ff42000
0000030  01000000 00000000 00000296 21ee2227
0000040  695a8ea2 fbbd136e dcb435e0 64fbcdfb
0000050

Note that it's just the score fields for /mail which are wrong; the
size fields (280 and 39f) are correct, matching yesterday's dump:

/n/dump/2009/0407/mail:
0000000  00000000 1ff41fe0 03000000 00000000
0000010  00000280 ...

So, what's going on?  My intuition says a fossil bug - I can't think
of any disk hardware error which would lead to this kind of corruption.
I have of course studied /sys/src/cmd/fossil/archive.c looking for
races (my fossil+venti machine has two processors), but everything
appears to be protected by locks.

I would be interested to know if anyone else's archive is being quietly
messed up in this way - you wouldn't necessarily know until you tried
something like a history(1) command and found pieces missing.  This
is a quick test which may show if you have a similar problem:

  cd /n/dump/2009
  for (i in *) { test -d $i$home/tmp || ls -d $i$home/tmp }
  for (i in *) { test -f $i/mail/box/$user/mbox || ls $i/mail/box/$user/mbox }

In a post a while ago, Russ said
> The amazing thing to me about fossil is how indestructable
> it is when used with venti.
> ... Once you see the data in the archive
> tree, you can be very sure it's not going away.

I agree with this, but I'd like a way to be reassured that my daily
data has actually gone into the archive correctly.

-- Richard

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] fossil caching venti errors
  2009-03-30 19:06     ` lucio
@ 2009-03-30 19:15       ` erik quanstrom
  0 siblings, 0 replies; 24+ messages in thread
From: erik quanstrom @ 2009-03-30 19:15 UTC (permalink / raw)
  To: lucio, 9fans

On Mon Mar 30 15:10:25 EDT 2009, lucio@proxima.alt.za wrote:
> > never mind. i think it's not a sign of the problem we were discussing,
> > but possibly something is simply down.
>
> Then the error message needs some tidying up.  It happened to me too
> and coincided with a total failure for "replica".  Sigh!

i submitted some changes to replica several years ago which should
cause replica to abort when the remote fs has gone into casters
up mode.  the changes are worth considering if you depend on
replica.  normally, contrib/install quanstro/replica but also
http://sources.coraid.com/sources/contrib/quanstro/replica/replica
http://sources.coraid.com/sources/contrib/quanstro/root/sys/src/replica

- erik

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] fossil caching venti errors
  2009-03-30 11:55   ` C H Forsyth
@ 2009-03-30 19:06     ` lucio
  2009-03-30 19:15       ` erik quanstrom
  0 siblings, 1 reply; 24+ messages in thread
From: lucio @ 2009-03-30 19:06 UTC (permalink / raw)
  To: 9fans

> never mind. i think it's not a sign of the problem we were discussing,
> but possibly something is simply down.

Then the error message needs some tidying up.  It happened to me too
and coincided with a total failure for "replica".  Sigh!

++L




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] fossil caching venti errors
@ 2009-03-30 16:19 Pavel Klinkovsky
  0 siblings, 0 replies; 24+ messages in thread
From: Pavel Klinkovsky @ 2009-03-30 16:19 UTC (permalink / raw)
  To: 9fans

> md5sum: error reading /n/sources/plan9/sys/src/boot/pc/ether82563.c: venti i/o error or wrong score, block 72804bdc64d1a772cd4b2eaeda6f1e1b8f175b21
I have got similar problem with 'pull'.
And the last update even ruined binaries of NDB... :-(

Pavel



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] fossil caching venti errors
  2009-03-30 11:45 ` C H Forsyth
@ 2009-03-30 11:55   ` C H Forsyth
  2009-03-30 19:06     ` lucio
  0 siblings, 1 reply; 24+ messages in thread
From: C H Forsyth @ 2009-03-30 11:55 UTC (permalink / raw)
  To: 9fans

[-- Attachment #1: Type: text/plain, Size: 110 bytes --]

never mind. i think it's not a sign of the problem we were discussing,
but possibly something is simply down.

[-- Attachment #2: Type: message/rfc822, Size: 1873 bytes --]

From: C H Forsyth <forsyth@vitanuova.com>
To: 9fans@9fans.net
Subject: Re: [9fans] fossil caching venti errors
Date: Mon, 30 Mar 2009 12:45:58 +0100
Message-ID: <8b1d939923c2bfc475ef44f45b271c63@vitanuova.com>

doppio% 9fs sources
post...
doppio% md5sum /n/sources/plan9/sys/src/boot/pc/ether82563.c
md5sum: error reading /n/sources/plan9/sys/src/boot/pc/ether82563.c: venti i/o error or wrong score, block 72804bdc64d1a772cd4b2eaeda6f1e1b8f175b21

hmmm.  it's not just my system.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] fossil caching venti errors
  2009-03-28 20:38 erik quanstrom
@ 2009-03-30 11:45 ` C H Forsyth
  2009-03-30 11:55   ` C H Forsyth
  0 siblings, 1 reply; 24+ messages in thread
From: C H Forsyth @ 2009-03-30 11:45 UTC (permalink / raw)
  To: 9fans

doppio% 9fs sources
post...
doppio% md5sum /n/sources/plan9/sys/src/boot/pc/ether82563.c
md5sum: error reading /n/sources/plan9/sys/src/boot/pc/ether82563.c: venti i/o error or wrong score, block 72804bdc64d1a772cd4b2eaeda6f1e1b8f175b21

hmmm.  it's not just my system.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] fossil caching venti errors
@ 2009-03-28 20:38 erik quanstrom
  2009-03-30 11:45 ` C H Forsyth
  0 siblings, 1 reply; 24+ messages in thread
From: erik quanstrom @ 2009-03-28 20:38 UTC (permalink / raw)
  To: 9fans

sorry if this is a dupe.  my original reply seems to
have gone missing due to a local mishap.

> venti/read should be doing this check automatically since libventi/client.c
> builds with "int ventidoublechecksha1 = 1;" by default.

yet you're reporting that fossil thinks the score does not
match.  this is a conundrum.  either fossil is wrong or
venti is.  it would be good to have some data to help sort
this out.

- erik

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] fossil caching venti errors
  2009-03-28 17:31     ` erik quanstrom
@ 2009-03-28 18:40       ` Nathaniel W Filardo
  0 siblings, 0 replies; 24+ messages in thread
From: Nathaniel W Filardo @ 2009-03-28 18:40 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 310 bytes --]

On Sat, Mar 28, 2009 at 01:31:01PM -0400, erik quanstrom wrote:
> it would be interesting to know if the score
> of the block returned by venti/read is correct.

venti/read should be doing this check automatically since libventi/client.c
builds with "int ventidoublechecksha1 = 1;" by default.

--nwf;

[-- Attachment #2: Type: application/pgp-signature, Size: 204 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] fossil caching venti errors
  2009-03-28 16:27   ` Nathaniel W Filardo
@ 2009-03-28 17:31     ` erik quanstrom
  2009-03-28 18:40       ` Nathaniel W Filardo
  0 siblings, 1 reply; 24+ messages in thread
From: erik quanstrom @ 2009-03-28 17:31 UTC (permalink / raw)
  To: 9fans

> AFAIK the disk is doing just fine.  Moreover, even during the period when
> fossil is complaining, venti/read on 9fs's score works just fine.  So I
> don't believe the fault is venti's.

i don't believe that conclusion is warranted.
/sys/src/cmd/fossil/cache.c:683,684
is where this condition gets set.  so either
the read fails or the score or length is bad.
%r is not set (see a few lines down) so when
combined with this report:

> This is likely too large a hammer, but when this happens I rebuild the venti index
> so that I can get past the issue.  I see this more under Plan 9 than p9p.  The
> block in error always exists in an arena and a checkarenas reports no errors.
> The problem usually persists across reboots until I reconstitute the index.

it's reasonable to guess that the block returned
might not be the right one.

in principle, this could be a drive failure,
bad memory or a venti bug. i don't have a
lot of venti experience, but i think this
/sys/src/cmd/venti/srv/lump.c:226,230
is where venti reads and it seems to insure
that the initial read double-checks scores.
it would 1e-80 hard for a drive error
to sneak by, so that leaves us with memory
errors or venti cache bugs.

it's hard to see how reindexing would fix
a cache bug, though.  so maybe i'm all wet.

it would be interesting to know if the score
of the block returned by venti/read is correct.

- erik

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] fossil caching venti errors
  2009-03-28 11:11 ` Charles Forsyth
  2009-03-28 14:47   ` david bulkow
  2009-03-28 15:13   ` erik quanstrom
@ 2009-03-28 16:27   ` Nathaniel W Filardo
  2009-03-28 17:31     ` erik quanstrom
  2009-04-08 14:44   ` Richard Miller
  3 siblings, 1 reply; 24+ messages in thread
From: Nathaniel W Filardo @ 2009-03-28 16:27 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 738 bytes --]

On Sat, Mar 28, 2009 at 11:11:25AM +0000, Charles Forsyth wrote:
> i've seen that just recently, but thought it might have been
> a failing (very old) drive, or a power failure beyond the endurance of the UPS.
> if neither of those are true in your case, it might be worth a deeper look.
> i also found there is a persistent problem that `check fix' won't fix.
> (since it's my principal file server at home, i can't easily investigate more
> until i can transfer the service to a new drive, leaving the old drive
> for experiment.)

AFAIK the disk is doing just fine.  Moreover, even during the period when
fossil is complaining, venti/read on 9fs's score works just fine.  So I
don't believe the fault is venti's.

--nwf;

[-- Attachment #2: Type: application/pgp-signature, Size: 204 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] fossil caching venti errors
  2009-03-28 11:11 ` Charles Forsyth
  2009-03-28 14:47   ` david bulkow
@ 2009-03-28 15:13   ` erik quanstrom
  2009-03-28 16:27   ` Nathaniel W Filardo
  2009-04-08 14:44   ` Richard Miller
  3 siblings, 0 replies; 24+ messages in thread
From: erik quanstrom @ 2009-03-28 15:13 UTC (permalink / raw)
  To: 9fans

On Sat Mar 28 06:54:35 EDT 2009, forsyth@terzarima.net wrote:
> i've seen that just recently, but thought it might have been
> a failing (very old) drive, or a power failure beyond the endurance of the UPS.
> if neither of those are true in your case, it might be worth a deeper look.
> i also found there is a persistent problem that `check fix' won't fix.
> (since it's my principal file server at home, i can't easily investigate more
> until i can transfer the service to a new drive, leaving the old drive
> for experiment.)

i hope this is useful information, but i fear it might not be.

if the very old disk in question is a scsi disk, disk/smart in my
contrib area should be gentle enough to run on a live server
with any kernel.

if the very old disk in question is an ata disk, disk/smart requires
the sd changes in my contrib area to run raw ata commands
like smart return status.  it is also should be gentle enough to run
on a live server.

unfortunately, disk/smart is not smart enough to access ata
general purpose logging information (gpl), yet.  manufacturer
diagnostic tools are still helpful here but require booting into
dos.

by the way, i think there are a number of interesting gsoc projects
related to devsd and ata.  for example, a devsd device that adds
checksumming to another devsd device.  gpl support could be
done entirely outside the kernel.

- erik

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] fossil caching venti errors
  2009-03-28 11:11 ` Charles Forsyth
@ 2009-03-28 14:47   ` david bulkow
  2009-03-28 15:13   ` erik quanstrom
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 24+ messages in thread
From: david bulkow @ 2009-03-28 14:47 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

This is likely too large a hammer, but when this happens I rebuild the
venti index
so that I can get past the issue.  I see this more under Plan 9 than p9p.  The
block in error always exists in an arena and a checkarenas reports no errors.
The problem usually persists across reboots until I reconstitute the index.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [9fans] fossil caching venti errors
  2009-03-28  3:48 Nathaniel W Filardo
@ 2009-03-28 11:11 ` Charles Forsyth
  2009-03-28 14:47   ` david bulkow
                     ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Charles Forsyth @ 2009-03-28 11:11 UTC (permalink / raw)
  To: 9fans

i've seen that just recently, but thought it might have been
a failing (very old) drive, or a power failure beyond the endurance of the UPS.
if neither of those are true in your case, it might be worth a deeper look.
i also found there is a persistent problem that `check fix' won't fix.
(since it's my principal file server at home, i can't easily investigate more
until i can transfer the service to a new drive, leaving the old drive
for experiment.)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [9fans] fossil caching venti errors
@ 2009-03-28  3:48 Nathaniel W Filardo
  2009-03-28 11:11 ` Charles Forsyth
  0 siblings, 1 reply; 24+ messages in thread
From: Nathaniel W Filardo @ 2009-03-28  3:48 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 542 bytes --]

Entertaining.  Indented lines are fossil console; nonindented are at a
normal CPU prompt.

cpu% 9fs
9fs: venti i/o error or wrong score, block 558b88fbae4e0aa894c614fb3eeccf4d2f7492ca

        main: venti tcp!xxx.xxx.xxx.xxx!xxxxx

cpu% 9fs
9fs: venti i/o error or wrong score, block 558b88fbae4e0aa894c614fb3eeccf4d2f7492ca

        main: df
          main: 741,801,984 used + 38,548,643,840 free = 39,290,445,824 (1% used)

cpu% 9fs
usage: 9fs service [mountpoint]

Something about this seems wrong.  Suggestions?
--nwf;

[-- Attachment #2: Type: application/pgp-signature, Size: 204 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2010-01-08  6:53 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-01-08  6:53 [9fans] fossil caching venti errors Josef Artur
  -- strict thread matches above, loose matches on Subject: below --
2009-03-30 16:19 Pavel Klinkovsky
2009-03-28 20:38 erik quanstrom
2009-03-30 11:45 ` C H Forsyth
2009-03-30 11:55   ` C H Forsyth
2009-03-30 19:06     ` lucio
2009-03-30 19:15       ` erik quanstrom
2009-03-28  3:48 Nathaniel W Filardo
2009-03-28 11:11 ` Charles Forsyth
2009-03-28 14:47   ` david bulkow
2009-03-28 15:13   ` erik quanstrom
2009-03-28 16:27   ` Nathaniel W Filardo
2009-03-28 17:31     ` erik quanstrom
2009-03-28 18:40       ` Nathaniel W Filardo
2009-04-08 14:44   ` Richard Miller
2009-04-08 14:56     ` ron minnich
2009-04-08 15:36       ` C H Forsyth
2009-04-08 15:55       ` Richard Miller
2009-04-08 16:46     ` cinap_lenrek
2009-04-08 17:01     ` cinap_lenrek
2009-04-08 17:28       ` Richard Miller
2009-04-08 17:41         ` cinap_lenrek
2009-04-08 18:18           ` Richard Miller
2009-04-08 17:36     ` Steve Simon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).