9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] fossil/venti falling down?
@ 2009-06-18 16:01 John Floren
  2009-06-18 16:21 ` John Floren
  2009-06-24 17:43 ` John Floren
  0 siblings, 2 replies; 23+ messages in thread
From: John Floren @ 2009-06-18 16:01 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Our Coraid device recently lost two disks from the RAID5
configuration; while we were able to rebuild from instructions given
by support, I suspect some small amount of data was corrupted.

Since rebuilding the device a few days ago, every morning I have
returned to work to find my CPU/auth/file server in a classic "lost my
file system" state--not locked, but trying to run any command causes
it to hang. Also, files have been corrupted--here's the top bit of a
copy of /sys/src/cmd/rio/fsys.c that I was working on:
\x02�\x12
�\x1e\x10\x104\x01"\x1e\a�C�/� TEXBASE1ENCODING + DVIPSENCODING\vTEX-PTMRI8R\x02�H
}\x0fm\x06�����\x0f\x0f�\x14a\x14a\x0f�\x01�\x1a�@\x1c�8\x16�\x01T�\x01
\f�(\x0f�X\x0f�\x19�\x18�\x01�\x0f�\x15�\x03&\x010\x01\x16\x03!\x01\x17\x06��\x0f�\x0f�\x0f�\x0f�\x0f�\x0f�\x0f�<\x0f�T\x0f�\x0ff\x15�\x15P\x15�\x0f�\x01\x19^[�\x13�\x01\x1a\x13�\x01*\x14�@\x16�\x01.\x13�D\x13�Q4\x16�\x01<\x16��t\x0e�m>\x14�}E\x12�
J\x19�X\x14��Q\x16�\x01T\x13�\x01\\x16�\x01b\x13�\x01e\x0f�\x1dk\x12��m\x16�a}\x13���\x19���\x13�d\x12���\x12�p

Any ideas? I thought maybe fossil was choking somewhere, maybe bad
info on venti?

Thanks

John
-- 
"I've tried programming Ruby on Rails, following TechCrunch in my RSS
reader, and drinking absinthe. It doesn't work. I'm going back to C,
Hunter S. Thompson, and cheap whiskey." -- Ted Dziuba

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [9fans] fossil/venti falling down?
  2009-06-18 16:01 [9fans] fossil/venti falling down? John Floren
@ 2009-06-18 16:21 ` John Floren
  2009-06-18 16:25   ` erik quanstrom
  2009-06-21 11:57   ` Richard Miller
  2009-06-24 17:43 ` John Floren
  1 sibling, 2 replies; 23+ messages in thread
From: John Floren @ 2009-06-18 16:21 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, Jun 18, 2009 at 9:01 AM, John Floren <slawmaster@gmail.com> wrote:
>
> Our Coraid device recently lost two disks from the RAID5
> configuration; while we were able to rebuild from instructions given
> by support, I suspect some small amount of data was corrupted.
>
> Since rebuilding the device a few days ago, every morning I have
> returned to work to find my CPU/auth/file server in a classic "lost my
> file system" state--not locked, but trying to run any command causes
> it to hang. Also, files have been corrupted--here's the top bit of a
> copy of /sys/src/cmd/rio/fsys.c that I was working on:
>  ï¿½
> �   4 "  ï¿½C�/� TEXBASE1ENCODING + DVIPSENCODING TEX-PTMRI8R �H
> } m �����  ï¿½ a a � � �@ �8 � T�
>  ï¿½( �X � � � � � � & 0   !   �� � � � � � � �< �T � f � P � �   � �   � * �@ � . �D �Q4 � < ��t �m> �}E �
> J �X ��Q � T � \ � b � e � k ��m �a} ��� ��� �d ��� �p
>
> Any ideas? I thought maybe fossil was choking somewhere, maybe bad
> info on venti?
>
> Thanks
>
> John

Forgot to add that I've only seen one error on the console during all of this:
/boot/fossil: could not write super block; waiting 10 seconds
/boot/fossil: blistAlloc: called on clean block.

John
--
"I've tried programming Ruby on Rails, following TechCrunch in my RSS
reader, and drinking absinthe. It doesn't work. I'm going back to C,
Hunter S. Thompson, and cheap whiskey." -- Ted Dziuba

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [9fans] fossil/venti falling down?
  2009-06-18 16:21 ` John Floren
@ 2009-06-18 16:25   ` erik quanstrom
  2009-06-18 16:30     ` John Floren
  2009-06-21 11:57   ` Richard Miller
  1 sibling, 1 reply; 23+ messages in thread
From: erik quanstrom @ 2009-06-18 16:25 UTC (permalink / raw)
  To: 9fans

> Forgot to add that I've only seen one error on the console during all of this:
> /boot/fossil: could not write super block; waiting 10 seconds
> /boot/fossil: blistAlloc: called on clean block.

is that once, or every time?

- erik



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [9fans] fossil/venti falling down?
  2009-06-18 16:25   ` erik quanstrom
@ 2009-06-18 16:30     ` John Floren
  2009-06-18 16:45       ` erik quanstrom
  0 siblings, 1 reply; 23+ messages in thread
From: John Floren @ 2009-06-18 16:30 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, Jun 18, 2009 at 9:25 AM, erik quanstrom <quanstro@quanstro.net> wrote:
>
> > Forgot to add that I've only seen one error on the console during all of this:
> > /boot/fossil: could not write super block; waiting 10 seconds
> > /boot/fossil: blistAlloc: called on clean block.
>
> is that once, or every time?
>
> - erik
>

It seems to only happen once per boot, but not necessarily when fossil
starts responding--I've seen it a couple hours after booting, which
the filesystem tends to go away at night.

John
--
"I've tried programming Ruby on Rails, following TechCrunch in my RSS
reader, and drinking absinthe. It doesn't work. I'm going back to C,
Hunter S. Thompson, and cheap whiskey." -- Ted Dziuba



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [9fans] fossil/venti falling down?
  2009-06-18 16:30     ` John Floren
@ 2009-06-18 16:45       ` erik quanstrom
  2009-06-18 17:10         ` John Floren
  0 siblings, 1 reply; 23+ messages in thread
From: erik quanstrom @ 2009-06-18 16:45 UTC (permalink / raw)
  To: 9fans

> It seems to only happen once per boot, but not necessarily when fossil
> starts responding--I've seen it a couple hours after booting, which
> the filesystem tends to go away at night.

the failure is somewhere in blockWrite.  since blockWrite
calls diskWrite and diskWrite just queues up i/o to send
to the disk, it's not possible to get i/o errors directly from
blockWrite.

there are two case that do return errors.

one is if the block can't be locked.  a runaway periodic function
would make that more likely, since we don't wait for the lock.
but it seems more likely in this case that some of fossil's data is
corrupted since this started after the double-failure.
see http://9fans.net/archive/2009/03/487

the other case is a funny dependency.  there's a fprint there
that's commented out.

- erik



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [9fans] fossil/venti falling down?
  2009-06-18 16:45       ` erik quanstrom
@ 2009-06-18 17:10         ` John Floren
  2009-06-24 17:06           ` John Floren
  0 siblings, 1 reply; 23+ messages in thread
From: John Floren @ 2009-06-18 17:10 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, Jun 18, 2009 at 9:45 AM, erik quanstrom <quanstro@quanstro.net> wrote:
>
> > It seems to only happen once per boot, but not necessarily when fossil
> > starts responding--I've seen it a couple hours after booting, which
> > the filesystem tends to go away at night.
>
> the failure is somewhere in blockWrite.  since blockWrite
> calls diskWrite and diskWrite just queues up i/o to send
> to the disk, it's not possible to get i/o errors directly from
> blockWrite.
>
> there are two case that do return errors.
>
> one is if the block can't be locked.  a runaway periodic function
> would make that more likely, since we don't wait for the lock.
> but it seems more likely in this case that some of fossil's data is
> corrupted since this started after the double-failure.
> see http://9fans.net/archive/2009/03/487
>
> the other case is a funny dependency.  there's a fprint there
> that's commented out.
>
> - erik
>

Here's another message that may be of interest. I ran fshalt before
rebooting (to test the periodicthread patch) and saw this:

syncing.../srv/fscons...prompt: sourceRoot: fs->ehi = 5395, b->l =
BtDir,3,Copied,e=5394,-1,tag=0x1
venti...
halting.../srv/fscons...archive vac:a9d9b0b9fe0db783fe618f680804a18df532a67a

I don't remember seeing that "sourceRoot: ..." stuff before; as soon
as the system comes back up I guess I'll take a look at source.

John
--
"I've tried programming Ruby on Rails, following TechCrunch in my RSS
reader, and drinking absinthe. It doesn't work. I'm going back to C,
Hunter S. Thompson, and cheap whiskey." -- Ted Dziuba



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [9fans] fossil/venti falling down?
  2009-06-18 16:21 ` John Floren
  2009-06-18 16:25   ` erik quanstrom
@ 2009-06-21 11:57   ` Richard Miller
  2009-06-21 12:12     ` Steve Simon
  2009-06-21 14:11     ` erik quanstrom
  1 sibling, 2 replies; 23+ messages in thread
From: Richard Miller @ 2009-06-21 11:57 UTC (permalink / raw)
  To: 9fans

> Forgot to add that I've only seen one error on the console during all of this:
> /boot/fossil: could not write super block; waiting 10 seconds
> /boot/fossil: blistAlloc: called on clean block.

I get a few of these nearly every day.  I've been assuming they are benign.




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [9fans] fossil/venti falling down?
  2009-06-21 11:57   ` Richard Miller
@ 2009-06-21 12:12     ` Steve Simon
  2009-06-21 14:11     ` erik quanstrom
  1 sibling, 0 replies; 23+ messages in thread
From: Steve Simon @ 2009-06-21 12:12 UTC (permalink / raw)
  To: 9fans

> /boot/fossil: could not write super block; waiting 10 seconds
> /boot/fossil: blistAlloc: called on clean block.

I have a few a day for the last 5 years on my home server, and one a week
on the work machine... I always ignored them.

-Steve



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [9fans] fossil/venti falling down?
  2009-06-21 11:57   ` Richard Miller
  2009-06-21 12:12     ` Steve Simon
@ 2009-06-21 14:11     ` erik quanstrom
  2009-06-21 19:50       ` Josh Wood
  1 sibling, 1 reply; 23+ messages in thread
From: erik quanstrom @ 2009-06-21 14:11 UTC (permalink / raw)
  To: 9fans

On Sun Jun 21 07:59:52 EDT 2009, 9fans@hamnavoe.com wrote:
> > Forgot to add that I've only seen one error on the console during all of this:
> > /boot/fossil: could not write super block; waiting 10 seconds
> > /boot/fossil: blistAlloc: called on clean block.
>
> I get a few of these nearly every day.  I've been assuming they are benign.

this error sets off alarms for me.  the comment in the
code is "BUG", and naively, i can't work out
a) why the author added that comment,
b) what superWrite would be competing with for
a lock on the superblock (blockWrite returns 0 if
it !vtCanLock from _cacheLocalLookup), and
c) why superWrite isn't doing a vtLock rather than
a vtCanLock, and
d) what happens if b is stll dirty (as per comment)
and a crash occurs.

can someone explain why these are all okay?

- erik



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [9fans] fossil/venti falling down?
  2009-06-21 14:11     ` erik quanstrom
@ 2009-06-21 19:50       ` Josh Wood
  0 siblings, 0 replies; 23+ messages in thread
From: Josh Wood @ 2009-06-21 19:50 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs


On Jun 21, 2009, at 7:11 AM, erik quanstrom wrote:

> On Sun Jun 21 07:59:52 EDT 2009, 9fans@hamnavoe.com wrote:
>>> Forgot to add that I've only seen one error on the console during
>>> all of this:
>>> /boot/fossil: could not write super block; waiting 10 seconds
>>> /boot/fossil: blistAlloc: called on clean block.
>>
>> I get a few of these nearly every day.  I've been assuming they are
>> benign.
>
> this error sets off alarms for me.

This diagnostic has concerned me in the past as well. It is mentioned
here:
http://plan9.bell-labs.com/wiki/plan9/Console_messages/index.html
but that page does not explain why it "should not be cause for concern,"
or otherwise address erik's questions a-d. Should we define those
answers, adding them here would be good.

-Josh




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [9fans] fossil/venti falling down?
  2009-06-18 17:10         ` John Floren
@ 2009-06-24 17:06           ` John Floren
  0 siblings, 0 replies; 23+ messages in thread
From: John Floren @ 2009-06-24 17:06 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, Jun 18, 2009 at 10:10 AM, John Floren<slawmaster@gmail.com> wrote:
> On Thu, Jun 18, 2009 at 9:45 AM, erik quanstrom <quanstro@quanstro.net> wrote:
>>
>> > It seems to only happen once per boot, but not necessarily when fossil
>> > starts responding--I've seen it a couple hours after booting, which
>> > the filesystem tends to go away at night.
>>
>> the failure is somewhere in blockWrite.  since blockWrite
>> calls diskWrite and diskWrite just queues up i/o to send
>> to the disk, it's not possible to get i/o errors directly from
>> blockWrite.
>>
>> there are two case that do return errors.
>>
>> one is if the block can't be locked.  a runaway periodic function
>> would make that more likely, since we don't wait for the lock.
>> but it seems more likely in this case that some of fossil's data is
>> corrupted since this started after the double-failure.
>> see http://9fans.net/archive/2009/03/487
>>
>> the other case is a funny dependency.  there's a fprint there
>> that's commented out.
>>
>> - erik
>>
>
> Here's another message that may be of interest. I ran fshalt before
> rebooting (to test the periodicthread patch) and saw this:
>
> syncing.../srv/fscons...prompt: sourceRoot: fs->ehi = 5395, b->l =
> BtDir,3,Copied,e=5394,-1,tag=0x1
> venti...
> halting.../srv/fscons...archive vac:a9d9b0b9fe0db783fe618f680804a18df532a67a
>
> I don't remember seeing that "sourceRoot: ..." stuff before; as soon
> as the system comes back up I guess I'll take a look at source.
>

After replacing the problematic server and moving the fossil disk to
the new machine, we're not getting random hangs any more.

However, I've seen this a few times on the console:

/boot/fossil: cacheLocalData: addr=78989 type got 0 exp 0: tag got
e63eb942 exp 663eb942
archive(0, 0x1348d): cannot find block: block label mismatch

and

 /boot/fossil: cacheLocalData: addr=134772 type got 0 exp 0: tag got
7795335e exp 7715335e

is this something to worry about?

John
-- 
"I've tried programming Ruby on Rails, following TechCrunch in my RSS
reader, and drinking absinthe. It doesn't work. I'm going back to C,
Hunter S. Thompson, and cheap whiskey." -- Ted Dziuba



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [9fans] fossil/venti falling down?
  2009-06-18 16:01 [9fans] fossil/venti falling down? John Floren
  2009-06-18 16:21 ` John Floren
@ 2009-06-24 17:43 ` John Floren
  2009-06-24 19:09   ` cinap_lenrek
  1 sibling, 1 reply; 23+ messages in thread
From: John Floren @ 2009-06-24 17:43 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, Jun 18, 2009 at 9:01 AM, John Floren <slawmaster@gmail.com> wrote:
>
> Our Coraid device recently lost two disks from the RAID5
> configuration; while we were able to rebuild from instructions given
> by support, I suspect some small amount of data was corrupted.
>
> Since rebuilding the device a few days ago, every morning I have
> returned to work to find my CPU/auth/file server in a classic "lost my
> file system" state--not locked, but trying to run any command causes
> it to hang. Also, files have been corrupted--here's the top bit of a
> copy of /sys/src/cmd/rio/fsys.c that I was working on:
>  ï¿½
> �   4 "  ï¿½C�/� TEXBASE1ENCODING + DVIPSENCODING TEX-PTMRI8R �H
> } m �����  ï¿½ a a � � �@ �8 � T�
>  ï¿½( �X � � � � � � & 0   !   �� � � � � � � �< �T � f � P � �   � �   � * �@ � . �D �Q4 � < ��t �m> �}E �
> J �X ��Q � T � \ � b � e � k ��m �a} ��� ��� �d ��� �p
>
> Any ideas? I thought maybe fossil was choking somewhere, maybe bad
> info on venti?
>
> Thanks
>
> John
> --

Wow. Did a "fsys main check pdir fix" on fossil console and saw this
over the serial line:

 /boot/fossil: cacheLocalData: addr=78989 type got 0 exp 0: tag got
e63eb942 exp 663eb942
/boot/fossil: cacheLocalData: addr=99457 type got 0 exp 0: tag got
150daf85 exp 150daf05
/boot/fossil: cacheLocalData: addr=68651 type got 0 exp 0: tag got
66be7fe5 exp 663e7fe5
/boot/fossil: cacheLocalData: addr=166723 type got 0 exp 0: tag got
3eabf0f9 exp 3eabf079
/boot/fossil: cacheLocalData: addr=134772 type got 0 exp 0: tag got
7795335e exp 7715335e
/boot/fossil: cacheLocalData: addr=155039 type got 0 exp 0: tag got
19383bf exp 11383bf
/boot/fossil: cacheLocalData: addr=155167 type got 0 exp 0: tag got
19383bf exp 11383bf
/boot/fossil: cacheLocalData: addr=155231 type got 0 exp 0: tag got
19383bf exp 11383bf
/boot/fossil: cacheLocalData: addr=57147 type got 0 exp 8: tag got
7bea015b exp 1ef0c892
/boot/fossil: labelUnpack: bad label: 0x09 0x80 0x00000071 0x00000071 0x75a49da9
2009/0624 17:36:15 err 2: pre-copy sha1 wrong at arenas09 352136039:
expected=0577f95b51dd0ecd0b922e4a9ddb9e825a0f8422
got=52b4f746ce46c9eaf5ef5f3f50b8a90ca5368e6d
2009/0624 17:36:15 err 2: loading clump: corrupted at arenas09
352136039; expected=0577f95b51dd0ecd0b922e4a9ddb9e825a0f8422
got=52b4f746ce46c9eaf5ef5f3f50b8a90ca5368e6d
2009/0624 17:36:15 err 2: pre-copy sha1 wrong at arenas09 352284179:
expected=b255ea67eae36e0957cb92eec3aa5a6ee5e4b090
got=89adc89d298be7a0a921dd9117a3ecdd2ae39693
2009/0624 17:36:15 err 2: loading clump: corrupted at arenas09
352284179; expected=b255ea67eae36e0957cb92eec3aa5a6ee5e4b090
got=89adc89d298be7a0a921dd9117a3ecdd2ae39693
2009/0624 17:36:15 err 2: pre-copy sha1 wrong at arenas09 352465239:
expected=68899de88e1ef0c70a477dbf0caa6b71f8beaec4
got=d9f4b4fcfadf87a892a683e71eb80bd7c7742cb3
2009/0624 17:36:15 err 2: loading clump: corrupted at arenas09
352465239; expected=68899de88e1ef0c70a477dbf0caa6b71f8beaec4
got=d9f4b4fcfadf87a892a683e71eb80bd7c7742cb3
2009/0624 17:36:15 err 2: pre-copy sha1 wrong at arenas09 354350879:
expected=9cf5f41c091edacc68352017712c71d7a0dfcf5a
got=1f11661042a123d1a9808c868845d498fb339fcc
2009/0624 17:36:15 err 2: loading clump: corrupted at arenas09
354350879; expected=9cf5f41c091edacc68352017712c71d7a0dfcf5a
got=1f11661042a123d1a9808c868845d498fb339fcc
2009/0624 17:36:15 err 2: pre-copy sha1 wrong at arenas09 354959899:
expected=caf5ffc383d2eaf02fc96efb87760ec28448eebc
got=15dd16b127d15d072c2403cf99d72c5c2b60ca80
2009/0624 17:36:15 err 2: loading clump: corrupted at arenas09
354959899; expected=caf5ffc383d2eaf02fc96efb87760ec28448eebc
got=15dd16b127d15d072c2403cf99d72c5c2b60ca80
2009/0624 17:36:15 err 2: pre-copy sha1 wrong at arenas09 355470147:
expected=f6a723292b8262cff81fd582747d8ca5e347f45d
got=16cadb318198307d543df7f3ad14a95f0758672b
2009/0624 17:36:15 err 2: loading clump: corrupted at arenas09
355470147; expected=f6a723292b8262cff81fd582747d8ca5e347f45d
got=16cadb318198307d543df7f3ad14a95f0758672b
2009/0624 17:36:16 err 2: pre-copy sha1 wrong at arenas09 357371276:
expected=0c5b77207e98339d8a7a32365cf0a8a832dd516c
got=f757d80f9517d188c3141b310208498659d378e9
2009/0624 17:36:16 err 2: loading clump: corrupted at arenas09
357371276; expected=0c5b77207e98339d8a7a32365cf0a8a832dd516c
got=f757d80f9517d188c3141b310208498659d378e9
2009/0624 17:36:16 err 2: pre-copy sha1 wrong at arenas09 357544105:
expected=b353ac1d52f935b2d85f19e918bad0656f12bbb7
got=ed170cf4168bbee1c145b174664e505dcb679bc7
2009/0624 17:36:16 err 2: loading clump: corrupted at arenas09
357544105; expected=b353ac1d52f935b2d85f19e918bad0656f12bbb7
got=ed170cf4168bbee1c145b174664e505dcb679bc7
2009/0624 17:36:16 err 2: pre-copy sha1 wrong at arenas09 357634635:
expected=604aa628095f64176d4c5a9ec438d472b707e878
got=6d22e6b1c95675676907fc001eac9ba97e0cd983
2009/0624 17:36:16 err 2: loading clump: corrupted at arenas09
357634635; expected=604aa628095f64176d4c5a9ec438d472b707e878
got=6d22e6b1c95675676907fc001eac9ba97e0cd983
2009/0624 17:36:16 err 2: pre-copy sha1 wrong at arenas09 357659325:
expected=637ae3d28a3634ca58d30e55e7766f29faefaab4
got=dfeabe2212f7f9f598b8624ba33f6f5bfba51fb0
2009/0624 17:36:16 err 2: loading clump: corrupted at arenas09
357659325; expected=637ae3d28a3634ca58d30e55e7766f29faefaab4
got=dfeabe2212f7f9f598b8624ba33f6f5bfba51fb0
2009/0624 17:36:16 err 2: pre-copy sha1 wrong at arenas09 359083103:
expected=63cab5c3b87aade05fa242e9bf5ec1a44241bd03
got=0d232292c70bf214b570fbce354802022ab4c829
2009/0624 17:36:16 err 2: loading clump: corrupted at arenas09
359083103; expected=63cab5c3b87aade05fa242e9bf5ec1a44241bd03
got=0d232292c70bf214b570fbce354802022ab4c829
2009/0624 17:36:16 err 2: pre-copy sha1 wrong at arenas09 359330003:
expected=1e90a1c479d65bb18533f21451c986fabc8eb9a5
got=874407a20841a923569e8c9fe1e2e65eb4712a31
2009/0624 17:36:16 err 2: loading clump: corrupted at arenas09
359330003; expected=1e90a1c479d65bb18533f21451c986fabc8eb9a5
got=874407a20841a923569e8c9fe1e2e65eb4712a31
2009/0624 17:36:16 err 2: pre-copy sha1 wrong at arenas09 359856723:
expected=194f966e4c0814592c38ff70ad137deb6b2cb806
got=56aeec2fbf2f22db26dfe63e920fdcd64bc39273
2009/0624 17:36:16 err 2: loading clump: corrupted at arenas09
359856723; expected=194f966e4c0814592c38ff70ad137deb6b2cb806
got=56aeec2fbf2f22db26dfe63e920fdcd64bc39273
2009/0624 17:36:16 err 2: pre-copy sha1 wrong at arenas09 360852553:
expected=d79a08f04802893e2c4f408daf3d7a49901d3b91
got=8f344e9c738483c8d7d29cb46608b4e5434198c7
2009/0624 17:36:16 err 2: loading clump: corrupted at arenas09
360852553; expected=d79a08f04802893e2c4f408daf3d7a49901d3b91
got=8f344e9c738483c8d7d29cb46608b4e5434198c7
2009/0624 17:36:16 err 2: pre-copy sha1 wrong at arenas09 361428653:
expected=a066a808d8ccee2a0badd981ac82cb0e64efa65c
got=16b4c0d1e339279833b1387b6e58c051c282f59b
2009/0624 17:36:16 err 2: loading clump: corrupted at arenas09
361428653; expected=a066a808d8ccee2a0badd981ac82cb0e64efa65c
got=16b4c0d1e339279833b1387b6e58c051c282f59b
2009/0624 17:36:16 err 2: pre-copy sha1 wrong at arenas09 361938913:
expected=bef39df83a60e8ef9322def9eeeba0d501a0d960
got=9e375c5d1b58830a565dd2ab06e53b242a11c89b
2009/0624 17:36:16 err 2: loading clump: corrupted at arenas09
361938913; expected=bef39df83a60e8ef9322def9eeeba0d501a0d960
got=9e375c5d1b58830a565dd2ab06e53b242a11c89b
2009/0624 17:36:17 err 2: pre-copy sha1 wrong at arenas09 365938668:
expected=455abfee403540c43658d3367ae6c9e5f471e2af
got=fb18fe046102f9de3aed2a4d8e09323fdaa47f3a
2009/0624 17:36:17 err 2: loading clump: corrupted at arenas09
365938668; expected=455abfee403540c43658d3367ae6c9e5f471e2af
got=fb18fe046102f9de3aed2a4d8e09323fdaa47f3a
2009/0624 17:36:17 err 2: pre-copy sha1 wrong at arenas09 366506538:
expected=04be253f41e6d9c31d92489d11fb2fdb3522df60
got=f0ded93449a5f797e6a3aae2b596a6fb9c5d6fb6
2009/0624 17:36:17 err 2: loading clump: corrupted at arenas09
366506538; expected=04be253f41e6d9c31d92489d11fb2fdb3522df60
got=f0ded93449a5f797e6a3aae2b596a6fb9c5d6fb6
2009/0624 17:36:17 err 2: pre-copy sha1 wrong at arenas09 367000337:
expected=6d40d3af6e51409ab2e5826da7739f703f6afd25
got=693990c9bd7ca0aaec9abadd739653d244c865fb
2009/0624 17:36:17 err 2: loading clump: corrupted at arenas09
367000337; expected=6d40d3af6e51409ab2e5826da7739f703f6afd25
got=693990c9bd7ca0aaec9abadd739653d244c865fb
[shortened to save space]
2009/0624 17:36:45 err 2: pre-copy sha1 wrong at arenas09 408124319:
expected=1fe38fd7cdddcfa87917dd8efd53f20730e7cc45
got=b895cd3117141832bb1cd1b9da08f9e847940bbe
2009/0624 17:36:45 err 2: loading clump: corrupted at arenas09
408124319; expected=1fe38fd7cdddcfa87917dd8efd53f20730e7cc45
got=b895cd3117141832bb1cd1b9da08f9e847940bbe
2009/0624 17:36:45 err 2: pre-copy sha1 wrong at arenas09 409037849:
expected=5c9b78b5a64b6b4926e20ec5a9d4aa9156283f57
got=087d67c8f43351b84282146959f9a45042b9e1a7
2009/0624 17:36:45 err 2: loading clump: corrupted at arenas09
409037849; expected=5c9b78b5a64b6b4926e20ec5a9d4aa9156283f57
got=087d67c8f43351b84282146959f9a45042b9e1a7
2009/0624 17:36:45 err 2: pre-copy sha1 wrong at arenas09 409243191:
expected=e8301bea2b2f2b20bd32b2035a210322481c6061
got=f5a0211a2dded4fd91f190b27c71fc2452928754
2009/0624 17:36:45 err 2: loading clump: corrupted at arenas09
409243191; expected=e8301bea2b2f2b20bd32b2035a210322481c6061
got=f5a0211a2dded4fd91f190b27c71fc2452928754
assert failed: b->part != PartVenti
73 fossil: checked 7866 page table entries
fossil 73: suicide: sys: trap: fault read addr=0x0 pc=0x0002e638


Yow. Do I need to reinitialize my venti and fossil, just start from scratch?


John.
--
"I've tried programming Ruby on Rails, following TechCrunch in my RSS
reader, and drinking absinthe. It doesn't work. I'm going back to C,
Hunter S. Thompson, and cheap whiskey." -- Ted Dziuba



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [9fans] fossil/venti falling down?
  2009-06-24 17:43 ` John Floren
@ 2009-06-24 19:09   ` cinap_lenrek
  2009-06-24 23:33     ` John Floren
  2009-06-25  0:59     ` erik quanstrom
  0 siblings, 2 replies; 23+ messages in thread
From: cinap_lenrek @ 2009-06-24 19:09 UTC (permalink / raw)
  To: 9fans

[-- Attachment #1: Type: text/plain, Size: 933 bytes --]

 /boot/fossil: cacheLocalData: addr=78989 type got 0 exp 0: tag got
e63eb942 exp 663eb942
/boot/fossil: cacheLocalData: addr=99457 type got 0 exp 0: tag got
150daf85 exp 150daf05
/boot/fossil: cacheLocalData: addr=68651 type got 0 exp 0: tag got
66be7fe5 exp 663e7fe5
/boot/fossil: cacheLocalData: addr=166723 type got 0 exp 0: tag got
3eabf0f9 exp 3eabf079
/boot/fossil: cacheLocalData: addr=134772 type got 0 exp 0: tag got
7795335e exp 7715335e
/boot/fossil: cacheLocalData: addr=155039 type got 0 exp 0: tag got
19383bf exp 11383bf
/boot/fossil: cacheLocalData: addr=155167 type got 0 exp 0: tag got
19383bf exp 11383bf
/boot/fossil: cacheLocalData: addr=155231 type got 0 exp 0: tag got
19383bf exp 11383bf
/boot/fossil: cacheLocalData: addr=57147 type got 0 exp 8: tag got
7bea015b exp 1ef0c892

why do i see the bits 0x800000 and 0x80 set in the bad tags, but
the rest seems to be identical?

--
cinap

[-- Attachment #2: Type: message/rfc822, Size: 14644 bytes --]

From: John Floren <slawmaster@gmail.com>
To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net>
Subject: Re: [9fans] fossil/venti falling down?
Date: Wed, 24 Jun 2009 10:43:43 -0700
Message-ID: <7d3530220906241043n2a6e153ao259d4b4082c7020f@mail.gmail.com>

On Thu, Jun 18, 2009 at 9:01 AM, John Floren <slawmaster@gmail.com> wrote:
>
> Our Coraid device recently lost two disks from the RAID5
> configuration; while we were able to rebuild from instructions given
> by support, I suspect some small amount of data was corrupted.
>
> Since rebuilding the device a few days ago, every morning I have
> returned to work to find my CPU/auth/file server in a classic "lost my
> file system" state--not locked, but trying to run any command causes
> it to hang. Also, files have been corrupted--here's the top bit of a
> copy of /sys/src/cmd/rio/fsys.c that I was working on:
>  ï¿½
> �   4 "  ï¿½C�/� TEXBASE1ENCODING + DVIPSENCODING TEX-PTMRI8R �H
> } m �����  ï¿½ a a � � �@ �8 � T�
>  ï¿½( �X � � � � � � & 0   !   �� � � � � � � �< �T � f � P � �   � �   � * �@ � . �D �Q4 � < ��t �m> �}E �
> J �X ��Q � T � \ � b � e � k ��m �a} ��� ��� �d ��� �p
>
> Any ideas? I thought maybe fossil was choking somewhere, maybe bad
> info on venti?
>
> Thanks
>
> John
> --

Wow. Did a "fsys main check pdir fix" on fossil console and saw this
over the serial line:

 /boot/fossil: cacheLocalData: addr=78989 type got 0 exp 0: tag got
e63eb942 exp 663eb942
/boot/fossil: cacheLocalData: addr=99457 type got 0 exp 0: tag got
150daf85 exp 150daf05
/boot/fossil: cacheLocalData: addr=68651 type got 0 exp 0: tag got
66be7fe5 exp 663e7fe5
/boot/fossil: cacheLocalData: addr=166723 type got 0 exp 0: tag got
3eabf0f9 exp 3eabf079
/boot/fossil: cacheLocalData: addr=134772 type got 0 exp 0: tag got
7795335e exp 7715335e
/boot/fossil: cacheLocalData: addr=155039 type got 0 exp 0: tag got
19383bf exp 11383bf
/boot/fossil: cacheLocalData: addr=155167 type got 0 exp 0: tag got
19383bf exp 11383bf
/boot/fossil: cacheLocalData: addr=155231 type got 0 exp 0: tag got
19383bf exp 11383bf
/boot/fossil: cacheLocalData: addr=57147 type got 0 exp 8: tag got
7bea015b exp 1ef0c892
/boot/fossil: labelUnpack: bad label: 0x09 0x80 0x00000071 0x00000071 0x75a49da9
2009/0624 17:36:15 err 2: pre-copy sha1 wrong at arenas09 352136039:
expected=0577f95b51dd0ecd0b922e4a9ddb9e825a0f8422
got=52b4f746ce46c9eaf5ef5f3f50b8a90ca5368e6d
2009/0624 17:36:15 err 2: loading clump: corrupted at arenas09
352136039; expected=0577f95b51dd0ecd0b922e4a9ddb9e825a0f8422
got=52b4f746ce46c9eaf5ef5f3f50b8a90ca5368e6d
2009/0624 17:36:15 err 2: pre-copy sha1 wrong at arenas09 352284179:
expected=b255ea67eae36e0957cb92eec3aa5a6ee5e4b090
got=89adc89d298be7a0a921dd9117a3ecdd2ae39693
2009/0624 17:36:15 err 2: loading clump: corrupted at arenas09
352284179; expected=b255ea67eae36e0957cb92eec3aa5a6ee5e4b090
got=89adc89d298be7a0a921dd9117a3ecdd2ae39693
2009/0624 17:36:15 err 2: pre-copy sha1 wrong at arenas09 352465239:
expected=68899de88e1ef0c70a477dbf0caa6b71f8beaec4
got=d9f4b4fcfadf87a892a683e71eb80bd7c7742cb3
2009/0624 17:36:15 err 2: loading clump: corrupted at arenas09
352465239; expected=68899de88e1ef0c70a477dbf0caa6b71f8beaec4
got=d9f4b4fcfadf87a892a683e71eb80bd7c7742cb3
2009/0624 17:36:15 err 2: pre-copy sha1 wrong at arenas09 354350879:
expected=9cf5f41c091edacc68352017712c71d7a0dfcf5a
got=1f11661042a123d1a9808c868845d498fb339fcc
2009/0624 17:36:15 err 2: loading clump: corrupted at arenas09
354350879; expected=9cf5f41c091edacc68352017712c71d7a0dfcf5a
got=1f11661042a123d1a9808c868845d498fb339fcc
2009/0624 17:36:15 err 2: pre-copy sha1 wrong at arenas09 354959899:
expected=caf5ffc383d2eaf02fc96efb87760ec28448eebc
got=15dd16b127d15d072c2403cf99d72c5c2b60ca80
2009/0624 17:36:15 err 2: loading clump: corrupted at arenas09
354959899; expected=caf5ffc383d2eaf02fc96efb87760ec28448eebc
got=15dd16b127d15d072c2403cf99d72c5c2b60ca80
2009/0624 17:36:15 err 2: pre-copy sha1 wrong at arenas09 355470147:
expected=f6a723292b8262cff81fd582747d8ca5e347f45d
got=16cadb318198307d543df7f3ad14a95f0758672b
2009/0624 17:36:15 err 2: loading clump: corrupted at arenas09
355470147; expected=f6a723292b8262cff81fd582747d8ca5e347f45d
got=16cadb318198307d543df7f3ad14a95f0758672b
2009/0624 17:36:16 err 2: pre-copy sha1 wrong at arenas09 357371276:
expected=0c5b77207e98339d8a7a32365cf0a8a832dd516c
got=f757d80f9517d188c3141b310208498659d378e9
2009/0624 17:36:16 err 2: loading clump: corrupted at arenas09
357371276; expected=0c5b77207e98339d8a7a32365cf0a8a832dd516c
got=f757d80f9517d188c3141b310208498659d378e9
2009/0624 17:36:16 err 2: pre-copy sha1 wrong at arenas09 357544105:
expected=b353ac1d52f935b2d85f19e918bad0656f12bbb7
got=ed170cf4168bbee1c145b174664e505dcb679bc7
2009/0624 17:36:16 err 2: loading clump: corrupted at arenas09
357544105; expected=b353ac1d52f935b2d85f19e918bad0656f12bbb7
got=ed170cf4168bbee1c145b174664e505dcb679bc7
2009/0624 17:36:16 err 2: pre-copy sha1 wrong at arenas09 357634635:
expected=604aa628095f64176d4c5a9ec438d472b707e878
got=6d22e6b1c95675676907fc001eac9ba97e0cd983
2009/0624 17:36:16 err 2: loading clump: corrupted at arenas09
357634635; expected=604aa628095f64176d4c5a9ec438d472b707e878
got=6d22e6b1c95675676907fc001eac9ba97e0cd983
2009/0624 17:36:16 err 2: pre-copy sha1 wrong at arenas09 357659325:
expected=637ae3d28a3634ca58d30e55e7766f29faefaab4
got=dfeabe2212f7f9f598b8624ba33f6f5bfba51fb0
2009/0624 17:36:16 err 2: loading clump: corrupted at arenas09
357659325; expected=637ae3d28a3634ca58d30e55e7766f29faefaab4
got=dfeabe2212f7f9f598b8624ba33f6f5bfba51fb0
2009/0624 17:36:16 err 2: pre-copy sha1 wrong at arenas09 359083103:
expected=63cab5c3b87aade05fa242e9bf5ec1a44241bd03
got=0d232292c70bf214b570fbce354802022ab4c829
2009/0624 17:36:16 err 2: loading clump: corrupted at arenas09
359083103; expected=63cab5c3b87aade05fa242e9bf5ec1a44241bd03
got=0d232292c70bf214b570fbce354802022ab4c829
2009/0624 17:36:16 err 2: pre-copy sha1 wrong at arenas09 359330003:
expected=1e90a1c479d65bb18533f21451c986fabc8eb9a5
got=874407a20841a923569e8c9fe1e2e65eb4712a31
2009/0624 17:36:16 err 2: loading clump: corrupted at arenas09
359330003; expected=1e90a1c479d65bb18533f21451c986fabc8eb9a5
got=874407a20841a923569e8c9fe1e2e65eb4712a31
2009/0624 17:36:16 err 2: pre-copy sha1 wrong at arenas09 359856723:
expected=194f966e4c0814592c38ff70ad137deb6b2cb806
got=56aeec2fbf2f22db26dfe63e920fdcd64bc39273
2009/0624 17:36:16 err 2: loading clump: corrupted at arenas09
359856723; expected=194f966e4c0814592c38ff70ad137deb6b2cb806
got=56aeec2fbf2f22db26dfe63e920fdcd64bc39273
2009/0624 17:36:16 err 2: pre-copy sha1 wrong at arenas09 360852553:
expected=d79a08f04802893e2c4f408daf3d7a49901d3b91
got=8f344e9c738483c8d7d29cb46608b4e5434198c7
2009/0624 17:36:16 err 2: loading clump: corrupted at arenas09
360852553; expected=d79a08f04802893e2c4f408daf3d7a49901d3b91
got=8f344e9c738483c8d7d29cb46608b4e5434198c7
2009/0624 17:36:16 err 2: pre-copy sha1 wrong at arenas09 361428653:
expected=a066a808d8ccee2a0badd981ac82cb0e64efa65c
got=16b4c0d1e339279833b1387b6e58c051c282f59b
2009/0624 17:36:16 err 2: loading clump: corrupted at arenas09
361428653; expected=a066a808d8ccee2a0badd981ac82cb0e64efa65c
got=16b4c0d1e339279833b1387b6e58c051c282f59b
2009/0624 17:36:16 err 2: pre-copy sha1 wrong at arenas09 361938913:
expected=bef39df83a60e8ef9322def9eeeba0d501a0d960
got=9e375c5d1b58830a565dd2ab06e53b242a11c89b
2009/0624 17:36:16 err 2: loading clump: corrupted at arenas09
361938913; expected=bef39df83a60e8ef9322def9eeeba0d501a0d960
got=9e375c5d1b58830a565dd2ab06e53b242a11c89b
2009/0624 17:36:17 err 2: pre-copy sha1 wrong at arenas09 365938668:
expected=455abfee403540c43658d3367ae6c9e5f471e2af
got=fb18fe046102f9de3aed2a4d8e09323fdaa47f3a
2009/0624 17:36:17 err 2: loading clump: corrupted at arenas09
365938668; expected=455abfee403540c43658d3367ae6c9e5f471e2af
got=fb18fe046102f9de3aed2a4d8e09323fdaa47f3a
2009/0624 17:36:17 err 2: pre-copy sha1 wrong at arenas09 366506538:
expected=04be253f41e6d9c31d92489d11fb2fdb3522df60
got=f0ded93449a5f797e6a3aae2b596a6fb9c5d6fb6
2009/0624 17:36:17 err 2: loading clump: corrupted at arenas09
366506538; expected=04be253f41e6d9c31d92489d11fb2fdb3522df60
got=f0ded93449a5f797e6a3aae2b596a6fb9c5d6fb6
2009/0624 17:36:17 err 2: pre-copy sha1 wrong at arenas09 367000337:
expected=6d40d3af6e51409ab2e5826da7739f703f6afd25
got=693990c9bd7ca0aaec9abadd739653d244c865fb
2009/0624 17:36:17 err 2: loading clump: corrupted at arenas09
367000337; expected=6d40d3af6e51409ab2e5826da7739f703f6afd25
got=693990c9bd7ca0aaec9abadd739653d244c865fb
[shortened to save space]
2009/0624 17:36:45 err 2: pre-copy sha1 wrong at arenas09 408124319:
expected=1fe38fd7cdddcfa87917dd8efd53f20730e7cc45
got=b895cd3117141832bb1cd1b9da08f9e847940bbe
2009/0624 17:36:45 err 2: loading clump: corrupted at arenas09
408124319; expected=1fe38fd7cdddcfa87917dd8efd53f20730e7cc45
got=b895cd3117141832bb1cd1b9da08f9e847940bbe
2009/0624 17:36:45 err 2: pre-copy sha1 wrong at arenas09 409037849:
expected=5c9b78b5a64b6b4926e20ec5a9d4aa9156283f57
got=087d67c8f43351b84282146959f9a45042b9e1a7
2009/0624 17:36:45 err 2: loading clump: corrupted at arenas09
409037849; expected=5c9b78b5a64b6b4926e20ec5a9d4aa9156283f57
got=087d67c8f43351b84282146959f9a45042b9e1a7
2009/0624 17:36:45 err 2: pre-copy sha1 wrong at arenas09 409243191:
expected=e8301bea2b2f2b20bd32b2035a210322481c6061
got=f5a0211a2dded4fd91f190b27c71fc2452928754
2009/0624 17:36:45 err 2: loading clump: corrupted at arenas09
409243191; expected=e8301bea2b2f2b20bd32b2035a210322481c6061
got=f5a0211a2dded4fd91f190b27c71fc2452928754
assert failed: b->part != PartVenti
73 fossil: checked 7866 page table entries
fossil 73: suicide: sys: trap: fault read addr=0x0 pc=0x0002e638


Yow. Do I need to reinitialize my venti and fossil, just start from scratch?


John.
--
"I've tried programming Ruby on Rails, following TechCrunch in my RSS
reader, and drinking absinthe. It doesn't work. I'm going back to C,
Hunter S. Thompson, and cheap whiskey." -- Ted Dziuba

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [9fans] fossil/venti falling down?
  2009-06-24 19:09   ` cinap_lenrek
@ 2009-06-24 23:33     ` John Floren
  2009-06-24 23:39       ` erik quanstrom
  2009-06-25  0:59     ` erik quanstrom
  1 sibling, 1 reply; 23+ messages in thread
From: John Floren @ 2009-06-24 23:33 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Wed, Jun 24, 2009 at 12:09 PM, <cinap_lenrek@gmx.de> wrote:
>  /boot/fossil: cacheLocalData: addr=78989 type got 0 exp 0: tag got
> e63eb942 exp 663eb942
> /boot/fossil: cacheLocalData: addr=99457 type got 0 exp 0: tag got
> 150daf85 exp 150daf05
> /boot/fossil: cacheLocalData: addr=68651 type got 0 exp 0: tag got
> 66be7fe5 exp 663e7fe5
> /boot/fossil: cacheLocalData: addr=166723 type got 0 exp 0: tag got
> 3eabf0f9 exp 3eabf079
> /boot/fossil: cacheLocalData: addr=134772 type got 0 exp 0: tag got
> 7795335e exp 7715335e
> /boot/fossil: cacheLocalData: addr=155039 type got 0 exp 0: tag got
> 19383bf exp 11383bf
> /boot/fossil: cacheLocalData: addr=155167 type got 0 exp 0: tag got
> 19383bf exp 11383bf
> /boot/fossil: cacheLocalData: addr=155231 type got 0 exp 0: tag got
> 19383bf exp 11383bf
> /boot/fossil: cacheLocalData: addr=57147 type got 0 exp 8: tag got
> 7bea015b exp 1ef0c892
>
> why do i see the bits 0x800000 and 0x80 set in the bad tags, but
> the rest seems to be identical?
>
> --
> cinap
>

That's interesting...

So I went ahead and reinstalled fossil and venti--this time I went
with a RAID-10 configuration on the Coraid.

Now, on the first archival snapshot to venti, I'm seeing these errors:

/boot/fossil: cacheLocalData: addr=4149 type got 0 exp 0: tag got
fa5c83d5 exp 7a5c83d5
archive(0, 0x1035): cannot find block: block label mismatch
/boot/fossil: cacheLocalData: addr=12511 type got 0 exp 0: tag got
5afe0bcb exp 5a7e0bcb
archive(0, 0x30df): cannot find block: block label mismatch

In both cases, it seems like a 1000 mask is being applied to a single
byte in the "expected" number to get the "got" number.

Russ, can you shed some light on this?

John
-- 
"I've tried programming Ruby on Rails, following TechCrunch in my RSS
reader, and drinking absinthe. It doesn't work. I'm going back to C,
Hunter S. Thompson, and cheap whiskey." -- Ted Dziuba



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [9fans] fossil/venti falling down?
  2009-06-24 23:33     ` John Floren
@ 2009-06-24 23:39       ` erik quanstrom
  2009-06-25  0:00         ` Venkatesh Srinivas
  0 siblings, 1 reply; 23+ messages in thread
From: erik quanstrom @ 2009-06-24 23:39 UTC (permalink / raw)
  To: 9fans

> So I went ahead and reinstalled fossil and venti--this time I went
> with a RAID-10 configuration on the Coraid.

for data integrety, raid 5 is a better solution because
on a raid 10, if one block is wrong, it's a coin flip as
to which one is correct (if any).  with raid 5, it's possible
to determine which disk has the incorrect information.

- erik



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [9fans] fossil/venti falling down?
  2009-06-24 23:39       ` erik quanstrom
@ 2009-06-25  0:00         ` Venkatesh Srinivas
  2009-06-25  0:42           ` erik quanstrom
  0 siblings, 1 reply; 23+ messages in thread
From: Venkatesh Srinivas @ 2009-06-25  0:00 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Wed, Jun 24, 2009 at 7:39 PM, erik quanstrom<quanstro@coraid.com> wrote:
>> So I went ahead and reinstalled fossil and venti--this time I went
>> with a RAID-10 configuration on the Coraid.
>
> for data integrety, raid 5 is a better solution because
> on a raid 10, if one block is wrong, it's a coin flip as
> to which one is correct (if any).  with raid 5, it's possible
> to determine which disk has the incorrect information.
>

Not directly related to the topic here, but this has always bugged me
about running Venti on mirrored or raided disks.

When a block on a mirrored pair doesn't match the block on its
partner, the mirroring layer has no idea which one is right, but Venti
does. Some way to export this read failure to it and give it a chance
to decide which block to pick would be pretty neat.

Alternatively, run one Venti per disk and run something like Inferno's
vcache in front of each of them, each one naming the other as the
'remote' server...

For more protection, I have an (currently stalled) 'devrs' started. It
is based on Inferno's devds (Plan 9's devfs) in RAID1 mode, but
protects each disk block with a (255,223) Reed-Solomon code and
interleaves the coded blocks around the disks somewhat. Email me for
sources; does this seem like a reasonable approach?

Take care,
-- vs



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [9fans] fossil/venti falling down?
  2009-06-25  0:00         ` Venkatesh Srinivas
@ 2009-06-25  0:42           ` erik quanstrom
  2009-06-25 16:13             ` Russ Cox
  0 siblings, 1 reply; 23+ messages in thread
From: erik quanstrom @ 2009-06-25  0:42 UTC (permalink / raw)
  To: 9fans

> Not directly related to the topic here, but this has always bugged me
> about running Venti on mirrored or raided disks.
>
> When a block on a mirrored pair doesn't match the block on its
> partner, the mirroring layer has no idea which one is right, but Venti
> does. Some way to export this read failure to it and give it a chance
> to decide which block to pick would be pretty neat.

it's even neater to use a raid level that doesn't require venti
intervention.

does venti even keep scores on the bloom filter blocks and the icache?

- erik



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [9fans] fossil/venti falling down?
  2009-06-24 19:09   ` cinap_lenrek
  2009-06-24 23:33     ` John Floren
@ 2009-06-25  0:59     ` erik quanstrom
  2009-06-25 16:13       ` Russ Cox
  1 sibling, 1 reply; 23+ messages in thread
From: erik quanstrom @ 2009-06-25  0:59 UTC (permalink / raw)
  To: 9fans

> /boot/fossil: cacheLocalData: addr=155039 type got 0 exp 0: tag got
> 19383bf exp 11383bf
> /boot/fossil: cacheLocalData: addr=155167 type got 0 exp 0: tag got
> 19383bf exp 11383bf

am i wrong in thinking that it would be an error to have the same tag at
two different addresses?

- erik



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [9fans] fossil/venti falling down?
  2009-06-25  0:59     ` erik quanstrom
@ 2009-06-25 16:13       ` Russ Cox
  0 siblings, 0 replies; 23+ messages in thread
From: Russ Cox @ 2009-06-25 16:13 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Wed, Jun 24, 2009 at 5:59 PM, erik quanstrom<quanstro@quanstro.net> wrote:
>> /boot/fossil: cacheLocalData: addr=155039 type got 0 exp 0: tag got
>> 19383bf exp 11383bf
>> /boot/fossil: cacheLocalData: addr=155167 type got 0 exp 0: tag got
>> 19383bf exp 11383bf
>
> am i wrong in thinking that it would be an error to have the same tag at
> two different addresses?

the tag is more or less an inode number.
every data block in a file has the same tag.

russ


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [9fans] fossil/venti falling down?
  2009-06-25  0:42           ` erik quanstrom
@ 2009-06-25 16:13             ` Russ Cox
  2009-06-25 16:24               ` erik quanstrom
  0 siblings, 1 reply; 23+ messages in thread
From: Russ Cox @ 2009-06-25 16:13 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> it's even neater to use a raid level that doesn't require venti
> intervention.

agreed.

> does venti even keep scores on the bloom filter blocks and the icache?

no, but those are soft data and can be reconstructed.

russ


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [9fans] fossil/venti falling down?
  2009-06-25 16:13             ` Russ Cox
@ 2009-06-25 16:24               ` erik quanstrom
  2009-06-25 16:47                 ` Russ Cox
  0 siblings, 1 reply; 23+ messages in thread
From: erik quanstrom @ 2009-06-25 16:24 UTC (permalink / raw)
  To: 9fans

> > does venti even keep scores on the bloom filter blocks and the icache?
>
> no, but those are soft data and can be reconstructed.

being the paranoid type, i worry about this.  does the
rebuild rate on a large (say, 1tb) venti make this a
practical strategy?

- erik



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [9fans] fossil/venti falling down?
  2009-06-25 16:24               ` erik quanstrom
@ 2009-06-25 16:47                 ` Russ Cox
  2009-06-25 16:51                   ` erik quanstrom
  0 siblings, 1 reply; 23+ messages in thread
From: Russ Cox @ 2009-06-25 16:47 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, Jun 25, 2009 at 9:24 AM, erik quanstrom<quanstro@quanstro.net> wrote:
>> > does venti even keep scores on the bloom filter blocks and the icache?
>>
>> no, but those are soft data and can be reconstructed.
>
> being the paranoid type, i worry about this.  does the
> rebuild rate on a large (say, 1tb) venti make this a
> practical strategy?

there's no question that a better strategy is to
use a 100% reliable underlying storage device.

if that's not available and one must cope with
disk failures some other way, it is very nice
that venti can use the sha1 checksums to check
the integrity of the core data and rebuild the rest.
this is what i do when a disk fails on the mit venti
backup server, which has about 5tb of data right now.
it takes about an hour to rebuild everything with
venti/buildindex.

russ


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [9fans] fossil/venti falling down?
  2009-06-25 16:47                 ` Russ Cox
@ 2009-06-25 16:51                   ` erik quanstrom
  0 siblings, 0 replies; 23+ messages in thread
From: erik quanstrom @ 2009-06-25 16:51 UTC (permalink / raw)
  To: 9fans

> there's no question that a better strategy is to
> use a 100% reliable underlying storage device.

let me know when you find one.

- erik



^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2009-06-25 16:51 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-18 16:01 [9fans] fossil/venti falling down? John Floren
2009-06-18 16:21 ` John Floren
2009-06-18 16:25   ` erik quanstrom
2009-06-18 16:30     ` John Floren
2009-06-18 16:45       ` erik quanstrom
2009-06-18 17:10         ` John Floren
2009-06-24 17:06           ` John Floren
2009-06-21 11:57   ` Richard Miller
2009-06-21 12:12     ` Steve Simon
2009-06-21 14:11     ` erik quanstrom
2009-06-21 19:50       ` Josh Wood
2009-06-24 17:43 ` John Floren
2009-06-24 19:09   ` cinap_lenrek
2009-06-24 23:33     ` John Floren
2009-06-24 23:39       ` erik quanstrom
2009-06-25  0:00         ` Venkatesh Srinivas
2009-06-25  0:42           ` erik quanstrom
2009-06-25 16:13             ` Russ Cox
2009-06-25 16:24               ` erik quanstrom
2009-06-25 16:47                 ` Russ Cox
2009-06-25 16:51                   ` erik quanstrom
2009-06-25  0:59     ` erik quanstrom
2009-06-25 16:13       ` Russ Cox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).