From mboxrd@z Thu Jan 1 00:00:00 1970 MIME-Version: 1.0 In-Reply-To: <7d3530220906181010l25557061k774bb250a4a2e6dd@mail.gmail.com> References: <7d3530220906180930p575fcb4bk473decb7d1a89c27@mail.gmail.com> <7d3530220906181010l25557061k774bb250a4a2e6dd@mail.gmail.com> Date: Wed, 24 Jun 2009 10:06:22 -0700 Message-ID: <7d3530220906241006t7e9799f8r17f09f57c1c41831@mail.gmail.com> From: John Floren To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [9fans] fossil/venti falling down? Topicbox-Message-UUID: 0e94c898-ead5-11e9-9d60-3106f5b1d025 On Thu, Jun 18, 2009 at 10:10 AM, John Floren wrote: > On Thu, Jun 18, 2009 at 9:45 AM, erik quanstrom w= rote: >> >> > It seems to only happen once per boot, but not necessarily when fossil >> > starts responding--I've seen it a couple hours after booting, which >> > the filesystem tends to go away at night. >> >> the failure is somewhere in blockWrite. =C2=A0since blockWrite >> calls diskWrite and diskWrite just queues up i/o to send >> to the disk, it's not possible to get i/o errors directly from >> blockWrite. >> >> there are two case that do return errors. >> >> one is if the block can't be locked. =C2=A0a runaway periodic function >> would make that more likely, since we don't wait for the lock. >> but it seems more likely in this case that some of fossil's data is >> corrupted since this started after the double-failure. >> see http://9fans.net/archive/2009/03/487 >> >> the other case is a funny dependency. =C2=A0there's a fprint there >> that's commented out. >> >> - erik >> > > Here's another message that may be of interest. I ran fshalt before > rebooting (to test the periodicthread patch) and saw this: > > syncing.../srv/fscons...prompt: sourceRoot: fs->ehi =3D 5395, b->l =3D > BtDir,3,Copied,e=3D5394,-1,tag=3D0x1 > venti... > halting.../srv/fscons...archive vac:a9d9b0b9fe0db783fe618f680804a18df532a= 67a > > I don't remember seeing that "sourceRoot: ..." stuff before; as soon > as the system comes back up I guess I'll take a look at source. > After replacing the problematic server and moving the fossil disk to the new machine, we're not getting random hangs any more. However, I've seen this a few times on the console: /boot/fossil: cacheLocalData: addr=3D78989 type got 0 exp 0: tag got e63eb942 exp 663eb942 archive(0, 0x1348d): cannot find block: block label mismatch and /boot/fossil: cacheLocalData: addr=3D134772 type got 0 exp 0: tag got 7795335e exp 7715335e is this something to worry about? John --=20 "I've tried programming Ruby on Rails, following TechCrunch in my RSS reader, and drinking absinthe. It doesn't work. I'm going back to C, Hunter S. Thompson, and cheap whiskey." -- Ted Dziuba