From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: To: 9fans@9fans.net Date: Wed, 8 Apr 2009 18:46:44 +0200 From: cinap_lenrek@gmx.de In-Reply-To: <18a0a7d320f2c6799a99d3d3b64c2f4a@hamnavoe.com> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="upas-smoigxihuljneeazzdpycwwddz" Subject: Re: [9fans] fossil caching venti errors Topicbox-Message-UUID: d66f7cb0-ead4-11e9-9d60-3106f5b1d025 This is a multi-part message in MIME format. --upas-smoigxihuljneeazzdpycwwddz Content-Disposition: inline Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit hm... could it be that the archiver for some reason is looking at rolled back blocks? to ron: are you using temporal snapshots? i'v seen the same symptom but hadnt the nerves to look deeper into it... just reformated the fossil from last score and disabled snapshots and snaptime. :( it happend while i dumped the blocks from another fossils venti to the same venti that the failing fossil was archiving to... with snapshots enabled. have no idea if this is relevant... keep on debugging! -- cinap --upas-smoigxihuljneeazzdpycwwddz Content-Type: message/rfc822 Content-Disposition: inline Return-Path: <9fans-bounces+cinap_lenrek=gmx.de@9fans.net> X-Flags: 0000 Delivered-To: GMX delivery to cinap_lenrek@gmx.de Received: (qmail invoked by alias); 08 Apr 2009 14:48:52 -0000 Received: from gouda.swtch.com (EHLO gouda.swtch.com) [67.207.142.3] by mx0.gmx.net (mx095) with SMTP; 08 Apr 2009 16:48:52 +0200 Received: from localhost ([127.0.0.1] helo=gouda.swtch.com) by gouda.swtch.com with esmtp (Exim 4.67) (envelope-from <9fans-bounces@9fans.net>) id 1LrZ0X-0006Ow-Bu; Wed, 08 Apr 2009 14:44:05 +0000 Received: from smarthost02.mail.zen.net.uk ([212.23.3.141]) by gouda.swtch.com with esmtp (Exim 4.67) (envelope-from ) id 1LrZ0U-0006Or-UP for 9fans@9fans.net; Wed, 08 Apr 2009 14:44:03 +0000 Received: from [82.71.34.244] (helo=zen.hamnavoe.com) by smarthost02.mail.zen.net.uk with esmtp (Exim 4.63) (envelope-from ) id 1LrZ0S-0007H8-UP for 9fans@9fans.net; Wed, 08 Apr 2009 14:44:01 +0000 Message-ID: <18a0a7d320f2c6799a99d3d3b64c2f4a@hamnavoe.com> To: 9fans@9fans.net From: Richard Miller <9fans@hamnavoe.com> Date: Wed, 8 Apr 2009 15:44:00 +0100 In-Reply-To: <9795467bbf361247c67dd50a1d03ac4f@terzarima.net> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-Smarthost02-IP: [82.71.34.244] Subject: Re: [9fans] fossil caching venti errors X-BeenThere: 9fans@9fans.net X-Mailman-Version: 2.1.9 Precedence: list Reply-To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> List-Id: Fans of the OS Plan 9 from Bell Labs <9fans.9fans.net> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: 9fans-bounces@9fans.net Errors-To: 9fans-bounces+cinap_lenrek=gmx.de@9fans.net X-GMX-Antivirus: 0 (no virus found) X-GMX-Antispam: 0 (Mail was not recognized as spam) X-GMX-UID: Z4DkfIFWbmw7Oo01yzdLr5lHUzc4cpG3 I've been seeing corruption of fossil /archive data, at a rate of about once a week. The symptom is that venti dir entries for /archive copies of some frequently referenced directories in my main fossil fs (generally it's /usr, /usr/miller or /mail/box/miller) contain incorrect venti scores. Sometimes the score is for a nonexistent block, and sometimes it's a valid score but for a different block (e.g. a data or dir block when a meta block is expected). Almost always it's the dir entry for a metadata block which gets a bad score, but occasionally it's the dir entry for a sub-dir block. This morning both the sub-dir and meta entries for one directory were written with the scores of the corresponding entries for a different directory: term% ls -lqd /n/dump/2009/0408/mail /n/dump/2009/0408/usr/miller/disk (0000000000001646 9 80) d-rwxrwxr-x M 36 upas upas 0 Jan 1 2007 /n/dump/2009/0408/mail (000000000000694e 23 80) d-rwxr-xr-x M 36 miller miller 0 Dec 31 2003 /n/dump/2009/0408/usr/miller/disk term% ls /n/dump/2009/0408/mail /n/dump/2009/0408/usr/miller/disk /n/dump/2009/0408/mail/fdisk.c /n/dump/2009/0408/mail/fdisk.c_ok /n/dump/2009/0408/mail/fdisk.c_try /n/dump/2009/0408/mail/mkfs.c /n/dump/2009/0408/usr/miller/disk/fdisk.c /n/dump/2009/0408/usr/miller/disk/fdisk.c_ok /n/dump/2009/0408/usr/miller/disk/fdisk.c_try /n/dump/2009/0408/usr/miller/disk/mkfs.c term% diff -r /n/dump/2009/0408/mail /n/dump/2009/0408/usr/miller/disk term% Here are the pairs of VtEntry structures for the two files as retrieved from venti: /n/dump/2009/0408/mail: 0000000 00000000 1ff41fe0 03000000 00000000 0000010 00000280 0830e9ae 3023dce8 3ddf6138 0000020 4fe87d59 7257e3b7 00000000 1ff42000 0000030 01000000 00000000 0000039f 21ee2227 0000040 695a8ea2 fbbd136e dcb435e0 64fbcdfb 0000050 /n/dump/2009/0408/usr/miller/disk: 0000000 00000000 1ff41fe0 03000000 00000000 0000010 000000a0 0830e9ae 3023dce8 3ddf6138 0000020 4fe87d59 7257e3b7 00000000 1ff42000 0000030 01000000 00000000 00000296 21ee2227 0000040 695a8ea2 fbbd136e dcb435e0 64fbcdfb 0000050 Note that it's just the score fields for /mail which are wrong; the size fields (280 and 39f) are correct, matching yesterday's dump: /n/dump/2009/0407/mail: 0000000 00000000 1ff41fe0 03000000 00000000 0000010 00000280 ... So, what's going on? My intuition says a fossil bug - I can't think of any disk hardware error which would lead to this kind of corruption. I have of course studied /sys/src/cmd/fossil/archive.c looking for races (my fossil+venti machine has two processors), but everything appears to be protected by locks. I would be interested to know if anyone else's archive is being quietly messed up in this way - you wouldn't necessarily know until you tried something like a history(1) command and found pieces missing. This is a quick test which may show if you have a similar problem: cd /n/dump/2009 for (i in *) { test -d $i$home/tmp || ls -d $i$home/tmp } for (i in *) { test -f $i/mail/box/$user/mbox || ls $i/mail/box/$user/mbox } In a post a while ago, Russ said > The amazing thing to me about fossil is how indestructable > it is when used with venti. > ... Once you see the data in the archive > tree, you can be very sure it's not going away. I agree with this, but I'd like a way to be reassured that my daily data has actually gone into the archive correctly. -- Richard --upas-smoigxihuljneeazzdpycwwddz--