From mboxrd@z Thu Jan  1 00:00:00 1970
Date: Thu,  8 Sep 2011 13:48:23 +0100
To: 9fans@9fans.net
User-Agent: Heirloom mailx 12.4 7/29/08
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-Id: <20110908124823.9235665D38@server.hemiola.co.uk>
From: rod@hemiola.co.uk
Subject: [9fans] fossil deadlock
Topicbox-Message-UUID: 1bc7b44c-ead7-11e9-9d60-3106f5b1d025

I can easily get fossil to deadlock. It would be great if anyone with
fossil internals expertise could comment. As far as I can it tell from
a stack dump it usually goes like:


The snapshot thread has called cacheFlush with "wait" set to 1 to
ensure all blocks are flushed to disk before continuing. It acquires
"dirtylk" and then loops waiting for ndirty to become zero.  In the
loop it wakes "flush" to kick the flush thread and then sleeps on
"flushwait".

Now the unlink thread tries to mark a block as dirty, and it gets
stuck in blockDirty. It needs to increment ndirty, but can't acquire
the dirtylk, as the snapshot thread is holding it.

The flush thread is in _cacheLocalLookup.  The block it was trying to
return was in the process of being written by the disk thread so it
waited for the io to complete.  Subsequently the write finished and
the block was declared "clean" but was grabbed by the unlink thread.
The flush thread now can't reacquire the block lock on return from
vtSleep(b->ioready).

Does this sound feasible? Even if so, I'm not sure how to fix it.


- rod


Further notes:

Probably not related, but occasionally I get an error similar to the
following:

fossil: diskReadRaw failed: /usr/glenda/test/fossil: score 0x00643e42: part=label block 6569538: illegal block address
archive(0, 0xe51246a5): cannot find block: error reading block 0x00643e42
archWalk 0xe51246a5 failed; ptr is in 0x173d offset 0
archiveBlock 0x173d: error reading block 0x00643e42