From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Thu, 8 Sep 2011 13:48:23 +0100 To: 9fans@9fans.net User-Agent: Heirloom mailx 12.4 7/29/08 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: <20110908124823.9235665D38@server.hemiola.co.uk> From: rod@hemiola.co.uk Subject: [9fans] fossil deadlock Topicbox-Message-UUID: 1bc7b44c-ead7-11e9-9d60-3106f5b1d025 I can easily get fossil to deadlock. It would be great if anyone with fossil internals expertise could comment. As far as I can it tell from a stack dump it usually goes like: The snapshot thread has called cacheFlush with "wait" set to 1 to ensure all blocks are flushed to disk before continuing. It acquires "dirtylk" and then loops waiting for ndirty to become zero. In the loop it wakes "flush" to kick the flush thread and then sleeps on "flushwait". Now the unlink thread tries to mark a block as dirty, and it gets stuck in blockDirty. It needs to increment ndirty, but can't acquire the dirtylk, as the snapshot thread is holding it. The flush thread is in _cacheLocalLookup. The block it was trying to return was in the process of being written by the disk thread so it waited for the io to complete. Subsequently the write finished and the block was declared "clean" but was grabbed by the unlink thread. The flush thread now can't reacquire the block lock on return from vtSleep(b->ioready). Does this sound feasible? Even if so, I'm not sure how to fix it. - rod Further notes: Probably not related, but occasionally I get an error similar to the following: fossil: diskReadRaw failed: /usr/glenda/test/fossil: score 0x00643e42: part=label block 6569538: illegal block address archive(0, 0xe51246a5): cannot find block: error reading block 0x00643e42 archWalk 0xe51246a5 failed; ptr is in 0x173d offset 0 archiveBlock 0x173d: error reading block 0x00643e42