From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <635e0a367d595f2c6e2b38c2480d2cf9@hamnavoe.com> To: 9fans@9fans.net From: Richard Miller <9fans@hamnavoe.com> Date: Thu, 22 Mar 2012 11:57:10 +0000 In-Reply-To: <20110908124823.9235665D38@server.hemiola.co.uk> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Subject: Re: [9fans] fossil deadlock Topicbox-Message-UUID: 6dee0906-ead7-11e9-9d60-3106f5b1d025 Thanks to detective work by Rod at hemiola and David du Colombier, I've been able to find & fix the long-standing deadlock in fossil's snapshot code. The problem is the lock c->dirtylk, which prevents any thread from dirtying new blocks while flushThread is writing old dirty blocks to disk. Deadlock occurs when a thread T tries to dirty a block B while holding a lock on an already-dirty block A. T can't proceed until c->dirtylk is unlocked, which won't happen until flushThread has written out block A; but block A can't be written out until T unlocks it. The deadlock is usually observed when a fossil 'sync' occurs (eg at the beginning of a periodic temporary snapshot) while 'snap -a' is running; in this case thread T is archThread, updating the label in block B for a dirty block A which it has just sent to venti. But there are many other cases where a thread locks two blocks and marks one dirty, which are vulnerable to this deadlock if a snapshot occurs just at the wrong time. The solution is simple (though I only realised it after trying and rejecting several complicated ones). It turns out that c->dirtylk serves no useful purpose and can be eliminated altogether. Its use is explained by a comment in cacheFlush: /* * Lock c->dirtylk so that more blocks aren't being dirtied * while we try to write out what's already here. * Otherwise we might not ever finish! */ But whenever cacheFlush is called with wait=1, a write lock is already held on the "epoch lock" fs->elk, which prevents any new 9p requests from touching the file system. Any dirtying of blocks is limited to internal fossil activities for requests which were already in progress when cacheFlush was called, and thus strictly bounded even without c->dirtylk. Patch is fossil-snap-deadlock.