[9fans] fossil deadlock

9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed

* [9fans] fossil deadlock
@ 2011-09-08 12:48 rod
  2011-09-15 14:28 ` Ethan Grammatikidis
  2012-03-22 11:57 ` Richard Miller
  0 siblings, 2 replies; 6+ messages in thread
From: rod @ 2011-09-08 12:48 UTC (permalink / raw)
  To: 9fans

I can easily get fossil to deadlock. It would be great if anyone with
fossil internals expertise could comment. As far as I can it tell from
a stack dump it usually goes like:

The snapshot thread has called cacheFlush with "wait" set to 1 to
ensure all blocks are flushed to disk before continuing. It acquires
"dirtylk" and then loops waiting for ndirty to become zero.  In the
loop it wakes "flush" to kick the flush thread and then sleeps on
"flushwait".

Now the unlink thread tries to mark a block as dirty, and it gets
stuck in blockDirty. It needs to increment ndirty, but can't acquire
the dirtylk, as the snapshot thread is holding it.

The flush thread is in _cacheLocalLookup.  The block it was trying to
return was in the process of being written by the disk thread so it
waited for the io to complete.  Subsequently the write finished and
the block was declared "clean" but was grabbed by the unlink thread.
The flush thread now can't reacquire the block lock on return from
vtSleep(b->ioready).

Does this sound feasible? Even if so, I'm not sure how to fix it.

- rod

Further notes:

Probably not related, but occasionally I get an error similar to the
following:

fossil: diskReadRaw failed: /usr/glenda/test/fossil: score 0x00643e42: part=label block 6569538: illegal block address
archive(0, 0xe51246a5): cannot find block: error reading block 0x00643e42
archWalk 0xe51246a5 failed; ptr is in 0x173d offset 0
archiveBlock 0x173d: error reading block 0x00643e42

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9fans] fossil deadlock
  2011-09-08 12:48 [9fans] fossil deadlock rod
@ 2011-09-15 14:28 ` Ethan Grammatikidis
  2011-09-15 14:32   ` erik quanstrom
  2012-03-22 11:57 ` Richard Miller
  1 sibling, 1 reply; 6+ messages in thread
From: Ethan Grammatikidis @ 2011-09-15 14:28 UTC (permalink / raw)
  To: 9fans

On Thu, 08 Sep 2011 13:48:23 +0100
rod@hemiola.co.uk wrote:

> I can easily get fossil to deadlock. It would be great if anyone with
> fossil internals expertise could comment.

Every now and then someone studies fossil, having the belief that if only it were understood it could be made to work properly. No-one has succeeded in gaining "fossil internals expertise" yet, and the filesystem is over 10 years old.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9fans] fossil deadlock
  2011-09-15 14:28 ` Ethan Grammatikidis
@ 2011-09-15 14:32   ` erik quanstrom
  0 siblings, 0 replies; 6+ messages in thread
From: erik quanstrom @ 2011-09-15 14:32 UTC (permalink / raw)
  To: 9fans

On Thu Sep 15 10:31:09 EDT 2011, eekee57@fastmail.fm wrote:
> On Thu, 08 Sep 2011 13:48:23 +0100 rod@hemiola.co.uk wrote:
>
> > I can easily get fossil to deadlock.  It would be great if anyone
> > with fossil internals expertise could comment.
> >
> Every now and then someone studies fossil, having the belief that if
> only it were understood it could be made to work properly.  No-one has
> succeeded in gaining "fossil internals expertise" yet, and the
> filesystem is over 10 years old.

i think that's an excellent summary of the issues.

- erik



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9fans] fossil deadlock
  2011-09-08 12:48 [9fans] fossil deadlock rod
  2011-09-15 14:28 ` Ethan Grammatikidis
@ 2012-03-22 11:57 ` Richard Miller
  2012-03-22 13:13   ` Gorka Guardiola
  2012-03-22 13:41   ` David du Colombier
  1 sibling, 2 replies; 6+ messages in thread
From: Richard Miller @ 2012-03-22 11:57 UTC (permalink / raw)
  To: 9fans

Thanks to detective work by Rod at hemiola and David du Colombier,
I've been able to find & fix the long-standing deadlock in fossil's
snapshot code.

The problem is the lock c->dirtylk, which prevents any thread from
dirtying new blocks while flushThread is writing old dirty blocks to
disk.  Deadlock occurs when a thread T tries to dirty a block B while
holding a lock on an already-dirty block A. T can't proceed until
c->dirtylk is unlocked, which won't happen until flushThread has written
out block A; but block A can't be written out until T unlocks it.

The deadlock is usually observed when a fossil 'sync' occurs (eg at the
beginning of a periodic temporary snapshot) while 'snap -a' is running;
in this case thread T is archThread, updating the label in block B for a
dirty block A which it has just sent to venti.  But there are many other
cases where a thread locks two blocks and marks one dirty, which are
vulnerable to this deadlock if a snapshot occurs just at the wrong time.

The solution is simple (though I only realised it after trying and
rejecting several complicated ones).  It turns out that c->dirtylk
serves no useful purpose and can be eliminated altogether.  Its use
is explained by a comment in cacheFlush:
	/*
	 * Lock c->dirtylk so that more blocks aren't being dirtied
	 * while we try to write out what's already here.
	 * Otherwise we might not ever finish!
	 */
But whenever cacheFlush is called with wait=1, a write lock is already
held on the "epoch lock" fs->elk, which prevents any new 9p requests
from touching the file system.  Any dirtying of blocks is limited to
internal fossil activities for requests which were already in progress
when cacheFlush was called, and thus strictly bounded even without
c->dirtylk.

Patch is fossil-snap-deadlock.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9fans] fossil deadlock
  2012-03-22 11:57 ` Richard Miller
@ 2012-03-22 13:13   ` Gorka Guardiola
  2012-03-22 13:41   ` David du Colombier
  1 sibling, 0 replies; 6+ messages in thread
From: Gorka Guardiola @ 2012-03-22 13:13 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Mar 22, 2012, at 12:57 PM, Richard Miller <9fans@hamnavoe.com> wrote:

> Thanks to detective work by Rod at hemiola and David du Colombier,
> I've been able to find & fix the long-standing deadlock in fossil's
> snapshot code.
>
> Patch is fossil-snap-deadlock.

Kudos to you all!!

G.




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9fans] fossil deadlock
  2012-03-22 11:57 ` Richard Miller
  2012-03-22 13:13   ` Gorka Guardiola
@ 2012-03-22 13:41   ` David du Colombier
  1 sibling, 0 replies; 6+ messages in thread
From: David du Colombier @ 2012-03-22 13:41 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Thanks for your work. I'm glad we've come to the same conclusion.

In my side, I've also removed the c->dirtylk lock. It solved the
deadlock problem, but I was still unsure it was the right thing to do.

I've tried to obtain more information from the Fossil authors,
but without success.

Since we both fixed it the same way, this is probably the correct
solution.

--
David du Colombier

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-03-22 13:41 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-08 12:48 [9fans] fossil deadlock rod
2011-09-15 14:28 ` Ethan Grammatikidis
2011-09-15 14:32   ` erik quanstrom
2012-03-22 11:57 ` Richard Miller
2012-03-22 13:13   ` Gorka Guardiola
2012-03-22 13:41   ` David du Colombier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).