From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, HTML_MESSAGE,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 23294 invoked from network); 4 Apr 2023 17:16:14 -0000 Received: from tb-ob1.topicbox.com (64.147.108.173) by inbox.vuxu.org with ESMTPUTF8; 4 Apr 2023 17:16:14 -0000 Received: from tb-mx1.topicbox.com (tb-mx1.nyi.icgroup.com [10.90.30.61]) by tb-ob1.topicbox.com (Postfix) with ESMTP id 5B59929E76 for ; Tue, 4 Apr 2023 13:16:10 -0400 (EDT) (envelope-from bounce.mMa4bc3ee3ea80defbb239f382.r522be890-2105-11eb-b15e-8d699134e1fa@9fans.bounce.topicbox.com) Received: by tb-mx1.topicbox.com (Postfix, from userid 1132) id 7A5C717A732A; Tue, 4 Apr 2023 13:16:10 -0400 (EDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed; d=9fans.net; h=from:to :message-id:date:mime-version:content-type :content-transfer-encoding:list-help:list-id:list-post :list-subscribe:reply-to:subject:list-unsubscribe; s=dkim-1; t= 1680628570; x=1680714970; bh=tEVURg8/zQvqtD/OKupRMoQUbA+hNE2NOTN IKVSDVg8=; b=DaB/Wa+6a/7RKAzigoyHpNnRgDq6U5yZk5E5f2iqlN3R82zmHLx aXU+cK78qgyqU7eNjW+AfhCQIvkLgMvJq0zgV9YXNwn/RuQOOQii7R5myL5PjS5L SFwWSA8RedqpbLkRnofrWxmGaWtjE6g+hr4NpVALBZFlaI1QhaGiMUu8= From: noam@pixelhero.dev To: 9fans <9fans@9fans.net> Message-Id: <16806285560.EA879D18D.575584@composer.9fans.topicbox.com> Date: Tue, 4 Apr 2023 13:15:56 -0400 MIME-Version: 1.0 Content-Type: multipart/alternative; boundary=16806285561.Ea12aFb.575584 Content-Transfer-Encoding: 7bit Topicbox-Policy-Reasoning: allow: sender is a member Topicbox-Message-UUID: 5d060da2-d30c-11ed-81d5-5344262d11b0 Archived-At: =?UTF-8?B?PGh0dHBzOi8vOWZhbnMudG9waWNib3guY29tL2dyb3Vwcy85?= =?UTF-8?B?ZmFucy9UMzU0ZmU3MDJlMWU5ZDVlOS1NYTRiYzNlZTNlYTgwZGVmYmIyMzlm?= =?UTF-8?B?MzgyPg==?= List-Help: List-Id: "9fans" <9fans.9fans.net> List-Post: List-Software: Topicbox v0 List-Subscribe: Precedence: list Reply-To: 9fans <9fans@9fans.net> Subject: [9fans] [PATCH] fossil: fix a deadlock in the caching logic List-Unsubscribe: , Topicbox-Delivery-ID: 2:9fans:437d30aa-c441-11e9-8a57-d036212d11b0:522be890-2105-11eb-b15e-8d699134e1fa:Ma4bc3ee3ea80defbb239f382:1:gl-z95Sf8qbeFDDeghPzgmqJ6qSVSCptz8mWI7Qh6AE --16806285561.Ea12aFb.575584 Date: Tue, 4 Apr 2023 13:15:56 -0400 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable I've sporadically encountered a deadlock in fossil. Naturally, when your ro= ot file system crashes, it can be hard to debug. My solution: stop having a= root file system. Was able to attach acid using mycroft's tooling from ANT= S, and get a clean stack trace (https://pixelhero.dev/notebook/fossil/stack= s/2023-04-03.1). After a few hours yesterday (https://pixelhero.dev/notebook/fossil/2023-04-= 03.html ), I eventually track= ed down the deadlock. When blockWrite is told to flush a clean block to dis= k - i.e. one which is already flushed - it removes the block from the cache= 's free list, locks the block, detects that it's clean, and then... drops t= he reference. While keeping the block locked. And in the cache. This leak of the lock, of course, means that the *next* access to the block= - which is still in the cache! - hangs indefinitely. This is seen exactly = in the stack trace: _cacheLocal grabs the block from the cache, tries to lock it, and hangs ind= efinitely. Worse, it does so under a call to fileWalk, which holds a differ= ent lock, so the effect spreads out and makes even more of the file system = inaccessible as well (the fileMetaFlush proc hangs waiting on this file loc= k). This patch just ensures we call blockPut on the BioClean path as well, thus= unlocking the block and readding it to the cache's free lists. The patch is on my branch - https://git.sr.ht/~pixelherodev/plan9/commit/1b= f8bd4f44e058261da7e89d87527b12073c9e0f - but I figured I should probably po= st it here as well. If anyone has any other patches that weren't in the 9legacy download as of = ~2018, please let me know! :) --- sys/src/cmd/fossil/cache.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/sys/src/cmd/fossil/cache.c b/sys/src/cmd/fossil/cache.c index f473d211e..2fec44949 100644 --- a/sys/src/cmd/fossil/cache.c +++ b/sys/src/cmd/fossil/cache.c @@ -1203,8 +1203,10 @@ blockWrite(Block *b, int waitlock) fprint(2, "%s: %d:%x:%d iostate is %d in blockWrite\n", argv0, bb->part, bb->addr, bb->l.type, bb->iostate); /* probably BioWriting if it happens? */ - if(bb->iostate =3D=3D BioClean) + if(bb->iostate =3D=3D BioClean){ + blockPut(bb); goto ignblock; + } } blockPut(bb); -- ------------------------------------------ 9fans: 9fans Permalink: https://9fans.topicbox.com/groups/9fans/T354fe702e1e9d5e9-Ma4bc3= ee3ea80defbb239f382 Delivery options: https://9fans.topicbox.com/groups/9fans/subscription --16806285561.Ea12aFb.575584 Date: Tue, 4 Apr 2023 13:15:56 -0400 MIME-Version: 1.0 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I've sporadically encountered a deadlock i= n fossil. Naturally, when your root file system crashes, it can be hard to = debug. My solution: stop having a root file system. Was able to attach acid= using mycroft's tooling from ANTS, and get a clean stack trace (https://pix= elhero.dev/notebook/fossil/stacks/2023-04-03.1).

=
After a few hours yesterday (https://pixelhero.dev/notebook/fossil/2023-04-03.html= ), I eventually tracked down the deadlock. When blockWrite is told to f= lush a clean block to disk - i.e. one which is already flushed - it removes= the block from the cache's free list, locks the block, detects that it= 's clean, and then... drops the reference. While keeping the block lock= ed. And in the cache.

This leak of the loc= k, of course, means that the *next* access to the block - which is still in= the cache! - hangs indefinitely. This is seen exactly in the stack trace:<= br />

_cacheLocal grabs the block from the cache= , tries to lock it, and hangs indefinitely. Worse, it does so under a call = to fileWalk, which holds a different lock, so the effect spreads out and ma= kes even more of the file system inaccessible as well (the fileMetaFlush pr= oc hangs waiting on this file lock).

This = patch just ensures we call blockPut on the BioClean path as well, thus unlo= cking the block and readding it to the cache's free lists.
<= div>
The patch is on my branch - htt= ps://git.sr.ht/~pixelherodev/plan9/commit/1bf8bd4f44e058261da7e89d87527b120= 73c9e0f - but I figured I should probably post it here as well.

If anyone has any other patches that weren't = in the 9legacy download as of ~2018, please let me know! :)

---
sys/src/cmd/fossil/cache.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/sys/src/cmd/fossil/cache.c b/sys/src/cmd/f= ossil/cache.c
index f473d211e..2fec44949 100644
=
--- a/sys/src/cmd/fossil/cache.c
+++ b/sys/src/cmd/fos= sil/cache.c
@@ -1203,8 +1203,10 @@ blockWrite(Block *b, int= waitlock)
fprint(2, "%s: %d:%x:%d iostate is %d in bl= ockWrite\n",
argv0, bb->part, bb->addr, bb->l= .type, bb->iostate);
/* probably BioWriting if it happen= s? */
- if(bb->iostate =3D=3D BioClean)
+ if(bb->iostate =3D=3D BioClean){
+ blockPut(bb)= ;
goto ignblock;
+ }
}

blockPut(bb);
--
=

= --16806285561.Ea12aFb.575584--