From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/13740
Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail
From: Alexey Izbyshev <izbyshev@ispras.ru>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: __synccall: deadlock and reliance on racy /proc/self/task
Date: Sat, 09 Feb 2019 21:33:32 +0300
Message-ID: <6e0306699add531af519843de20c343a@ispras.ru>
References: <1cc54dbe2e4832d804184f33cda0bdd1@ispras.ru>
 <20190207183626.GQ23599@brightrain.aerifal.cx>
 <f368f992-9c4c-8b7f-7b0e-39e39c27ebf7@ispras.ru>
 <20190208183357.GX23599@brightrain.aerifal.cx>
 <20190209162101.GN21289@port70.net>
Reply-To: musl@lists.openwall.com
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII;
 format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226";
	logging-data="135307"; mail-complaints-to="usenet@blaine.gmane.org"
User-Agent: Roundcube Webmail/1.1.2
To: musl@lists.openwall.com, Szabolcs Nagy <nsz@port70.net>
Original-X-From: musl-return-13756-gllmg-musl=m.gmane.org@lists.openwall.com Sat Feb 09 19:33:47 2019
Return-path: <musl-return-13756-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@m.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by blaine.gmane.org with smtp (Exim 4.89)
	(envelope-from <musl-return-13756-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1gsXRj-000Z6d-15
	for gllmg-musl@m.gmane.org; Sat, 09 Feb 2019 19:33:47 +0100
Original-Received: (qmail 23773 invoked by uid 550); 9 Feb 2019 18:33:44 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Original-Received: (qmail 23725 invoked from network); 9 Feb 2019 18:33:43 -0000
In-Reply-To: <20190209162101.GN21289@port70.net>
X-Sender: izbyshev@ispras.ru
Xref: news.gmane.org gmane.linux.lib.musl.general:13740
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/13740>

On 2019-02-09 19:21, Szabolcs Nagy wrote:
> * Rich Felker <dalias@libc.org> [2019-02-08 13:33:57 -0500]:
>> On Fri, Feb 08, 2019 at 09:14:48PM +0300, Alexey Izbyshev wrote:
>> > On 2/7/19 9:36 PM, Rich Felker wrote:
>> > >Does it work if we force two iterations of the readdir loop with no
>> > >tasks missed, rather than just one, to catch the case of missed
>> > >concurrent additions? I'm not sure. But all this makes me really
>> > >uncomfortable with the current approach.
>> >
>> > I've tested with 0, 1, 2 and 3 retries of the main loop if miss_cnt
>> > == 0. The test eventually failed in all cases, with 0 retries
>> > requiring only a handful of iterations, 1 -- on the order of 100, 2
>> > -- on the order of 10000 and 3 -- on the order of 100000.
>> 
>> Do you have a theory on the mechanism of failure here? I'm guessing
>> it's something like this: there's a thread that goes unseen in the
>> first round, and during the second round, it creates a new thread and
>> exits itself. The exit gets seen (again, it doesn't show up in the
>> dirents) but the new thread it created still doesn't. Is that right?
>> 
>> In any case, it looks like the whole mechanism we're using is
>> unreliable, so something needs to be done. My leaning is to go with
>> the global thread list and atomicity of list-unlock with exit.
> 
> yes that sounds possible, i added some instrumentation to musl
> and the trace shows situations like that before the deadlock,
> exiting threads can even cause old (previously seen) entries to
> disappear from the dir.
> 
Thanks for the thorough instrumentation! Your traces confirm both my 
theory about the deadlock and unreliability of /proc/self/task.

I'd also done a very light instrumentation just before I got your email, 
but it took me a while to understand the output I got (see below).

I've used test-setuid-mismatch.c from my first email, with musl modified 
to avoid the deadlock (by removing kill() loop from __synccall) and 
retry readdir() loop several times in case miss_cnt == 0. I've also 
added tracing in several points:
* pthread_exit (just before __unmapself)
* __syncccall: before readdir() loop (prints the retry count)
* __syncccall: tid of the current dentry (only if tid != pid and we've 
not seen it in the chain)
* __syncccall: errors of tgkill() and futex() in the readdir() loop

A couple of traces with max retry == 1:

--iter: 43
retry 0
tid 21089
tid 21091
retry 1
exit 21091
--iter: 44
retry 0
tid 21091
futex: ESRCH
tid 21093
exit 21093
retry 1
mismatch: tid 21094: 0 != 23517
------------------
--iter: 3
retry 0
tid 15974
futex: ESRCH
tid 15977
retry 1
tid 15978
--iter: 4
exit 15977
retry 0
tid 15977
tid 15978
exit 15978
retry 1
tid 15978
tgkill: ESRCH
mismatch: tid 15979: 0 != 23517

And with max retry == 2:

--iter: 19812
retry 0
tid 29606
retry 1
retry 2
exit 29606
--iter: 19813
retry 0
tid 29606
futex: ESRCH
tid 29609
exit 29609
retry 1
tid 29609
tgkill: ESRCH
retry 2
mismatch: tid 29610: 23517 != 0
------------------
--iter: 9762
retry 0
tid 14859
tid 14860
retry 1
retry 2
--iter: 9763
exit 14859
retry 0
tid 14859
exit 14860
tid 14860
retry 1
retry 2
mismatch: tid 14862: 23517 != 0

So, it's clear that /proc/self/task easily misses concurrently created 
threads, at least if other threads exit at the same time. And Szabolcs's 
traces indicate that even threads already running in userspace can be 
missed.

Now, about the strange output I mentioned. Consider one of the above 
fragments:
--iter: 4
exit 15977
retry 0
tid 15977
tid 15978
exit 15978
retry 1
tid 15978
tgkill: ESRCH
mismatch: tid 15979: 0 != 23517

Note that "tid 15978" is printed two times. Recall that it's printed 
only if we haven't seen it in the chain. But how it's possible that we 
haven't seen it *two* times? Since we waited on the futex the first time 
and we got the lock, the signal handler must have unlocked it. There is 
even a comment before futex() call:

/* Obtaining the lock means the thread responded. ESRCH
  * means the target thread exited, which is okay too. */

If it the signal handler reached futex unlock code, it must have updated 
the chain, and we must see the tid in the chain on the next retry and 
don't print it.

Apparently, there is another reason of futex(FUTEX_LOCK_PI) success: the 
owner is exiting concurrently (which is also indicated by the subsequent 
failure of tgkill with ESRCH). So obtaining the lock doesn't necessarily 
mean that the owner responded: it may also mean that the owner is (about 
to be?) dead.

Alexey