From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/6820
Path: news.gmane.org!not-for-mail
From: Rich Felker <dalias@libc.org>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: Fixing multithreaded set*id() AS-safety
Date: Sat, 10 Jan 2015 00:33:21 -0500
Message-ID: <20150110053321.GD4574@brightrain.aerifal.cx>
References: <20141220033918.GA3273@brightrain.aerifal.cx>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: ger.gmane.org 1420868024 24840 80.91.229.3 (10 Jan 2015 05:33:44 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Sat, 10 Jan 2015 05:33:44 +0000 (UTC)
To: musl@lists.openwall.com
Original-X-From: musl-return-6833-gllmg-musl=m.gmane.org@lists.openwall.com Sat Jan 10 06:33:39 2015
Return-path: <musl-return-6833-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@m.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by plane.gmane.org with smtp (Exim 4.69)
	(envelope-from <musl-return-6833-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1Y9og6-00005Q-1t
	for gllmg-musl@m.gmane.org; Sat, 10 Jan 2015 06:33:38 +0100
Original-Received: (qmail 5314 invoked by uid 550); 10 Jan 2015 05:33:35 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
Original-Received: (qmail 5306 invoked from network); 10 Jan 2015 05:33:34 -0000
Content-Disposition: inline
In-Reply-To: <20141220033918.GA3273@brightrain.aerifal.cx>
User-Agent: Mutt/1.5.21 (2010-09-15)
Original-Sender: Rich Felker <dalias@aerifal.cx>
Xref: news.gmane.org gmane.linux.lib.musl.general:6820
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/6820>

On Fri, Dec 19, 2014 at 10:39:18PM -0500, Rich Felker wrote:
> Strategy 2: Use the kernel, via /proc/self/task, to enumerate threads.
> Multithreaded set*id() would have to fail on systems with /proc not
> mounted or non-Linux systems that don't emulate /proc/self/task. The
> benefit of this approach is that we can eliminate basically all need
> for synchronization between important, performance-critical operations
> (pthread_create and pthread_exist) and the ugly set*id() mess. The
> only requirement seems to be ensuring that unboundedly many new
> threads can't be created during set*id() (imagine a chain of threads
> creating new threads and terminating, which reading /proc/self/task
> might never catch up with), but this can easily be precluded with a
> global flag that blocks forward progress in pthread_create (via a
> futex wait) if it's set, and no synchronization would be required to
> access the flag since it only needs to block unboundedly many new
> threads. There may also be some complicated logic for handling threads
> that exit after they're signaled but before the signal is delivered,
> but I think it's doable.

I think we _have_ to go with strategy 2 to guarantee a security
invariant I want: the impossibility of code sharing a memory space and
maintaining elevated privileged after setuid() has reported that
privileges were successfully dropped. With strategy 1 (and the current
code), __synccall happily ignores threads that have already "passed
the point of no return" in exiting; signaling them is impossible since
they've already blocked all signals, and may not even have a stack
anymore. But such a thread with elevated privileges is still
dangerous: while it's in this state, the unprivileged, post-setuid()
process could mmap() over top of the code it's running and thereby
execute arbitrary code with the elevated privileges which were
supposed to have been fully relinquished.

In order to avoid this issue, __synccall MUST ensure that it has
captured all kernel threads which are still part of the process (the
set of threads sharing memory). It's not enough to know that the
logical thread (from a userspace perspective) is dead.

This could be solved via the kernel's clearing of the tid futex when a
thread exits, but that would require having a reaper thread for
detached threads rather than having them unmap their own stacks, and
would impose lots of expensive bookkeeping and design constraints on
the implementation. I don't think that's remotely acceptable.

So we're left with the alternative: using /proc/self/task/ to observe
the set of live kernel threads. Unfortunately, this is not easy
either. I was actually almost to the point of ruling out strategy 2
because it's so hard.

What makes it hard is that we lack any good way to ensure that other
threads make forward progress. The thread calling set*id() has no way
to relinquish control until the signaled thread(s) respond, because it
doesn't even know there are any left to respond. It's possible that
all the threads it signaled were in the exiting state and exit before
they get the signal. So it has to keep re-scanning the list until it's
caught them all. However, if any threads have priority lower than the
signaling thread, they may never get a chance to run. Sleeping is the
only obvious way to let them run, but that's rather hideous and going
to be observably slow.

So here's the best alternative I came up with:

1. Pick any thread from /proc/self/task that hasn't yet responded yet,
   write its tid to a futex, and use FUTEX_LOCK_PI to donate priority
   to it. However (!) it's possible that the thread already exited and
   the tid was recycled to another process. So use a very short
   timeout so that, in this worst-case, we just get a minor slowdown
   if this (extremely rare) tid reuse race happens.

2. When a target thread gets the signal, it uses FUTEX_UNLOCK_PI to
   relinquish the donated priority and notify the signaling thread
   that it got the signal.

3. Repeat until there are no threads in the /proc/self/task list that
   haven't already been caught.

There should be a lot of opportunities to make this more efficient in
typical cases, e.g. just signaling all threads without any waiting
first and only doing the fancy stuff for threads that haven't already
responded.

Rich