From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/6820 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: Fixing multithreaded set*id() AS-safety Date: Sat, 10 Jan 2015 00:33:21 -0500 Message-ID: <20150110053321.GD4574@brightrain.aerifal.cx> References: <20141220033918.GA3273@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1420868024 24840 80.91.229.3 (10 Jan 2015 05:33:44 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 10 Jan 2015 05:33:44 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-6833-gllmg-musl=m.gmane.org@lists.openwall.com Sat Jan 10 06:33:39 2015 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1Y9og6-00005Q-1t for gllmg-musl@m.gmane.org; Sat, 10 Jan 2015 06:33:38 +0100 Original-Received: (qmail 5314 invoked by uid 550); 10 Jan 2015 05:33:35 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 5306 invoked from network); 10 Jan 2015 05:33:34 -0000 Content-Disposition: inline In-Reply-To: <20141220033918.GA3273@brightrain.aerifal.cx> User-Agent: Mutt/1.5.21 (2010-09-15) Original-Sender: Rich Felker Xref: news.gmane.org gmane.linux.lib.musl.general:6820 Archived-At: On Fri, Dec 19, 2014 at 10:39:18PM -0500, Rich Felker wrote: > Strategy 2: Use the kernel, via /proc/self/task, to enumerate threads. > Multithreaded set*id() would have to fail on systems with /proc not > mounted or non-Linux systems that don't emulate /proc/self/task. The > benefit of this approach is that we can eliminate basically all need > for synchronization between important, performance-critical operations > (pthread_create and pthread_exist) and the ugly set*id() mess. The > only requirement seems to be ensuring that unboundedly many new > threads can't be created during set*id() (imagine a chain of threads > creating new threads and terminating, which reading /proc/self/task > might never catch up with), but this can easily be precluded with a > global flag that blocks forward progress in pthread_create (via a > futex wait) if it's set, and no synchronization would be required to > access the flag since it only needs to block unboundedly many new > threads. There may also be some complicated logic for handling threads > that exit after they're signaled but before the signal is delivered, > but I think it's doable. I think we _have_ to go with strategy 2 to guarantee a security invariant I want: the impossibility of code sharing a memory space and maintaining elevated privileged after setuid() has reported that privileges were successfully dropped. With strategy 1 (and the current code), __synccall happily ignores threads that have already "passed the point of no return" in exiting; signaling them is impossible since they've already blocked all signals, and may not even have a stack anymore. But such a thread with elevated privileges is still dangerous: while it's in this state, the unprivileged, post-setuid() process could mmap() over top of the code it's running and thereby execute arbitrary code with the elevated privileges which were supposed to have been fully relinquished. In order to avoid this issue, __synccall MUST ensure that it has captured all kernel threads which are still part of the process (the set of threads sharing memory). It's not enough to know that the logical thread (from a userspace perspective) is dead. This could be solved via the kernel's clearing of the tid futex when a thread exits, but that would require having a reaper thread for detached threads rather than having them unmap their own stacks, and would impose lots of expensive bookkeeping and design constraints on the implementation. I don't think that's remotely acceptable. So we're left with the alternative: using /proc/self/task/ to observe the set of live kernel threads. Unfortunately, this is not easy either. I was actually almost to the point of ruling out strategy 2 because it's so hard. What makes it hard is that we lack any good way to ensure that other threads make forward progress. The thread calling set*id() has no way to relinquish control until the signaled thread(s) respond, because it doesn't even know there are any left to respond. It's possible that all the threads it signaled were in the exiting state and exit before they get the signal. So it has to keep re-scanning the list until it's caught them all. However, if any threads have priority lower than the signaling thread, they may never get a chance to run. Sleeping is the only obvious way to let them run, but that's rather hideous and going to be observably slow. So here's the best alternative I came up with: 1. Pick any thread from /proc/self/task that hasn't yet responded yet, write its tid to a futex, and use FUTEX_LOCK_PI to donate priority to it. However (!) it's possible that the thread already exited and the tid was recycled to another process. So use a very short timeout so that, in this worst-case, we just get a minor slowdown if this (extremely rare) tid reuse race happens. 2. When a target thread gets the signal, it uses FUTEX_UNLOCK_PI to relinquish the donated priority and notify the signaling thread that it got the signal. 3. Repeat until there are no threads in the /proc/self/task list that haven't already been caught. There should be a lot of opportunities to make this more efficient in typical cases, e.g. just signaling all threads without any waiting first and only doing the fancy stuff for threads that haven't already responded. Rich