From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.3 required=5.0 tests=MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,
	T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4
Received: (qmail 27230 invoked from network); 10 Jan 2024 21:24:53 -0000
Received: from second.openwall.net (193.110.157.125)
  by inbox.vuxu.org with ESMTPUTF8; 10 Jan 2024 21:24:53 -0000
Received: (qmail 30108 invoked by uid 550); 10 Jan 2024 21:23:17 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 30073 invoked from network); 10 Jan 2024 21:23:16 -0000
Date: Wed, 10 Jan 2024 16:24:53 -0500
From: Rich Felker <dalias@libc.org>
To: musl@lists.openwall.com
Cc: linux-api@vger.kernel.org
Message-ID: <20240110212453.GR4163@brightrain.aerifal.cx>
References: <ec138086-c5b9-4ca2-9da5-bef8b14de27d@dustri.org>
 <20240109190726.GO4163@brightrain.aerifal.cx>
 <20240110015550.GP4163@brightrain.aerifal.cx>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20240110015550.GP4163@brightrain.aerifal.cx>
User-Agent: Mutt/1.5.21 (2010-09-15)
Subject: Re: [musl] Protect pthreads' mutexes against use-after-destroy

On Tue, Jan 09, 2024 at 08:55:50PM -0500, Rich Felker wrote:
> On Tue, Jan 09, 2024 at 02:07:26PM -0500, Rich Felker wrote:
> > On Tue, Jan 09, 2024 at 03:37:17PM +0100, jvoisin wrote:
> > > Ohai,
> > > 
> > > as discussed on irc, Android's bionic has a check to prevent
> > > use-after-destroy on phtread mutexes
> > > (https://github.com/LineageOS/android_bionic/blob/e0aac7df6f58138dae903b5d456c947a3f8092ea/libc/bionic/pthread_mutex.cpp#L803),
> > > and musl doesn't.
> > > 
> > > While odds are that this is a super-duper common bug, it would still be
> > > nice to have this kind of protection, since it's cheap, and would
> > > prevent/make it easy to diagnose weird states.
> > > 
> > > Is this something that should/could be implemented?
> > > 
> > > o/
> > 
> > I think you meant that the odds are it's not common. There's already
> > enough complexity in the code paths for supporting all the different
> > mutex types that my leaning would be, if we do any hardening for
> > use-after-destroy, that it should probably just take the form of
> > putting the object in a state that will naturally deadlock or error
> > rather than adding extra checks to every path where it's used.
> > 
> > If OTOH we do want it to actually trap in all cases where it's used
> > after destroy, the simplest way to achieve that is probably to set it
> > up as a non-robust non-PI recursive or errorchecking mutex with
> > invalid prev/next pointers and owner of 0x3fffffff. Then the only
> > place that would actually have to have an explicit trap is trylock in
> > the code path:
> > 
> >         if (own == 0x3fffffff) return ENOTRECOVERABLE;
> > 
> > where it could trap if type isn't robust. The unlock code path would
> > trap on accessing invalid prev/next pointers.
> 
> Unfortunately I discovered a problem we need to deal with in
> researching for this: at some point Linux quietly changed the futex
> ABI, so that bit 29 is no longer reserved but potentially a tid bit.
> This was documented in 9c40365a65d62d7c06a95fb331b3442cb02d2fd9 but
> apparently actually happened at the source level a long time before
> that. So, we cannot assume 0x3fffffff is not a valid tid, and thereby
> cannot assume 0x7fffffff is not equal to ownerdead|valid_tid.
> 
> This probably means we need to find a way to encode "not recoverable"
> as 0x40000000, as 0 is now the _only_ value in the low-30-bits that
> can't potentially be a valid tid.
> 
> I'll look at this more over the next day or two. It's probably fixable
> but requires fiddling with delicate logic.
> 
> Note that the only in-the-wild breakage possible is on systems where
> the pid/tid limit has been set extremely high, where attempts to lock
> a recursive or errorchecking mutex owned by a thread with tid
> 0x3fffffff could malfunction.

OK, more fun. The kernel documentation (locking/robust-futex-ABI.rst)
is *wrong*. It claims that, when processing the robust list for a
dying task, the only actions are:

 1) if bit 31 (0x80000000) is set in that word, then attempt a futex
    wakeup on that address, which will waken the next thread that has
    used to the futex mechanism to wait on that address, and
 2) atomically set  bit 30 (0x40000000) in the 'lock word'.

However, 2 is incorrect. There is no "set bit 30" operation in the
kernel. Rather, it atomically replaces the old lock value (tid of the
dying task, possibly with bit 31 added on for "waiters present") with
0x40000000 (no waiters) or 0xc0000000 (maybe waiters), clearing out
the tid of the dying task.

Thus, we cannot use 0x40000000 as the "not recoverable" value; this is
the only value the kernel ever used for "owner died" state.

At first this sounds good: we should be able to use any value with bit
30 set, including something like 0x7fffffff like we're using now, as
the "not recoverable" state. Unfortunately, that's not quite true. The
kernel's robust list processing at task death masks off bit 30, so if
a task with tid 0x3fffffff were to die with the unrecoverable mutex in
its robust list or or pending slot, it would wrongly get converted
back by the kernel from "not recoverable" state to "owner died" state.

Naively one would expect an already not-recoverable not to be able to
be added to the robust list, and this is correct; however, it can be
added to the pending slot if a thread attempting to lock it was
suspended after checking the old state, before it was in
not-recoverable state, but before attempting the atomic CAS to take
the lock, seeing that fail, and removing it from the pending slot.

I don't see any solution to this problem without having one or more
values of the low 30 bits (bits 0-29) that are reserved and guaranteed
never to match a real task. AIUI, Linux presently doesn't support more
than 22 bits of tid anyway, so this probably is not a real issue, but
it is an issue for future-proofing the user-kernel interface. What
we're doing now (using 0x7fffffff as the unrecoverable value) is
unsafe if Linux or any kernel purporting to honor the Linux syscall
API/ABI ever supports full 30-bit tids.

I think we should probably ask the kernel folks to reinstate the
original specification that bit 29 (0x20000000) is reserved and
guaranteed not to be present in a valid tid. If that's deemed
inappropriate, just reserving a single value 0x3fffffff that's
guaranteed not to be a real tid should suffice. Either would make what
we're doing now (based on the old pre-2020 documentation) valid and
future-proof once again.

Rich