From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.3 required=5.0 tests=MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL autolearn=ham
	autolearn_force=no version=3.4.4
Received: (qmail 14285 invoked from network); 12 Nov 2020 00:00:38 -0000
Received: from mother.openwall.net (195.42.179.200)
  by inbox.vuxu.org with ESMTPUTF8; 12 Nov 2020 00:00:37 -0000
Received: (qmail 17584 invoked by uid 550); 12 Nov 2020 00:00:32 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 17566 invoked from network); 12 Nov 2020 00:00:32 -0000
Date: Wed, 11 Nov 2020 19:00:19 -0500
From: Rich Felker <dalias@libc.org>
To: musl@lists.openwall.com
Message-ID: <20201112000018.GO534@brightrain.aerifal.cx>
References: <20201111204119.GN534@brightrain.aerifal.cx>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20201111204119.GN534@brightrain.aerifal.cx>
User-Agent: Mutt/1.5.21 (2010-09-15)
Subject: Re: [musl] Getting rid of vmlock

On Wed, Nov 11, 2020 at 03:41:20PM -0500, Rich Felker wrote:
> The vmlock mechanism is a horrible hack to work around Linux's robust
> mutex system being broken; it inherently has a race of the following
> form:
> 
> 	thread A		thread B
> 1.	put M in pending
> 2.	unlock M
> 3.				lock M
> 4.				unlock M
> 5.				destroy M
> 6.				unmap M's memory
> 7.				mmap something new at same address
> 8.	remove M from pending
> 
> whereby, if the process is async terminated (e.g. by SIGKILL) right
> before the line 8 (removing M from pending), the kernel will clobber
> unrelated memory. Normally this is harmless since the process is
> dying, but it's possible this unrelated memory is a new MAP_SHARED
> PROT_WRITE file mapping, and now you have file corruption.
> 
> musl's vmlock mechanism avoids this by taking a sort of lock (just a
> usage counter) while a robust list pending slot is in use, and
> disallowing munmap or mmap/mremap with MAP_FIXED until the counter
> reaches zero. This has the unfortunate consequence of making the mmap
> family non-AS-safe. Thus I'd really like to get rid of it.
> 
> Right now we're preventing the problem by forcing the munmap operation
> (line 6) to wait for the delisting from pending (ling 8). But we could
> instead just delay the lock (line 3) of a process-shared robust mutex
> until all pending operations in the current process finish. Since the
> window (between line 2 and 8) only exists until pthread_mutex_unlock
> returns in thread A, there is no way for thread B to observe M being
> ready to destroy/unmap/whatever without a lock or trylock operation on
> M. So this seems to work.

This isn't entirely true. As soon as the atomic release of M happens
in thread A, thread C in a different process can see it and can inform
thread B via other channels, bypassing the lock in line 3 but still
creating a viable synchronization path by which B would be entitled to
unmap/destroy/reuse memory from M.

I hope this can be reasonably solved if not just pthread_mutex_lock
(for pshared+robust case) but also pthread_mutex_destroy and the mmap
family do the following:

> takes the lock (line 3), rather than waiting on the pending count to
> reach zero, if it sees a nonzero count, it can walk the list and
> *perform the removal itself* of M from any pending slot it finds M in.

This isn't quite as good as I'd like in terms of getting rid of the
interaction, but at least it gets rid of the AS-unsafety of the mmap
family and the potential for arbitrarily-long delays.

One slight difficulty here is that, unlike pthread_mutex_lock or
destroy, mmap-family functions don't have a particular mutex they're
waiting to exit pending status. You can't just clear all pending slots
because one of them might not yet be unlocked, and then there'd be a
window where it would be left locked if the process died. You also
can't just skip mutexes that still have an owner assuming they're not
yet unlocked, since a thread in another process might have acquired
them after the thread with the pending slot unlocked.

At first glance, it looks like it suffices to just check whether the
mutex pointed-to by thread X's pending slot is still owned by thread
X. If not, it doesn't belong in the pending slot anymore and the
pending slot can be cleared. However this is essentially the same
check the kernel does at robust_list processing, and it seems to have
the same race. If thread C from another process locks the mutex, then
unlocks and destroys it and reuses the shared memory it lived in for
another purpose, the memory might end up containing a value that
matches the old owner's tid. Then we skip removing it from the pending
slot, and we can again end up with the original memory clobbering by
the kernel all over again.

I don't see any good out:

- There's a high-tech solution involving a custom asm robust-unlock
  primitive akin to the cancellation mechanism, whereby a signal
  context can determine if the thread has performed the pending unlock
  yet, but that's really way too much machinery.
  
- There's the option of making the pending slot critical section block
  signals so that it can't delay execution if we just wait for it like
  we're doing now, but that imposes 2+ syscalls for every robust mutex
  unlock.

- With the possible signal handler wrapping idea I proposed, that
  becomes a lot less expensive. Signal blocking can be replaced with a
  signal-deferring critical section at almost no cost. But this is
  contingent on a big change.

- We could just punt on this. Do as described above to get rid of the
  AS-unsafe blocking of mmap family, and deem the rube goldberg
  machine needed to setup erroneous kernel behavior sufficiently
  unrealistic that it doesn't matter.

- We could just leave it as-is. I don't like this because mmap family
  is AS-unsafe and subject to blocking arbitrarily long due to signal
  handler in another thread.

This is all rather disappointing, as I thought I had a nice path
towards getting rid of this mess. But it does make the signal handler
wrapping idea seem increasingly appealing...

There is another awful idea that imposes syscall cost on every robust
mutex unlock, but eliminates the need for this pending slot juggling
entirely: make the unlock and pending-slot clear atomic with each
other with respect to process exit. This means using a syscall to
perform both writes in a way that it can't sleep or return to
userspace with only one written. I was hoping FUTEX_WAKE_OP could do
this but it doesn't look like it (this function is appealing because
it also does the wake and it present on any kernel that can do robust
mutexes). The other plausible option is process_vm_writev, but it's
only available on newer kernels and may be blocked by policy. So this
seems hard to make work reliably.

Rich