[musl] Getting rid of vmlock

mailing list of musl libc
 help / color / mirror / code / Atom feed

* [musl] Getting rid of vmlock
@ 2020-11-11 20:41 Rich Felker
  2020-11-12  0:00 ` Rich Felker
  0 siblings, 1 reply; 2+ messages in thread
From: Rich Felker @ 2020-11-11 20:41 UTC (permalink / raw)
  To: musl

The vmlock mechanism is a horrible hack to work around Linux's robust
mutex system being broken; it inherently has a race of the following
form:

	thread A		thread B
1.	put M in pending
2.	unlock M
3.				lock M
4.				unlock M
5.				destroy M
6.				unmap M's memory
7.				mmap something new at same address
8.	remove M from pending

whereby, if the process is async terminated (e.g. by SIGKILL) right
before the line 8 (removing M from pending), the kernel will clobber
unrelated memory. Normally this is harmless since the process is
dying, but it's possible this unrelated memory is a new MAP_SHARED
PROT_WRITE file mapping, and now you have file corruption.

musl's vmlock mechanism avoids this by taking a sort of lock (just a
usage counter) while a robust list pending slot is in use, and
disallowing munmap or mmap/mremap with MAP_FIXED until the counter
reaches zero. This has the unfortunate consequence of making the mmap
family non-AS-safe. Thus I'd really like to get rid of it.

Right now we're preventing the problem by forcing the munmap operation
(line 6) to wait for the delisting from pending (ling 8). But we could
instead just delay the lock (line 3) of a process-shared robust mutex
until all pending operations in the current process finish. Since the
window (between line 2 and 8) only exists until pthread_mutex_unlock
returns in thread A, there is no way for thread B to observe M being
ready to destroy/unmap/whatever without a lock or trylock operation on
M. So this seems to work.

Both before and after such a change, however, we have a nasty problem.
If thread A is interrupted by a signal handler, or has low priority
(think PI where the unlock drops priority), between 2 and 8, it's
possible for the pending slot removal to be delayed arbitrarily. This
is really bad.

I think the problem is solvable by not just keeping a count of threads
with pending robust list slots, but making them register in a list (or
simply using the existing global thread list). Then when thread B
takes the lock (line 3), rather than waiting on the pending count to
reach zero, if it sees a nonzero count, it can walk the list and
*perform the removal itself* of M from any pending slot it finds M in.

The vmlock is also used (rather gratuitously, in the same way) for
process-shared barriers. These can be fixed to perform __vm_wait
before returning, rather than imposing that it happen later at unmap
time, and then the vmlock can be replaced with a "threads exiting
pshared barrier" lock that's functionally equivalent to the old
vmlock.

Rich

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [musl] Getting rid of vmlock
  2020-11-11 20:41 [musl] Getting rid of vmlock Rich Felker
@ 2020-11-12  0:00 ` Rich Felker
  0 siblings, 0 replies; 2+ messages in thread
From: Rich Felker @ 2020-11-12  0:00 UTC (permalink / raw)
  To: musl

On Wed, Nov 11, 2020 at 03:41:20PM -0500, Rich Felker wrote:
> The vmlock mechanism is a horrible hack to work around Linux's robust
> mutex system being broken; it inherently has a race of the following
> form:
> 
> 	thread A		thread B
> 1.	put M in pending
> 2.	unlock M
> 3.				lock M
> 4.				unlock M
> 5.				destroy M
> 6.				unmap M's memory
> 7.				mmap something new at same address
> 8.	remove M from pending
> 
> whereby, if the process is async terminated (e.g. by SIGKILL) right
> before the line 8 (removing M from pending), the kernel will clobber
> unrelated memory. Normally this is harmless since the process is
> dying, but it's possible this unrelated memory is a new MAP_SHARED
> PROT_WRITE file mapping, and now you have file corruption.
> 
> musl's vmlock mechanism avoids this by taking a sort of lock (just a
> usage counter) while a robust list pending slot is in use, and
> disallowing munmap or mmap/mremap with MAP_FIXED until the counter
> reaches zero. This has the unfortunate consequence of making the mmap
> family non-AS-safe. Thus I'd really like to get rid of it.
> 
> Right now we're preventing the problem by forcing the munmap operation
> (line 6) to wait for the delisting from pending (ling 8). But we could
> instead just delay the lock (line 3) of a process-shared robust mutex
> until all pending operations in the current process finish. Since the
> window (between line 2 and 8) only exists until pthread_mutex_unlock
> returns in thread A, there is no way for thread B to observe M being
> ready to destroy/unmap/whatever without a lock or trylock operation on
> M. So this seems to work.

This isn't entirely true. As soon as the atomic release of M happens
in thread A, thread C in a different process can see it and can inform
thread B via other channels, bypassing the lock in line 3 but still
creating a viable synchronization path by which B would be entitled to
unmap/destroy/reuse memory from M.

I hope this can be reasonably solved if not just pthread_mutex_lock
(for pshared+robust case) but also pthread_mutex_destroy and the mmap
family do the following:

> takes the lock (line 3), rather than waiting on the pending count to
> reach zero, if it sees a nonzero count, it can walk the list and
> *perform the removal itself* of M from any pending slot it finds M in.

This isn't quite as good as I'd like in terms of getting rid of the
interaction, but at least it gets rid of the AS-unsafety of the mmap
family and the potential for arbitrarily-long delays.

One slight difficulty here is that, unlike pthread_mutex_lock or
destroy, mmap-family functions don't have a particular mutex they're
waiting to exit pending status. You can't just clear all pending slots
because one of them might not yet be unlocked, and then there'd be a
window where it would be left locked if the process died. You also
can't just skip mutexes that still have an owner assuming they're not
yet unlocked, since a thread in another process might have acquired
them after the thread with the pending slot unlocked.

At first glance, it looks like it suffices to just check whether the
mutex pointed-to by thread X's pending slot is still owned by thread
X. If not, it doesn't belong in the pending slot anymore and the
pending slot can be cleared. However this is essentially the same
check the kernel does at robust_list processing, and it seems to have
the same race. If thread C from another process locks the mutex, then
unlocks and destroys it and reuses the shared memory it lived in for
another purpose, the memory might end up containing a value that
matches the old owner's tid. Then we skip removing it from the pending
slot, and we can again end up with the original memory clobbering by
the kernel all over again.

I don't see any good out:

- There's a high-tech solution involving a custom asm robust-unlock
  primitive akin to the cancellation mechanism, whereby a signal
  context can determine if the thread has performed the pending unlock
  yet, but that's really way too much machinery.

- There's the option of making the pending slot critical section block
  signals so that it can't delay execution if we just wait for it like
  we're doing now, but that imposes 2+ syscalls for every robust mutex
  unlock.

- With the possible signal handler wrapping idea I proposed, that
  becomes a lot less expensive. Signal blocking can be replaced with a
  signal-deferring critical section at almost no cost. But this is
  contingent on a big change.

- We could just punt on this. Do as described above to get rid of the
  AS-unsafe blocking of mmap family, and deem the rube goldberg
  machine needed to setup erroneous kernel behavior sufficiently
  unrealistic that it doesn't matter.

- We could just leave it as-is. I don't like this because mmap family
  is AS-unsafe and subject to blocking arbitrarily long due to signal
  handler in another thread.

This is all rather disappointing, as I thought I had a nice path
towards getting rid of this mess. But it does make the signal handler
wrapping idea seem increasingly appealing...

There is another awful idea that imposes syscall cost on every robust
mutex unlock, but eliminates the need for this pending slot juggling
entirely: make the unlock and pending-slot clear atomic with each
other with respect to process exit. This means using a syscall to
perform both writes in a way that it can't sleep or return to
userspace with only one written. I was hoping FUTEX_WAKE_OP could do
this but it doesn't look like it (this function is appealing because
it also does the wake and it present on any kernel that can do robust
mutexes). The other plausible option is process_vm_writev, but it's
only available on newer kernels and may be blocked by policy. So this
seems hard to make work reliably.

Rich

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2020-11-12  0:00 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-11 20:41 [musl] Getting rid of vmlock Rich Felker
2020-11-12  0:00 ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).