From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-3.3 required=5.0 tests=MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 14285 invoked from network); 12 Nov 2020 00:00:38 -0000 Received: from mother.openwall.net (195.42.179.200) by inbox.vuxu.org with ESMTPUTF8; 12 Nov 2020 00:00:37 -0000 Received: (qmail 17584 invoked by uid 550); 12 Nov 2020 00:00:32 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 17566 invoked from network); 12 Nov 2020 00:00:32 -0000 Date: Wed, 11 Nov 2020 19:00:19 -0500 From: Rich Felker To: musl@lists.openwall.com Message-ID: <20201112000018.GO534@brightrain.aerifal.cx> References: <20201111204119.GN534@brightrain.aerifal.cx> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20201111204119.GN534@brightrain.aerifal.cx> User-Agent: Mutt/1.5.21 (2010-09-15) Subject: Re: [musl] Getting rid of vmlock On Wed, Nov 11, 2020 at 03:41:20PM -0500, Rich Felker wrote: > The vmlock mechanism is a horrible hack to work around Linux's robust > mutex system being broken; it inherently has a race of the following > form: > > thread A thread B > 1. put M in pending > 2. unlock M > 3. lock M > 4. unlock M > 5. destroy M > 6. unmap M's memory > 7. mmap something new at same address > 8. remove M from pending > > whereby, if the process is async terminated (e.g. by SIGKILL) right > before the line 8 (removing M from pending), the kernel will clobber > unrelated memory. Normally this is harmless since the process is > dying, but it's possible this unrelated memory is a new MAP_SHARED > PROT_WRITE file mapping, and now you have file corruption. > > musl's vmlock mechanism avoids this by taking a sort of lock (just a > usage counter) while a robust list pending slot is in use, and > disallowing munmap or mmap/mremap with MAP_FIXED until the counter > reaches zero. This has the unfortunate consequence of making the mmap > family non-AS-safe. Thus I'd really like to get rid of it. > > Right now we're preventing the problem by forcing the munmap operation > (line 6) to wait for the delisting from pending (ling 8). But we could > instead just delay the lock (line 3) of a process-shared robust mutex > until all pending operations in the current process finish. Since the > window (between line 2 and 8) only exists until pthread_mutex_unlock > returns in thread A, there is no way for thread B to observe M being > ready to destroy/unmap/whatever without a lock or trylock operation on > M. So this seems to work. This isn't entirely true. As soon as the atomic release of M happens in thread A, thread C in a different process can see it and can inform thread B via other channels, bypassing the lock in line 3 but still creating a viable synchronization path by which B would be entitled to unmap/destroy/reuse memory from M. I hope this can be reasonably solved if not just pthread_mutex_lock (for pshared+robust case) but also pthread_mutex_destroy and the mmap family do the following: > takes the lock (line 3), rather than waiting on the pending count to > reach zero, if it sees a nonzero count, it can walk the list and > *perform the removal itself* of M from any pending slot it finds M in. This isn't quite as good as I'd like in terms of getting rid of the interaction, but at least it gets rid of the AS-unsafety of the mmap family and the potential for arbitrarily-long delays. One slight difficulty here is that, unlike pthread_mutex_lock or destroy, mmap-family functions don't have a particular mutex they're waiting to exit pending status. You can't just clear all pending slots because one of them might not yet be unlocked, and then there'd be a window where it would be left locked if the process died. You also can't just skip mutexes that still have an owner assuming they're not yet unlocked, since a thread in another process might have acquired them after the thread with the pending slot unlocked. At first glance, it looks like it suffices to just check whether the mutex pointed-to by thread X's pending slot is still owned by thread X. If not, it doesn't belong in the pending slot anymore and the pending slot can be cleared. However this is essentially the same check the kernel does at robust_list processing, and it seems to have the same race. If thread C from another process locks the mutex, then unlocks and destroys it and reuses the shared memory it lived in for another purpose, the memory might end up containing a value that matches the old owner's tid. Then we skip removing it from the pending slot, and we can again end up with the original memory clobbering by the kernel all over again. I don't see any good out: - There's a high-tech solution involving a custom asm robust-unlock primitive akin to the cancellation mechanism, whereby a signal context can determine if the thread has performed the pending unlock yet, but that's really way too much machinery. - There's the option of making the pending slot critical section block signals so that it can't delay execution if we just wait for it like we're doing now, but that imposes 2+ syscalls for every robust mutex unlock. - With the possible signal handler wrapping idea I proposed, that becomes a lot less expensive. Signal blocking can be replaced with a signal-deferring critical section at almost no cost. But this is contingent on a big change. - We could just punt on this. Do as described above to get rid of the AS-unsafe blocking of mmap family, and deem the rube goldberg machine needed to setup erroneous kernel behavior sufficiently unrealistic that it doesn't matter. - We could just leave it as-is. I don't like this because mmap family is AS-unsafe and subject to blocking arbitrarily long due to signal handler in another thread. This is all rather disappointing, as I thought I had a nice path towards getting rid of this mess. But it does make the signal handler wrapping idea seem increasingly appealing... There is another awful idea that imposes syscall cost on every robust mutex unlock, but eliminates the need for this pending slot juggling entirely: make the unlock and pending-slot clear atomic with each other with respect to process exit. This means using a syscall to perform both writes in a way that it can't sleep or return to userspace with only one written. I was hoping FUTEX_WAKE_OP could do this but it doesn't look like it (this function is appealing because it also does the wake and it present on any kernel that can do robust mutexes). The other plausible option is process_vm_writev, but it's only available on newer kernels and may be blocked by policy. So this seems hard to make work reliably. Rich