mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] Serious missing synchronization bug between internal locks and threads_minus_1 1->0 transition
@ 2020-05-22 16:21 Rich Felker
  2020-05-22 21:56 ` [musl] [PATCH] (series) Fix serious " Rich Felker
  0 siblings, 1 reply; 2+ messages in thread
From: Rich Felker @ 2020-05-22 16:21 UTC (permalink / raw)
  To: musl

This was reported on #musl yesterday as a problem in malloc
(oldmalloc). When the second-to-last thread exits (leaving the process
single-threaded again), it decrements libc.threads_minus_1. The
remaining thread can then observe this as a relaxed-order atomic,
which is fine in itself, but then uses the observed
single-threadedness to skip locks. This means it also skips
synchronization with any changes made to memory by the exiting thread.
The race goes like (L exiting, M remaining):

M accesses and caches data D
                                    L takes lock protecting D
                                    L modifies D
                                    L releases lock protecting D
                                    L call pthread_exit
                                    L locks thread list
                                    L decrements threads_minus_1
                                    L exits & releases list
M observes threads_minus_1==0
M skips taking lock protecting D
M accesses D, using cached value

Note that while one might expect this not to happen on x86 with strong
memory ordering, the lack of memory barrier also ends up being a lack
of *compiler barrier*, and in the observed breakage, it was actually
the compiler, not the cpu, caching D between the first and last line
of the above example.

The simple, safe fix for this is to stop using libc.threads_minus_1 to
skip locking and instead use libc.threaded, which is
permanent-once-set. This is the first change I plan to commit, to have
a working baseline, but it will be something of a performance
regression for mostly-single-threaded programs which only occasionally
use threads since they'll be stuck using locks all the time.

I've been trying to work out a better fix, based on SYS_membarrier
from the exiting thread, but that's complicated by it only being able
to impose memory barriers, not compiler barriers.

Anyway, first fix coming soon. This will be important for distros to
pick up.

Rich

^ permalink raw reply	[flat|nested] 2+ messages in thread

* [musl] [PATCH] (series) Fix serious missing synchronization bug between internal locks and threads_minus_1 1->0 transition
  2020-05-22 16:21 [musl] Serious missing synchronization bug between internal locks and threads_minus_1 1->0 transition Rich Felker
@ 2020-05-22 21:56 ` Rich Felker
  0 siblings, 0 replies; 2+ messages in thread
From: Rich Felker @ 2020-05-22 21:56 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 478 bytes --]

On Fri, May 22, 2020 at 12:21:42PM -0400, Rich Felker wrote:
> Anyway, first fix coming soon. This will be important for distros to
> pick up.

And here's a patch series.

I found and fixed a second bug, independent of the first, where
threads_minus_1 was being decremented too early and causing dlerror
cleanup (__dl_thread_cleanup) to skip locking.

I also have a solution for returning to skipping locks after the
process becomes single-threaded again.

See attached series.

[-- Attachment #2: 0001-reorder-thread-list-unlink-in-pthread_exit-after-all.patch --]
[-- Type: text/plain, Size: 2223 bytes --]

From 4d5aa20a94a2d3fae3e69289dc23ecafbd0c16c4 Mon Sep 17 00:00:00 2001
From: Rich Felker <dalias@aerifal.cx>
Date: Fri, 22 May 2020 17:35:14 -0400
Subject: [PATCH 1/4] reorder thread list unlink in pthread_exit after all
 locks

since the backend for LOCK() skips locking if single-threaded, it's
unsafe to make the process appear single-threaded before the last use
of lock.

this fixes potential unsynchronized access to a linked list via
__dl_thread_cleanup.
---
 src/thread/pthread_create.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/src/thread/pthread_create.c b/src/thread/pthread_create.c
index 5f491092..6a3b0c21 100644
--- a/src/thread/pthread_create.c
+++ b/src/thread/pthread_create.c
@@ -90,14 +90,7 @@ _Noreturn void __pthread_exit(void *result)
 		exit(0);
 	}
 
-	/* At this point we are committed to thread termination. Unlink
-	 * the thread from the list. This change will not be visible
-	 * until the lock is released, which only happens after SYS_exit
-	 * has been called, via the exit futex address pointing at the lock. */
-	libc.threads_minus_1--;
-	self->next->prev = self->prev;
-	self->prev->next = self->next;
-	self->prev = self->next = self;
+	/* At this point we are committed to thread termination. */
 
 	/* Process robust list in userspace to handle non-pshared mutexes
 	 * and the detached thread case where the robust list head will
@@ -121,6 +114,16 @@ _Noreturn void __pthread_exit(void *result)
 	__do_orphaned_stdio_locks();
 	__dl_thread_cleanup();
 
+	/* Last, unlink thread from the list. This change will not be visible
+	 * until the lock is released, which only happens after SYS_exit
+	 * has been called, via the exit futex address pointing at the lock.
+	 * This needs to happen after any possible calls to LOCK() that might
+	 * skip locking if libc.threads_minus_1 is zero. */
+	libc.threads_minus_1--;
+	self->next->prev = self->prev;
+	self->prev->next = self->next;
+	self->prev = self->next = self;
+
 	/* This atomic potentially competes with a concurrent pthread_detach
 	 * call; the loser is responsible for freeing thread resources. */
 	int state = a_cas(&self->detach_state, DT_JOINABLE, DT_EXITING);
-- 
2.21.0


[-- Attachment #3: 0002-don-t-use-libc.threads_minus_1-as-relaxed-atomic-for.patch --]
[-- Type: text/plain, Size: 2802 bytes --]

From e01b5939b38aea5ecbe41670643199825874b26c Mon Sep 17 00:00:00 2001
From: Rich Felker <dalias@aerifal.cx>
Date: Thu, 21 May 2020 23:32:45 -0400
Subject: [PATCH 2/4] don't use libc.threads_minus_1 as relaxed atomic for
 skipping locks

after all but the last thread exits, the next thread to observe
libc.threads_minus_1==0 and conclude that it can skip locking fails to
synchronize with any changes to memory that were made by the
last-exiting thread. this can produce data races.

on some archs, at least x86, memory synchronization is unlikely to be
a problem; however, with the inline locks in malloc, skipping the lock
also eliminated the compiler barrier, and caused code that needed to
re-check chunk in-use bits after obtaining the lock to reuse a stale
value, possibly from before the process became single-threaded. this
in turn produced corruption of the heap state.

some uses of libc.threads_minus_1 remain, especially for allocation of
new TLS in the dynamic linker; otherwise, it could be removed
entirely. it's made non-volatile to reflect that the remaining
accesses are only made under lock on the thread list.

instead of libc.threads_minus_1, libc.threaded is now used for
skipping locks. the difference is that libc.threaded is permanently
true once an additional thread has been created. this will produce
some performance regression in processes that are mostly
single-threaded but occasionally creating threads. in the future it
may be possible to bring back the full lock-skipping, but more care
needs to be taken to produce a safe design.
---
 src/internal/libc.h | 2 +-
 src/malloc/malloc.c | 2 +-
 src/thread/__lock.c | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/internal/libc.h b/src/internal/libc.h
index ac97dc7e..c0614852 100644
--- a/src/internal/libc.h
+++ b/src/internal/libc.h
@@ -21,7 +21,7 @@ struct __libc {
 	int can_do_threads;
 	int threaded;
 	int secure;
-	volatile int threads_minus_1;
+	int threads_minus_1;
 	size_t *auxv;
 	struct tls_module *tls_head;
 	size_t tls_size, tls_align, tls_cnt;
diff --git a/src/malloc/malloc.c b/src/malloc/malloc.c
index 96982596..2553a62e 100644
--- a/src/malloc/malloc.c
+++ b/src/malloc/malloc.c
@@ -26,7 +26,7 @@ int __malloc_replaced;
 
 static inline void lock(volatile int *lk)
 {
-	if (libc.threads_minus_1)
+	if (libc.threaded)
 		while(a_swap(lk, 1)) __wait(lk, lk+1, 1, 1);
 }
 
diff --git a/src/thread/__lock.c b/src/thread/__lock.c
index 45557c88..5b9b144e 100644
--- a/src/thread/__lock.c
+++ b/src/thread/__lock.c
@@ -18,7 +18,7 @@
 
 void __lock(volatile int *l)
 {
-	if (!libc.threads_minus_1) return;
+	if (!libc.threaded) return;
 	/* fast path: INT_MIN for the lock, +1 for the congestion */
 	int current = a_cas(l, 0, INT_MIN + 1);
 	if (!current) return;
-- 
2.21.0


[-- Attachment #4: 0003-cut-down-size-of-some-libc-struct-members.patch --]
[-- Type: text/plain, Size: 738 bytes --]

From f12888e9eb9eed60cc266b899dcafecb4752964a Mon Sep 17 00:00:00 2001
From: Rich Felker <dalias@aerifal.cx>
Date: Fri, 22 May 2020 17:25:38 -0400
Subject: [PATCH 3/4] cut down size of some libc struct members

these are all flags that can be single-byte values.
---
 src/internal/libc.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/internal/libc.h b/src/internal/libc.h
index c0614852..d47f58e0 100644
--- a/src/internal/libc.h
+++ b/src/internal/libc.h
@@ -18,9 +18,9 @@ struct tls_module {
 };
 
 struct __libc {
-	int can_do_threads;
-	int threaded;
-	int secure;
+	char can_do_threads;
+	char threaded;
+	char secure;
 	int threads_minus_1;
 	size_t *auxv;
 	struct tls_module *tls_head;
-- 
2.21.0


[-- Attachment #5: 0004-restore-lock-skipping-for-processes-that-return-to-s.patch --]
[-- Type: text/plain, Size: 3617 bytes --]

From 8d81ba8c0bc6fe31136cb15c9c82ef4c24965040 Mon Sep 17 00:00:00 2001
From: Rich Felker <dalias@aerifal.cx>
Date: Fri, 22 May 2020 17:45:47 -0400
Subject: [PATCH 4/4] restore lock-skipping for processes that return to
 single-threaded state

the design used here relies on the barrier provided by the first lock
operation after the process returns to single-threaded state to
synchronize with actions by the last thread that exited. by storing
the intent to change modes in the same object used to detect whether
locking is needed, it's possible to avoid an extra (possibly costly)
memory load after the lock is taken.
---
 src/internal/libc.h         | 1 +
 src/malloc/malloc.c         | 5 ++++-
 src/thread/__lock.c         | 4 +++-
 src/thread/pthread_create.c | 8 ++++----
 4 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/src/internal/libc.h b/src/internal/libc.h
index d47f58e0..619bba86 100644
--- a/src/internal/libc.h
+++ b/src/internal/libc.h
@@ -21,6 +21,7 @@ struct __libc {
 	char can_do_threads;
 	char threaded;
 	char secure;
+	volatile signed char need_locks;
 	int threads_minus_1;
 	size_t *auxv;
 	struct tls_module *tls_head;
diff --git a/src/malloc/malloc.c b/src/malloc/malloc.c
index 2553a62e..a803d4c9 100644
--- a/src/malloc/malloc.c
+++ b/src/malloc/malloc.c
@@ -26,8 +26,11 @@ int __malloc_replaced;
 
 static inline void lock(volatile int *lk)
 {
-	if (libc.threaded)
+	int need_locks = libc.need_locks;
+	if (need_locks) {
 		while(a_swap(lk, 1)) __wait(lk, lk+1, 1, 1);
+		if (need_locks < 0) libc.need_locks = 0;
+	}
 }
 
 static inline void unlock(volatile int *lk)
diff --git a/src/thread/__lock.c b/src/thread/__lock.c
index 5b9b144e..60eece49 100644
--- a/src/thread/__lock.c
+++ b/src/thread/__lock.c
@@ -18,9 +18,11 @@
 
 void __lock(volatile int *l)
 {
-	if (!libc.threaded) return;
+	int need_locks = libc.need_locks;
+	if (!need_locks) return;
 	/* fast path: INT_MIN for the lock, +1 for the congestion */
 	int current = a_cas(l, 0, INT_MIN + 1);
+	if (need_locks < 0) libc.need_locks = 0;
 	if (!current) return;
 	/* A first spin loop, for medium congestion. */
 	for (unsigned i = 0; i < 10; ++i) {
diff --git a/src/thread/pthread_create.c b/src/thread/pthread_create.c
index 6a3b0c21..6bdfb44f 100644
--- a/src/thread/pthread_create.c
+++ b/src/thread/pthread_create.c
@@ -118,8 +118,8 @@ _Noreturn void __pthread_exit(void *result)
 	 * until the lock is released, which only happens after SYS_exit
 	 * has been called, via the exit futex address pointing at the lock.
 	 * This needs to happen after any possible calls to LOCK() that might
-	 * skip locking if libc.threads_minus_1 is zero. */
-	libc.threads_minus_1--;
+	 * skip locking if process appears single-threaded. */
+	if (!--libc.threads_minus_1) libc.need_locks = -1;
 	self->next->prev = self->prev;
 	self->prev->next = self->next;
 	self->prev = self->next = self;
@@ -339,7 +339,7 @@ int __pthread_create(pthread_t *restrict res, const pthread_attr_t *restrict att
 		~(1UL<<((SIGCANCEL-1)%(8*sizeof(long))));
 
 	__tl_lock();
-	libc.threads_minus_1++;
+	if (!libc.threads_minus_1++) libc.need_locks = 1;
 	ret = __clone((c11 ? start_c11 : start), stack, flags, args, &new->tid, TP_ADJ(new), &__thread_list_lock);
 
 	/* All clone failures translate to EAGAIN. If explicit scheduling
@@ -363,7 +363,7 @@ int __pthread_create(pthread_t *restrict res, const pthread_attr_t *restrict att
 		new->next->prev = new;
 		new->prev->next = new;
 	} else {
-		libc.threads_minus_1--;
+		if (!--libc.threads_minus_1) libc.need_locks = 0;
 	}
 	__tl_unlock();
 	__restore_sigs(&set);
-- 
2.21.0


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2020-05-22 21:57 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-22 16:21 [musl] Serious missing synchronization bug between internal locks and threads_minus_1 1->0 transition Rich Felker
2020-05-22 21:56 ` [musl] [PATCH] (series) Fix serious " Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).