From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-3.3 required=5.0 tests=MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 31176 invoked from network); 22 May 2020 16:22:00 -0000 Received: from mother.openwall.net (195.42.179.200) by inbox.vuxu.org with ESMTPUTF8; 22 May 2020 16:22:00 -0000 Received: (qmail 22230 invoked by uid 550); 22 May 2020 16:21:56 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 22188 invoked from network); 22 May 2020 16:21:55 -0000 Date: Fri, 22 May 2020 12:21:42 -0400 From: Rich Felker To: musl@lists.openwall.com Message-ID: <20200522162142.GV1079@brightrain.aerifal.cx> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) Subject: [musl] Serious missing synchronization bug between internal locks and threads_minus_1 1->0 transition This was reported on #musl yesterday as a problem in malloc (oldmalloc). When the second-to-last thread exits (leaving the process single-threaded again), it decrements libc.threads_minus_1. The remaining thread can then observe this as a relaxed-order atomic, which is fine in itself, but then uses the observed single-threadedness to skip locks. This means it also skips synchronization with any changes made to memory by the exiting thread. The race goes like (L exiting, M remaining): M accesses and caches data D L takes lock protecting D L modifies D L releases lock protecting D L call pthread_exit L locks thread list L decrements threads_minus_1 L exits & releases list M observes threads_minus_1==0 M skips taking lock protecting D M accesses D, using cached value Note that while one might expect this not to happen on x86 with strong memory ordering, the lack of memory barrier also ends up being a lack of *compiler barrier*, and in the observed breakage, it was actually the compiler, not the cpu, caching D between the first and last line of the above example. The simple, safe fix for this is to stop using libc.threads_minus_1 to skip locking and instead use libc.threaded, which is permanent-once-set. This is the first change I plan to commit, to have a working baseline, but it will be something of a performance regression for mostly-single-threaded programs which only occasionally use threads since they'll be stuck using locks all the time. I've been trying to work out a better fix, based on SYS_membarrier from the exiting thread, but that's complicated by it only being able to impose memory barriers, not compiler barriers. Anyway, first fix coming soon. This will be important for distros to pick up. Rich