From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/9170 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: atomic.h cleanup Date: Thu, 21 Jan 2016 19:09:45 -0500 Message-ID: <20160122000945.GW238@brightrain.aerifal.cx> References: <20160110122139.GF2016@debian> <20160110165718.GR238@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1453421407 21125 80.91.229.3 (22 Jan 2016 00:10:07 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 22 Jan 2016 00:10:07 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-9183-gllmg-musl=m.gmane.org@lists.openwall.com Fri Jan 22 01:10:02 2016 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1aMPIf-0007MP-Q6 for gllmg-musl@m.gmane.org; Fri, 22 Jan 2016 01:10:01 +0100 Original-Received: (qmail 6134 invoked by uid 550); 22 Jan 2016 00:09:58 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 6116 invoked from network); 22 Jan 2016 00:09:58 -0000 Content-Disposition: inline In-Reply-To: <20160110165718.GR238@brightrain.aerifal.cx> User-Agent: Mutt/1.5.21 (2010-09-15) Original-Sender: Rich Felker Xref: news.gmane.org gmane.linux.lib.musl.general:9170 Archived-At: On Sun, Jan 10, 2016 at 11:57:18AM -0500, Rich Felker wrote: > On Sun, Jan 10, 2016 at 01:21:39PM +0100, Markus Wichmann wrote: > > Hi all, > > > > The development roadmap on the musl wiki lists the ominous point > > "atomic.h cleanup" for 1.2.0. > > > > I assume you mean a sort of simplification and unification. I noticed > > that for the RISC arch's there are rather liberal amounts of inline > > assembly for the atomic operations. And I have always been taught, that > > as soon as you start copying code, you are probably doing it wrong. > > > > So first thing I'd do: add a new file, let's call it atomic_debruijn.h. > > It contains an implementation of a_ctz() and a_ctz_64() based on the > > DeBruijn number. That way, all the architectures currently implementing > > a_ctz() in this manner can just include that file, and a lot of > > duplicate code goes out the window. > > > > Second thing: We can reduce the inline assembly footprint and the amount > > of duplicate code by adding a new file, let's call it atomic_llsc.h, > > that implements a_cas(), a_cas_p(), a_swap(), a_fetch_add(), a_inc(), > > a_dec(), a_and() and a_or() in terms of new functions that would have to > > be defined, namely: > > > > static inline void a_presync(void) - execute any barrier needed before > > attempting an atomic operation, like "dmb ish" for arm, or "sync" for > > ppc. > > > > static inline void a_postsync(void) - execute any barrier needed > > afterwards, like "isync" for PPC, or, again, "dmb ish" for ARM. > > > > static inline int a_ll(int*) - perform an LL on the given pointer and > > return the value there. This would be "lwarx" for PPC, or "ldrex" for > > ARM. > > > > static inline int a_sc(int*, int) - perform an SC on the given pointer > > with the given value. Return zero iff that failed. > > > > static inline void* a_ll_p(void*) - same as a_ll(), but with machine > > words instead of int, if that's a difference. > > > > static inline int a_sc_p(void*, void*) - same as a_sc(), but with > > machine words. > > > > > > With these function we can implement e.g. CAS as: > > > > static inline int a_cas(volatile int *p, int t, int s) > > { > > int v; > > do { > > v = a_ll(p); > > if (v != t) > > break; > > } while (!a_sc(p, s)); > > return v; > > } > > > > Add some #ifdefs to only activate the pointer variations if they're > > needed (i.e. if we're on 64 bits) and Bob's your uncle. > > > > The only hardship would be in implementing a_sc(), but that can be > > solved by using a feature often referenced but rarely seen in the wild: > > ASM goto. How that works is that, if the arch's SC instruction returns > > success or failure in a flag and the CPU can jump on that flag (unlike, > > say, microblaze, which can only jump on comparisons), then you encode > > the jump in the assembly snippet but let the compiler handle the targets > > for you. Since in all cases, we want to jump on failure, that's what the > > assembly should do, so for instance for PowerPC: > > > > static inline int a_sc(volatile int* p, int x) > > { > > __asm__ goto ("stwcx. %0, 0, %1\n\tbne- %l2" : : "r"(x), "r"(p) : "cc", "memory" : fail); > > return 1; > > fail: > > return 0; > > } > > > > I already tried the compiler results for such a design, but I never > > tried running it for lack of hardware. > > > > Anyway, this code makes it possible for the compiler to redirect the > > conditional jump on failure to the top of the loop in a_cas(). Since the > > return value isn't used otherwise, the values 1 and 0 never appear in > > the generated assembly. > > > > What do you say to this design? > > Have you read this thread? :) > > http://www.openwall.com/lists/musl/2015/05/20/1 > > I thought at one point it was linked from the wiki but maybe it got > lost. > > Basically I have this done already outside of musl as an experiment, > but there are minor details that were holding it up. One annoyance is > that, on some archs, success/failure of "sc" comes via a condition > flag which the C caller can't easily branch on, so there's an extra > conversion to a boolean result inside the asm and extra conversion > back to a test/branch outside the asm. In practice we probably don't > care. > > One other issue is that risc-v seems to guarantee, at least on some > implementations, stronger forward-progress guarantees than a normal > ll/sc as long as the ll/sc are in order, within a few instruction > slots of each other, with no branches between. Such conditions cannot > be met without putting them in the same asm block, so we might need to > do a custom version for risc-v if we want to take advantage of the > stronger properties. > > Anyway, at this point the main obstacle to finishing the task is doing > the actual merging and testing, not any new coding, I think. Most of this is committed now! The original commit introducing the new system is: http://git.musl-libc.org/cgit/musl/commit/?id=1315596b510189b5159e742110b504177bdd4932 Subsequent commits are converting archs over one by one to make best use of the new system. I'd really like to see some before-and-after benchmarks on real hardware. For Timo's cond_bench test which I've been using as a quick check to make sure the new atomics have no obvious breakage, performance under qemu user-level emulation has roughly doubled for most of the llsc archs due to better ability of gcc to inline (it refuses to inline the largeish ll/sc loops written in asm but is happy to inline the tiny a_ll and a_sc asm). Please report any regressions, etc. Rich