From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/7710 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Refactoring atomics as llsc? Date: Wed, 20 May 2015 01:11:08 -0400 Message-ID: <20150520051108.GA28347@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1432098691 22140 80.91.229.3 (20 May 2015 05:11:31 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Wed, 20 May 2015 05:11:31 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-7722-gllmg-musl=m.gmane.org@lists.openwall.com Wed May 20 07:11:31 2015 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1YuwHx-0001Ld-JO for gllmg-musl@m.gmane.org; Wed, 20 May 2015 07:11:29 +0200 Original-Received: (qmail 9560 invoked by uid 550); 20 May 2015 05:11:27 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 9509 invoked from network); 20 May 2015 05:11:21 -0000 Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) Original-Sender: Rich Felker Xref: news.gmane.org gmane.linux.lib.musl.general:7710 Archived-At: In the inline sh4a atomics thread, I discussed an idea of refactoring atomics for llsc-style archs so that the arch would just need to provide inline asm ll() and sc() functions (with more namespace-clean names of course), and a shared atomic.h could build the a_* functions on top of that, as in: static inline int a_cas(volatile int *p, int t, int s) { int old; do old = ll(p); while (old == t && !sc(p, s)); return old; } (Note: I've omitted barriers for simplicity; they could be in the ll and sc functions but it would probably make more sense to have them outside the loop.) In the sh4a thread, I was somewhat discouraged with this approach because there's no way to model the output of the sc asm coming out in a condition flag; the inline asm would have to convert the flag to a value in a register. However, looking at the archs we have now, only or1k, powerpc, and sh put the sc output in a condition flag. The rest of the llsc-type archs leave the result as a boolean value in a general-purpose register. So only a few archs would be negatively affected, and only in a very minor way. On the other hand, the benefits of doing this would be pretty significant: First of all, basically all archs could share all the implementation logic for atomics, massively reducing the amount of per-arch asm and the risk of subtle errors. Right now the only way to keep per-arch asm down is to implement just a_cas and have everything else be a wrapper for a_cas, but that leads to all the other atomics being a pair of nested do/while loops, which is larger and slower than having a "native" version of each op. Second, this approach would allow us to add some really nice new atomic primitives without having to write them per-arch: things like atomic-dec-if-positive. Normally these would be written with a CAS retry loop, which is a pair of nested do/while loops on llsc archs, but with the above approach it would be a single loop. Of course the big outlier is x86, which is not llsc based but has actual atomic primitives at the instruction level. If we defined the sc() primitive to take 3 args instead of 2 (address, old value from ll, new value to conditionally store; most archs would ignore the old value argument) then we could model x86 with ll being a plain load and sc being cmpxchg to allow any new custom primitives to work using cmpxchg. Then we would just continue providing custom versions of all the old a_* ops (a_cas, a_fetch_add, a_inc, a_dec, a_and, a_or, a_swap) to take advantage of the x86 instructions. These versions could probably be shared by all x86 variants (i386, x86_64, x32) since they're operating on 32-bit values and the asm should be the same. If we decide to go this way, it would replace the atomic.h refactoring work I already sent to the list (Deduplicating atomics written in terms of CAS). For the few archs that would be adversely affected (albeit very minor) by this approach due to the inability to model condition-flags in asm outputs, we could in principle still keep some or all of the old asm (probably cutting it down to the primitives that actually matter for performance) if desired. We could also keep the concept of atomic_generic.h using __sync builtins, but offer ll() and sc() using a plain load for ll and __sync_bool_compare_and_swap for sc, and then define a few other a_* functions that the __sync primitives allow us to define directly. Comments? Rich