From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/7696 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: [PATCH] inline llsc atomics when compiling for sh4a Date: Mon, 18 May 2015 20:30:45 -0400 Message-ID: <20150519003045.GC17573@brightrain.aerifal.cx> References: <20150517185516.GA32020@duality.lan> <20150518023402.GS17573@brightrain.aerifal.cx> <20150518225617.GA1905@duality.lan> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1431995468 10173 80.91.229.3 (19 May 2015 00:31:08 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 19 May 2015 00:31:08 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-7708-gllmg-musl=m.gmane.org@lists.openwall.com Tue May 19 02:31:00 2015 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1YuVQy-0008G6-8f for gllmg-musl@m.gmane.org; Tue, 19 May 2015 02:31:00 +0200 Original-Received: (qmail 19652 invoked by uid 550); 19 May 2015 00:30:58 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 19628 invoked from network); 19 May 2015 00:30:57 -0000 Content-Disposition: inline In-Reply-To: <20150518225617.GA1905@duality.lan> User-Agent: Mutt/1.5.21 (2010-09-15) Original-Sender: Rich Felker Xref: news.gmane.org gmane.linux.lib.musl.general:7696 Archived-At: On Mon, May 18, 2015 at 05:56:18PM -0500, Bobby Bingham wrote: > On Sun, May 17, 2015 at 10:34:02PM -0400, Rich Felker wrote: > > On Sun, May 17, 2015 at 01:55:16PM -0500, Bobby Bingham wrote: > > > If we're building for sh4a, the compiler is already free to use > > > instructions only available on sh4a, so we can do the same and inline the > > > llsc atomics. If we're building for an older processor, we still do the > > > same runtime atomics selection as before. > > > > Thanks! I think it's ok for commit as-is, but based on re-reading this > > code I have some ideas for improving it that are orthogonal to this > > change. See comments inline: > > Would you prefer I resend this patch to remove the LLSC_* macros at the > same time, or another patch to remove them separately? No, let's do that separately. I like keeping independent changes separate in commits. I've tested the current patch already and it seems to be fine. If we do the second LLSC_* removal patch it shouldn't affect the generated binaries, which is easy to verify. > > > > > > --- > > > arch/sh/atomic.h | 83 +++++++++++++++++++++++++++++++ > > > arch/sh/src/atomic.c | 135 +++++++++++++++++---------------------------------- > > > 2 files changed, 128 insertions(+), 90 deletions(-) > > > > > > diff --git a/arch/sh/atomic.h b/arch/sh/atomic.h > > > index a1d22e4..f2e6dac 100644 > > > --- a/arch/sh/atomic.h > > > +++ b/arch/sh/atomic.h > > > @@ -22,6 +22,88 @@ static inline int a_ctz_64(uint64_t x) > > > return a_ctz_l(y); > > > } > > > > > > +#define LLSC_CLOBBERS "r0", "t", "memory" > > > +#define LLSC_START(mem) "synco\n" \ > > > + "0: movli.l @" mem ", r0\n" > > > +#define LLSC_END(mem) \ > > > + "1: movco.l r0, @" mem "\n" \ > > > + " bf 0b\n" \ > > > + " synco\n" > > > + > > > +static inline int __sh_cas_llsc(volatile int *p, int t, int s) > > > +{ > > > + int old; > > > + __asm__ __volatile__( > > > + LLSC_START("%1") > > > + " mov r0, %0\n" > > > + " cmp/eq %0, %2\n" > > > + " bf 1f\n" > > > + " mov %3, r0\n" > > > + LLSC_END("%1") > > > + : "=&r"(old) : "r"(p), "r"(t), "r"(s) : LLSC_CLOBBERS); > > > + return old; > > > +} > > > > The mov from r0 to %0 seems unnecessary here. Presumably it's because > > you didn't have a constraint to force old to come out via r0. Could > > you do the following? > > No, because the "mov %3, r0" a couple instructions down clobbers r0, and > this is necessary because movco.l only accept r0 as the source operand. Oh, I see. We lose the old value that the function needs to return. So a version of cas that just returns success/failure rather than the old value would be able to omit it, but musl's can't, right? I should keep that in mind, since at some point, if I can determine that the old value isn't important anywhere, I might consider changing the a_cas API to just have success/failure as its result. This is mildly cheaper on some other archs too, I think. > > I've actually always wondered about the value of having the LLSC_* > > macros. I usually prefer for the whole asm to be written out > > explicitly and readable unless there's a compelling reason to wrap it > > up in macros. Then it would look like: > > I think it made a bigger difference for the gusa version, so I mostly > did it with llsc for consistency. And before this patch, with the gusa > and llsc versions were side by side in the same function, it made it > easier for me to verify both versions were doing the same thing as I > wrote them. > > Now that the llsc version is moving, I'm less attached to the LLSC_* > macros. I do think the gusa stuff is ugly and magical enough that I'd > still prefer to keep it hidden away, if you don't object. Yeah, I don't mind. > > static inline int __sh_cas_llsc(volatile int *p, int t, int s) > > { > > register int old __asm__("r0"); > > __asm__ __volatile__( > > " synco\n" > > "0: movli.l @%1, r0\n" > > " cmp/eq r0, %2\n" > > " bf 1f\n" > > " mov %3, r0\n" > > "1: movco.l r0, @%1\n" > > " bf 0b\n" > > " synco\n" > > : "=&r"(old) : "r"(p), "r"(t), "r"(s) : "t", "memory"); > > return old; > > } > > > > and similar for other functions. Part of the motivation of not hiding > > the outer logic in macros is that it might make it possible to > > fold/simplify some special cases like I did above for CAS. > > I don't mind in principle, but I think the fact that movco.l requires > its input be in r0 is going to mean there's not actually any > simplification you can do. I think you may be right. > > Another idea is letting the compiler simplify, with something like the > > following, which could actually be used cross-platform for all > > llsc-type archs: > > > > static inline int __sh_cas_llsc(volatile int *p, int t, int s) > > { > > do old = llsc_start(p); > > while (*p == t && !llsc_end(p, s)); > > return old; > > } > > > > Here llsc_start and llsc_end would be inline functions using asm with > > appropriate constraints. Unfortunately I don't see a way to model > > using the value of the truth flag "t" as the output of the asm for > > llsc_end, though. I suspect this would be a problem on a number of > > other archs too; the asm would have to waste an instruction (or > > several) converting the flag to an integer. Unless there's a solution > > to that problem, it makes an approach like this less appealing. > > I agree this would be even nicer if we could make it work. Yes. FWIW the above code has a bug. It should be: static inline int __sh_cas_llsc(volatile int *p, int t, int s) { do old = llsc_start(p); while (old == t && !llsc_end(p, s)); return old; } I think there would need to be a barrier before the return statement too. Embedding the barrier in llsc_end would not work because (1) it won't get executed on old!=t, and (2) it should be avoided when llsc_end fails since llsc_start will do a barrier anyway, but if we put the conditional inside llsc_end's asm then the condition would get branched-on twice. Issue (2) may be solvable by using a C conditional inside llsc_end, but that still leaves issue (1), so I think something like this would be needed: static inline int __sh_cas_llsc(volatile int *p, int t, int s) { do old = llsc_start(p); while (old == t && !llsc_end(p, s)); a_barrier(); return old; } But that's suboptimal on archs where the 'sc' part of llsc has an implicit barrier (microblaze and or1k, IIRC). So perhaps the ideal general version would be: static inline int __sh_cas_llsc(volatile int *p, int t, int s) { do old = llsc_start(p); while (old == t ? !llsc_end(p, s) : (a_barrier(), 0)); return old; } with a barrier in the success path for llsc_end. Alternatively we could aim to always do the sc: static inline int __sh_cas_llsc(volatile int *p, int t, int s) { do old = llsc_start(p); while (!llsc_end(p, old==t ? s : old)); return old; } This version is structurally analogous to the non-CAS atomics, but perhaps more costly in the old!=t case. Anyway at this point I don't see an efficient way do to the conditional, so it's mostly a theoretical topic at this point. > > For the GUSA stuff, do you really nexed the ODD/EVEN macros? I think > > you could just add appropriate .align inside that would cause the > > assembler to insert a nop if necessary. > > Should be possible. I'll work on it and send another patch. Thanks! Rich