From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/7675
Path: news.gmane.org!not-for-mail
From: Rich Felker <dalias@libc.org>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: [PATCH] inline llsc atomics when compiling for sh4a
Date: Sun, 17 May 2015 22:34:02 -0400
Message-ID: <20150518023402.GS17573@brightrain.aerifal.cx>
References: <20150517185516.GA32020@duality.lan>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: ger.gmane.org 1431916458 22116 80.91.229.3 (18 May 2015 02:34:18 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Mon, 18 May 2015 02:34:18 +0000 (UTC)
To: musl@lists.openwall.com
Original-X-From: musl-return-7687-gllmg-musl=m.gmane.org@lists.openwall.com Mon May 18 04:34:17 2015
Return-path: <musl-return-7687-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@m.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by plane.gmane.org with smtp (Exim 4.69)
	(envelope-from <musl-return-7687-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1YuAsj-00089C-Fh
	for gllmg-musl@m.gmane.org; Mon, 18 May 2015 04:34:17 +0200
Original-Received: (qmail 16340 invoked by uid 550); 18 May 2015 02:34:14 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
Original-Received: (qmail 16319 invoked from network); 18 May 2015 02:34:14 -0000
Content-Disposition: inline
In-Reply-To: <20150517185516.GA32020@duality.lan>
User-Agent: Mutt/1.5.21 (2010-09-15)
Original-Sender: Rich Felker <dalias@aerifal.cx>
Xref: news.gmane.org gmane.linux.lib.musl.general:7675
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/7675>

On Sun, May 17, 2015 at 01:55:16PM -0500, Bobby Bingham wrote:
> If we're building for sh4a, the compiler is already free to use
> instructions only available on sh4a, so we can do the same and inline the
> llsc atomics. If we're building for an older processor, we still do the
> same runtime atomics selection as before.

Thanks! I think it's ok for commit as-is, but based on re-reading this
code I have some ideas for improving it that are orthogonal to this
change. See comments inline:

> ---
>  arch/sh/atomic.h     |  83 +++++++++++++++++++++++++++++++
>  arch/sh/src/atomic.c | 135 +++++++++++++++++----------------------------------
>  2 files changed, 128 insertions(+), 90 deletions(-)
> 
> diff --git a/arch/sh/atomic.h b/arch/sh/atomic.h
> index a1d22e4..f2e6dac 100644
> --- a/arch/sh/atomic.h
> +++ b/arch/sh/atomic.h
> @@ -22,6 +22,88 @@ static inline int a_ctz_64(uint64_t x)
>  	return a_ctz_l(y);
>  }
>  
> +#define LLSC_CLOBBERS "r0", "t", "memory"
> +#define LLSC_START(mem) "synco\n"  \
> +	"0:	movli.l @" mem ", r0\n"
> +#define LLSC_END(mem)              \
> +	"1:	movco.l r0, @" mem "\n"    \
> +	"	bf 0b\n"                   \
> +	"	synco\n"
> +
> +static inline int __sh_cas_llsc(volatile int *p, int t, int s)
> +{
> +	int old;
> +	__asm__ __volatile__(
> +		LLSC_START("%1")
> +		"	mov r0, %0\n"
> +		"	cmp/eq %0, %2\n"
> +		"	bf 1f\n"
> +		"	mov %3, r0\n"
> +		LLSC_END("%1")
> +		: "=&r"(old) : "r"(p), "r"(t), "r"(s) : LLSC_CLOBBERS);
> +	return old;
> +}

The mov from r0 to %0 seems unnecessary here. Presumably it's because
you didn't have a constraint to force old to come out via r0. Could
you do the following?

static inline int __sh_cas_llsc(volatile int *p, int t, int s)
{
	register int old __asm__("r0");
	__asm__ __volatile__(
		LLSC_START("%1")
		"	cmp/eq r0, %2\n"
		"	bf 1f\n"
		"	mov %3, r0\n"
		LLSC_END("%1")
		: "=&r"(old) : "r"(p), "r"(t), "r"(s) : "t", "memory");
	return old;
}

Note that I had to drop LLSC_CLOBBERS because "r0" was in it and you
can't clobber an output register.

I've actually always wondered about the value of having the LLSC_*
macros. I usually prefer for the whole asm to be written out
explicitly and readable unless there's a compelling reason to wrap it
up in macros. Then it would look like:

static inline int __sh_cas_llsc(volatile int *p, int t, int s)
{
	register int old __asm__("r0");
	__asm__ __volatile__(
		"	synco\n"
		"0:	movli.l @%1, r0\n"
		"	cmp/eq r0, %2\n"
		"	bf 1f\n"
		"	mov %3, r0\n"
		"1:	movco.l r0, @%1\n"
		"	bf 0b\n"
		"	synco\n"
		: "=&r"(old) : "r"(p), "r"(t), "r"(s) : "t", "memory");
	return old;
}

and similar for other functions. Part of the motivation of not hiding
the outer logic in macros is that it might make it possible to
fold/simplify some special cases like I did above for CAS.

Another idea is letting the compiler simplify, with something like the
following, which could actually be used cross-platform for all
llsc-type archs:

static inline int __sh_cas_llsc(volatile int *p, int t, int s)
{
	do old = llsc_start(p);
	while (*p == t && !llsc_end(p, s));
	return old;
}

Here llsc_start and llsc_end would be inline functions using asm with
appropriate constraints. Unfortunately I don't see a way to model
using the value of the truth flag "t" as the output of the asm for
llsc_end, though. I suspect this would be a problem on a number of
other archs too; the asm would have to waste an instruction (or
several) converting the flag to an integer. Unless there's a solution
to that problem, it makes an approach like this less appealing.

>  int __sh_cas(volatile int *p, int t, int s)
>  {
> +	if (__hwcap & CPU_HAS_LLSC) return __sh_cas_llsc(p, t, s);
> +
>  	int old;
> -	if (__hwcap & CPU_HAS_LLSC) {
> -		__asm__ __volatile__(
> -			LLSC_START("%1")
> -			"	mov r0, %0\n"
> -			"	cmp/eq %0, %2\n"
> -			"	bf 1f\n"
> -			"	mov %3, r0\n"
> -			LLSC_END("%1")
> -			: "=&r"(old) : "r"(p), "r"(t), "r"(s) : LLSC_CLOBBERS);
> -	} else {
> -		__asm__ __volatile__(
> -			GUSA_START_EVEN("%1", "%0")
> -			"	cmp/eq %0, %2\n"
> -			"	bf 1f\n"
> -			GUSA_END("%1", "%3")
> -			: "=&r"(old) : "r"(p), "r"(t), "r"(s) : GUSA_CLOBBERS, "t");
> -	}
> +	__asm__ __volatile__(
> +		GUSA_START_EVEN("%1", "%0")
> +		"	cmp/eq %0, %2\n"
> +		"	bf 1f\n"
> +		GUSA_END("%1", "%3")
> +		: "=&r"(old) : "r"(p), "r"(t), "r"(s) : GUSA_CLOBBERS, "t");
>  	return old;
>  }

For the GUSA stuff, do you really nexed the ODD/EVEN macros? I think
you could just add appropriate .align inside that would cause the
assembler to insert a nop if necessary.

Rich