mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] [PATCH] ppc64: check for AltiVec in setjmp/longjmp
@ 2021-12-06 23:43 Stijn Tintel
  2021-12-07  0:37 ` Florian Weimer
  0 siblings, 1 reply; 21+ messages in thread
From: Stijn Tintel @ 2021-12-06 23:43 UTC (permalink / raw)
  To: musl

On machines without AltiVec, the lvx and stvx instructions are not
supported. Use __hwcap to test if AltiVec is supported.

Fixes SIGILL on PowerPC 64 processors without AltiVec support.
Runtime-tested on e5500 and e6500.

Signed-off-by: Stijn Tintel <stijn@linux-ipv6.be>
---
 src/setjmp/powerpc64/longjmp.s | 13 ++++++++++++-
 src/setjmp/powerpc64/setjmp.s  | 13 ++++++++++++-
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/src/setjmp/powerpc64/longjmp.s b/src/setjmp/powerpc64/longjmp.s
index 81d45ff6..da7172af 100644
--- a/src/setjmp/powerpc64/longjmp.s
+++ b/src/setjmp/powerpc64/longjmp.s
@@ -56,7 +56,17 @@ longjmp:
 	lfd 30, 38*8(3)
 	lfd 31, 39*8(3)
 
-	# 6) restore vector registers v20-v31
+	# 6) restore vector registers v20-v31 if hardware supports AltiVec
+	mflr 0
+	bl 1f
+	.hidden __hwcap
+	.long __hwcap-.
+1:      mflr 4
+	lwz 5, 0(4)
+	add 4, 4, 5
+	ld 4, 0(4)
+	andis. 4, 4, 0x1000
+	beq 1f
 	addi 3, 3, 40*8
 	lvx 20, 0, 3 ; addi 3, 3, 16
 	lvx 21, 0, 3 ; addi 3, 3, 16
@@ -70,6 +80,7 @@ longjmp:
 	lvx 29, 0, 3 ; addi 3, 3, 16
 	lvx 30, 0, 3 ; addi 3, 3, 16
 	lvx 31, 0, 3
+1:	mtlr 0
 
 	# 7) return r4 ? r4 : 1
 	mr    3,   4
diff --git a/src/setjmp/powerpc64/setjmp.s b/src/setjmp/powerpc64/setjmp.s
index 37683fda..32853693 100644
--- a/src/setjmp/powerpc64/setjmp.s
+++ b/src/setjmp/powerpc64/setjmp.s
@@ -69,7 +69,17 @@ __setjmp_toc:
 	stfd 30, 38*8(3)
 	stfd 31, 39*8(3)
 
-	# 5) store vector registers v20-v31
+	# 5) store vector registers v20-v31 if hardware supports AltiVec
+	mflr 0
+	bl 1f
+	.hidden __hwcap
+	.long __hwcap-.
+1:	mflr 4
+	lwz 5, 0(4)
+	add 4, 4, 5
+	ld 4, 0(4)
+	andis. 4, 4, 0x1000
+	beq 1f
 	addi  3, 3, 40*8
 	stvx 20, 0, 3 ; addi 3, 3, 16
 	stvx 21, 0, 3 ; addi 3, 3, 16
@@ -83,6 +93,7 @@ __setjmp_toc:
 	stvx 29, 0, 3 ; addi 3, 3, 16
 	stvx 30, 0, 3 ; addi 3, 3, 16
 	stvx 31, 0, 3
+1:	mtlr 0
 
 	# 6) return 0
 	li 3, 0
-- 
2.32.0


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [musl] [PATCH] ppc64: check for AltiVec in setjmp/longjmp
  2021-12-06 23:43 [musl] [PATCH] ppc64: check for AltiVec in setjmp/longjmp Stijn Tintel
@ 2021-12-07  0:37 ` Florian Weimer
  2021-12-07  0:59   ` Rich Felker
  0 siblings, 1 reply; 21+ messages in thread
From: Florian Weimer @ 2021-12-07  0:37 UTC (permalink / raw)
  To: Stijn Tintel; +Cc: musl

* Stijn Tintel:

> diff --git a/src/setjmp/powerpc64/setjmp.s b/src/setjmp/powerpc64/setjmp.s
> index 37683fda..32853693 100644
> --- a/src/setjmp/powerpc64/setjmp.s
> +++ b/src/setjmp/powerpc64/setjmp.s
> @@ -69,7 +69,17 @@ __setjmp_toc:
>  	stfd 30, 38*8(3)
>  	stfd 31, 39*8(3)
>  
> -	# 5) store vector registers v20-v31
> +	# 5) store vector registers v20-v31 if hardware supports AltiVec
> +	mflr 0
> +	bl 1f
> +	.hidden __hwcap
> +	.long __hwcap-.
> +1:	mflr 4

This de-balances the return stack and probably has quite severe
performance impact.  The ISA manual says to use

  bcl 20,31,$+4

and you'll have to store the __hwcap offset somewhere else.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [musl] [PATCH] ppc64: check for AltiVec in setjmp/longjmp
  2021-12-07  0:37 ` Florian Weimer
@ 2021-12-07  0:59   ` Rich Felker
  2021-12-07  1:15     ` David Edelsohn
  2021-12-08  8:43     ` Stijn Tintel
  0 siblings, 2 replies; 21+ messages in thread
From: Rich Felker @ 2021-12-07  0:59 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Stijn Tintel, musl

On Tue, Dec 07, 2021 at 01:37:12AM +0100, Florian Weimer wrote:
> * Stijn Tintel:
> 
> > diff --git a/src/setjmp/powerpc64/setjmp.s b/src/setjmp/powerpc64/setjmp.s
> > index 37683fda..32853693 100644
> > --- a/src/setjmp/powerpc64/setjmp.s
> > +++ b/src/setjmp/powerpc64/setjmp.s
> > @@ -69,7 +69,17 @@ __setjmp_toc:
> >  	stfd 30, 38*8(3)
> >  	stfd 31, 39*8(3)
> >  
> > -	# 5) store vector registers v20-v31
> > +	# 5) store vector registers v20-v31 if hardware supports AltiVec
> > +	mflr 0
> > +	bl 1f
> > +	.hidden __hwcap
> > +	.long __hwcap-.
> > +1:	mflr 4
> 
> This de-balances the return stack and probably has quite severe
> performance impact.  The ISA manual says to use
> 
>   bcl 20,31,$+4
> 
> and you'll have to store the __hwcap offset somewhere else.

To begin with, let's change the .s files to .S files and put the whole
branch logic inside #ifndef __ALTIVEC__ so that it does not impact
normal builds with an ISA level where Altivec can be assumed to be
present.

I'm not sufficiently familiar with the PowerPC ISA to know how bcl
works, but if there's a less expensive solution along those lines
that's compatible with all ISA levels, by all means let's use it. The
same could be done for powerpc-sf (32-bit) and its SPE branches, too.

Also the add and lwz can be used into lwzx (indexed load).

Rich

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [musl] [PATCH] ppc64: check for AltiVec in setjmp/longjmp
  2021-12-07  0:59   ` Rich Felker
@ 2021-12-07  1:15     ` David Edelsohn
  2021-12-07  1:39       ` Rich Felker
  2021-12-08  8:43     ` Stijn Tintel
  1 sibling, 1 reply; 21+ messages in thread
From: David Edelsohn @ 2021-12-07  1:15 UTC (permalink / raw)
  To: musl; +Cc: Florian Weimer, Stijn Tintel

On Mon, Dec 6, 2021 at 7:59 PM Rich Felker <dalias@libc.org> wrote:
>
> On Tue, Dec 07, 2021 at 01:37:12AM +0100, Florian Weimer wrote:
> > * Stijn Tintel:
> >
> > > diff --git a/src/setjmp/powerpc64/setjmp.s b/src/setjmp/powerpc64/setjmp.s
> > > index 37683fda..32853693 100644
> > > --- a/src/setjmp/powerpc64/setjmp.s
> > > +++ b/src/setjmp/powerpc64/setjmp.s
> > > @@ -69,7 +69,17 @@ __setjmp_toc:
> > >     stfd 30, 38*8(3)
> > >     stfd 31, 39*8(3)
> > >
> > > -   # 5) store vector registers v20-v31
> > > +   # 5) store vector registers v20-v31 if hardware supports AltiVec
> > > +   mflr 0
> > > +   bl 1f
> > > +   .hidden __hwcap
> > > +   .long __hwcap-.
> > > +1: mflr 4
> >
> > This de-balances the return stack and probably has quite severe
> > performance impact.  The ISA manual says to use
> >
> >   bcl 20,31,$+4
> >
> > and you'll have to store the __hwcap offset somewhere else.
>
> To begin with, let's change the .s files to .S files and put the whole
> branch logic inside #ifndef __ALTIVEC__ so that it does not impact
> normal builds with an ISA level where Altivec can be assumed to be
> present.
>
> I'm not sufficiently familiar with the PowerPC ISA to know how bcl
> works, but if there's a less expensive solution along those lines
> that's compatible with all ISA levels, by all means let's use it. The
> same could be done for powerpc-sf (32-bit) and its SPE branches, too.

bl = branch and link
bcl = branch conditional and link

link means place the next instruction address in the link register.
Normally a branch and link would be used for a matching "return"
instruction, but in this case it is being used to compute a position
independent code address.  As Florian correctly points out, the "bl"
will corrupt the link stack in the processor used to predict return
addresses and the recommended sequence is the one that he suggests.

bcl 20,31,addr

which means branch always and, because the condition register bits are
irrelevant, a special value that instructs the processor to not  push
the address onto the link stack so that the "calls" and "returns"
remain matched.

Thanks, David

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [musl] [PATCH] ppc64: check for AltiVec in setjmp/longjmp
  2021-12-07  1:15     ` David Edelsohn
@ 2021-12-07  1:39       ` Rich Felker
  2021-12-07  1:44         ` David Edelsohn
  0 siblings, 1 reply; 21+ messages in thread
From: Rich Felker @ 2021-12-07  1:39 UTC (permalink / raw)
  To: David Edelsohn; +Cc: musl, Florian Weimer, Stijn Tintel

On Mon, Dec 06, 2021 at 08:15:48PM -0500, David Edelsohn wrote:
> On Mon, Dec 6, 2021 at 7:59 PM Rich Felker <dalias@libc.org> wrote:
> >
> > On Tue, Dec 07, 2021 at 01:37:12AM +0100, Florian Weimer wrote:
> > > * Stijn Tintel:
> > >
> > > > diff --git a/src/setjmp/powerpc64/setjmp.s b/src/setjmp/powerpc64/setjmp.s
> > > > index 37683fda..32853693 100644
> > > > --- a/src/setjmp/powerpc64/setjmp.s
> > > > +++ b/src/setjmp/powerpc64/setjmp.s
> > > > @@ -69,7 +69,17 @@ __setjmp_toc:
> > > >     stfd 30, 38*8(3)
> > > >     stfd 31, 39*8(3)
> > > >
> > > > -   # 5) store vector registers v20-v31
> > > > +   # 5) store vector registers v20-v31 if hardware supports AltiVec
> > > > +   mflr 0
> > > > +   bl 1f
> > > > +   .hidden __hwcap
> > > > +   .long __hwcap-.
> > > > +1: mflr 4
> > >
> > > This de-balances the return stack and probably has quite severe
> > > performance impact.  The ISA manual says to use
> > >
> > >   bcl 20,31,$+4
> > >
> > > and you'll have to store the __hwcap offset somewhere else.
> >
> > To begin with, let's change the .s files to .S files and put the whole
> > branch logic inside #ifndef __ALTIVEC__ so that it does not impact
> > normal builds with an ISA level where Altivec can be assumed to be
> > present.
> >
> > I'm not sufficiently familiar with the PowerPC ISA to know how bcl
> > works, but if there's a less expensive solution along those lines
> > that's compatible with all ISA levels, by all means let's use it. The
> > same could be done for powerpc-sf (32-bit) and its SPE branches, too.
> 
> bl = branch and link
> bcl = branch conditional and link
> 
> link means place the next instruction address in the link register.
> Normally a branch and link would be used for a matching "return"
> instruction, but in this case it is being used to compute a position
> independent code address.  As Florian correctly points out, the "bl"
> will corrupt the link stack in the processor used to predict return
> addresses and the recommended sequence is the one that he suggests.
> 
> bcl 20,31,addr
> 
> which means branch always and, because the condition register bits are
> irrelevant, a special value that instructs the processor to not  push
> the address onto the link stack so that the "calls" and "returns"
> remain matched.

Thanks. Am I correct in understanding then that we don't need $+4, but
can instead use the 1f just as now, with inline .long __hwcap-. -- in
other words that "bcl 20,31," is a drop-in replacement for "bl"
without the link stack impact?

Rich

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [musl] [PATCH] ppc64: check for AltiVec in setjmp/longjmp
  2021-12-07  1:39       ` Rich Felker
@ 2021-12-07  1:44         ` David Edelsohn
  2021-12-07 13:25           ` Rich Felker
  2021-12-07 18:27           ` James Y Knight
  0 siblings, 2 replies; 21+ messages in thread
From: David Edelsohn @ 2021-12-07  1:44 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl, Florian Weimer, Stijn Tintel

On Mon, Dec 6, 2021 at 8:39 PM Rich Felker <dalias@libc.org> wrote:
>
> On Mon, Dec 06, 2021 at 08:15:48PM -0500, David Edelsohn wrote:
> > On Mon, Dec 6, 2021 at 7:59 PM Rich Felker <dalias@libc.org> wrote:
> > >
> > > On Tue, Dec 07, 2021 at 01:37:12AM +0100, Florian Weimer wrote:
> > > > * Stijn Tintel:
> > > >
> > > > > diff --git a/src/setjmp/powerpc64/setjmp.s b/src/setjmp/powerpc64/setjmp.s
> > > > > index 37683fda..32853693 100644
> > > > > --- a/src/setjmp/powerpc64/setjmp.s
> > > > > +++ b/src/setjmp/powerpc64/setjmp.s
> > > > > @@ -69,7 +69,17 @@ __setjmp_toc:
> > > > >     stfd 30, 38*8(3)
> > > > >     stfd 31, 39*8(3)
> > > > >
> > > > > -   # 5) store vector registers v20-v31
> > > > > +   # 5) store vector registers v20-v31 if hardware supports AltiVec
> > > > > +   mflr 0
> > > > > +   bl 1f
> > > > > +   .hidden __hwcap
> > > > > +   .long __hwcap-.
> > > > > +1: mflr 4
> > > >
> > > > This de-balances the return stack and probably has quite severe
> > > > performance impact.  The ISA manual says to use
> > > >
> > > >   bcl 20,31,$+4
> > > >
> > > > and you'll have to store the __hwcap offset somewhere else.
> > >
> > > To begin with, let's change the .s files to .S files and put the whole
> > > branch logic inside #ifndef __ALTIVEC__ so that it does not impact
> > > normal builds with an ISA level where Altivec can be assumed to be
> > > present.
> > >
> > > I'm not sufficiently familiar with the PowerPC ISA to know how bcl
> > > works, but if there's a less expensive solution along those lines
> > > that's compatible with all ISA levels, by all means let's use it. The
> > > same could be done for powerpc-sf (32-bit) and its SPE branches, too.
> >
> > bl = branch and link
> > bcl = branch conditional and link
> >
> > link means place the next instruction address in the link register.
> > Normally a branch and link would be used for a matching "return"
> > instruction, but in this case it is being used to compute a position
> > independent code address.  As Florian correctly points out, the "bl"
> > will corrupt the link stack in the processor used to predict return
> > addresses and the recommended sequence is the one that he suggests.
> >
> > bcl 20,31,addr
> >
> > which means branch always and, because the condition register bits are
> > irrelevant, a special value that instructs the processor to not  push
> > the address onto the link stack so that the "calls" and "returns"
> > remain matched.
>
> Thanks. Am I correct in understanding then that we don't need $+4, but
> can instead use the 1f just as now, with inline .long __hwcap-. -- in
> other words that "bcl 20,31," is a drop-in replacement for "bl"
> without the link stack impact?

It should work, but it's slightly preferred to use $+4 because one
explicitly wants the address of the next instruction and labels of the
form "1f" are not supported by all assemblers.

Thanks, David

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [musl] [PATCH] ppc64: check for AltiVec in setjmp/longjmp
  2021-12-07  1:44         ` David Edelsohn
@ 2021-12-07 13:25           ` Rich Felker
  2021-12-07 13:39             ` David Edelsohn
  2021-12-07 18:39             ` Markus Wichmann
  2021-12-07 18:27           ` James Y Knight
  1 sibling, 2 replies; 21+ messages in thread
From: Rich Felker @ 2021-12-07 13:25 UTC (permalink / raw)
  To: David Edelsohn; +Cc: musl, Florian Weimer, Stijn Tintel

On Mon, Dec 06, 2021 at 08:44:47PM -0500, David Edelsohn wrote:
> On Mon, Dec 6, 2021 at 8:39 PM Rich Felker <dalias@libc.org> wrote:
> >
> > On Mon, Dec 06, 2021 at 08:15:48PM -0500, David Edelsohn wrote:
> > > On Mon, Dec 6, 2021 at 7:59 PM Rich Felker <dalias@libc.org> wrote:
> > > >
> > > > On Tue, Dec 07, 2021 at 01:37:12AM +0100, Florian Weimer wrote:
> > > > > * Stijn Tintel:
> > > > >
> > > > > > diff --git a/src/setjmp/powerpc64/setjmp.s b/src/setjmp/powerpc64/setjmp.s
> > > > > > index 37683fda..32853693 100644
> > > > > > --- a/src/setjmp/powerpc64/setjmp.s
> > > > > > +++ b/src/setjmp/powerpc64/setjmp.s
> > > > > > @@ -69,7 +69,17 @@ __setjmp_toc:
> > > > > >     stfd 30, 38*8(3)
> > > > > >     stfd 31, 39*8(3)
> > > > > >
> > > > > > -   # 5) store vector registers v20-v31
> > > > > > +   # 5) store vector registers v20-v31 if hardware supports AltiVec
> > > > > > +   mflr 0
> > > > > > +   bl 1f
> > > > > > +   .hidden __hwcap
> > > > > > +   .long __hwcap-.
> > > > > > +1: mflr 4
> > > > >
> > > > > This de-balances the return stack and probably has quite severe
> > > > > performance impact.  The ISA manual says to use
> > > > >
> > > > >   bcl 20,31,$+4
> > > > >
> > > > > and you'll have to store the __hwcap offset somewhere else.
> > > >
> > > > To begin with, let's change the .s files to .S files and put the whole
> > > > branch logic inside #ifndef __ALTIVEC__ so that it does not impact
> > > > normal builds with an ISA level where Altivec can be assumed to be
> > > > present.
> > > >
> > > > I'm not sufficiently familiar with the PowerPC ISA to know how bcl
> > > > works, but if there's a less expensive solution along those lines
> > > > that's compatible with all ISA levels, by all means let's use it. The
> > > > same could be done for powerpc-sf (32-bit) and its SPE branches, too.
> > >
> > > bl = branch and link
> > > bcl = branch conditional and link
> > >
> > > link means place the next instruction address in the link register.
> > > Normally a branch and link would be used for a matching "return"
> > > instruction, but in this case it is being used to compute a position
> > > independent code address.  As Florian correctly points out, the "bl"
> > > will corrupt the link stack in the processor used to predict return
> > > addresses and the recommended sequence is the one that he suggests.
> > >
> > > bcl 20,31,addr
> > >
> > > which means branch always and, because the condition register bits are
> > > irrelevant, a special value that instructs the processor to not  push
> > > the address onto the link stack so that the "calls" and "returns"
> > > remain matched.
> >
> > Thanks. Am I correct in understanding then that we don't need $+4, but
> > can instead use the 1f just as now, with inline .long __hwcap-. -- in
> > other words that "bcl 20,31," is a drop-in replacement for "bl"
> > without the link stack impact?
> 
> It should work, but it's slightly preferred to use $+4 because one
> explicitly wants the address of the next instruction and labels of the

In this case we don't want the address of the next instruction. We
want the address of the constant __hwcap-.

Rich

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [musl] [PATCH] ppc64: check for AltiVec in setjmp/longjmp
  2021-12-07 13:25           ` Rich Felker
@ 2021-12-07 13:39             ` David Edelsohn
  2021-12-07 14:43               ` Rich Felker
  2021-12-07 18:39             ` Markus Wichmann
  1 sibling, 1 reply; 21+ messages in thread
From: David Edelsohn @ 2021-12-07 13:39 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl, Florian Weimer, Stijn Tintel

On Tue, Dec 7, 2021 at 8:25 AM Rich Felker <dalias@libc.org> wrote:
>
> On Mon, Dec 06, 2021 at 08:44:47PM -0500, David Edelsohn wrote:
> > On Mon, Dec 6, 2021 at 8:39 PM Rich Felker <dalias@libc.org> wrote:
> > >
> > > On Mon, Dec 06, 2021 at 08:15:48PM -0500, David Edelsohn wrote:
> > > > On Mon, Dec 6, 2021 at 7:59 PM Rich Felker <dalias@libc.org> wrote:
> > > > >
> > > > > On Tue, Dec 07, 2021 at 01:37:12AM +0100, Florian Weimer wrote:
> > > > > > * Stijn Tintel:
> > > > > >
> > > > > > > diff --git a/src/setjmp/powerpc64/setjmp.s b/src/setjmp/powerpc64/setjmp.s
> > > > > > > index 37683fda..32853693 100644
> > > > > > > --- a/src/setjmp/powerpc64/setjmp.s
> > > > > > > +++ b/src/setjmp/powerpc64/setjmp.s
> > > > > > > @@ -69,7 +69,17 @@ __setjmp_toc:
> > > > > > >     stfd 30, 38*8(3)
> > > > > > >     stfd 31, 39*8(3)
> > > > > > >
> > > > > > > -   # 5) store vector registers v20-v31
> > > > > > > +   # 5) store vector registers v20-v31 if hardware supports AltiVec
> > > > > > > +   mflr 0
> > > > > > > +   bl 1f
> > > > > > > +   .hidden __hwcap
> > > > > > > +   .long __hwcap-.
> > > > > > > +1: mflr 4
> > > > > >
> > > > > > This de-balances the return stack and probably has quite severe
> > > > > > performance impact.  The ISA manual says to use
> > > > > >
> > > > > >   bcl 20,31,$+4
> > > > > >
> > > > > > and you'll have to store the __hwcap offset somewhere else.
> > > > >
> > > > > To begin with, let's change the .s files to .S files and put the whole
> > > > > branch logic inside #ifndef __ALTIVEC__ so that it does not impact
> > > > > normal builds with an ISA level where Altivec can be assumed to be
> > > > > present.
> > > > >
> > > > > I'm not sufficiently familiar with the PowerPC ISA to know how bcl
> > > > > works, but if there's a less expensive solution along those lines
> > > > > that's compatible with all ISA levels, by all means let's use it. The
> > > > > same could be done for powerpc-sf (32-bit) and its SPE branches, too.
> > > >
> > > > bl = branch and link
> > > > bcl = branch conditional and link
> > > >
> > > > link means place the next instruction address in the link register.
> > > > Normally a branch and link would be used for a matching "return"
> > > > instruction, but in this case it is being used to compute a position
> > > > independent code address.  As Florian correctly points out, the "bl"
> > > > will corrupt the link stack in the processor used to predict return
> > > > addresses and the recommended sequence is the one that he suggests.
> > > >
> > > > bcl 20,31,addr
> > > >
> > > > which means branch always and, because the condition register bits are
> > > > irrelevant, a special value that instructs the processor to not  push
> > > > the address onto the link stack so that the "calls" and "returns"
> > > > remain matched.
> > >
> > > Thanks. Am I correct in understanding then that we don't need $+4, but
> > > can instead use the 1f just as now, with inline .long __hwcap-. -- in
> > > other words that "bcl 20,31," is a drop-in replacement for "bl"
> > > without the link stack impact?
> >
> > It should work, but it's slightly preferred to use $+4 because one
> > explicitly wants the address of the next instruction and labels of the
>
> In this case we don't want the address of the next instruction. We
> want the address of the constant __hwcap-.

.hidden __hwcap

is not an instruction.  It will not emit any data.

Thanks, David

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [musl] [PATCH] ppc64: check for AltiVec in setjmp/longjmp
  2021-12-07 13:39             ` David Edelsohn
@ 2021-12-07 14:43               ` Rich Felker
  2021-12-07 14:48                 ` David Edelsohn
  0 siblings, 1 reply; 21+ messages in thread
From: Rich Felker @ 2021-12-07 14:43 UTC (permalink / raw)
  To: David Edelsohn; +Cc: musl, Florian Weimer, Stijn Tintel

On Tue, Dec 07, 2021 at 08:39:20AM -0500, David Edelsohn wrote:
> On Tue, Dec 7, 2021 at 8:25 AM Rich Felker <dalias@libc.org> wrote:
> >
> > On Mon, Dec 06, 2021 at 08:44:47PM -0500, David Edelsohn wrote:
> > > On Mon, Dec 6, 2021 at 8:39 PM Rich Felker <dalias@libc.org> wrote:
> > > >
> > > > On Mon, Dec 06, 2021 at 08:15:48PM -0500, David Edelsohn wrote:
> > > > > On Mon, Dec 6, 2021 at 7:59 PM Rich Felker <dalias@libc.org> wrote:
> > > > > >
> > > > > > On Tue, Dec 07, 2021 at 01:37:12AM +0100, Florian Weimer wrote:
> > > > > > > * Stijn Tintel:
> > > > > > >
> > > > > > > > diff --git a/src/setjmp/powerpc64/setjmp.s b/src/setjmp/powerpc64/setjmp.s
> > > > > > > > index 37683fda..32853693 100644
> > > > > > > > --- a/src/setjmp/powerpc64/setjmp.s
> > > > > > > > +++ b/src/setjmp/powerpc64/setjmp.s
> > > > > > > > @@ -69,7 +69,17 @@ __setjmp_toc:
> > > > > > > >     stfd 30, 38*8(3)
> > > > > > > >     stfd 31, 39*8(3)
> > > > > > > >
> > > > > > > > -   # 5) store vector registers v20-v31
> > > > > > > > +   # 5) store vector registers v20-v31 if hardware supports AltiVec
> > > > > > > > +   mflr 0
> > > > > > > > +   bl 1f
> > > > > > > > +   .hidden __hwcap
> > > > > > > > +   .long __hwcap-.
> > > > > > > > +1: mflr 4
> > > > > > >
> > > > > > > This de-balances the return stack and probably has quite severe
> > > > > > > performance impact.  The ISA manual says to use
> > > > > > >
> > > > > > >   bcl 20,31,$+4
> > > > > > >
> > > > > > > and you'll have to store the __hwcap offset somewhere else.
> > > > > >
> > > > > > To begin with, let's change the .s files to .S files and put the whole
> > > > > > branch logic inside #ifndef __ALTIVEC__ so that it does not impact
> > > > > > normal builds with an ISA level where Altivec can be assumed to be
> > > > > > present.
> > > > > >
> > > > > > I'm not sufficiently familiar with the PowerPC ISA to know how bcl
> > > > > > works, but if there's a less expensive solution along those lines
> > > > > > that's compatible with all ISA levels, by all means let's use it. The
> > > > > > same could be done for powerpc-sf (32-bit) and its SPE branches, too.
> > > > >
> > > > > bl = branch and link
> > > > > bcl = branch conditional and link
> > > > >
> > > > > link means place the next instruction address in the link register.
> > > > > Normally a branch and link would be used for a matching "return"
> > > > > instruction, but in this case it is being used to compute a position
> > > > > independent code address.  As Florian correctly points out, the "bl"
> > > > > will corrupt the link stack in the processor used to predict return
> > > > > addresses and the recommended sequence is the one that he suggests.
> > > > >
> > > > > bcl 20,31,addr
> > > > >
> > > > > which means branch always and, because the condition register bits are
> > > > > irrelevant, a special value that instructs the processor to not  push
> > > > > the address onto the link stack so that the "calls" and "returns"
> > > > > remain matched.
> > > >
> > > > Thanks. Am I correct in understanding then that we don't need $+4, but
> > > > can instead use the 1f just as now, with inline .long __hwcap-. -- in
> > > > other words that "bcl 20,31," is a drop-in replacement for "bl"
> > > > without the link stack impact?
> > >
> > > It should work, but it's slightly preferred to use $+4 because one
> > > explicitly wants the address of the next instruction and labels of the
> >
> > In this case we don't want the address of the next instruction. We
> > want the address of the constant __hwcap-.
> 
> ..hidden __hwcap
> 
> is not an instruction.  It will not emit any data.

Of course it won't. .long __hwcap-. is the directive that does, on the
next line, which you seem to have missed.

Rich

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [musl] [PATCH] ppc64: check for AltiVec in setjmp/longjmp
  2021-12-07 14:43               ` Rich Felker
@ 2021-12-07 14:48                 ` David Edelsohn
  0 siblings, 0 replies; 21+ messages in thread
From: David Edelsohn @ 2021-12-07 14:48 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl, Florian Weimer, Stijn Tintel

On Tue, Dec 7, 2021 at 9:43 AM Rich Felker <dalias@libc.org> wrote:
>
> On Tue, Dec 07, 2021 at 08:39:20AM -0500, David Edelsohn wrote:
> > On Tue, Dec 7, 2021 at 8:25 AM Rich Felker <dalias@libc.org> wrote:
> > >
> > > On Mon, Dec 06, 2021 at 08:44:47PM -0500, David Edelsohn wrote:
> > > > On Mon, Dec 6, 2021 at 8:39 PM Rich Felker <dalias@libc.org> wrote:
> > > > >
> > > > > On Mon, Dec 06, 2021 at 08:15:48PM -0500, David Edelsohn wrote:
> > > > > > On Mon, Dec 6, 2021 at 7:59 PM Rich Felker <dalias@libc.org> wrote:
> > > > > > >
> > > > > > > On Tue, Dec 07, 2021 at 01:37:12AM +0100, Florian Weimer wrote:
> > > > > > > > * Stijn Tintel:
> > > > > > > >
> > > > > > > > > diff --git a/src/setjmp/powerpc64/setjmp.s b/src/setjmp/powerpc64/setjmp.s
> > > > > > > > > index 37683fda..32853693 100644
> > > > > > > > > --- a/src/setjmp/powerpc64/setjmp.s
> > > > > > > > > +++ b/src/setjmp/powerpc64/setjmp.s
> > > > > > > > > @@ -69,7 +69,17 @@ __setjmp_toc:
> > > > > > > > >     stfd 30, 38*8(3)
> > > > > > > > >     stfd 31, 39*8(3)
> > > > > > > > >
> > > > > > > > > -   # 5) store vector registers v20-v31
> > > > > > > > > +   # 5) store vector registers v20-v31 if hardware supports AltiVec
> > > > > > > > > +   mflr 0
> > > > > > > > > +   bl 1f
> > > > > > > > > +   .hidden __hwcap
> > > > > > > > > +   .long __hwcap-.
> > > > > > > > > +1: mflr 4
> > > > > > > >
> > > > > > > > This de-balances the return stack and probably has quite severe
> > > > > > > > performance impact.  The ISA manual says to use
> > > > > > > >
> > > > > > > >   bcl 20,31,$+4
> > > > > > > >
> > > > > > > > and you'll have to store the __hwcap offset somewhere else.
> > > > > > >
> > > > > > > To begin with, let's change the .s files to .S files and put the whole
> > > > > > > branch logic inside #ifndef __ALTIVEC__ so that it does not impact
> > > > > > > normal builds with an ISA level where Altivec can be assumed to be
> > > > > > > present.
> > > > > > >
> > > > > > > I'm not sufficiently familiar with the PowerPC ISA to know how bcl
> > > > > > > works, but if there's a less expensive solution along those lines
> > > > > > > that's compatible with all ISA levels, by all means let's use it. The
> > > > > > > same could be done for powerpc-sf (32-bit) and its SPE branches, too.
> > > > > >
> > > > > > bl = branch and link
> > > > > > bcl = branch conditional and link
> > > > > >
> > > > > > link means place the next instruction address in the link register.
> > > > > > Normally a branch and link would be used for a matching "return"
> > > > > > instruction, but in this case it is being used to compute a position
> > > > > > independent code address.  As Florian correctly points out, the "bl"
> > > > > > will corrupt the link stack in the processor used to predict return
> > > > > > addresses and the recommended sequence is the one that he suggests.
> > > > > >
> > > > > > bcl 20,31,addr
> > > > > >
> > > > > > which means branch always and, because the condition register bits are
> > > > > > irrelevant, a special value that instructs the processor to not  push
> > > > > > the address onto the link stack so that the "calls" and "returns"
> > > > > > remain matched.
> > > > >
> > > > > Thanks. Am I correct in understanding then that we don't need $+4, but
> > > > > can instead use the 1f just as now, with inline .long __hwcap-. -- in
> > > > > other words that "bcl 20,31," is a drop-in replacement for "bl"
> > > > > without the link stack impact?
> > > >
> > > > It should work, but it's slightly preferred to use $+4 because one
> > > > explicitly wants the address of the next instruction and labels of the
> > >
> > > In this case we don't want the address of the next instruction. We
> > > want the address of the constant __hwcap-.
> >
> > ..hidden __hwcap
> >
> > is not an instruction.  It will not emit any data.
>
> Of course it won't. .long __hwcap-. is the directive that does, on the
> next line, which you seem to have missed.

I'm sorry that you don't understand what I am expressing.  Snide
comments are not productive.  Do what you want.

Thanks, David

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [musl] [PATCH] ppc64: check for AltiVec in setjmp/longjmp
  2021-12-07  1:44         ` David Edelsohn
  2021-12-07 13:25           ` Rich Felker
@ 2021-12-07 18:27           ` James Y Knight
  2021-12-07 18:57             ` Markus Wichmann
  1 sibling, 1 reply; 21+ messages in thread
From: James Y Knight @ 2021-12-07 18:27 UTC (permalink / raw)
  To: musl; +Cc: Rich Felker, Florian Weimer, Stijn Tintel

[-- Attachment #1: Type: text/plain, Size: 3791 bytes --]

The important question at hand is whether the hardware treats "next
instruction" as a critical part of the special case. The recommended
sequence is:
  bcl 20,31,$+4
  next-instructions...

But, does the hardware _also_ trigger the expected special-cased effect on
the return stack when jumping to locations other than the next instruction?
E.g. is this OK w.r.t. return-stack?
  bcl 20,31,$+8
  .long __hwcap-.
  next-instructions...

On X86, calling *exactly* the next instruction is how you trigger the
special-case in the return-stack-predictor. But, it sounds like
potentially on PPC, the address is not part of what triggers the
special-case. Is that correct?


On Mon, Dec 6, 2021 at 8:45 PM David Edelsohn <dje.gcc@gmail.com> wrote:

> On Mon, Dec 6, 2021 at 8:39 PM Rich Felker <dalias@libc.org> wrote:
> >
> > On Mon, Dec 06, 2021 at 08:15:48PM -0500, David Edelsohn wrote:
> > > On Mon, Dec 6, 2021 at 7:59 PM Rich Felker <dalias@libc.org> wrote:
> > > >
> > > > On Tue, Dec 07, 2021 at 01:37:12AM +0100, Florian Weimer wrote:
> > > > > * Stijn Tintel:
> > > > >
> > > > > > diff --git a/src/setjmp/powerpc64/setjmp.s
> b/src/setjmp/powerpc64/setjmp.s
> > > > > > index 37683fda..32853693 100644
> > > > > > --- a/src/setjmp/powerpc64/setjmp.s
> > > > > > +++ b/src/setjmp/powerpc64/setjmp.s
> > > > > > @@ -69,7 +69,17 @@ __setjmp_toc:
> > > > > >     stfd 30, 38*8(3)
> > > > > >     stfd 31, 39*8(3)
> > > > > >
> > > > > > -   # 5) store vector registers v20-v31
> > > > > > +   # 5) store vector registers v20-v31 if hardware supports
> AltiVec
> > > > > > +   mflr 0
> > > > > > +   bl 1f
> > > > > > +   .hidden __hwcap
> > > > > > +   .long __hwcap-.
> > > > > > +1: mflr 4
> > > > >
> > > > > This de-balances the return stack and probably has quite severe
> > > > > performance impact.  The ISA manual says to use
> > > > >
> > > > >   bcl 20,31,$+4
> > > > >
> > > > > and you'll have to store the __hwcap offset somewhere else.
> > > >
> > > > To begin with, let's change the .s files to .S files and put the
> whole
> > > > branch logic inside #ifndef __ALTIVEC__ so that it does not impact
> > > > normal builds with an ISA level where Altivec can be assumed to be
> > > > present.
> > > >
> > > > I'm not sufficiently familiar with the PowerPC ISA to know how bcl
> > > > works, but if there's a less expensive solution along those lines
> > > > that's compatible with all ISA levels, by all means let's use it. The
> > > > same could be done for powerpc-sf (32-bit) and its SPE branches, too.
> > >
> > > bl = branch and link
> > > bcl = branch conditional and link
> > >
> > > link means place the next instruction address in the link register.
> > > Normally a branch and link would be used for a matching "return"
> > > instruction, but in this case it is being used to compute a position
> > > independent code address.  As Florian correctly points out, the "bl"
> > > will corrupt the link stack in the processor used to predict return
> > > addresses and the recommended sequence is the one that he suggests.
> > >
> > > bcl 20,31,addr
> > >
> > > which means branch always and, because the condition register bits are
> > > irrelevant, a special value that instructs the processor to not  push
> > > the address onto the link stack so that the "calls" and "returns"
> > > remain matched.
> >
> > Thanks. Am I correct in understanding then that we don't need $+4, but
> > can instead use the 1f just as now, with inline .long __hwcap-. -- in
> > other words that "bcl 20,31," is a drop-in replacement for "bl"
> > without the link stack impact?
>
> It should work, but it's slightly preferred to use $+4 because one
> explicitly wants the address of the next instruction and labels of the
> form "1f" are not supported by all assemblers.
>
> Thanks, David
>

[-- Attachment #2: Type: text/html, Size: 5251 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [musl] [PATCH] ppc64: check for AltiVec in setjmp/longjmp
  2021-12-07 13:25           ` Rich Felker
  2021-12-07 13:39             ` David Edelsohn
@ 2021-12-07 18:39             ` Markus Wichmann
  2021-12-07 18:57               ` David Edelsohn
  2021-12-07 19:28               ` Florian Weimer
  1 sibling, 2 replies; 21+ messages in thread
From: Markus Wichmann @ 2021-12-07 18:39 UTC (permalink / raw)
  To: musl

On Tue, Dec 07, 2021 at 08:25:09AM -0500, Rich Felker wrote:
> On Mon, Dec 06, 2021 at 08:44:47PM -0500, David Edelsohn wrote:
> > It should work, but it's slightly preferred to use $+4 because one
> > explicitly wants the address of the next instruction and labels of the
>
> In this case we don't want the address of the next instruction. We
> want the address of the constant __hwcap-.
>
> Rich

According to at least one source I found at some point (and can't seem
to find right away), "bcl 20,31,+4" is the one special form of the
instruction that is most likely to circumvent the shadow stack. I have
seen "bcl 20,31,label" in the wild, even in cases where the label didn't
directly follow the instruction, so maybe it works, maybe it doesn't.

That said, architecturally it will work either way. We are only talking
about an implementation detail, and both IBM's and Freescale's/NXP's
documentation is very cagey about revealing any of those. The
instruction is specified to exist and do the right thing (namely to
branch with linking unconditionally) all the way back to the first
PowerPC implementations from the early nineties, but whether such a
thing as a branch predictor even exists, or if it uses shadow stacks to
predict the "blr" target, is entirely unspecified.

So yeah, you might want to restructure the code to move the hwcap offset
somewhere else.

Ciao,
Markus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [musl] [PATCH] ppc64: check for AltiVec in setjmp/longjmp
  2021-12-07 18:39             ` Markus Wichmann
@ 2021-12-07 18:57               ` David Edelsohn
  2021-12-07 19:28               ` Florian Weimer
  1 sibling, 0 replies; 21+ messages in thread
From: David Edelsohn @ 2021-12-07 18:57 UTC (permalink / raw)
  To: musl

On Tue, Dec 7, 2021 at 1:39 PM Markus Wichmann <nullplan@gmx.net> wrote:
>
> On Tue, Dec 07, 2021 at 08:25:09AM -0500, Rich Felker wrote:
> > On Mon, Dec 06, 2021 at 08:44:47PM -0500, David Edelsohn wrote:
> > > It should work, but it's slightly preferred to use $+4 because one
> > > explicitly wants the address of the next instruction and labels of the
> >
> > In this case we don't want the address of the next instruction. We
> > want the address of the constant __hwcap-.
> >
> > Rich
>
> According to at least one source I found at some point (and can't seem
> to find right away), "bcl 20,31,+4" is the one special form of the
> instruction that is most likely to circumvent the shadow stack. I have
> seen "bcl 20,31,label" in the wild, even in cases where the label didn't
> directly follow the instruction, so maybe it works, maybe it doesn't.
>
> That said, architecturally it will work either way. We are only talking
> about an implementation detail, and both IBM's and Freescale's/NXP's
> documentation is very cagey about revealing any of those. The
> instruction is specified to exist and do the right thing (namely to
> branch with linking unconditionally) all the way back to the first
> PowerPC implementations from the early nineties, but whether such a
> thing as a branch predictor even exists, or if it uses shadow stacks to
> predict the "blr" target, is entirely unspecified.
>
> So yeah, you might want to restructure the code to move the hwcap offset
> somewhere else.

Only "$+4" is documented.  As you wote, it probably would be best to
place the offset somewhere else.  It also is not good practice to have
arbitrary bit patterns of data that potentially are not valid
instructions in the instruction stream.

Thanks, David

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [musl] [PATCH] ppc64: check for AltiVec in setjmp/longjmp
  2021-12-07 18:27           ` James Y Knight
@ 2021-12-07 18:57             ` Markus Wichmann
  0 siblings, 0 replies; 21+ messages in thread
From: Markus Wichmann @ 2021-12-07 18:57 UTC (permalink / raw)
  To: musl

On Tue, Dec 07, 2021 at 01:27:08PM -0500, James Y Knight wrote:
> The important question at hand is whether the hardware treats "next
> instruction" as a critical part of the special case. The recommended
> sequence is:
>   bcl 20,31,$+4
>   next-instructions...
>
> But, does the hardware _also_ trigger the expected special-cased effect on
> the return stack when jumping to locations other than the next instruction?
> E.g. is this OK w.r.t. return-stack?
>   bcl 20,31,$+8
>   .long __hwcap-.
>   next-instructions...
>
> On X86, calling *exactly* the next instruction is how you trigger the
> special-case in the return-stack-predictor. But, it sounds like
> potentially on PPC, the address is not part of what triggers the
> special-case. Is that correct?
>

In all the code I've read, people seem to gravitate towards the +4 form
if they can possibly help it. So I guess it really is the entire
instruction that is special. That said, architecturally the right thing
will happen either way, and if any kind of shadow stack is even involved
or successfully circumvented is in the hands of the implementation, and
all implementers whose documentation I have read so far have been very
stingy on implementation details like this.

The difference with X86 is that in case of PPC we are using a different
instruction entirely to get the instruction pointer. X86 only has the
one call instruction. Also, I'd thought the return stack was the reason
for GNU to add linkonce capability to the linker. Because at some point
I started seeing linkonce functions that read the return address into a
register and return crop up in assembler listings generated by GCC. I
didn't know there was a way to circumvent that stack.

Ciao,
Markus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [musl] [PATCH] ppc64: check for AltiVec in setjmp/longjmp
  2021-12-07 18:39             ` Markus Wichmann
  2021-12-07 18:57               ` David Edelsohn
@ 2021-12-07 19:28               ` Florian Weimer
  2021-12-07 20:15                 ` Markus Wichmann
  1 sibling, 1 reply; 21+ messages in thread
From: Florian Weimer @ 2021-12-07 19:28 UTC (permalink / raw)
  To: Markus Wichmann; +Cc: musl

* Markus Wichmann:

> That said, architecturally it will work either way. We are only talking
> about an implementation detail, and both IBM's and Freescale's/NXP's
> documentation is very cagey about revealing any of those.

We do have source code for one implementation.

| -- bcl 20,31,$+4 is special case.  not a subroutine call, used to get next instruction address, should not be placed on link stack.
| iu4_bo_d( 6 to 10)              <= iu3_instr_pri( 6 to 10);
| iu4_bi_d(11 to 15)              <= iu3_instr_pri(11 to 15);
| 
| iu4_getNIA                      <= iu4_opcode_q(0 to 5)         = "010000"      and
|                                    iu4_bo_q(6 to 10)            = "10100"       and
|                                    iu4_bi_q(11 to 15)           = "11111"       and
|                                    iu4_bd(EFF_IFAR'left to 61)  = 1             and
|                                    iu4_aa_q                     = '0'           and
|                                    iu4_lk_q                     = '1'           ;

<https://github.com/openpower-cores/a2i/blob/96299300abca65a074c635204a163e10569ee9b7/rel/src/vhdl/work/iuq_bp.vhdl#L880>

I suspect “iu4_bd(EFF_IFAR'left to 61) = 1” matches 4 exactly (the
lowest four bits of the offset are not encoded in the instruction
because they are always zero).  But I don't know any VHDL.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [musl] [PATCH] ppc64: check for AltiVec in setjmp/longjmp
  2021-12-07 19:28               ` Florian Weimer
@ 2021-12-07 20:15                 ` Markus Wichmann
  2021-12-07 20:29                   ` Rich Felker
  0 siblings, 1 reply; 21+ messages in thread
From: Markus Wichmann @ 2021-12-07 20:15 UTC (permalink / raw)
  To: Florian Weimer; +Cc: musl

On Tue, Dec 07, 2021 at 08:28:28PM +0100, Florian Weimer wrote:
> We do have source code for one implementation.
>
> | -- bcl 20,31,$+4 is special case.  not a subroutine call, used to get next instruction address, should not be placed on link stack.
> | iu4_bo_d( 6 to 10)              <= iu3_instr_pri( 6 to 10);
> | iu4_bi_d(11 to 15)              <= iu3_instr_pri(11 to 15);
> |
> | iu4_getNIA                      <= iu4_opcode_q(0 to 5)         = "010000"      and
> |                                    iu4_bo_q(6 to 10)            = "10100"       and
> |                                    iu4_bi_q(11 to 15)           = "11111"       and
> |                                    iu4_bd(EFF_IFAR'left to 61)  = 1             and
> |                                    iu4_aa_q                     = '0'           and
> |                                    iu4_lk_q                     = '1'           ;
>
> <https://github.com/openpower-cores/a2i/blob/96299300abca65a074c635204a163e10569ee9b7/rel/src/vhdl/work/iuq_bp.vhdl#L880>
>
> I suspect “iu4_bd(EFF_IFAR'left to 61) = 1” matches 4 exactly (the
> lowest four bits of the offset are not encoded in the instruction
> because they are always zero).  But I don't know any VHDL.
>

Me neither but I do recognize a few of those words. The opcode obviously
refers to the most significant six bits, encoding the primary opcode,
and "bo", "bi", "bd", "aa", and "lk" are what the PPC books call the
various fields of this particular instruction (that being "bc", branch
conditional). So this matches exactly the "+4" form of the instruction
discussed so far.

BTW, musl's PPC code contains a few more instances of getting NIA with
"bl", in the CRT code and in GETFUNCSYM() at least.  So if we're
spending this much time finding out the optimal way to get the NIA, we
should probably do the same there, for consistency if nothing else.

Ciao,
Markus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [musl] [PATCH] ppc64: check for AltiVec in setjmp/longjmp
  2021-12-07 20:15                 ` Markus Wichmann
@ 2021-12-07 20:29                   ` Rich Felker
  2021-12-08  5:02                     ` Markus Wichmann
  0 siblings, 1 reply; 21+ messages in thread
From: Rich Felker @ 2021-12-07 20:29 UTC (permalink / raw)
  To: Markus Wichmann; +Cc: Florian Weimer, musl

On Tue, Dec 07, 2021 at 09:15:05PM +0100, Markus Wichmann wrote:
> On Tue, Dec 07, 2021 at 08:28:28PM +0100, Florian Weimer wrote:
> > We do have source code for one implementation.
> >
> > | -- bcl 20,31,$+4 is special case.  not a subroutine call, used to get next instruction address, should not be placed on link stack.
> > | iu4_bo_d( 6 to 10)              <= iu3_instr_pri( 6 to 10);
> > | iu4_bi_d(11 to 15)              <= iu3_instr_pri(11 to 15);
> > |
> > | iu4_getNIA                      <= iu4_opcode_q(0 to 5)         = "010000"      and
> > |                                    iu4_bo_q(6 to 10)            = "10100"       and
> > |                                    iu4_bi_q(11 to 15)           = "11111"       and
> > |                                    iu4_bd(EFF_IFAR'left to 61)  = 1             and
> > |                                    iu4_aa_q                     = '0'           and
> > |                                    iu4_lk_q                     = '1'           ;
> >
> > <https://github.com/openpower-cores/a2i/blob/96299300abca65a074c635204a163e10569ee9b7/rel/src/vhdl/work/iuq_bp.vhdl#L880>
> >
> > I suspect “iu4_bd(EFF_IFAR'left to 61) = 1” matches 4 exactly (the
> > lowest four bits of the offset are not encoded in the instruction
> > because they are always zero).  But I don't know any VHDL.
> >
> 
> Me neither but I do recognize a few of those words. The opcode obviously
> refers to the most significant six bits, encoding the primary opcode,
> and "bo", "bi", "bd", "aa", and "lk" are what the PPC books call the
> various fields of this particular instruction (that being "bc", branch
> conditional). So this matches exactly the "+4" form of the instruction
> discussed so far.

Thanks for digging this up!

> BTW, musl's PPC code contains a few more instances of getting NIA with
> "bl", in the CRT code and in GETFUNCSYM() at least.  So if we're
> spending this much time finding out the optimal way to get the NIA, we
> should probably do the same there, for consistency if nothing else.

In general I would prefer the "obvious what it's doing" form over the
"special cased for performance" form in places where performance can't
matter -- for example, the ones you cited that execute once per
program invocation. But if it's easy to read either way, fine -- and
it probably can be made so.

Note that if the __hwcap-. constant is moved out of line, I think it's
possible to avoid any added cost. Something along the lines of the
following:

	bcl 20,31,1f
1:	mflr 4
	lwz 5,2f-1b(4)
	lwzx 4,4,5
	...
2:	.long __hwcap-1b

Does this look right?

Rich

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [musl] [PATCH] ppc64: check for AltiVec in setjmp/longjmp
  2021-12-07 20:29                   ` Rich Felker
@ 2021-12-08  5:02                     ` Markus Wichmann
  0 siblings, 0 replies; 21+ messages in thread
From: Markus Wichmann @ 2021-12-08  5:02 UTC (permalink / raw)
  To: Rich Felker; +Cc: Florian Weimer, musl

On Tue, Dec 07, 2021 at 03:29:21PM -0500, Rich Felker wrote:
> In general I would prefer the "obvious what it's doing" form over the
> "special cased for performance" form in places where performance can't
> matter -- for example, the ones you cited that execute once per
> program invocation. But if it's easy to read either way, fine -- and
> it probably can be made so.
>

I foresee no issue with readability. Indeed most avid PPC assembly
readers will recognize "bcl 20,31" as "just getting the instruction
pointer" sooner than "bl", but the functions in question are so small it
doesn't really matter either way.

> Note that if the __hwcap-. constant is moved out of line, I think it's
> possible to avoid any added cost. Something along the lines of the
> following:
>
> 	bcl 20,31,1f
> 1:	mflr 4
> 	lwz 5,2f-1b(4)
> 	lwzx 4,4,5
> 	...
> 2:	.long __hwcap-1b
>
> Does this look right?

Seems right to me.

David's warning made me remember an article I read once about branch
prediction and cache instructions: Basically, cache instructions have no
execution phase, I mean, architecturally they have no effect. They
change no memory and no registers, they change an implementation detail
that ought to be transparent to the programmer.

So if a branch is mispredicted to hit a given cache instruction, that
cache instruction will be executed to the fullest even if the pipeline
is flushed (pipeline flush simply skips execution phase, which cache
instructions don't have). Now, the XBox 360 CPU had a special cache
instruction (I believe it was "xdcbl" or so), which could circumvent the
L2 cache. Unfortunately, all access synchronization between CPUs
happens through the L2 cache. Therefore this instruction should not be
used on memory that can be shared between CPUs, which is pretty much all
memory in user space (any thread might be preempted and migrated at any
time, so not even stack is safe).

Unfortunately, with the above mentioned branch prediction drama, the
instruction can cause issues if it merely shows up in the instruction
stream, even if it is ultimately never executed. They had to remove any
instance of this instruction from their programs to get the issues to
disappear.

Now with your hwcap pointer, you have no idea what instruction it will
end up looking like. But if we put the pointer into .rodata, the offset
between labels 2 and 1 might be larger than 32kB, making the code more
complicated. You could put "b ." in front of it, to stop any branch
misprediction before it. I don't know, you figure it out.

Ciao,
Markus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [musl] [PATCH] ppc64: check for AltiVec in setjmp/longjmp
  2021-12-07  0:59   ` Rich Felker
  2021-12-07  1:15     ` David Edelsohn
@ 2021-12-08  8:43     ` Stijn Tintel
  2021-12-08 13:37       ` Rich Felker
  1 sibling, 1 reply; 21+ messages in thread
From: Stijn Tintel @ 2021-12-08  8:43 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

On 7/12/2021 02:59, Rich Felker wrote:
> On Tue, Dec 07, 2021 at 01:37:12AM +0100, Florian Weimer wrote:
>> * Stijn Tintel:
>>
>>> diff --git a/src/setjmp/powerpc64/setjmp.s b/src/setjmp/powerpc64/setjmp.s
>>> index 37683fda..32853693 100644
>>> --- a/src/setjmp/powerpc64/setjmp.s
>>> +++ b/src/setjmp/powerpc64/setjmp.s
>>> @@ -69,7 +69,17 @@ __setjmp_toc:
>>>  	stfd 30, 38*8(3)
>>>  	stfd 31, 39*8(3)
>>>  
>>> -	# 5) store vector registers v20-v31
>>> +	# 5) store vector registers v20-v31 if hardware supports AltiVec
>>> +	mflr 0
>>> +	bl 1f
>>> +	.hidden __hwcap
>>> +	.long __hwcap-.
>>> +1:	mflr 4
>> This de-balances the return stack and probably has quite severe
>> performance impact.  The ISA manual says to use
>>
>>   bcl 20,31,$+4
>>
>> and you'll have to store the __hwcap offset somewhere else.
> To begin with, let's change the .s files to .S files and put the whole
> branch logic inside #ifndef __ALTIVEC__ so that it does not impact
> normal builds with an ISA level where Altivec can be assumed to be
> present.
>
> I'm not sufficiently familiar with the PowerPC ISA to know how bcl
> works, but if there's a less expensive solution along those lines
> that's compatible with all ISA levels, by all means let's use it. The
> same could be done for powerpc-sf (32-bit) and its SPE branches, too.
>
> Also the add and lwz can be used into lwzx (indexed load).
>
The code for ppc64 uses ld after add, not lwz. This is required to make
it work on both big and little endian systems. We therefore cannot use
lwzx, but have to use ldx.

Stijn


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [musl] [PATCH] ppc64: check for AltiVec in setjmp/longjmp
  2021-12-08  8:43     ` Stijn Tintel
@ 2021-12-08 13:37       ` Rich Felker
  2021-12-08 15:36         ` Rich Felker
  0 siblings, 1 reply; 21+ messages in thread
From: Rich Felker @ 2021-12-08 13:37 UTC (permalink / raw)
  To: Stijn Tintel; +Cc: musl

On Wed, Dec 08, 2021 at 10:43:05AM +0200, Stijn Tintel wrote:
> On 7/12/2021 02:59, Rich Felker wrote:
> > On Tue, Dec 07, 2021 at 01:37:12AM +0100, Florian Weimer wrote:
> >> * Stijn Tintel:
> >>
> >>> diff --git a/src/setjmp/powerpc64/setjmp.s b/src/setjmp/powerpc64/setjmp.s
> >>> index 37683fda..32853693 100644
> >>> --- a/src/setjmp/powerpc64/setjmp.s
> >>> +++ b/src/setjmp/powerpc64/setjmp.s
> >>> @@ -69,7 +69,17 @@ __setjmp_toc:
> >>>  	stfd 30, 38*8(3)
> >>>  	stfd 31, 39*8(3)
> >>>  
> >>> -	# 5) store vector registers v20-v31
> >>> +	# 5) store vector registers v20-v31 if hardware supports AltiVec
> >>> +	mflr 0
> >>> +	bl 1f
> >>> +	.hidden __hwcap
> >>> +	.long __hwcap-.
> >>> +1:	mflr 4
> >> This de-balances the return stack and probably has quite severe
> >> performance impact.  The ISA manual says to use
> >>
> >>   bcl 20,31,$+4
> >>
> >> and you'll have to store the __hwcap offset somewhere else.
> > To begin with, let's change the .s files to .S files and put the whole
> > branch logic inside #ifndef __ALTIVEC__ so that it does not impact
> > normal builds with an ISA level where Altivec can be assumed to be
> > present.
> >
> > I'm not sufficiently familiar with the PowerPC ISA to know how bcl
> > works, but if there's a less expensive solution along those lines
> > that's compatible with all ISA levels, by all means let's use it. The
> > same could be done for powerpc-sf (32-bit) and its SPE branches, too.
> >
> > Also the add and lwz can be used into lwzx (indexed load).
> >
> The code for ppc64 uses ld after add, not lwz. This is required to make
> it work on both big and little endian systems. We therefore cannot use
> lwzx, but have to use ldx.

OK, I don't understand why endianness would matter, but I do see a
problem here: ld expects to load a 64-bit value, but the value is only
32-bit (.long). Unless I'm missing something, we need to either make
it 64-bit (.llong, and with proper alignment) or use a sign-extending
32-bit load. The latter would assume a model where the whole program
(for static linking) or libc.so (for dynamic) fits in ±2GB. This is
clearly valid for dynamic but dubious for static (although maybe GCC
already assumes this with how it loads the GOT address and DSO-local
globals?).

Rich

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [musl] [PATCH] ppc64: check for AltiVec in setjmp/longjmp
  2021-12-08 13:37       ` Rich Felker
@ 2021-12-08 15:36         ` Rich Felker
  0 siblings, 0 replies; 21+ messages in thread
From: Rich Felker @ 2021-12-08 15:36 UTC (permalink / raw)
  To: Stijn Tintel; +Cc: musl

On Wed, Dec 08, 2021 at 08:37:13AM -0500, Rich Felker wrote:
> On Wed, Dec 08, 2021 at 10:43:05AM +0200, Stijn Tintel wrote:
> > On 7/12/2021 02:59, Rich Felker wrote:
> > > On Tue, Dec 07, 2021 at 01:37:12AM +0100, Florian Weimer wrote:
> > >> * Stijn Tintel:
> > >>
> > >>> diff --git a/src/setjmp/powerpc64/setjmp.s b/src/setjmp/powerpc64/setjmp.s
> > >>> index 37683fda..32853693 100644
> > >>> --- a/src/setjmp/powerpc64/setjmp.s
> > >>> +++ b/src/setjmp/powerpc64/setjmp.s
> > >>> @@ -69,7 +69,17 @@ __setjmp_toc:
> > >>>  	stfd 30, 38*8(3)
> > >>>  	stfd 31, 39*8(3)
> > >>>  
> > >>> -	# 5) store vector registers v20-v31
> > >>> +	# 5) store vector registers v20-v31 if hardware supports AltiVec
> > >>> +	mflr 0
> > >>> +	bl 1f
> > >>> +	.hidden __hwcap
> > >>> +	.long __hwcap-.
> > >>> +1:	mflr 4
> > >> This de-balances the return stack and probably has quite severe
> > >> performance impact.  The ISA manual says to use
> > >>
> > >>   bcl 20,31,$+4
> > >>
> > >> and you'll have to store the __hwcap offset somewhere else.
> > > To begin with, let's change the .s files to .S files and put the whole
> > > branch logic inside #ifndef __ALTIVEC__ so that it does not impact
> > > normal builds with an ISA level where Altivec can be assumed to be
> > > present.
> > >
> > > I'm not sufficiently familiar with the PowerPC ISA to know how bcl
> > > works, but if there's a less expensive solution along those lines
> > > that's compatible with all ISA levels, by all means let's use it. The
> > > same could be done for powerpc-sf (32-bit) and its SPE branches, too.
> > >
> > > Also the add and lwz can be used into lwzx (indexed load).
> > >
> > The code for ppc64 uses ld after add, not lwz. This is required to make
> > it work on both big and little endian systems. We therefore cannot use
> > lwzx, but have to use ldx.
> 
> OK, I don't understand why endianness would matter, but I do see a
> problem here: ld expects to load a 64-bit value, but the value is only
> 32-bit (.long). Unless I'm missing something, we need to either make
> it 64-bit (.llong, and with proper alignment) or use a sign-extending
> 32-bit load. The latter would assume a model where the whole program
> (for static linking) or libc.so (for dynamic) fits in ±2GB. This is
> clearly valid for dynamic but dubious for static (although maybe GCC
> already assumes this with how it loads the GOT address and DSO-local
> globals?).

OK, I see now -- I was mixing up the load of __hwcap (which is
necessarily 64-bit) and the load of the relative address constant
(which could be done either way). The comment about lwzx not being
appropriate was just because __hwcap is 64-bit not 32-bit, and that's
fixed by using ldx. But I'm still unclear on whether we should use a
full 64-bit relative address constant or a 32-bit one like we've been
using (which assumes everything in ±2GB).

Rich

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-12-08 15:37 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-06 23:43 [musl] [PATCH] ppc64: check for AltiVec in setjmp/longjmp Stijn Tintel
2021-12-07  0:37 ` Florian Weimer
2021-12-07  0:59   ` Rich Felker
2021-12-07  1:15     ` David Edelsohn
2021-12-07  1:39       ` Rich Felker
2021-12-07  1:44         ` David Edelsohn
2021-12-07 13:25           ` Rich Felker
2021-12-07 13:39             ` David Edelsohn
2021-12-07 14:43               ` Rich Felker
2021-12-07 14:48                 ` David Edelsohn
2021-12-07 18:39             ` Markus Wichmann
2021-12-07 18:57               ` David Edelsohn
2021-12-07 19:28               ` Florian Weimer
2021-12-07 20:15                 ` Markus Wichmann
2021-12-07 20:29                   ` Rich Felker
2021-12-08  5:02                     ` Markus Wichmann
2021-12-07 18:27           ` James Y Knight
2021-12-07 18:57             ` Markus Wichmann
2021-12-08  8:43     ` Stijn Tintel
2021-12-08 13:37       ` Rich Felker
2021-12-08 15:36         ` Rich Felker

Code repositories for project(s) associated with this inbox:

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).