mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] Powerpc Linux 'scv' system call ABI proposal take 2
@ 2020-04-15 21:45 Nicholas Piggin
  2020-04-15 22:55 ` Rich Felker
  0 siblings, 1 reply; 62+ messages in thread
From: Nicholas Piggin @ 2020-04-15 21:45 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: libc-alpha, libc-dev, musl, Segher Boessenkool

I would like to enable Linux support for the powerpc 'scv' instruction,
as a faster system call instruction.

This requires two things to be defined: Firstly a way to advertise to 
userspace that kernel supports scv, and a way to allocate and advertise
support for individual scv vectors. Secondly, a calling convention ABI
for this new instruction.

Thanks to those who commented last time, since then I have removed my
answered questions and unpopular alternatives but you can find them
here

https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html

Let me try one more with a wider cc list, and then we'll get something
merged. Any questions or counter-opinions are welcome.

System Call Vectored (scv) ABI
==============================

The scv instruction is introduced with POWER9 / ISA3, it comes with an
rfscv counter-part. The benefit of these instructions is performance
(trading slower SRR0/1 with faster LR/CTR registers, and entering the
kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR 
updates. The scv instruction has 128 interrupt entry points (not enough 
to cover the Linux system call space).

The proposal is to assign scv numbers very conservatively and allocate 
them as individual HWCAP features as we add support for more. The zero 
vector ('scv 0') will be used for normal system calls, equivalent to 'sc'.

Advertisement

Linux has not enabled FSCR[SCV] yet, so the instruction will cause a
SIGILL in current environments. Linux has defined a HWCAP2 bit 
PPC_FEATURE2_SCV for SCV support, but does not set it.

When scv instruction support and the scv 0 vector for system calls are 
added, PPC_FEATURE2_SCV will indicate support for these. Other vectors 
should not be used without future HWCAP bits indicating support, which is
how we will allocate them. (Should unallocated ones generate SIGILL, or
return -ENOSYS in r3?)

Calling convention

The proposal is for scv 0 to provide the standard Linux system call ABI 
with the following differences from sc convention[1]:

- LR is to be volatile across scv calls. This is necessary because the 
  scv instruction clobbers LR. From previous discussion, this should be 
  possible to deal with in GCC clobbers and CFI.

- CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the
  kernel system call exit to avoid restoring the CR register (although 
  we probably still would anyway to avoid information leak).

- Error handling: I think the consensus has been to move to using negative
  return value in r3 rather than CR0[SO]=1 to indicate error, which matches
  most other architectures and is closer to a function call.

The number of scratch registers (r9-r12) at kernel entry seems 
sufficient that we don't have any costly spilling, patch is here[2].  

[1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst
[2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840.html


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-15 21:45 [musl] Powerpc Linux 'scv' system call ABI proposal take 2 Nicholas Piggin
@ 2020-04-15 22:55 ` Rich Felker
  2020-04-16  0:16   ` Nicholas Piggin
                     ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Rich Felker @ 2020-04-15 22:55 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: linuxppc-dev, libc-alpha, libc-dev, musl, Segher Boessenkool

On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote:
> I would like to enable Linux support for the powerpc 'scv' instruction,
> as a faster system call instruction.
> 
> This requires two things to be defined: Firstly a way to advertise to 
> userspace that kernel supports scv, and a way to allocate and advertise
> support for individual scv vectors. Secondly, a calling convention ABI
> for this new instruction.
> 
> Thanks to those who commented last time, since then I have removed my
> answered questions and unpopular alternatives but you can find them
> here
> 
> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html
> 
> Let me try one more with a wider cc list, and then we'll get something
> merged. Any questions or counter-opinions are welcome.
> 
> System Call Vectored (scv) ABI
> ==============================
> 
> The scv instruction is introduced with POWER9 / ISA3, it comes with an
> rfscv counter-part. The benefit of these instructions is performance
> (trading slower SRR0/1 with faster LR/CTR registers, and entering the
> kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR 
> updates. The scv instruction has 128 interrupt entry points (not enough 
> to cover the Linux system call space).
> 
> The proposal is to assign scv numbers very conservatively and allocate 
> them as individual HWCAP features as we add support for more. The zero 
> vector ('scv 0') will be used for normal system calls, equivalent to 'sc'.
> 
> Advertisement
> 
> Linux has not enabled FSCR[SCV] yet, so the instruction will cause a
> SIGILL in current environments. Linux has defined a HWCAP2 bit 
> PPC_FEATURE2_SCV for SCV support, but does not set it.
> 
> When scv instruction support and the scv 0 vector for system calls are 
> added, PPC_FEATURE2_SCV will indicate support for these. Other vectors 
> should not be used without future HWCAP bits indicating support, which is
> how we will allocate them. (Should unallocated ones generate SIGILL, or
> return -ENOSYS in r3?)
> 
> Calling convention
> 
> The proposal is for scv 0 to provide the standard Linux system call ABI 
> with the following differences from sc convention[1]:
> 
> - LR is to be volatile across scv calls. This is necessary because the 
>   scv instruction clobbers LR. From previous discussion, this should be 
>   possible to deal with in GCC clobbers and CFI.
> 
> - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the
>   kernel system call exit to avoid restoring the CR register (although 
>   we probably still would anyway to avoid information leak).
> 
> - Error handling: I think the consensus has been to move to using negative
>   return value in r3 rather than CR0[SO]=1 to indicate error, which matches
>   most other architectures and is closer to a function call.
> 
> The number of scratch registers (r9-r12) at kernel entry seems 
> sufficient that we don't have any costly spilling, patch is here[2].  
> 
> [1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst
> [2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840.html

My preference would be that it work just like the i386 AT_SYSINFO
where you just replace "int $128" with "call *%%gs:16" and the kernel
provides a stub in the vdso that performs either scv or the old
mechanism with the same calling convention. Then if the kernel doesn't
provide it (because the kernel is too old) libc would have to provide
its own stub that uses the legacy method and matches the calling
convention of the one the kernel is expected to provide.

Note that any libc that actually makes use of the new functionality is
not going to be able to make clobbers conditional on support for it;
branching around different clobbers is going to defeat any gains vs
always just treating anything clobbered by either method as clobbered.
Likewise, it's not useful to have different error return mechanisms
because the caller just has to branch to support both (or the
kernel-provided stub just has to emulate one for it; that could work
if you really want to change the bad existing convention).

Thoughts?

Rich

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-15 22:55 ` Rich Felker
@ 2020-04-16  0:16   ` Nicholas Piggin
  2020-04-16  0:48     ` Rich Felker
                       ` (2 more replies)
  2020-04-16  4:48   ` Florian Weimer
  2020-04-16 14:16   ` Adhemerval Zanella
  2 siblings, 3 replies; 62+ messages in thread
From: Nicholas Piggin @ 2020-04-16  0:16 UTC (permalink / raw)
  To: Rich Felker; +Cc: libc-alpha, libc-dev, linuxppc-dev, musl, Segher Boessenkool

Excerpts from Rich Felker's message of April 16, 2020 8:55 am:
> On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote:
>> I would like to enable Linux support for the powerpc 'scv' instruction,
>> as a faster system call instruction.
>> 
>> This requires two things to be defined: Firstly a way to advertise to 
>> userspace that kernel supports scv, and a way to allocate and advertise
>> support for individual scv vectors. Secondly, a calling convention ABI
>> for this new instruction.
>> 
>> Thanks to those who commented last time, since then I have removed my
>> answered questions and unpopular alternatives but you can find them
>> here
>> 
>> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html
>> 
>> Let me try one more with a wider cc list, and then we'll get something
>> merged. Any questions or counter-opinions are welcome.
>> 
>> System Call Vectored (scv) ABI
>> ==============================
>> 
>> The scv instruction is introduced with POWER9 / ISA3, it comes with an
>> rfscv counter-part. The benefit of these instructions is performance
>> (trading slower SRR0/1 with faster LR/CTR registers, and entering the
>> kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR 
>> updates. The scv instruction has 128 interrupt entry points (not enough 
>> to cover the Linux system call space).
>> 
>> The proposal is to assign scv numbers very conservatively and allocate 
>> them as individual HWCAP features as we add support for more. The zero 
>> vector ('scv 0') will be used for normal system calls, equivalent to 'sc'.
>> 
>> Advertisement
>> 
>> Linux has not enabled FSCR[SCV] yet, so the instruction will cause a
>> SIGILL in current environments. Linux has defined a HWCAP2 bit 
>> PPC_FEATURE2_SCV for SCV support, but does not set it.
>> 
>> When scv instruction support and the scv 0 vector for system calls are 
>> added, PPC_FEATURE2_SCV will indicate support for these. Other vectors 
>> should not be used without future HWCAP bits indicating support, which is
>> how we will allocate them. (Should unallocated ones generate SIGILL, or
>> return -ENOSYS in r3?)
>> 
>> Calling convention
>> 
>> The proposal is for scv 0 to provide the standard Linux system call ABI 
>> with the following differences from sc convention[1]:
>> 
>> - LR is to be volatile across scv calls. This is necessary because the 
>>   scv instruction clobbers LR. From previous discussion, this should be 
>>   possible to deal with in GCC clobbers and CFI.
>> 
>> - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the
>>   kernel system call exit to avoid restoring the CR register (although 
>>   we probably still would anyway to avoid information leak).
>> 
>> - Error handling: I think the consensus has been to move to using negative
>>   return value in r3 rather than CR0[SO]=1 to indicate error, which matches
>>   most other architectures and is closer to a function call.
>> 
>> The number of scratch registers (r9-r12) at kernel entry seems 
>> sufficient that we don't have any costly spilling, patch is here[2].  
>> 
>> [1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst
>> [2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840.html
> 
> My preference would be that it work just like the i386 AT_SYSINFO
> where you just replace "int $128" with "call *%%gs:16" and the kernel
> provides a stub in the vdso that performs either scv or the old
> mechanism with the same calling convention. Then if the kernel doesn't
> provide it (because the kernel is too old) libc would have to provide
> its own stub that uses the legacy method and matches the calling
> convention of the one the kernel is expected to provide.

I'm not sure if that's necessary. That's done on x86-32 because they
select different sequences to use based on the CPU running and if the host
kernel is 32 or 64 bit. Sure they could in theory have a bunch of HWCAP
bits and select the right sequence in libc as well I suppose.

> Note that any libc that actually makes use of the new functionality is
> not going to be able to make clobbers conditional on support for it;
> branching around different clobbers is going to defeat any gains vs
> always just treating anything clobbered by either method as clobbered.

Well it would have to test HWCAP and patch in or branch to two 
completely different sequences including register save/restores yes.
You could have the same asm and matching clobbers to put the sequence
inline and then you could patch the one sc/scv instruction I suppose.

A bit of logic to select between them doesn't defeat gains though,
it's about 90 cycle improvement which is a handful of branch mispredicts 
so it really is an improvement. Eventually userspace will stop 
supporting the old variant too.

> Likewise, it's not useful to have different error return mechanisms
> because the caller just has to branch to support both (or the
> kernel-provided stub just has to emulate one for it; that could work
> if you really want to change the bad existing convention).
> 
> Thoughts?

The existing convention has to change somewhat because of the clobbers,
so I thought we could change the error return at the same time. I'm
open to not changing it and using CR0[SO], but others liked the idea.
Pro: it matches sc and vsyscall. Con: it's different from other common
archs. Performnce-wise it would really be a wash -- cost of conditional
branch is not the cmp but the mispredict.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-16  0:16   ` Nicholas Piggin
@ 2020-04-16  0:48     ` Rich Felker
  2020-04-16  2:24       ` Nicholas Piggin
  2020-04-16  9:58     ` Szabolcs Nagy
  2020-04-16 15:21     ` Jeffrey Walton
  2 siblings, 1 reply; 62+ messages in thread
From: Rich Felker @ 2020-04-16  0:48 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: libc-alpha, libc-dev, linuxppc-dev, musl, Segher Boessenkool

On Thu, Apr 16, 2020 at 10:16:54AM +1000, Nicholas Piggin wrote:
> Excerpts from Rich Felker's message of April 16, 2020 8:55 am:
> > On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote:
> >> I would like to enable Linux support for the powerpc 'scv' instruction,
> >> as a faster system call instruction.
> >> 
> >> This requires two things to be defined: Firstly a way to advertise to 
> >> userspace that kernel supports scv, and a way to allocate and advertise
> >> support for individual scv vectors. Secondly, a calling convention ABI
> >> for this new instruction.
> >> 
> >> Thanks to those who commented last time, since then I have removed my
> >> answered questions and unpopular alternatives but you can find them
> >> here
> >> 
> >> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html
> >> 
> >> Let me try one more with a wider cc list, and then we'll get something
> >> merged. Any questions or counter-opinions are welcome.
> >> 
> >> System Call Vectored (scv) ABI
> >> ==============================
> >> 
> >> The scv instruction is introduced with POWER9 / ISA3, it comes with an
> >> rfscv counter-part. The benefit of these instructions is performance
> >> (trading slower SRR0/1 with faster LR/CTR registers, and entering the
> >> kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR 
> >> updates. The scv instruction has 128 interrupt entry points (not enough 
> >> to cover the Linux system call space).
> >> 
> >> The proposal is to assign scv numbers very conservatively and allocate 
> >> them as individual HWCAP features as we add support for more. The zero 
> >> vector ('scv 0') will be used for normal system calls, equivalent to 'sc'.
> >> 
> >> Advertisement
> >> 
> >> Linux has not enabled FSCR[SCV] yet, so the instruction will cause a
> >> SIGILL in current environments. Linux has defined a HWCAP2 bit 
> >> PPC_FEATURE2_SCV for SCV support, but does not set it.
> >> 
> >> When scv instruction support and the scv 0 vector for system calls are 
> >> added, PPC_FEATURE2_SCV will indicate support for these. Other vectors 
> >> should not be used without future HWCAP bits indicating support, which is
> >> how we will allocate them. (Should unallocated ones generate SIGILL, or
> >> return -ENOSYS in r3?)
> >> 
> >> Calling convention
> >> 
> >> The proposal is for scv 0 to provide the standard Linux system call ABI 
> >> with the following differences from sc convention[1]:
> >> 
> >> - LR is to be volatile across scv calls. This is necessary because the 
> >>   scv instruction clobbers LR. From previous discussion, this should be 
> >>   possible to deal with in GCC clobbers and CFI.
> >> 
> >> - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the
> >>   kernel system call exit to avoid restoring the CR register (although 
> >>   we probably still would anyway to avoid information leak).
> >> 
> >> - Error handling: I think the consensus has been to move to using negative
> >>   return value in r3 rather than CR0[SO]=1 to indicate error, which matches
> >>   most other architectures and is closer to a function call.
> >> 
> >> The number of scratch registers (r9-r12) at kernel entry seems 
> >> sufficient that we don't have any costly spilling, patch is here[2].  
> >> 
> >> [1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst
> >> [2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840..html
> > 
> > My preference would be that it work just like the i386 AT_SYSINFO
> > where you just replace "int $128" with "call *%%gs:16" and the kernel
> > provides a stub in the vdso that performs either scv or the old
> > mechanism with the same calling convention. Then if the kernel doesn't
> > provide it (because the kernel is too old) libc would have to provide
> > its own stub that uses the legacy method and matches the calling
> > convention of the one the kernel is expected to provide.
> 
> I'm not sure if that's necessary. That's done on x86-32 because they
> select different sequences to use based on the CPU running and if the host
> kernel is 32 or 64 bit. Sure they could in theory have a bunch of HWCAP
> bits and select the right sequence in libc as well I suppose.

It's not just a HWCAP. It's a contract between the kernel and
userspace to support a particular calling convention that's not
exposed except as the public entry point the kernel exports via
AT_SYSINFO.

> > Note that any libc that actually makes use of the new functionality is
> > not going to be able to make clobbers conditional on support for it;
> > branching around different clobbers is going to defeat any gains vs
> > always just treating anything clobbered by either method as clobbered.
> 
> Well it would have to test HWCAP and patch in or branch to two 
> completely different sequences including register save/restores yes.
> You could have the same asm and matching clobbers to put the sequence
> inline and then you could patch the one sc/scv instruction I suppose.
> 
> A bit of logic to select between them doesn't defeat gains though,
> it's about 90 cycle improvement which is a handful of branch mispredicts 
> so it really is an improvement. Eventually userspace will stop 
> supporting the old variant too.

Oh, I didn't mean it would neutralize the benefit of svc. Rather, I
meant it would be worse to do:

	if (hwcap & X) {
		__asm__(... with some clobbers);
	} else {
		__asm__(... with different clobbers);
	}

instead of just

	__asm__("indirect call" ... with common clobbers);

where the indirect call is to an address ideally provided like on
i386, or otherwise initialized to one of two or more code addresses in
libc based on hwcap bits.

> > Likewise, it's not useful to have different error return mechanisms
> > because the caller just has to branch to support both (or the
> > kernel-provided stub just has to emulate one for it; that could work
> > if you really want to change the bad existing convention).
> > 
> > Thoughts?
> 
> The existing convention has to change somewhat because of the clobbers,
> so I thought we could change the error return at the same time. I'm
> open to not changing it and using CR0[SO], but others liked the idea.
> Pro: it matches sc and vsyscall. Con: it's different from other common
> archs. Performnce-wise it would really be a wash -- cost of conditional
> branch is not the cmp but the mispredict.

If you do the branch on hwcap at each syscall, then you significantly
increase code size of every syscall point, likely turning a bunch of
trivial functions that didn't need stack frames into ones that do. You
also potentially make them need a TOC pointer. Making them all just do
an indirect call unconditionally (with pointer in TLS like i386?) is a
lot more efficient in code size and at least as good for performance.

Rich

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-16  0:48     ` Rich Felker
@ 2020-04-16  2:24       ` Nicholas Piggin
  2020-04-16  2:35         ` Rich Felker
  0 siblings, 1 reply; 62+ messages in thread
From: Nicholas Piggin @ 2020-04-16  2:24 UTC (permalink / raw)
  To: Rich Felker; +Cc: libc-alpha, libc-dev, linuxppc-dev, musl, Segher Boessenkool

Excerpts from Rich Felker's message of April 16, 2020 10:48 am:
> On Thu, Apr 16, 2020 at 10:16:54AM +1000, Nicholas Piggin wrote:
>> Excerpts from Rich Felker's message of April 16, 2020 8:55 am:
>> > On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote:
>> >> I would like to enable Linux support for the powerpc 'scv' instruction,
>> >> as a faster system call instruction.
>> >> 
>> >> This requires two things to be defined: Firstly a way to advertise to 
>> >> userspace that kernel supports scv, and a way to allocate and advertise
>> >> support for individual scv vectors. Secondly, a calling convention ABI
>> >> for this new instruction.
>> >> 
>> >> Thanks to those who commented last time, since then I have removed my
>> >> answered questions and unpopular alternatives but you can find them
>> >> here
>> >> 
>> >> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html
>> >> 
>> >> Let me try one more with a wider cc list, and then we'll get something
>> >> merged. Any questions or counter-opinions are welcome.
>> >> 
>> >> System Call Vectored (scv) ABI
>> >> ==============================
>> >> 
>> >> The scv instruction is introduced with POWER9 / ISA3, it comes with an
>> >> rfscv counter-part. The benefit of these instructions is performance
>> >> (trading slower SRR0/1 with faster LR/CTR registers, and entering the
>> >> kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR 
>> >> updates. The scv instruction has 128 interrupt entry points (not enough 
>> >> to cover the Linux system call space).
>> >> 
>> >> The proposal is to assign scv numbers very conservatively and allocate 
>> >> them as individual HWCAP features as we add support for more. The zero 
>> >> vector ('scv 0') will be used for normal system calls, equivalent to 'sc'.
>> >> 
>> >> Advertisement
>> >> 
>> >> Linux has not enabled FSCR[SCV] yet, so the instruction will cause a
>> >> SIGILL in current environments. Linux has defined a HWCAP2 bit 
>> >> PPC_FEATURE2_SCV for SCV support, but does not set it.
>> >> 
>> >> When scv instruction support and the scv 0 vector for system calls are 
>> >> added, PPC_FEATURE2_SCV will indicate support for these. Other vectors 
>> >> should not be used without future HWCAP bits indicating support, which is
>> >> how we will allocate them. (Should unallocated ones generate SIGILL, or
>> >> return -ENOSYS in r3?)
>> >> 
>> >> Calling convention
>> >> 
>> >> The proposal is for scv 0 to provide the standard Linux system call ABI 
>> >> with the following differences from sc convention[1]:
>> >> 
>> >> - LR is to be volatile across scv calls. This is necessary because the 
>> >>   scv instruction clobbers LR. From previous discussion, this should be 
>> >>   possible to deal with in GCC clobbers and CFI.
>> >> 
>> >> - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the
>> >>   kernel system call exit to avoid restoring the CR register (although 
>> >>   we probably still would anyway to avoid information leak).
>> >> 
>> >> - Error handling: I think the consensus has been to move to using negative
>> >>   return value in r3 rather than CR0[SO]=1 to indicate error, which matches
>> >>   most other architectures and is closer to a function call.
>> >> 
>> >> The number of scratch registers (r9-r12) at kernel entry seems 
>> >> sufficient that we don't have any costly spilling, patch is here[2].  
>> >> 
>> >> [1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst
>> >> [2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840..html
>> > 
>> > My preference would be that it work just like the i386 AT_SYSINFO
>> > where you just replace "int $128" with "call *%%gs:16" and the kernel
>> > provides a stub in the vdso that performs either scv or the old
>> > mechanism with the same calling convention. Then if the kernel doesn't
>> > provide it (because the kernel is too old) libc would have to provide
>> > its own stub that uses the legacy method and matches the calling
>> > convention of the one the kernel is expected to provide.
>> 
>> I'm not sure if that's necessary. That's done on x86-32 because they
>> select different sequences to use based on the CPU running and if the host
>> kernel is 32 or 64 bit. Sure they could in theory have a bunch of HWCAP
>> bits and select the right sequence in libc as well I suppose.
> 
> It's not just a HWCAP. It's a contract between the kernel and
> userspace to support a particular calling convention that's not
> exposed except as the public entry point the kernel exports via
> AT_SYSINFO.

Right.

>> > Note that any libc that actually makes use of the new functionality is
>> > not going to be able to make clobbers conditional on support for it;
>> > branching around different clobbers is going to defeat any gains vs
>> > always just treating anything clobbered by either method as clobbered.
>> 
>> Well it would have to test HWCAP and patch in or branch to two 
>> completely different sequences including register save/restores yes.
>> You could have the same asm and matching clobbers to put the sequence
>> inline and then you could patch the one sc/scv instruction I suppose.
>> 
>> A bit of logic to select between them doesn't defeat gains though,
>> it's about 90 cycle improvement which is a handful of branch mispredicts 
>> so it really is an improvement. Eventually userspace will stop 
>> supporting the old variant too.
> 
> Oh, I didn't mean it would neutralize the benefit of svc. Rather, I
> meant it would be worse to do:
> 
> 	if (hwcap & X) {
> 		__asm__(... with some clobbers);
> 	} else {
> 		__asm__(... with different clobbers);
> 	}
> 
> instead of just
> 
> 	__asm__("indirect call" ... with common clobbers);

Ah okay. Well that's debatable but if you didn't have an indirect call,
rather a runtime-patched sequence, then yes saving the LR clobber or
whatever wouldn't be worth a branch.

> where the indirect call is to an address ideally provided like on
> i386, or otherwise initialized to one of two or more code addresses in
> libc based on hwcap bits.

Right, I'm just skeptical we need the indirect call or need to provide
it in the vdso. The "clever" reason to add it on x86-32 was because of
the bugs and different combinations needed, that doesn't really apply
to scv 0 and was not necessarily a great choice.

> 
>> > Likewise, it's not useful to have different error return mechanisms
>> > because the caller just has to branch to support both (or the
>> > kernel-provided stub just has to emulate one for it; that could work
>> > if you really want to change the bad existing convention).
>> > 
>> > Thoughts?
>> 
>> The existing convention has to change somewhat because of the clobbers,
>> so I thought we could change the error return at the same time. I'm
>> open to not changing it and using CR0[SO], but others liked the idea.
>> Pro: it matches sc and vsyscall. Con: it's different from other common
>> archs. Performnce-wise it would really be a wash -- cost of conditional
>> branch is not the cmp but the mispredict.
> 
> If you do the branch on hwcap at each syscall, then you significantly
> increase code size of every syscall point, likely turning a bunch of
> trivial functions that didn't need stack frames into ones that do. You
> also potentially make them need a TOC pointer. Making them all just do
> an indirect call unconditionally (with pointer in TLS like i386?) is a
> lot more efficient in code size and at least as good for performance.

I disagree. Doing the long vdso indirect call *necessarily* requires
touching a new icache line, and even a new TLB entry. Indirect branches
also tend to be more costly and/or less accurate to predict than
direct even without spectre (generally fewer indirect predictor entries
than direct, far branches in particular require a lot of bits for 
target). And with spectre we're flushing the indirect predictors
on context switch or even disabling indirect prediction or flushing
across privilege domains in the same context.

And finally, the HWCAP test can eventually go away in future. A vdso
call can not.

If you really want to select with an indirect branch rather than
direct conditional, you can do that all within the library.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-16  2:24       ` Nicholas Piggin
@ 2020-04-16  2:35         ` Rich Felker
  2020-04-16  2:53           ` Nicholas Piggin
  0 siblings, 1 reply; 62+ messages in thread
From: Rich Felker @ 2020-04-16  2:35 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: libc-alpha, libc-dev, linuxppc-dev, musl, Segher Boessenkool

On Thu, Apr 16, 2020 at 12:24:16PM +1000, Nicholas Piggin wrote:
> >> > Likewise, it's not useful to have different error return mechanisms
> >> > because the caller just has to branch to support both (or the
> >> > kernel-provided stub just has to emulate one for it; that could work
> >> > if you really want to change the bad existing convention).
> >> > 
> >> > Thoughts?
> >> 
> >> The existing convention has to change somewhat because of the clobbers,
> >> so I thought we could change the error return at the same time. I'm
> >> open to not changing it and using CR0[SO], but others liked the idea.
> >> Pro: it matches sc and vsyscall. Con: it's different from other common
> >> archs. Performnce-wise it would really be a wash -- cost of conditional
> >> branch is not the cmp but the mispredict.
> > 
> > If you do the branch on hwcap at each syscall, then you significantly
> > increase code size of every syscall point, likely turning a bunch of
> > trivial functions that didn't need stack frames into ones that do. You
> > also potentially make them need a TOC pointer. Making them all just do
> > an indirect call unconditionally (with pointer in TLS like i386?) is a
> > lot more efficient in code size and at least as good for performance.
> 
> I disagree. Doing the long vdso indirect call *necessarily* requires
> touching a new icache line, and even a new TLB entry. Indirect branches

The increase in number of icache lines from the branch at every
syscall point is far greater than the use of a single extra icache
line shared by all syscalls. Not to mention the dcache line to access
__hwcap or whatever, and the icache lines to setup access TOC-relative
access to it. (Of course you could put a copy of its value in TLS at a
fixed offset, which would somewhat mitigate both.)

> And finally, the HWCAP test can eventually go away in future. A vdso
> call can not.

We support nearly arbitrarily old kernels (with limited functionality)
and hardware (with full functionality) and don't intend for that to
change, ever. But indeed glibc might want too eventually drop the
check.

> If you really want to select with an indirect branch rather than
> direct conditional, you can do that all within the library.

OK. It's a little bit more work if that's not the interface the kernel
will give us, but it's no big deal.

Rich

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-16  2:35         ` Rich Felker
@ 2020-04-16  2:53           ` Nicholas Piggin
  2020-04-16  3:03             ` Rich Felker
  2020-04-16 20:18             ` Florian Weimer
  0 siblings, 2 replies; 62+ messages in thread
From: Nicholas Piggin @ 2020-04-16  2:53 UTC (permalink / raw)
  To: Rich Felker; +Cc: libc-alpha, libc-dev, linuxppc-dev, musl, Segher Boessenkool

Excerpts from Rich Felker's message of April 16, 2020 12:35 pm:
> On Thu, Apr 16, 2020 at 12:24:16PM +1000, Nicholas Piggin wrote:
>> >> > Likewise, it's not useful to have different error return mechanisms
>> >> > because the caller just has to branch to support both (or the
>> >> > kernel-provided stub just has to emulate one for it; that could work
>> >> > if you really want to change the bad existing convention).
>> >> > 
>> >> > Thoughts?
>> >> 
>> >> The existing convention has to change somewhat because of the clobbers,
>> >> so I thought we could change the error return at the same time. I'm
>> >> open to not changing it and using CR0[SO], but others liked the idea.
>> >> Pro: it matches sc and vsyscall. Con: it's different from other common
>> >> archs. Performnce-wise it would really be a wash -- cost of conditional
>> >> branch is not the cmp but the mispredict.
>> > 
>> > If you do the branch on hwcap at each syscall, then you significantly
>> > increase code size of every syscall point, likely turning a bunch of
>> > trivial functions that didn't need stack frames into ones that do. You
>> > also potentially make them need a TOC pointer. Making them all just do
>> > an indirect call unconditionally (with pointer in TLS like i386?) is a
>> > lot more efficient in code size and at least as good for performance.
>> 
>> I disagree. Doing the long vdso indirect call *necessarily* requires
>> touching a new icache line, and even a new TLB entry. Indirect branches
> 
> The increase in number of icache lines from the branch at every
> syscall point is far greater than the use of a single extra icache
> line shared by all syscalls.

That's true, I was thinking of a single function that does the test and 
calls syscalls, which might be the fair comparison.

> Not to mention the dcache line to access
> __hwcap or whatever, and the icache lines to setup access TOC-relative
> access to it. (Of course you could put a copy of its value in TLS at a
> fixed offset, which would somewhat mitigate both.)
> 
>> And finally, the HWCAP test can eventually go away in future. A vdso
>> call can not.
> 
> We support nearly arbitrarily old kernels (with limited functionality)
> and hardware (with full functionality) and don't intend for that to
> change, ever. But indeed glibc might want too eventually drop the
> check.

Ah, cool. Any build-time flexibility there?

We may or may not be getting a new ABI that will use instructions not 
supported by old processors.

https://sourceware.org/legacy-ml/binutils/2019-05/msg00331.html

Current ABI continues to work of course and be the default for some 
time, but building for new one would give some opportunity to drop
such support for old procs, at least for glibc.

> 
>> If you really want to select with an indirect branch rather than
>> direct conditional, you can do that all within the library.
> 
> OK. It's a little bit more work if that's not the interface the kernel
> will give us, but it's no big deal.

Okay.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-16  2:53           ` Nicholas Piggin
@ 2020-04-16  3:03             ` Rich Felker
  2020-04-16  3:41               ` Nicholas Piggin
  2020-04-16 20:18             ` Florian Weimer
  1 sibling, 1 reply; 62+ messages in thread
From: Rich Felker @ 2020-04-16  3:03 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: libc-alpha, libc-dev, linuxppc-dev, musl, Segher Boessenkool

On Thu, Apr 16, 2020 at 12:53:31PM +1000, Nicholas Piggin wrote:
> > Not to mention the dcache line to access
> > __hwcap or whatever, and the icache lines to setup access TOC-relative
> > access to it. (Of course you could put a copy of its value in TLS at a
> > fixed offset, which would somewhat mitigate both.)
> > 
> >> And finally, the HWCAP test can eventually go away in future. A vdso
> >> call can not.
> > 
> > We support nearly arbitrarily old kernels (with limited functionality)
> > and hardware (with full functionality) and don't intend for that to
> > change, ever. But indeed glibc might want too eventually drop the
> > check.
> 
> Ah, cool. Any build-time flexibility there?
> 
> We may or may not be getting a new ABI that will use instructions not 
> supported by old processors.
> 
> https://sourceware.org/legacy-ml/binutils/2019-05/msg00331.html
> 
> Current ABI continues to work of course and be the default for some 
> time, but building for new one would give some opportunity to drop
> such support for old procs, at least for glibc.

What does "new ABI" entail to you? In the terminology I use with musl,
"new ABI" and "new ISA level" are different things. You can compile
(explicit -march or compiler default) binaries that won't run on older
cpus due to use of new insns etc., but we consider it the same ABI if
you can link code for an older/baseline ISA level with the
newer-ISA-level object files, i.e. if the interface surface for
linkage remains compatible. We also try to avoid gratuitous
proliferation of different ABIs unless there's a strong underlying
need (like addition of softfloat ABIs for archs that usually have FPU,
or vice versa).

In principle the same could be done for kernels except it's a bigger
silent gotcha (possible ENOSYS in places where it shouldn't be able to
happen rather than a trapping SIGILL or similar) and there's rarely
any serious performance or size benefit to dropping support for older
kernels.

Rich

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-16  3:03             ` Rich Felker
@ 2020-04-16  3:41               ` Nicholas Piggin
  0 siblings, 0 replies; 62+ messages in thread
From: Nicholas Piggin @ 2020-04-16  3:41 UTC (permalink / raw)
  To: Rich Felker; +Cc: libc-alpha, libc-dev, linuxppc-dev, musl, Segher Boessenkool

Excerpts from Rich Felker's message of April 16, 2020 1:03 pm:
> On Thu, Apr 16, 2020 at 12:53:31PM +1000, Nicholas Piggin wrote:
>> > Not to mention the dcache line to access
>> > __hwcap or whatever, and the icache lines to setup access TOC-relative
>> > access to it. (Of course you could put a copy of its value in TLS at a
>> > fixed offset, which would somewhat mitigate both.)
>> > 
>> >> And finally, the HWCAP test can eventually go away in future. A vdso
>> >> call can not.
>> > 
>> > We support nearly arbitrarily old kernels (with limited functionality)
>> > and hardware (with full functionality) and don't intend for that to
>> > change, ever. But indeed glibc might want too eventually drop the
>> > check.
>> 
>> Ah, cool. Any build-time flexibility there?
>> 
>> We may or may not be getting a new ABI that will use instructions not 
>> supported by old processors.
>> 
>> https://sourceware.org/legacy-ml/binutils/2019-05/msg00331.html
>> 
>> Current ABI continues to work of course and be the default for some 
>> time, but building for new one would give some opportunity to drop
>> such support for old procs, at least for glibc.
> 
> What does "new ABI" entail to you? In the terminology I use with musl,
> "new ABI" and "new ISA level" are different things. You can compile
> (explicit -march or compiler default) binaries that won't run on older
> cpus due to use of new insns etc., but we consider it the same ABI if
> you can link code for an older/baseline ISA level with the
> newer-ISA-level object files, i.e. if the interface surface for
> linkage remains compatible. We also try to avoid gratuitous
> proliferation of different ABIs unless there's a strong underlying
> need (like addition of softfloat ABIs for archs that usually have FPU,
> or vice versa).

Yeah it will be a new ABI type that also requires a new ISA level.
As far as I know (and I'm not on the toolchain side) there will be
some call compatibility between the two, so it may be fine to
continue with existing ABI for libc. But it just something that
comes to mind as a build-time cutover where we might be able to
assume particular features.

> In principle the same could be done for kernels except it's a bigger
> silent gotcha (possible ENOSYS in places where it shouldn't be able to
> happen rather than a trapping SIGILL or similar) and there's rarely
> any serious performance or size benefit to dropping support for older
> kernels.

Right, I don't think it'd be a huge problem whatever way we go,
compared with the cost of the system call.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-15 22:55 ` Rich Felker
  2020-04-16  0:16   ` Nicholas Piggin
@ 2020-04-16  4:48   ` Florian Weimer
  2020-04-16 15:35     ` Rich Felker
  2020-04-16 14:16   ` Adhemerval Zanella
  2 siblings, 1 reply; 62+ messages in thread
From: Florian Weimer @ 2020-04-16  4:48 UTC (permalink / raw)
  To: Rich Felker; +Cc: Nicholas Piggin, libc-alpha, musl, linuxppc-dev, libc-dev

* Rich Felker:

> My preference would be that it work just like the i386 AT_SYSINFO
> where you just replace "int $128" with "call *%%gs:16" and the kernel
> provides a stub in the vdso that performs either scv or the old
> mechanism with the same calling convention.

The i386 mechanism has received some criticism because it provides an
effective means to redirect execution flow to anyone who can write to
the TCB.  I am not sure if it makes sense to copy it.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-16  0:16   ` Nicholas Piggin
  2020-04-16  0:48     ` Rich Felker
@ 2020-04-16  9:58     ` Szabolcs Nagy
  2020-04-20  0:27       ` Nicholas Piggin
  2020-04-16 15:21     ` Jeffrey Walton
  2 siblings, 1 reply; 62+ messages in thread
From: Szabolcs Nagy @ 2020-04-16  9:58 UTC (permalink / raw)
  To: Nicholas Piggin via Libc-alpha; +Cc: Rich Felker, libc-dev, linuxppc-dev, musl

* Nicholas Piggin via Libc-alpha <libc-alpha@sourceware.org> [2020-04-16 10:16:54 +1000]:
> Well it would have to test HWCAP and patch in or branch to two 
> completely different sequences including register save/restores yes.
> You could have the same asm and matching clobbers to put the sequence
> inline and then you could patch the one sc/scv instruction I suppose.

how would that 'patch' work?

there are many reasons why you don't
want libc to write its .text

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-15 22:55 ` Rich Felker
  2020-04-16  0:16   ` Nicholas Piggin
  2020-04-16  4:48   ` Florian Weimer
@ 2020-04-16 14:16   ` Adhemerval Zanella
  2020-04-16 15:37     ` Rich Felker
  2 siblings, 1 reply; 62+ messages in thread
From: Adhemerval Zanella @ 2020-04-16 14:16 UTC (permalink / raw)
  To: Rich Felker, Nicholas Piggin; +Cc: libc-alpha, musl, linuxppc-dev, libc-dev



On 15/04/2020 19:55, Rich Felker wrote:
> On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote:
>> I would like to enable Linux support for the powerpc 'scv' instruction,
>> as a faster system call instruction.
>>
>> This requires two things to be defined: Firstly a way to advertise to 
>> userspace that kernel supports scv, and a way to allocate and advertise
>> support for individual scv vectors. Secondly, a calling convention ABI
>> for this new instruction.
>>
>> Thanks to those who commented last time, since then I have removed my
>> answered questions and unpopular alternatives but you can find them
>> here
>>
>> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html
>>
>> Let me try one more with a wider cc list, and then we'll get something
>> merged. Any questions or counter-opinions are welcome.
>>
>> System Call Vectored (scv) ABI
>> ==============================
>>
>> The scv instruction is introduced with POWER9 / ISA3, it comes with an
>> rfscv counter-part. The benefit of these instructions is performance
>> (trading slower SRR0/1 with faster LR/CTR registers, and entering the
>> kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR 
>> updates. The scv instruction has 128 interrupt entry points (not enough 
>> to cover the Linux system call space).
>>
>> The proposal is to assign scv numbers very conservatively and allocate 
>> them as individual HWCAP features as we add support for more. The zero 
>> vector ('scv 0') will be used for normal system calls, equivalent to 'sc'.
>>
>> Advertisement
>>
>> Linux has not enabled FSCR[SCV] yet, so the instruction will cause a
>> SIGILL in current environments. Linux has defined a HWCAP2 bit 
>> PPC_FEATURE2_SCV for SCV support, but does not set it.
>>
>> When scv instruction support and the scv 0 vector for system calls are 
>> added, PPC_FEATURE2_SCV will indicate support for these. Other vectors 
>> should not be used without future HWCAP bits indicating support, which is
>> how we will allocate them. (Should unallocated ones generate SIGILL, or
>> return -ENOSYS in r3?)
>>
>> Calling convention
>>
>> The proposal is for scv 0 to provide the standard Linux system call ABI 
>> with the following differences from sc convention[1]:
>>
>> - LR is to be volatile across scv calls. This is necessary because the 
>>   scv instruction clobbers LR. From previous discussion, this should be 
>>   possible to deal with in GCC clobbers and CFI.
>>
>> - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the
>>   kernel system call exit to avoid restoring the CR register (although 
>>   we probably still would anyway to avoid information leak).
>>
>> - Error handling: I think the consensus has been to move to using negative
>>   return value in r3 rather than CR0[SO]=1 to indicate error, which matches
>>   most other architectures and is closer to a function call.
>>
>> The number of scratch registers (r9-r12) at kernel entry seems 
>> sufficient that we don't have any costly spilling, patch is here[2].  
>>
>> [1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst
>> [2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840.html
> 
> My preference would be that it work just like the i386 AT_SYSINFO
> where you just replace "int $128" with "call *%%gs:16" and the kernel
> provides a stub in the vdso that performs either scv or the old
> mechanism with the same calling convention. Then if the kernel doesn't
> provide it (because the kernel is too old) libc would have to provide
> its own stub that uses the legacy method and matches the calling
> convention of the one the kernel is expected to provide.

What about pthread cancellation and the requirement of checking the
cancellable syscall anchors in asynchronous cancellation? My plan is
still to use musl strategy on glibc (BZ#12683) and for i686 it 
requires to always use old int$128 for program that uses cancellation
(static case) or just threads (dynamic mode, which should be more
common on glibc).

Using the i686 strategy of a vDSO bridge symbol would require to always
fallback to 'sc' to still use the same cancellation strategy (and
thus defeating this optimization in such cases).

> Note that any libc that actually makes use of the new functionality is
> not going to be able to make clobbers conditional on support for it;
> branching around different clobbers is going to defeat any gains vs
> always just treating anything clobbered by either method as clobbered.
> Likewise, it's not useful to have different error return mechanisms
> because the caller just has to branch to support both (or the
> kernel-provided stub just has to emulate one for it; that could work
> if you really want to change the bad existing convention).
> 
> Thoughts?
> 
> Rich
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-16  0:16   ` Nicholas Piggin
  2020-04-16  0:48     ` Rich Felker
  2020-04-16  9:58     ` Szabolcs Nagy
@ 2020-04-16 15:21     ` Jeffrey Walton
  2020-04-16 15:40       ` Rich Felker
  2 siblings, 1 reply; 62+ messages in thread
From: Jeffrey Walton @ 2020-04-16 15:21 UTC (permalink / raw)
  To: musl; +Cc: libc-alpha, libc-dev, linuxppc-dev

On Wed, Apr 15, 2020 at 8:17 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>
> Excerpts from Rich Felker's message of April 16, 2020 8:55 am:
> > On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote:
> >> I would like to enable Linux support for the powerpc 'scv' instruction,
> >> as a faster system call instruction.
> >>
> >> This requires two things to be defined: Firstly a way to advertise to
> >> userspace that kernel supports scv, and a way to allocate and advertise
> >> support for individual scv vectors. Secondly, a calling convention ABI
> >> for this new instruction.
> >> ...
> > Note that any libc that actually makes use of the new functionality is
> > not going to be able to make clobbers conditional on support for it;
> > branching around different clobbers is going to defeat any gains vs
> > always just treating anything clobbered by either method as clobbered.
>
> Well it would have to test HWCAP and patch in or branch to two
> completely different sequences including register save/restores yes.
> You could have the same asm and matching clobbers to put the sequence
> inline and then you could patch the one sc/scv instruction I suppose.

Could GCC function multiversioning work here?
https://gcc.gnu.org/wiki/FunctionMultiVersioning

It seems like selecting a runtime version of a function is the sort of
thing you are trying to do.

Jeff

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-16  4:48   ` Florian Weimer
@ 2020-04-16 15:35     ` Rich Felker
  2020-04-16 16:42       ` Florian Weimer
  0 siblings, 1 reply; 62+ messages in thread
From: Rich Felker @ 2020-04-16 15:35 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Nicholas Piggin, libc-alpha, musl, linuxppc-dev, libc-dev

On Thu, Apr 16, 2020 at 06:48:44AM +0200, Florian Weimer wrote:
> * Rich Felker:
> 
> > My preference would be that it work just like the i386 AT_SYSINFO
> > where you just replace "int $128" with "call *%%gs:16" and the kernel
> > provides a stub in the vdso that performs either scv or the old
> > mechanism with the same calling convention.
> 
> The i386 mechanism has received some criticism because it provides an
> effective means to redirect execution flow to anyone who can write to
> the TCB.  I am not sure if it makes sense to copy it.

Indeed that's a good point. Do you have ideas for making it equally
efficient without use of a function pointer in the TCB?

Rich

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-16 14:16   ` Adhemerval Zanella
@ 2020-04-16 15:37     ` Rich Felker
  2020-04-16 17:50       ` Adhemerval Zanella
  0 siblings, 1 reply; 62+ messages in thread
From: Rich Felker @ 2020-04-16 15:37 UTC (permalink / raw)
  To: Adhemerval Zanella
  Cc: Nicholas Piggin, libc-alpha, musl, linuxppc-dev, libc-dev

On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote:
> > My preference would be that it work just like the i386 AT_SYSINFO
> > where you just replace "int $128" with "call *%%gs:16" and the kernel
> > provides a stub in the vdso that performs either scv or the old
> > mechanism with the same calling convention. Then if the kernel doesn't
> > provide it (because the kernel is too old) libc would have to provide
> > its own stub that uses the legacy method and matches the calling
> > convention of the one the kernel is expected to provide.
> 
> What about pthread cancellation and the requirement of checking the
> cancellable syscall anchors in asynchronous cancellation? My plan is
> still to use musl strategy on glibc (BZ#12683) and for i686 it 
> requires to always use old int$128 for program that uses cancellation
> (static case) or just threads (dynamic mode, which should be more
> common on glibc).
> 
> Using the i686 strategy of a vDSO bridge symbol would require to always
> fallback to 'sc' to still use the same cancellation strategy (and
> thus defeating this optimization in such cases).

Yes, I assumed it would be the same, ignoring the new syscall
mechanism for cancellable syscalls. While there are some exceptions,
cancellable syscalls are generally not hot paths but things that are
expected to block and to have significant amounts of work to do in
kernelspace, so saving a few tens of cycles is rather pointless.

It's possible to do a branch/multiple versions of the syscall asm for
cancellation but would require extending the cancellation handler to
support checking against multiple independent address ranges or using
some alternate markup of them.

Rich

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-16 15:21     ` Jeffrey Walton
@ 2020-04-16 15:40       ` Rich Felker
  0 siblings, 0 replies; 62+ messages in thread
From: Rich Felker @ 2020-04-16 15:40 UTC (permalink / raw)
  To: Jeffrey Walton; +Cc: musl, libc-alpha, libc-dev, linuxppc-dev

On Thu, Apr 16, 2020 at 11:21:56AM -0400, Jeffrey Walton wrote:
> On Wed, Apr 15, 2020 at 8:17 PM Nicholas Piggin <npiggin@gmail.com> wrote:
> >
> > Excerpts from Rich Felker's message of April 16, 2020 8:55 am:
> > > On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote:
> > >> I would like to enable Linux support for the powerpc 'scv' instruction,
> > >> as a faster system call instruction.
> > >>
> > >> This requires two things to be defined: Firstly a way to advertise to
> > >> userspace that kernel supports scv, and a way to allocate and advertise
> > >> support for individual scv vectors. Secondly, a calling convention ABI
> > >> for this new instruction.
> > >> ...
> > > Note that any libc that actually makes use of the new functionality is
> > > not going to be able to make clobbers conditional on support for it;
> > > branching around different clobbers is going to defeat any gains vs
> > > always just treating anything clobbered by either method as clobbered.
> >
> > Well it would have to test HWCAP and patch in or branch to two
> > completely different sequences including register save/restores yes.
> > You could have the same asm and matching clobbers to put the sequence
> > inline and then you could patch the one sc/scv instruction I suppose.
> 
> Could GCC function multiversioning work here?
> https://gcc.gnu.org/wiki/FunctionMultiVersioning
> 
> It seems like selecting a runtime version of a function is the sort of
> thing you are trying to do.

On glibc it potentially could. This is ifunc-based functionality
though and musl explicitly does not (and will not) support ifunc
because of lots of fundamental problems it entails. But even on glibc
the underlying mechanisms for ifunc are just the same as a normal
indirect call and there's no real reason to prefer implementing it
with ifunc/multiversioning vs directly.

Rich

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-16 15:35     ` Rich Felker
@ 2020-04-16 16:42       ` Florian Weimer
  2020-04-16 16:52         ` Rich Felker
  0 siblings, 1 reply; 62+ messages in thread
From: Florian Weimer @ 2020-04-16 16:42 UTC (permalink / raw)
  To: Rich Felker; +Cc: Nicholas Piggin, libc-alpha, musl, linuxppc-dev, libc-dev

* Rich Felker:

> On Thu, Apr 16, 2020 at 06:48:44AM +0200, Florian Weimer wrote:
>> * Rich Felker:
>> 
>> > My preference would be that it work just like the i386 AT_SYSINFO
>> > where you just replace "int $128" with "call *%%gs:16" and the kernel
>> > provides a stub in the vdso that performs either scv or the old
>> > mechanism with the same calling convention.
>> 
>> The i386 mechanism has received some criticism because it provides an
>> effective means to redirect execution flow to anyone who can write to
>> the TCB.  I am not sure if it makes sense to copy it.
>
> Indeed that's a good point. Do you have ideas for making it equally
> efficient without use of a function pointer in the TCB?

We could add a shared non-writable mapping at a 64K offset from the
thread pointer and store the function pointer or the code there.  Then
it would be safe.

However, since this is apparently tied to POWER9 and we already have a
POWER9 multilib, and assuming that we are going to backport the kernel
change, I would tweak the selection criterion for that multilib to
include the new HWCAP2 flag.  If a user runs this glibc on a kernel
which does not have support, they will get set baseline (POWER8)
multilib, which still works.  This way, outside the dynamic loader, no
run-time dispatch is needed at all.  I guess this is not at all the
answer you were looking for. 8-)

If a single binary is needed, I would perhaps follow what Arm did for
-moutline-atomics: lay out the code so that its easy to execute for
the non-POWER9 case, assuming that POWER9 machines will be better at
predicting things than their predecessors.

Or you could also put the function pointer into a RELRO segment.  Then
there's overlap with the __libc_single_threaded discussion, where
people objected to this kind of optimization (although I did not
propose to change the TCB ABI, that would be required for
__libc_single_threaded because it's an external interface).

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-16 16:42       ` Florian Weimer
@ 2020-04-16 16:52         ` Rich Felker
  2020-04-16 18:12           ` Florian Weimer
  0 siblings, 1 reply; 62+ messages in thread
From: Rich Felker @ 2020-04-16 16:52 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Nicholas Piggin, libc-alpha, musl, linuxppc-dev, libc-dev

On Thu, Apr 16, 2020 at 06:42:32PM +0200, Florian Weimer wrote:
> * Rich Felker:
> 
> > On Thu, Apr 16, 2020 at 06:48:44AM +0200, Florian Weimer wrote:
> >> * Rich Felker:
> >> 
> >> > My preference would be that it work just like the i386 AT_SYSINFO
> >> > where you just replace "int $128" with "call *%%gs:16" and the kernel
> >> > provides a stub in the vdso that performs either scv or the old
> >> > mechanism with the same calling convention.
> >> 
> >> The i386 mechanism has received some criticism because it provides an
> >> effective means to redirect execution flow to anyone who can write to
> >> the TCB.  I am not sure if it makes sense to copy it.
> >
> > Indeed that's a good point. Do you have ideas for making it equally
> > efficient without use of a function pointer in the TCB?
> 
> We could add a shared non-writable mapping at a 64K offset from the
> thread pointer and store the function pointer or the code there.  Then
> it would be safe.
> 
> However, since this is apparently tied to POWER9 and we already have a
> POWER9 multilib, and assuming that we are going to backport the kernel
> change, I would tweak the selection criterion for that multilib to
> include the new HWCAP2 flag.  If a user runs this glibc on a kernel
> which does not have support, they will get set baseline (POWER8)
> multilib, which still works.  This way, outside the dynamic loader, no
> run-time dispatch is needed at all.  I guess this is not at all the
> answer you were looking for. 8-)

How does this work with -static? :-)

> If a single binary is needed, I would perhaps follow what Arm did for
> -moutline-atomics: lay out the code so that its easy to execute for
> the non-POWER9 case, assuming that POWER9 machines will be better at
> predicting things than their predecessors.
> 
> Or you could also put the function pointer into a RELRO segment.  Then
> there's overlap with the __libc_single_threaded discussion, where
> people objected to this kind of optimization (although I did not
> propose to change the TCB ABI, that would be required for
> __libc_single_threaded because it's an external interface).

Of course you can use a normal global, but now every call point needs
to setup a TOC pointer (= two entry points and more icache lines for
otherwise trivial functions).

I think my choice would be just making the inline syscall be a single
call insn to an asm source file that out-of-lines the loading of TOC
pointer and call through it or branch based on hwcap so that it's not
repeated all over the place.

Alternatively, it would perhaps work to just put hwcap in the TCB and
branch on it rather than making an indirect call to a function pointer
in the TCB, so that the worst you could do by clobbering it is execute
the wrong syscall insn and thereby get SIGILL.

Rich

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-16 15:37     ` Rich Felker
@ 2020-04-16 17:50       ` Adhemerval Zanella
  2020-04-16 17:59         ` Rich Felker
  0 siblings, 1 reply; 62+ messages in thread
From: Adhemerval Zanella @ 2020-04-16 17:50 UTC (permalink / raw)
  To: Rich Felker; +Cc: Nicholas Piggin, libc-alpha, musl, linuxppc-dev, libc-dev



On 16/04/2020 12:37, Rich Felker wrote:
> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote:
>>> My preference would be that it work just like the i386 AT_SYSINFO
>>> where you just replace "int $128" with "call *%%gs:16" and the kernel
>>> provides a stub in the vdso that performs either scv or the old
>>> mechanism with the same calling convention. Then if the kernel doesn't
>>> provide it (because the kernel is too old) libc would have to provide
>>> its own stub that uses the legacy method and matches the calling
>>> convention of the one the kernel is expected to provide.
>>
>> What about pthread cancellation and the requirement of checking the
>> cancellable syscall anchors in asynchronous cancellation? My plan is
>> still to use musl strategy on glibc (BZ#12683) and for i686 it 
>> requires to always use old int$128 for program that uses cancellation
>> (static case) or just threads (dynamic mode, which should be more
>> common on glibc).
>>
>> Using the i686 strategy of a vDSO bridge symbol would require to always
>> fallback to 'sc' to still use the same cancellation strategy (and
>> thus defeating this optimization in such cases).
> 
> Yes, I assumed it would be the same, ignoring the new syscall
> mechanism for cancellable syscalls. While there are some exceptions,
> cancellable syscalls are generally not hot paths but things that are
> expected to block and to have significant amounts of work to do in
> kernelspace, so saving a few tens of cycles is rather pointless.
> 
> It's possible to do a branch/multiple versions of the syscall asm for
> cancellation but would require extending the cancellation handler to
> support checking against multiple independent address ranges or using
> some alternate markup of them.

The main issue is at least for glibc dynamic linking is way more common
than static linking and once the program become multithread the fallback
will be always used.

And besides the cancellation performance issue, a new bridge vDSO mechanism
will still require to setup some extra bridge for the case of the older
kernel.  In the scheme you suggested:

  __asm__("indirect call" ... with common clobbers);

The indirect call will be either the vDSO bridge or an libc provided that
fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain
against:

   if (hwcap & PPC_FEATURE2_SCV) {
     __asm__(... with some clobbers);
   } else {
     __asm__(... with different clobbers);
   }

Specially if 'hwcap & PPC_FEATURE2_SCV' could be optimized with a 
TCB member (as we do on glibc) and if we could make the asm clever
enough to not require different clobbers (although not sure if
it would be possible).

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-16 17:50       ` Adhemerval Zanella
@ 2020-04-16 17:59         ` Rich Felker
  2020-04-16 18:18           ` Adhemerval Zanella
  0 siblings, 1 reply; 62+ messages in thread
From: Rich Felker @ 2020-04-16 17:59 UTC (permalink / raw)
  To: Adhemerval Zanella
  Cc: Nicholas Piggin, libc-alpha, musl, linuxppc-dev, libc-dev

On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote:
> 
> 
> On 16/04/2020 12:37, Rich Felker wrote:
> > On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote:
> >>> My preference would be that it work just like the i386 AT_SYSINFO
> >>> where you just replace "int $128" with "call *%%gs:16" and the kernel
> >>> provides a stub in the vdso that performs either scv or the old
> >>> mechanism with the same calling convention. Then if the kernel doesn't
> >>> provide it (because the kernel is too old) libc would have to provide
> >>> its own stub that uses the legacy method and matches the calling
> >>> convention of the one the kernel is expected to provide.
> >>
> >> What about pthread cancellation and the requirement of checking the
> >> cancellable syscall anchors in asynchronous cancellation? My plan is
> >> still to use musl strategy on glibc (BZ#12683) and for i686 it 
> >> requires to always use old int$128 for program that uses cancellation
> >> (static case) or just threads (dynamic mode, which should be more
> >> common on glibc).
> >>
> >> Using the i686 strategy of a vDSO bridge symbol would require to always
> >> fallback to 'sc' to still use the same cancellation strategy (and
> >> thus defeating this optimization in such cases).
> > 
> > Yes, I assumed it would be the same, ignoring the new syscall
> > mechanism for cancellable syscalls. While there are some exceptions,
> > cancellable syscalls are generally not hot paths but things that are
> > expected to block and to have significant amounts of work to do in
> > kernelspace, so saving a few tens of cycles is rather pointless.
> > 
> > It's possible to do a branch/multiple versions of the syscall asm for
> > cancellation but would require extending the cancellation handler to
> > support checking against multiple independent address ranges or using
> > some alternate markup of them.
> 
> The main issue is at least for glibc dynamic linking is way more common
> than static linking and once the program become multithread the fallback
> will be always used.

I'm not relying on static linking optimizing out the cancellable
version. I'm talking about how cancellable syscalls are pretty much
all "heavy" operations to begin with where a few tens of cycles are in
the realm of "measurement noise" relative to the dominating time
costs.

> And besides the cancellation performance issue, a new bridge vDSO mechanism
> will still require to setup some extra bridge for the case of the older
> kernel.  In the scheme you suggested:
> 
>   __asm__("indirect call" ... with common clobbers);
> 
> The indirect call will be either the vDSO bridge or an libc provided that
> fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain
> against:
> 
>    if (hwcap & PPC_FEATURE2_SCV) {
>      __asm__(... with some clobbers);
>    } else {
>      __asm__(... with different clobbers);
>    }

If the indirect call can be made roughly as efficiently as the sc
sequence now (which already have some cost due to handling the nasty
error return convention, making the indirect call likely just as small
or smaller), it's O(1) additional code size (and thus icache usage)
rather than O(n) where n is number of syscall points.

Of course it would work just as well (for avoiding O(n) growth) to
have a direct call to out-of-line branch like you suggested.

> Specially if 'hwcap & PPC_FEATURE2_SCV' could be optimized with a 
> TCB member (as we do on glibc) and if we could make the asm clever
> enough to not require different clobbers (although not sure if
> it would be possible).

The easy way not to require different clobbers is just using the union
of the clobbers, no? Does the proposed new method clobber any
call-saved registers that would make it painful (requiring new call
frames to save them in)?

Rich

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-16 16:52         ` Rich Felker
@ 2020-04-16 18:12           ` Florian Weimer
  2020-04-16 23:02             ` Segher Boessenkool
  0 siblings, 1 reply; 62+ messages in thread
From: Florian Weimer @ 2020-04-16 18:12 UTC (permalink / raw)
  To: Rich Felker; +Cc: Nicholas Piggin, libc-alpha, musl, linuxppc-dev, libc-dev

* Rich Felker:

> On Thu, Apr 16, 2020 at 06:42:32PM +0200, Florian Weimer wrote:
>> * Rich Felker:
>> 
>> > On Thu, Apr 16, 2020 at 06:48:44AM +0200, Florian Weimer wrote:
>> >> * Rich Felker:
>> >> 
>> >> > My preference would be that it work just like the i386 AT_SYSINFO
>> >> > where you just replace "int $128" with "call *%%gs:16" and the kernel
>> >> > provides a stub in the vdso that performs either scv or the old
>> >> > mechanism with the same calling convention.
>> >> 
>> >> The i386 mechanism has received some criticism because it provides an
>> >> effective means to redirect execution flow to anyone who can write to
>> >> the TCB.  I am not sure if it makes sense to copy it.
>> >
>> > Indeed that's a good point. Do you have ideas for making it equally
>> > efficient without use of a function pointer in the TCB?
>> 
>> We could add a shared non-writable mapping at a 64K offset from the
>> thread pointer and store the function pointer or the code there.  Then
>> it would be safe.
>> 
>> However, since this is apparently tied to POWER9 and we already have a
>> POWER9 multilib, and assuming that we are going to backport the kernel
>> change, I would tweak the selection criterion for that multilib to
>> include the new HWCAP2 flag.  If a user runs this glibc on a kernel
>> which does not have support, they will get set baseline (POWER8)
>> multilib, which still works.  This way, outside the dynamic loader, no
>> run-time dispatch is needed at all.  I guess this is not at all the
>> answer you were looking for. 8-)
>
> How does this work with -static? :-)

-static is not supported. 8-) (If you use the unsupported static
libraries, you get POWER8 code.)

(Just to be clear, in case someone doesn't get the joke: This is about
a potential approach for a heavily constrained, vertically integrated
environment.  It does not reflect general glibc recommendations.)

>> If a single binary is needed, I would perhaps follow what Arm did for
>> -moutline-atomics: lay out the code so that its easy to execute for
>> the non-POWER9 case, assuming that POWER9 machines will be better at
>> predicting things than their predecessors.
>> 
>> Or you could also put the function pointer into a RELRO segment.  Then
>> there's overlap with the __libc_single_threaded discussion, where
>> people objected to this kind of optimization (although I did not
>> propose to change the TCB ABI, that would be required for
>> __libc_single_threaded because it's an external interface).
>
> Of course you can use a normal global, but now every call point needs
> to setup a TOC pointer (= two entry points and more icache lines for
> otherwise trivial functions).
>
> I think my choice would be just making the inline syscall be a single
> call insn to an asm source file that out-of-lines the loading of TOC
> pointer and call through it or branch based on hwcap so that it's not
> repeated all over the place.

I don't know how problematic control flow out of an inline asm is on
POWER.  But this is basically the -moutline-atomics approach.

> Alternatively, it would perhaps work to just put hwcap in the TCB and
> branch on it rather than making an indirect call to a function pointer
> in the TCB, so that the worst you could do by clobbering it is execute
> the wrong syscall insn and thereby get SIGILL.

The HWCAP is already in the TCB.  I expect this is what generic glibc
builds are going to use (perhaps with a bit of tweaking favorable to
POWER8 implementations, but we'll see).

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-16 17:59         ` Rich Felker
@ 2020-04-16 18:18           ` Adhemerval Zanella
  2020-04-16 18:31             ` Rich Felker
  0 siblings, 1 reply; 62+ messages in thread
From: Adhemerval Zanella @ 2020-04-16 18:18 UTC (permalink / raw)
  To: Rich Felker; +Cc: Nicholas Piggin, libc-alpha, musl, linuxppc-dev, libc-dev



On 16/04/2020 14:59, Rich Felker wrote:
> On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote:
>>
>>
>> On 16/04/2020 12:37, Rich Felker wrote:
>>> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote:
>>>>> My preference would be that it work just like the i386 AT_SYSINFO
>>>>> where you just replace "int $128" with "call *%%gs:16" and the kernel
>>>>> provides a stub in the vdso that performs either scv or the old
>>>>> mechanism with the same calling convention. Then if the kernel doesn't
>>>>> provide it (because the kernel is too old) libc would have to provide
>>>>> its own stub that uses the legacy method and matches the calling
>>>>> convention of the one the kernel is expected to provide.
>>>>
>>>> What about pthread cancellation and the requirement of checking the
>>>> cancellable syscall anchors in asynchronous cancellation? My plan is
>>>> still to use musl strategy on glibc (BZ#12683) and for i686 it 
>>>> requires to always use old int$128 for program that uses cancellation
>>>> (static case) or just threads (dynamic mode, which should be more
>>>> common on glibc).
>>>>
>>>> Using the i686 strategy of a vDSO bridge symbol would require to always
>>>> fallback to 'sc' to still use the same cancellation strategy (and
>>>> thus defeating this optimization in such cases).
>>>
>>> Yes, I assumed it would be the same, ignoring the new syscall
>>> mechanism for cancellable syscalls. While there are some exceptions,
>>> cancellable syscalls are generally not hot paths but things that are
>>> expected to block and to have significant amounts of work to do in
>>> kernelspace, so saving a few tens of cycles is rather pointless.
>>>
>>> It's possible to do a branch/multiple versions of the syscall asm for
>>> cancellation but would require extending the cancellation handler to
>>> support checking against multiple independent address ranges or using
>>> some alternate markup of them.
>>
>> The main issue is at least for glibc dynamic linking is way more common
>> than static linking and once the program become multithread the fallback
>> will be always used.
> 
> I'm not relying on static linking optimizing out the cancellable
> version. I'm talking about how cancellable syscalls are pretty much
> all "heavy" operations to begin with where a few tens of cycles are in
> the realm of "measurement noise" relative to the dominating time
> costs.

Yes I am aware, but at same time I am not sure how it plays on real world.
For instance, some workloads might issue kernel query syscalls, such as
recv, where buffer copying might not be dominant factor. So I see that if
the idea is optimizing syscall mechanism, we should try to leverage it
as whole in libc.

> 
>> And besides the cancellation performance issue, a new bridge vDSO mechanism
>> will still require to setup some extra bridge for the case of the older
>> kernel.  In the scheme you suggested:
>>
>>   __asm__("indirect call" ... with common clobbers);
>>
>> The indirect call will be either the vDSO bridge or an libc provided that
>> fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain
>> against:
>>
>>    if (hwcap & PPC_FEATURE2_SCV) {
>>      __asm__(... with some clobbers);
>>    } else {
>>      __asm__(... with different clobbers);
>>    }
> 
> If the indirect call can be made roughly as efficiently as the sc
> sequence now (which already have some cost due to handling the nasty
> error return convention, making the indirect call likely just as small
> or smaller), it's O(1) additional code size (and thus icache usage)
> rather than O(n) where n is number of syscall points.
> 
> Of course it would work just as well (for avoiding O(n) growth) to
> have a direct call to out-of-line branch like you suggested.

Yes, but does it really matter to optimize this specific usage case
for size? glibc, for instance, tries to leverage the syscall mechanism 
by adding some complex pre-processor asm directives.  It optimizes
the syscall code size in most cases.  For instance, kill in static case 
generates on x86_64:

0000000000000000 <__kill>:
   0:   b8 3e 00 00 00          mov    $0x3e,%eax
   5:   0f 05                   syscall 
   7:   48 3d 01 f0 ff ff       cmp    $0xfffffffffffff001,%rax
   d:   0f 83 00 00 00 00       jae    13 <__kill+0x13>
  13:   c3                      retq   

While on musl:

0000000000000000 <kill>:
   0:	48 83 ec 08          	sub    $0x8,%rsp
   4:	48 63 ff             	movslq %edi,%rdi
   7:	48 63 f6             	movslq %esi,%rsi
   a:	b8 3e 00 00 00       	mov    $0x3e,%eax
   f:	0f 05                	syscall 
  11:	48 89 c7             	mov    %rax,%rdi
  14:	e8 00 00 00 00       	callq  19 <kill+0x19>
  19:	5a                   	pop    %rdx
  1a:	c3                   	retq   

But I hardly think it pays off the required code complexity.  Some
for providing a O(1) bridge: this will require additional complexity
to write it and setup correctly.

> 
>> Specially if 'hwcap & PPC_FEATURE2_SCV' could be optimized with a 
>> TCB member (as we do on glibc) and if we could make the asm clever
>> enough to not require different clobbers (although not sure if
>> it would be possible).
> 
> The easy way not to require different clobbers is just using the union
> of the clobbers, no? Does the proposed new method clobber any
> call-saved registers that would make it painful (requiring new call
> frames to save them in)?

As far I can tell, it should be ok.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-16 18:18           ` Adhemerval Zanella
@ 2020-04-16 18:31             ` Rich Felker
  2020-04-16 18:44               ` Rich Felker
                                 ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Rich Felker @ 2020-04-16 18:31 UTC (permalink / raw)
  To: Adhemerval Zanella
  Cc: Nicholas Piggin, libc-alpha, musl, linuxppc-dev, libc-dev

On Thu, Apr 16, 2020 at 03:18:42PM -0300, Adhemerval Zanella wrote:
> 
> 
> On 16/04/2020 14:59, Rich Felker wrote:
> > On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote:
> >>
> >>
> >> On 16/04/2020 12:37, Rich Felker wrote:
> >>> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote:
> >>>>> My preference would be that it work just like the i386 AT_SYSINFO
> >>>>> where you just replace "int $128" with "call *%%gs:16" and the kernel
> >>>>> provides a stub in the vdso that performs either scv or the old
> >>>>> mechanism with the same calling convention. Then if the kernel doesn't
> >>>>> provide it (because the kernel is too old) libc would have to provide
> >>>>> its own stub that uses the legacy method and matches the calling
> >>>>> convention of the one the kernel is expected to provide.
> >>>>
> >>>> What about pthread cancellation and the requirement of checking the
> >>>> cancellable syscall anchors in asynchronous cancellation? My plan is
> >>>> still to use musl strategy on glibc (BZ#12683) and for i686 it 
> >>>> requires to always use old int$128 for program that uses cancellation
> >>>> (static case) or just threads (dynamic mode, which should be more
> >>>> common on glibc).
> >>>>
> >>>> Using the i686 strategy of a vDSO bridge symbol would require to always
> >>>> fallback to 'sc' to still use the same cancellation strategy (and
> >>>> thus defeating this optimization in such cases).
> >>>
> >>> Yes, I assumed it would be the same, ignoring the new syscall
> >>> mechanism for cancellable syscalls. While there are some exceptions,
> >>> cancellable syscalls are generally not hot paths but things that are
> >>> expected to block and to have significant amounts of work to do in
> >>> kernelspace, so saving a few tens of cycles is rather pointless.
> >>>
> >>> It's possible to do a branch/multiple versions of the syscall asm for
> >>> cancellation but would require extending the cancellation handler to
> >>> support checking against multiple independent address ranges or using
> >>> some alternate markup of them.
> >>
> >> The main issue is at least for glibc dynamic linking is way more common
> >> than static linking and once the program become multithread the fallback
> >> will be always used.
> > 
> > I'm not relying on static linking optimizing out the cancellable
> > version. I'm talking about how cancellable syscalls are pretty much
> > all "heavy" operations to begin with where a few tens of cycles are in
> > the realm of "measurement noise" relative to the dominating time
> > costs.
> 
> Yes I am aware, but at same time I am not sure how it plays on real world.
> For instance, some workloads might issue kernel query syscalls, such as
> recv, where buffer copying might not be dominant factor. So I see that if
> the idea is optimizing syscall mechanism, we should try to leverage it
> as whole in libc.

Have you timed a minimal recv? I'm not assuming buffer copying is the
dominant factor. I'm assuming the overhead of all the kernel layers
involved is dominant.

> >> And besides the cancellation performance issue, a new bridge vDSO mechanism
> >> will still require to setup some extra bridge for the case of the older
> >> kernel.  In the scheme you suggested:
> >>
> >>   __asm__("indirect call" ... with common clobbers);
> >>
> >> The indirect call will be either the vDSO bridge or an libc provided that
> >> fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain
> >> against:
> >>
> >>    if (hwcap & PPC_FEATURE2_SCV) {
> >>      __asm__(... with some clobbers);
> >>    } else {
> >>      __asm__(... with different clobbers);
> >>    }
> > 
> > If the indirect call can be made roughly as efficiently as the sc
> > sequence now (which already have some cost due to handling the nasty
> > error return convention, making the indirect call likely just as small
> > or smaller), it's O(1) additional code size (and thus icache usage)
> > rather than O(n) where n is number of syscall points.
> > 
> > Of course it would work just as well (for avoiding O(n) growth) to
> > have a direct call to out-of-line branch like you suggested.
> 
> Yes, but does it really matter to optimize this specific usage case
> for size? glibc, for instance, tries to leverage the syscall mechanism 
> by adding some complex pre-processor asm directives.  It optimizes
> the syscall code size in most cases.  For instance, kill in static case 
> generates on x86_64:
> 
> 0000000000000000 <__kill>:
>    0:   b8 3e 00 00 00          mov    $0x3e,%eax
>    5:   0f 05                   syscall 
>    7:   48 3d 01 f0 ff ff       cmp    $0xfffffffffffff001,%rax
>    d:   0f 83 00 00 00 00       jae    13 <__kill+0x13>
>   13:   c3                      retq   
> 
> While on musl:
> 
> 0000000000000000 <kill>:
>    0:	48 83 ec 08          	sub    $0x8,%rsp
>    4:	48 63 ff             	movslq %edi,%rdi
>    7:	48 63 f6             	movslq %esi,%rsi
>    a:	b8 3e 00 00 00       	mov    $0x3e,%eax
>    f:	0f 05                	syscall 
>   11:	48 89 c7             	mov    %rax,%rdi
>   14:	e8 00 00 00 00       	callq  19 <kill+0x19>
>   19:	5a                   	pop    %rdx
>   1a:	c3                   	retq   

Wow that's some extraordinarily bad codegen going on by gcc... The
sign-extension is semantically needed and I don't see a good way
around it (glibc's asm is kinda a hack taking advantage of kernel not
looking at high bits, I think), but the gratuitous stack adjustment
and refusal to generate a tail call isn't. I'll see if we can track
down what's going on and get it fixed.

> But I hardly think it pays off the required code complexity.  Some
> for providing a O(1) bridge: this will require additional complexity
> to write it and setup correctly.

In some sense I agree, but inline instructions are a lot more
expensive on ppc (being 32-bit each), and it might take out-of-lining
anyway to get rid of stack frame setups if that ends up being a
problem.

> >> Specially if 'hwcap & PPC_FEATURE2_SCV' could be optimized with a 
> >> TCB member (as we do on glibc) and if we could make the asm clever
> >> enough to not require different clobbers (although not sure if
> >> it would be possible).
> > 
> > The easy way not to require different clobbers is just using the union
> > of the clobbers, no? Does the proposed new method clobber any
> > call-saved registers that would make it painful (requiring new call
> > frames to save them in)?
> 
> As far I can tell, it should be ok.

Note that because lr is clobbered we need at least once normally
call-clobbered register that's not syscall clobbered to save lr in.
Otherwise stack frame setup is required to spill it. (And I'm not even
sure if gcc does things right to avoid it by using a register -- we
should check that I guess...)

Rich

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-16 18:31             ` Rich Felker
@ 2020-04-16 18:44               ` Rich Felker
  2020-04-16 18:52               ` Adhemerval Zanella
  2020-04-20  1:10               ` Nicholas Piggin
  2 siblings, 0 replies; 62+ messages in thread
From: Rich Felker @ 2020-04-16 18:44 UTC (permalink / raw)
  To: Adhemerval Zanella
  Cc: Nicholas Piggin, libc-alpha, musl, linuxppc-dev, libc-dev

On Thu, Apr 16, 2020 at 02:31:51PM -0400, Rich Felker wrote:
> > While on musl:
> > 
> > 0000000000000000 <kill>:
> >    0:	48 83 ec 08          	sub    $0x8,%rsp
> >    4:	48 63 ff             	movslq %edi,%rdi
> >    7:	48 63 f6             	movslq %esi,%rsi
> >    a:	b8 3e 00 00 00       	mov    $0x3e,%eax
> >    f:	0f 05                	syscall 
> >   11:	48 89 c7             	mov    %rax,%rdi
> >   14:	e8 00 00 00 00       	callq  19 <kill+0x19>
> >   19:	5a                   	pop    %rdx
> >   1a:	c3                   	retq   
> 
> Wow that's some extraordinarily bad codegen going on by gcc... The
> sign-extension is semantically needed and I don't see a good way
> around it (glibc's asm is kinda a hack taking advantage of kernel not
> looking at high bits, I think), but the gratuitous stack adjustment
> and refusal to generate a tail call isn't. I'll see if we can track
> down what's going on and get it fixed.

It seems to be https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14441
which I've updated with a comment about the above.

Rich

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-16 18:31             ` Rich Felker
  2020-04-16 18:44               ` Rich Felker
@ 2020-04-16 18:52               ` Adhemerval Zanella
  2020-04-20  0:46                 ` Nicholas Piggin
  2020-04-20  1:10               ` Nicholas Piggin
  2 siblings, 1 reply; 62+ messages in thread
From: Adhemerval Zanella @ 2020-04-16 18:52 UTC (permalink / raw)
  To: Rich Felker; +Cc: Nicholas Piggin, libc-alpha, musl, linuxppc-dev, libc-dev



On 16/04/2020 15:31, Rich Felker wrote:
> On Thu, Apr 16, 2020 at 03:18:42PM -0300, Adhemerval Zanella wrote:
>>
>>
>> On 16/04/2020 14:59, Rich Felker wrote:
>>> On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote:
>>>>
>>>>
>>>> On 16/04/2020 12:37, Rich Felker wrote:
>>>>> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote:
>>>>>>> My preference would be that it work just like the i386 AT_SYSINFO
>>>>>>> where you just replace "int $128" with "call *%%gs:16" and the kernel
>>>>>>> provides a stub in the vdso that performs either scv or the old
>>>>>>> mechanism with the same calling convention. Then if the kernel doesn't
>>>>>>> provide it (because the kernel is too old) libc would have to provide
>>>>>>> its own stub that uses the legacy method and matches the calling
>>>>>>> convention of the one the kernel is expected to provide.
>>>>>>
>>>>>> What about pthread cancellation and the requirement of checking the
>>>>>> cancellable syscall anchors in asynchronous cancellation? My plan is
>>>>>> still to use musl strategy on glibc (BZ#12683) and for i686 it 
>>>>>> requires to always use old int$128 for program that uses cancellation
>>>>>> (static case) or just threads (dynamic mode, which should be more
>>>>>> common on glibc).
>>>>>>
>>>>>> Using the i686 strategy of a vDSO bridge symbol would require to always
>>>>>> fallback to 'sc' to still use the same cancellation strategy (and
>>>>>> thus defeating this optimization in such cases).
>>>>>
>>>>> Yes, I assumed it would be the same, ignoring the new syscall
>>>>> mechanism for cancellable syscalls. While there are some exceptions,
>>>>> cancellable syscalls are generally not hot paths but things that are
>>>>> expected to block and to have significant amounts of work to do in
>>>>> kernelspace, so saving a few tens of cycles is rather pointless.
>>>>>
>>>>> It's possible to do a branch/multiple versions of the syscall asm for
>>>>> cancellation but would require extending the cancellation handler to
>>>>> support checking against multiple independent address ranges or using
>>>>> some alternate markup of them.
>>>>
>>>> The main issue is at least for glibc dynamic linking is way more common
>>>> than static linking and once the program become multithread the fallback
>>>> will be always used.
>>>
>>> I'm not relying on static linking optimizing out the cancellable
>>> version. I'm talking about how cancellable syscalls are pretty much
>>> all "heavy" operations to begin with where a few tens of cycles are in
>>> the realm of "measurement noise" relative to the dominating time
>>> costs.
>>
>> Yes I am aware, but at same time I am not sure how it plays on real world.
>> For instance, some workloads might issue kernel query syscalls, such as
>> recv, where buffer copying might not be dominant factor. So I see that if
>> the idea is optimizing syscall mechanism, we should try to leverage it
>> as whole in libc.
> 
> Have you timed a minimal recv? I'm not assuming buffer copying is the
> dominant factor. I'm assuming the overhead of all the kernel layers
> involved is dominant.

Not really, but reading the advantages of using 'scv' over 'sc' also does
not outline the real expect gain.  Taking in consideration this should
be a micro-optimization (focused on entry syscall patch), I think we should
use where it possible.

> 
>>>> And besides the cancellation performance issue, a new bridge vDSO mechanism
>>>> will still require to setup some extra bridge for the case of the older
>>>> kernel.  In the scheme you suggested:
>>>>
>>>>   __asm__("indirect call" ... with common clobbers);
>>>>
>>>> The indirect call will be either the vDSO bridge or an libc provided that
>>>> fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain
>>>> against:
>>>>
>>>>    if (hwcap & PPC_FEATURE2_SCV) {
>>>>      __asm__(... with some clobbers);
>>>>    } else {
>>>>      __asm__(... with different clobbers);
>>>>    }
>>>
>>> If the indirect call can be made roughly as efficiently as the sc
>>> sequence now (which already have some cost due to handling the nasty
>>> error return convention, making the indirect call likely just as small
>>> or smaller), it's O(1) additional code size (and thus icache usage)
>>> rather than O(n) where n is number of syscall points.
>>>
>>> Of course it would work just as well (for avoiding O(n) growth) to
>>> have a direct call to out-of-line branch like you suggested.
>>
>> Yes, but does it really matter to optimize this specific usage case
>> for size? glibc, for instance, tries to leverage the syscall mechanism 
>> by adding some complex pre-processor asm directives.  It optimizes
>> the syscall code size in most cases.  For instance, kill in static case 
>> generates on x86_64:
>>
>> 0000000000000000 <__kill>:
>>    0:   b8 3e 00 00 00          mov    $0x3e,%eax
>>    5:   0f 05                   syscall 
>>    7:   48 3d 01 f0 ff ff       cmp    $0xfffffffffffff001,%rax
>>    d:   0f 83 00 00 00 00       jae    13 <__kill+0x13>
>>   13:   c3                      retq   
>>
>> While on musl:
>>
>> 0000000000000000 <kill>:
>>    0:	48 83 ec 08          	sub    $0x8,%rsp
>>    4:	48 63 ff             	movslq %edi,%rdi
>>    7:	48 63 f6             	movslq %esi,%rsi
>>    a:	b8 3e 00 00 00       	mov    $0x3e,%eax
>>    f:	0f 05                	syscall 
>>   11:	48 89 c7             	mov    %rax,%rdi
>>   14:	e8 00 00 00 00       	callq  19 <kill+0x19>
>>   19:	5a                   	pop    %rdx
>>   1a:	c3                   	retq   
> 
> Wow that's some extraordinarily bad codegen going on by gcc... The
> sign-extension is semantically needed and I don't see a good way
> around it (glibc's asm is kinda a hack taking advantage of kernel not
> looking at high bits, I think), but the gratuitous stack adjustment
> and refusal to generate a tail call isn't. I'll see if we can track
> down what's going on and get it fixed.

Wrt glibc, it is most likely and it has bitten us on x32 port recently
(where some types were being passed correctly).  In any case, my long
term plan to also get rid of this nasty assembly pre-processor on
syscall passing.

> 
>> But I hardly think it pays off the required code complexity.  Some
>> for providing a O(1) bridge: this will require additional complexity
>> to write it and setup correctly.
> 
> In some sense I agree, but inline instructions are a lot more
> expensive on ppc (being 32-bit each), and it might take out-of-lining
> anyway to get rid of stack frame setups if that ends up being a
> problem.

Indeed, I didn't started to prototype what would be required to make
this change on glibc. Maybe an out-of-line helper might make sense.

> 
>>>> Specially if 'hwcap & PPC_FEATURE2_SCV' could be optimized with a 
>>>> TCB member (as we do on glibc) and if we could make the asm clever
>>>> enough to not require different clobbers (although not sure if
>>>> it would be possible).
>>>
>>> The easy way not to require different clobbers is just using the union
>>> of the clobbers, no? Does the proposed new method clobber any
>>> call-saved registers that would make it painful (requiring new call
>>> frames to save them in)?
>>
>> As far I can tell, it should be ok.
> 
> Note that because lr is clobbered we need at least once normally
> call-clobbered register that's not syscall clobbered to save lr in.
> Otherwise stack frame setup is required to spill it. (And I'm not even
> sure if gcc does things right to avoid it by using a register -- we
> should check that I guess...)

If I recall correctly Florian has found some issue in lr clobbering.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-16  2:53           ` Nicholas Piggin
  2020-04-16  3:03             ` Rich Felker
@ 2020-04-16 20:18             ` Florian Weimer
  1 sibling, 0 replies; 62+ messages in thread
From: Florian Weimer @ 2020-04-16 20:18 UTC (permalink / raw)
  To: Nicholas Piggin via Libc-alpha
  Cc: Rich Felker, Nicholas Piggin, libc-dev, linuxppc-dev, musl

* Nicholas Piggin via Libc-alpha:

> We may or may not be getting a new ABI that will use instructions not 
> supported by old processors.
>
> https://sourceware.org/legacy-ml/binutils/2019-05/msg00331.html
>
> Current ABI continues to work of course and be the default for some 
> time, but building for new one would give some opportunity to drop
> such support for old procs, at least for glibc.

If I recall correctly, during last year's GNU Tools Cauldron, I think
it was pretty clear that this was only to be used for intra-DSO ABIs,
not cross-DSO optimization.  Relocatable object files have an ABI,
too, of course, so that's why there's a ABI documentation needed.

For cross-DSO optimization, the link editor would look at the DSO
being linked in, check if it uses the -mfuture ABI, and apply some
shortcuts.  But at that point, if the DSO is swapped back to a version
built without -mfuture, it no longer works with those newly linked
binaries against the -mfuture version.  Such a thing is a clear ABI
bump, and based what I remember from Cauldron, that is not the plan
here.

(I don't have any insider knowledge—I just don't want people to read
this think: gosh, yet another POWER ABI bump.  But the PCREL stuff
*is* exciting!)

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-16 18:12           ` Florian Weimer
@ 2020-04-16 23:02             ` Segher Boessenkool
  2020-04-17  0:34               ` Rich Felker
  0 siblings, 1 reply; 62+ messages in thread
From: Segher Boessenkool @ 2020-04-16 23:02 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Rich Felker, musl, libc-alpha, linuxppc-dev, Nicholas Piggin, libc-dev

On Thu, Apr 16, 2020 at 08:12:19PM +0200, Florian Weimer wrote:
> > I think my choice would be just making the inline syscall be a single
> > call insn to an asm source file that out-of-lines the loading of TOC
> > pointer and call through it or branch based on hwcap so that it's not
> > repeated all over the place.
> 
> I don't know how problematic control flow out of an inline asm is on
> POWER.  But this is basically the -moutline-atomics approach.

Control flow out of inline asm (other than with "asm goto") is not
allowed at all, just like on any other target (and will not work in
practice, either -- just like on any other target).  But the suggestion
was to use actual assembler code, not inline asm?


Segher

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-16 23:02             ` Segher Boessenkool
@ 2020-04-17  0:34               ` Rich Felker
  2020-04-17  1:48                 ` Segher Boessenkool
  0 siblings, 1 reply; 62+ messages in thread
From: Rich Felker @ 2020-04-17  0:34 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Florian Weimer, musl, libc-alpha, linuxppc-dev, Nicholas Piggin,
	libc-dev

On Thu, Apr 16, 2020 at 06:02:35PM -0500, Segher Boessenkool wrote:
> On Thu, Apr 16, 2020 at 08:12:19PM +0200, Florian Weimer wrote:
> > > I think my choice would be just making the inline syscall be a single
> > > call insn to an asm source file that out-of-lines the loading of TOC
> > > pointer and call through it or branch based on hwcap so that it's not
> > > repeated all over the place.
> > 
> > I don't know how problematic control flow out of an inline asm is on
> > POWER.  But this is basically the -moutline-atomics approach.
> 
> Control flow out of inline asm (other than with "asm goto") is not
> allowed at all, just like on any other target (and will not work in
> practice, either -- just like on any other target).  But the suggestion
> was to use actual assembler code, not inline asm?

Calling it control flow out of inline asm is something of a misnomer.
The enclosing state is not discarded or altered; the asm statement
exits normally, reaching the next instruction in the enclosing
block/function as soon as the call from the asm statement returns,
with all register/clobber constraints satisfied.

Control flow out of inline asm would be more like longjmp, and it can
be valid -- for instance, you can implement coroutines this way
(assuming you switch stack correctly) or do longjmp this way (jumping
to the location saved by setjmp). But it's not what'd be happening
here.

Rich

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-17  0:34               ` Rich Felker
@ 2020-04-17  1:48                 ` Segher Boessenkool
  2020-04-17  8:34                   ` Florian Weimer
  0 siblings, 1 reply; 62+ messages in thread
From: Segher Boessenkool @ 2020-04-17  1:48 UTC (permalink / raw)
  To: Rich Felker
  Cc: Florian Weimer, musl, libc-alpha, linuxppc-dev, Nicholas Piggin,
	libc-dev

On Thu, Apr 16, 2020 at 08:34:42PM -0400, Rich Felker wrote:
> On Thu, Apr 16, 2020 at 06:02:35PM -0500, Segher Boessenkool wrote:
> > On Thu, Apr 16, 2020 at 08:12:19PM +0200, Florian Weimer wrote:
> > > > I think my choice would be just making the inline syscall be a single
> > > > call insn to an asm source file that out-of-lines the loading of TOC
> > > > pointer and call through it or branch based on hwcap so that it's not
> > > > repeated all over the place.
> > > 
> > > I don't know how problematic control flow out of an inline asm is on
> > > POWER.  But this is basically the -moutline-atomics approach.
> > 
> > Control flow out of inline asm (other than with "asm goto") is not
> > allowed at all, just like on any other target (and will not work in
> > practice, either -- just like on any other target).  But the suggestion
> > was to use actual assembler code, not inline asm?
> 
> Calling it control flow out of inline asm is something of a misnomer.
> The enclosing state is not discarded or altered; the asm statement
> exits normally, reaching the next instruction in the enclosing
> block/function as soon as the call from the asm statement returns,
> with all register/clobber constraints satisfied.

Ah.  That should always Just Work, then -- our ABIs guarantee you can.

> Control flow out of inline asm would be more like longjmp, and it can
> be valid -- for instance, you can implement coroutines this way
> (assuming you switch stack correctly) or do longjmp this way (jumping
> to the location saved by setjmp). But it's not what'd be happening
> here.

Yeah, you cannot do that in C, not without making assumptions about what
machine code the compiler generates.  GCC explicitly disallows it, too:

     'asm' statements may not perform jumps into other 'asm' statements,
     only to the listed GOTOLABELS.  GCC's optimizers do not know about
     other jumps; therefore they cannot take account of them when
     deciding how to optimize.


Segher

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-17  1:48                 ` Segher Boessenkool
@ 2020-04-17  8:34                   ` Florian Weimer
  0 siblings, 0 replies; 62+ messages in thread
From: Florian Weimer @ 2020-04-17  8:34 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Rich Felker, musl, libc-alpha, linuxppc-dev, Nicholas Piggin, libc-dev

* Segher Boessenkool:

> On Thu, Apr 16, 2020 at 08:34:42PM -0400, Rich Felker wrote:
>> On Thu, Apr 16, 2020 at 06:02:35PM -0500, Segher Boessenkool wrote:
>> > On Thu, Apr 16, 2020 at 08:12:19PM +0200, Florian Weimer wrote:
>> > > > I think my choice would be just making the inline syscall be a single
>> > > > call insn to an asm source file that out-of-lines the loading of TOC
>> > > > pointer and call through it or branch based on hwcap so that it's not
>> > > > repeated all over the place.
>> > > 
>> > > I don't know how problematic control flow out of an inline asm is on
>> > > POWER.  But this is basically the -moutline-atomics approach.
>> > 
>> > Control flow out of inline asm (other than with "asm goto") is not
>> > allowed at all, just like on any other target (and will not work in
>> > practice, either -- just like on any other target).  But the suggestion
>> > was to use actual assembler code, not inline asm?
>> 
>> Calling it control flow out of inline asm is something of a misnomer.
>> The enclosing state is not discarded or altered; the asm statement
>> exits normally, reaching the next instruction in the enclosing
>> block/function as soon as the call from the asm statement returns,
>> with all register/clobber constraints satisfied.
>
> Ah.  That should always Just Work, then -- our ABIs guarantee you can.

After thinking about it, I agree: GCC will handle spilling of the link
register.  Branch-and-link instructions do not clobber the protected
zone, so no stack adjustment is needed (which would be problematic to
reflect in the unwind information).

Of course, the target function has to be written in assembler because
it must not use a regular stack frame.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-16  9:58     ` Szabolcs Nagy
@ 2020-04-20  0:27       ` Nicholas Piggin
  2020-04-20  1:29         ` Rich Felker
  0 siblings, 1 reply; 62+ messages in thread
From: Nicholas Piggin @ 2020-04-20  0:27 UTC (permalink / raw)
  To: Nicholas Piggin via Libc-alpha, Szabolcs Nagy
  Cc: Rich Felker, libc-dev, linuxppc-dev, musl

Excerpts from Szabolcs Nagy's message of April 16, 2020 7:58 pm:
> * Nicholas Piggin via Libc-alpha <libc-alpha@sourceware.org> [2020-04-16 10:16:54 +1000]:
>> Well it would have to test HWCAP and patch in or branch to two 
>> completely different sequences including register save/restores yes.
>> You could have the same asm and matching clobbers to put the sequence
>> inline and then you could patch the one sc/scv instruction I suppose.
> 
> how would that 'patch' work?
> 
> there are many reasons why you don't
> want libc to write its .text

I guess I don't know what I'm talking about when it comes to libraries. 
Shame if there is no good way to load-time patch libc. It's orthogonal
to the scv selection though -- if you don't patch you have to 
conditional or indirect branch however you implement it.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-16 18:52               ` Adhemerval Zanella
@ 2020-04-20  0:46                 ` Nicholas Piggin
  0 siblings, 0 replies; 62+ messages in thread
From: Nicholas Piggin @ 2020-04-20  0:46 UTC (permalink / raw)
  To: Adhemerval Zanella, Rich Felker; +Cc: libc-alpha, libc-dev, linuxppc-dev, musl

Excerpts from Adhemerval Zanella's message of April 17, 2020 4:52 am:
> 
> 
> On 16/04/2020 15:31, Rich Felker wrote:
>> On Thu, Apr 16, 2020 at 03:18:42PM -0300, Adhemerval Zanella wrote:
>>>
>>>
>>> On 16/04/2020 14:59, Rich Felker wrote:
>>>> On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote:
>>>>>
>>>>>
>>>>> On 16/04/2020 12:37, Rich Felker wrote:
>>>>>> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote:
>>>>>>>> My preference would be that it work just like the i386 AT_SYSINFO
>>>>>>>> where you just replace "int $128" with "call *%%gs:16" and the kernel
>>>>>>>> provides a stub in the vdso that performs either scv or the old
>>>>>>>> mechanism with the same calling convention. Then if the kernel doesn't
>>>>>>>> provide it (because the kernel is too old) libc would have to provide
>>>>>>>> its own stub that uses the legacy method and matches the calling
>>>>>>>> convention of the one the kernel is expected to provide.
>>>>>>>
>>>>>>> What about pthread cancellation and the requirement of checking the
>>>>>>> cancellable syscall anchors in asynchronous cancellation? My plan is
>>>>>>> still to use musl strategy on glibc (BZ#12683) and for i686 it 
>>>>>>> requires to always use old int$128 for program that uses cancellation
>>>>>>> (static case) or just threads (dynamic mode, which should be more
>>>>>>> common on glibc).
>>>>>>>
>>>>>>> Using the i686 strategy of a vDSO bridge symbol would require to always
>>>>>>> fallback to 'sc' to still use the same cancellation strategy (and
>>>>>>> thus defeating this optimization in such cases).
>>>>>>
>>>>>> Yes, I assumed it would be the same, ignoring the new syscall
>>>>>> mechanism for cancellable syscalls. While there are some exceptions,
>>>>>> cancellable syscalls are generally not hot paths but things that are
>>>>>> expected to block and to have significant amounts of work to do in
>>>>>> kernelspace, so saving a few tens of cycles is rather pointless.
>>>>>>
>>>>>> It's possible to do a branch/multiple versions of the syscall asm for
>>>>>> cancellation but would require extending the cancellation handler to
>>>>>> support checking against multiple independent address ranges or using
>>>>>> some alternate markup of them.
>>>>>
>>>>> The main issue is at least for glibc dynamic linking is way more common
>>>>> than static linking and once the program become multithread the fallback
>>>>> will be always used.
>>>>
>>>> I'm not relying on static linking optimizing out the cancellable
>>>> version. I'm talking about how cancellable syscalls are pretty much
>>>> all "heavy" operations to begin with where a few tens of cycles are in
>>>> the realm of "measurement noise" relative to the dominating time
>>>> costs.
>>>
>>> Yes I am aware, but at same time I am not sure how it plays on real world.
>>> For instance, some workloads might issue kernel query syscalls, such as
>>> recv, where buffer copying might not be dominant factor. So I see that if
>>> the idea is optimizing syscall mechanism, we should try to leverage it
>>> as whole in libc.
>> 
>> Have you timed a minimal recv? I'm not assuming buffer copying is the
>> dominant factor. I'm assuming the overhead of all the kernel layers
>> involved is dominant.
> 
> Not really, but reading the advantages of using 'scv' over 'sc' also does
> not outline the real expect gain.  Taking in consideration this should
> be a micro-optimization (focused on entry syscall patch), I think we should
> use where it possible.

It's around 90 cycles improvement, depending on config options and 
speculative mitigations in place, this may be roughly 5-20% of a gettid
syscall, which itself probably bears little relationship to what a recv
syscall doing real work would do, it's easy to swamp it with other work.

But it's a pretty big win in terms of how much we try to optimise this
path.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-16 18:31             ` Rich Felker
  2020-04-16 18:44               ` Rich Felker
  2020-04-16 18:52               ` Adhemerval Zanella
@ 2020-04-20  1:10               ` Nicholas Piggin
  2020-04-20  1:34                 ` Rich Felker
  2020-04-21 12:28                 ` David Laight
  2 siblings, 2 replies; 62+ messages in thread
From: Nicholas Piggin @ 2020-04-20  1:10 UTC (permalink / raw)
  To: Adhemerval Zanella, Rich Felker; +Cc: libc-alpha, libc-dev, linuxppc-dev, musl

Excerpts from Rich Felker's message of April 17, 2020 4:31 am:
> On Thu, Apr 16, 2020 at 03:18:42PM -0300, Adhemerval Zanella wrote:
>> 
>> 
>> On 16/04/2020 14:59, Rich Felker wrote:
>> > On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote:
>> >>
>> >>
>> >> On 16/04/2020 12:37, Rich Felker wrote:
>> >>> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote:
>> >>>>> My preference would be that it work just like the i386 AT_SYSINFO
>> >>>>> where you just replace "int $128" with "call *%%gs:16" and the kernel
>> >>>>> provides a stub in the vdso that performs either scv or the old
>> >>>>> mechanism with the same calling convention. Then if the kernel doesn't
>> >>>>> provide it (because the kernel is too old) libc would have to provide
>> >>>>> its own stub that uses the legacy method and matches the calling
>> >>>>> convention of the one the kernel is expected to provide.
>> >>>>
>> >>>> What about pthread cancellation and the requirement of checking the
>> >>>> cancellable syscall anchors in asynchronous cancellation? My plan is
>> >>>> still to use musl strategy on glibc (BZ#12683) and for i686 it 
>> >>>> requires to always use old int$128 for program that uses cancellation
>> >>>> (static case) or just threads (dynamic mode, which should be more
>> >>>> common on glibc).
>> >>>>
>> >>>> Using the i686 strategy of a vDSO bridge symbol would require to always
>> >>>> fallback to 'sc' to still use the same cancellation strategy (and
>> >>>> thus defeating this optimization in such cases).
>> >>>
>> >>> Yes, I assumed it would be the same, ignoring the new syscall
>> >>> mechanism for cancellable syscalls. While there are some exceptions,
>> >>> cancellable syscalls are generally not hot paths but things that are
>> >>> expected to block and to have significant amounts of work to do in
>> >>> kernelspace, so saving a few tens of cycles is rather pointless.
>> >>>
>> >>> It's possible to do a branch/multiple versions of the syscall asm for
>> >>> cancellation but would require extending the cancellation handler to
>> >>> support checking against multiple independent address ranges or using
>> >>> some alternate markup of them.
>> >>
>> >> The main issue is at least for glibc dynamic linking is way more common
>> >> than static linking and once the program become multithread the fallback
>> >> will be always used.
>> > 
>> > I'm not relying on static linking optimizing out the cancellable
>> > version. I'm talking about how cancellable syscalls are pretty much
>> > all "heavy" operations to begin with where a few tens of cycles are in
>> > the realm of "measurement noise" relative to the dominating time
>> > costs.
>> 
>> Yes I am aware, but at same time I am not sure how it plays on real world.
>> For instance, some workloads might issue kernel query syscalls, such as
>> recv, where buffer copying might not be dominant factor. So I see that if
>> the idea is optimizing syscall mechanism, we should try to leverage it
>> as whole in libc.
> 
> Have you timed a minimal recv? I'm not assuming buffer copying is the
> dominant factor. I'm assuming the overhead of all the kernel layers
> involved is dominant.
> 
>> >> And besides the cancellation performance issue, a new bridge vDSO mechanism
>> >> will still require to setup some extra bridge for the case of the older
>> >> kernel.  In the scheme you suggested:
>> >>
>> >>   __asm__("indirect call" ... with common clobbers);
>> >>
>> >> The indirect call will be either the vDSO bridge or an libc provided that
>> >> fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain
>> >> against:
>> >>
>> >>    if (hwcap & PPC_FEATURE2_SCV) {
>> >>      __asm__(... with some clobbers);
>> >>    } else {
>> >>      __asm__(... with different clobbers);
>> >>    }
>> > 
>> > If the indirect call can be made roughly as efficiently as the sc
>> > sequence now (which already have some cost due to handling the nasty
>> > error return convention, making the indirect call likely just as small
>> > or smaller), it's O(1) additional code size (and thus icache usage)
>> > rather than O(n) where n is number of syscall points.
>> > 
>> > Of course it would work just as well (for avoiding O(n) growth) to
>> > have a direct call to out-of-line branch like you suggested.
>> 
>> Yes, but does it really matter to optimize this specific usage case
>> for size? glibc, for instance, tries to leverage the syscall mechanism 
>> by adding some complex pre-processor asm directives.  It optimizes
>> the syscall code size in most cases.  For instance, kill in static case 
>> generates on x86_64:
>> 
>> 0000000000000000 <__kill>:
>>    0:   b8 3e 00 00 00          mov    $0x3e,%eax
>>    5:   0f 05                   syscall 
>>    7:   48 3d 01 f0 ff ff       cmp    $0xfffffffffffff001,%rax
>>    d:   0f 83 00 00 00 00       jae    13 <__kill+0x13>
>>   13:   c3                      retq   
>> 
>> While on musl:
>> 
>> 0000000000000000 <kill>:
>>    0:	48 83 ec 08          	sub    $0x8,%rsp
>>    4:	48 63 ff             	movslq %edi,%rdi
>>    7:	48 63 f6             	movslq %esi,%rsi
>>    a:	b8 3e 00 00 00       	mov    $0x3e,%eax
>>    f:	0f 05                	syscall 
>>   11:	48 89 c7             	mov    %rax,%rdi
>>   14:	e8 00 00 00 00       	callq  19 <kill+0x19>
>>   19:	5a                   	pop    %rdx
>>   1a:	c3                   	retq   
> 
> Wow that's some extraordinarily bad codegen going on by gcc... The
> sign-extension is semantically needed and I don't see a good way
> around it (glibc's asm is kinda a hack taking advantage of kernel not
> looking at high bits, I think), but the gratuitous stack adjustment
> and refusal to generate a tail call isn't. I'll see if we can track
> down what's going on and get it fixed.
> 
>> But I hardly think it pays off the required code complexity.  Some
>> for providing a O(1) bridge: this will require additional complexity
>> to write it and setup correctly.
> 
> In some sense I agree, but inline instructions are a lot more
> expensive on ppc (being 32-bit each), and it might take out-of-lining
> anyway to get rid of stack frame setups if that ends up being a
> problem.
> 
>> >> Specially if 'hwcap & PPC_FEATURE2_SCV' could be optimized with a 
>> >> TCB member (as we do on glibc) and if we could make the asm clever
>> >> enough to not require different clobbers (although not sure if
>> >> it would be possible).
>> > 
>> > The easy way not to require different clobbers is just using the union
>> > of the clobbers, no? Does the proposed new method clobber any
>> > call-saved registers that would make it painful (requiring new call
>> > frames to save them in)?
>> 
>> As far I can tell, it should be ok.
> 
> Note that because lr is clobbered we need at least once normally
> call-clobbered register that's not syscall clobbered to save lr in.
> Otherwise stack frame setup is required to spill it.

The kernel would like to use r9-r12 for itself. We could do with fewer 
registers, but we have some delay establishing the stack (depends on a
load which depends on a mfspr), and entry code tends to be quite store
heavy whereas on the caller side you have r1 set up (modulo stack 
updates), and the system call is a long delay during which time the 
store queue has significant time to drain.

My feeling is it would be better for kernel to have these scratch 
registers.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-20  0:27       ` Nicholas Piggin
@ 2020-04-20  1:29         ` Rich Felker
  2020-04-20  2:08           ` Nicholas Piggin
  0 siblings, 1 reply; 62+ messages in thread
From: Rich Felker @ 2020-04-20  1:29 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Nicholas Piggin via Libc-alpha, Szabolcs Nagy, libc-dev,
	linuxppc-dev, musl

On Mon, Apr 20, 2020 at 10:27:58AM +1000, Nicholas Piggin wrote:
> Excerpts from Szabolcs Nagy's message of April 16, 2020 7:58 pm:
> > * Nicholas Piggin via Libc-alpha <libc-alpha@sourceware.org> [2020-04-16 10:16:54 +1000]:
> >> Well it would have to test HWCAP and patch in or branch to two 
> >> completely different sequences including register save/restores yes.
> >> You could have the same asm and matching clobbers to put the sequence
> >> inline and then you could patch the one sc/scv instruction I suppose.
> > 
> > how would that 'patch' work?
> > 
> > there are many reasons why you don't
> > want libc to write its .text
> 
> I guess I don't know what I'm talking about when it comes to libraries. 
> Shame if there is no good way to load-time patch libc. It's orthogonal
> to the scv selection though -- if you don't patch you have to 
> conditional or indirect branch however you implement it.

Patched pages cannot be shared. The whole design of PIC and shared
libraries is that the code("text")/rodata is immutable and shared and
that only a minimal amount of data, packed tightly together (the GOT)
has to exist per-instance.

Also, allowing patching of executable pages is generally frowned upon
these days because W^X is a desirable hardening property.

Rich

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-20  1:10               ` Nicholas Piggin
@ 2020-04-20  1:34                 ` Rich Felker
  2020-04-20  2:32                   ` Nicholas Piggin
  2020-04-21 12:28                 ` David Laight
  1 sibling, 1 reply; 62+ messages in thread
From: Rich Felker @ 2020-04-20  1:34 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Adhemerval Zanella, libc-alpha, libc-dev, linuxppc-dev, musl

On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote:
> Excerpts from Rich Felker's message of April 17, 2020 4:31 am:
> > Note that because lr is clobbered we need at least once normally
> > call-clobbered register that's not syscall clobbered to save lr in.
> > Otherwise stack frame setup is required to spill it.
> 
> The kernel would like to use r9-r12 for itself. We could do with fewer 
> registers, but we have some delay establishing the stack (depends on a
> load which depends on a mfspr), and entry code tends to be quite store
> heavy whereas on the caller side you have r1 set up (modulo stack 
> updates), and the system call is a long delay during which time the 
> store queue has significant time to drain.
> 
> My feeling is it would be better for kernel to have these scratch 
> registers.

If your new kernel syscall mechanism requires the caller to make a
whole stack frame it otherwise doesn't need and spill registers to it,
it becomes a lot less attractive. Some of those 90 cycles saved are
immediately lost on the userspace side, plus you either waste icache
at the call point or require the syscall to go through a
userspace-side helper function that performs the spill and restore.

The right way to do this is to have the kernel preserve enough
registers that userspace can avoid having any spills. It doesn't have
to preserve everything, probably just enough to save lr. (BTW are
syscall arg registers still preserved? If not, this is a major cost on
the userspace side, since any call point that has to loop-and-retry
(e.g. futex) now needs to make its own place to store the original
values.)

Rich

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-20  1:29         ` Rich Felker
@ 2020-04-20  2:08           ` Nicholas Piggin
  2020-04-20 21:17             ` Szabolcs Nagy
  0 siblings, 1 reply; 62+ messages in thread
From: Nicholas Piggin @ 2020-04-20  2:08 UTC (permalink / raw)
  To: Rich Felker
  Cc: Nicholas Piggin via Libc-alpha, libc-dev, linuxppc-dev, musl,
	Szabolcs Nagy

Excerpts from Rich Felker's message of April 20, 2020 11:29 am:
> On Mon, Apr 20, 2020 at 10:27:58AM +1000, Nicholas Piggin wrote:
>> Excerpts from Szabolcs Nagy's message of April 16, 2020 7:58 pm:
>> > * Nicholas Piggin via Libc-alpha <libc-alpha@sourceware.org> [2020-04-16 10:16:54 +1000]:
>> >> Well it would have to test HWCAP and patch in or branch to two 
>> >> completely different sequences including register save/restores yes.
>> >> You could have the same asm and matching clobbers to put the sequence
>> >> inline and then you could patch the one sc/scv instruction I suppose.
>> > 
>> > how would that 'patch' work?
>> > 
>> > there are many reasons why you don't
>> > want libc to write its .text
>> 
>> I guess I don't know what I'm talking about when it comes to libraries. 
>> Shame if there is no good way to load-time patch libc. It's orthogonal
>> to the scv selection though -- if you don't patch you have to 
>> conditional or indirect branch however you implement it.
> 
> Patched pages cannot be shared. The whole design of PIC and shared
> libraries is that the code("text")/rodata is immutable and shared and
> that only a minimal amount of data, packed tightly together (the GOT)
> has to exist per-instance.

Yeah the pages which were patched couldn't be shared across exec, which
is a significant downside, unless you could group all patch sites into
their own section and similarly pack it together (which has issues of
being out of line).

> 
> Also, allowing patching of executable pages is generally frowned upon
> these days because W^X is a desirable hardening property.

Right, it would want be write-protected after being patched.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-20  1:34                 ` Rich Felker
@ 2020-04-20  2:32                   ` Nicholas Piggin
  2020-04-20  4:09                     ` Rich Felker
  0 siblings, 1 reply; 62+ messages in thread
From: Nicholas Piggin @ 2020-04-20  2:32 UTC (permalink / raw)
  To: Rich Felker; +Cc: Adhemerval Zanella, libc-alpha, libc-dev, linuxppc-dev, musl

Excerpts from Rich Felker's message of April 20, 2020 11:34 am:
> On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote:
>> Excerpts from Rich Felker's message of April 17, 2020 4:31 am:
>> > Note that because lr is clobbered we need at least once normally
>> > call-clobbered register that's not syscall clobbered to save lr in.
>> > Otherwise stack frame setup is required to spill it.
>> 
>> The kernel would like to use r9-r12 for itself. We could do with fewer 
>> registers, but we have some delay establishing the stack (depends on a
>> load which depends on a mfspr), and entry code tends to be quite store
>> heavy whereas on the caller side you have r1 set up (modulo stack 
>> updates), and the system call is a long delay during which time the 
>> store queue has significant time to drain.
>> 
>> My feeling is it would be better for kernel to have these scratch 
>> registers.
> 
> If your new kernel syscall mechanism requires the caller to make a
> whole stack frame it otherwise doesn't need and spill registers to it,
> it becomes a lot less attractive. Some of those 90 cycles saved are
> immediately lost on the userspace side, plus you either waste icache
> at the call point or require the syscall to go through a
> userspace-side helper function that performs the spill and restore.

You would be surprised how few cycles that takes on a high end CPU. Some 
might be a couple of %. I am one for counting cycles mind you, I'm not 
being flippant about it. If we can come up with something faster I'd be 
up for it.

> 
> The right way to do this is to have the kernel preserve enough
> registers that userspace can avoid having any spills. It doesn't have
> to preserve everything, probably just enough to save lr. (BTW are

Again, the problem is the kernel doesn't have its dependencies 
immediately ready to spill, and spilling (may be) more costly 
immediately after the call because we're doing a lot of stores.

I could try measure this. Unfortunately our pipeline simulator tool 
doesn't model system calls properly so it's hard to see what's happening 
across the user/kernel horizon, I might check if that can be improved
or I can hack it by putting some isync in there or something.

> syscall arg registers still preserved? If not, this is a major cost on
> the userspace side, since any call point that has to loop-and-retry
> (e.g. futex) now needs to make its own place to store the original
> values.)

Powerpc system calls never did. We could have scv preserve them, but 
you'd still need to restore r3. We could make an ABI which does not
clobber r3 but puts the return value in r9, say. I'd like to see what
the user side code looks like to take advantage of such a thing though.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-20  2:32                   ` Nicholas Piggin
@ 2020-04-20  4:09                     ` Rich Felker
  2020-04-20  4:31                       ` Nicholas Piggin
  0 siblings, 1 reply; 62+ messages in thread
From: Rich Felker @ 2020-04-20  4:09 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Adhemerval Zanella, libc-alpha, libc-dev, linuxppc-dev, musl

On Mon, Apr 20, 2020 at 12:32:21PM +1000, Nicholas Piggin wrote:
> Excerpts from Rich Felker's message of April 20, 2020 11:34 am:
> > On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote:
> >> Excerpts from Rich Felker's message of April 17, 2020 4:31 am:
> >> > Note that because lr is clobbered we need at least once normally
> >> > call-clobbered register that's not syscall clobbered to save lr in.
> >> > Otherwise stack frame setup is required to spill it.
> >> 
> >> The kernel would like to use r9-r12 for itself. We could do with fewer 
> >> registers, but we have some delay establishing the stack (depends on a
> >> load which depends on a mfspr), and entry code tends to be quite store
> >> heavy whereas on the caller side you have r1 set up (modulo stack 
> >> updates), and the system call is a long delay during which time the 
> >> store queue has significant time to drain.
> >> 
> >> My feeling is it would be better for kernel to have these scratch 
> >> registers.
> > 
> > If your new kernel syscall mechanism requires the caller to make a
> > whole stack frame it otherwise doesn't need and spill registers to it,
> > it becomes a lot less attractive. Some of those 90 cycles saved are
> > immediately lost on the userspace side, plus you either waste icache
> > at the call point or require the syscall to go through a
> > userspace-side helper function that performs the spill and restore.
> 
> You would be surprised how few cycles that takes on a high end CPU. Some 
> might be a couple of %. I am one for counting cycles mind you, I'm not 
> being flippant about it. If we can come up with something faster I'd be 
> up for it.

If the cycle count is trivial then just do it on the kernel side.

> > The right way to do this is to have the kernel preserve enough
> > registers that userspace can avoid having any spills. It doesn't have
> > to preserve everything, probably just enough to save lr. (BTW are
> 
> Again, the problem is the kernel doesn't have its dependencies 
> immediately ready to spill, and spilling (may be) more costly 
> immediately after the call because we're doing a lot of stores.
> 
> I could try measure this. Unfortunately our pipeline simulator tool 
> doesn't model system calls properly so it's hard to see what's happening 
> across the user/kernel horizon, I might check if that can be improved
> or I can hack it by putting some isync in there or something.

I think it's unlikely to make any real difference to the total number
of cycles spent which side it happens on, but putting it on the kernel
side makes it easier to avoid wasting size/icache at each syscall
site.

> > syscall arg registers still preserved? If not, this is a major cost on
> > the userspace side, since any call point that has to loop-and-retry
> > (e.g. futex) now needs to make its own place to store the original
> > values.)
> 
> Powerpc system calls never did. We could have scv preserve them, but 
> you'd still need to restore r3. We could make an ABI which does not
> clobber r3 but puts the return value in r9, say. I'd like to see what
> the user side code looks like to take advantage of such a thing though.

Oh wow, I hadn't realized that, but indeed the code we have now is
allowing for the kernel to clobber them all. So at least this isn't
getting any worse I guess. I think it was a very poor choice of
behavior though and a disadvantage vs what other archs do (some of
them preserve all registers; others preserve only normally call-saved
ones plus the syscall arg ones and possibly a few other specials).

Rich

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-20  4:09                     ` Rich Felker
@ 2020-04-20  4:31                       ` Nicholas Piggin
  2020-04-20 17:27                         ` Rich Felker
  0 siblings, 1 reply; 62+ messages in thread
From: Nicholas Piggin @ 2020-04-20  4:31 UTC (permalink / raw)
  To: Rich Felker; +Cc: Adhemerval Zanella, libc-alpha, libc-dev, linuxppc-dev, musl

Excerpts from Rich Felker's message of April 20, 2020 2:09 pm:
> On Mon, Apr 20, 2020 at 12:32:21PM +1000, Nicholas Piggin wrote:
>> Excerpts from Rich Felker's message of April 20, 2020 11:34 am:
>> > On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote:
>> >> Excerpts from Rich Felker's message of April 17, 2020 4:31 am:
>> >> > Note that because lr is clobbered we need at least once normally
>> >> > call-clobbered register that's not syscall clobbered to save lr in.
>> >> > Otherwise stack frame setup is required to spill it.
>> >> 
>> >> The kernel would like to use r9-r12 for itself. We could do with fewer 
>> >> registers, but we have some delay establishing the stack (depends on a
>> >> load which depends on a mfspr), and entry code tends to be quite store
>> >> heavy whereas on the caller side you have r1 set up (modulo stack 
>> >> updates), and the system call is a long delay during which time the 
>> >> store queue has significant time to drain.
>> >> 
>> >> My feeling is it would be better for kernel to have these scratch 
>> >> registers.
>> > 
>> > If your new kernel syscall mechanism requires the caller to make a
>> > whole stack frame it otherwise doesn't need and spill registers to it,
>> > it becomes a lot less attractive. Some of those 90 cycles saved are
>> > immediately lost on the userspace side, plus you either waste icache
>> > at the call point or require the syscall to go through a
>> > userspace-side helper function that performs the spill and restore.
>> 
>> You would be surprised how few cycles that takes on a high end CPU. Some 
>> might be a couple of %. I am one for counting cycles mind you, I'm not 
>> being flippant about it. If we can come up with something faster I'd be 
>> up for it.
> 
> If the cycle count is trivial then just do it on the kernel side.

The cycle count for user is, because you have r1 ready. Kernel does not 
have its stack ready, it has to mfspr rX ; ld rY,N(rX); to get stack to 
save into.

Which is also wasted work for a userspace.

Now that I think about it, no stack frame is even required! lr is saved 
into the caller's stack when its clobbered with an asm, just as when 
it's used for a function call.

>> > The right way to do this is to have the kernel preserve enough
>> > registers that userspace can avoid having any spills. It doesn't have
>> > to preserve everything, probably just enough to save lr. (BTW are
>> 
>> Again, the problem is the kernel doesn't have its dependencies 
>> immediately ready to spill, and spilling (may be) more costly 
>> immediately after the call because we're doing a lot of stores.
>> 
>> I could try measure this. Unfortunately our pipeline simulator tool 
>> doesn't model system calls properly so it's hard to see what's happening 
>> across the user/kernel horizon, I might check if that can be improved
>> or I can hack it by putting some isync in there or something.
> 
> I think it's unlikely to make any real difference to the total number
> of cycles spent which side it happens on, but putting it on the kernel
> side makes it easier to avoid wasting size/icache at each syscall
> site.
> 
>> > syscall arg registers still preserved? If not, this is a major cost on
>> > the userspace side, since any call point that has to loop-and-retry
>> > (e.g. futex) now needs to make its own place to store the original
>> > values.)
>> 
>> Powerpc system calls never did. We could have scv preserve them, but 
>> you'd still need to restore r3. We could make an ABI which does not
>> clobber r3 but puts the return value in r9, say. I'd like to see what
>> the user side code looks like to take advantage of such a thing though.
> 
> Oh wow, I hadn't realized that, but indeed the code we have now is
> allowing for the kernel to clobber them all. So at least this isn't
> getting any worse I guess. I think it was a very poor choice of
> behavior though and a disadvantage vs what other archs do (some of
> them preserve all registers; others preserve only normally call-saved
> ones plus the syscall arg ones and possibly a few other specials).

Well, we could change it. Does the generated code improve significantly
we take those clobbers away?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-20  4:31                       ` Nicholas Piggin
@ 2020-04-20 17:27                         ` Rich Felker
  2020-04-22  6:18                           ` Nicholas Piggin
  0 siblings, 1 reply; 62+ messages in thread
From: Rich Felker @ 2020-04-20 17:27 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Adhemerval Zanella, libc-alpha, libc-dev, linuxppc-dev, musl

On Mon, Apr 20, 2020 at 02:31:58PM +1000, Nicholas Piggin wrote:
> Excerpts from Rich Felker's message of April 20, 2020 2:09 pm:
> > On Mon, Apr 20, 2020 at 12:32:21PM +1000, Nicholas Piggin wrote:
> >> Excerpts from Rich Felker's message of April 20, 2020 11:34 am:
> >> > On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote:
> >> >> Excerpts from Rich Felker's message of April 17, 2020 4:31 am:
> >> >> > Note that because lr is clobbered we need at least once normally
> >> >> > call-clobbered register that's not syscall clobbered to save lr in.
> >> >> > Otherwise stack frame setup is required to spill it.
> >> >> 
> >> >> The kernel would like to use r9-r12 for itself. We could do with fewer 
> >> >> registers, but we have some delay establishing the stack (depends on a
> >> >> load which depends on a mfspr), and entry code tends to be quite store
> >> >> heavy whereas on the caller side you have r1 set up (modulo stack 
> >> >> updates), and the system call is a long delay during which time the 
> >> >> store queue has significant time to drain.
> >> >> 
> >> >> My feeling is it would be better for kernel to have these scratch 
> >> >> registers.
> >> > 
> >> > If your new kernel syscall mechanism requires the caller to make a
> >> > whole stack frame it otherwise doesn't need and spill registers to it,
> >> > it becomes a lot less attractive. Some of those 90 cycles saved are
> >> > immediately lost on the userspace side, plus you either waste icache
> >> > at the call point or require the syscall to go through a
> >> > userspace-side helper function that performs the spill and restore.
> >> 
> >> You would be surprised how few cycles that takes on a high end CPU. Some 
> >> might be a couple of %. I am one for counting cycles mind you, I'm not 
> >> being flippant about it. If we can come up with something faster I'd be 
> >> up for it.
> > 
> > If the cycle count is trivial then just do it on the kernel side.
> 
> The cycle count for user is, because you have r1 ready. Kernel does not 
> have its stack ready, it has to mfspr rX ; ld rY,N(rX); to get stack to 
> save into.
> 
> Which is also wasted work for a userspace.
> 
> Now that I think about it, no stack frame is even required! lr is saved 
> into the caller's stack when its clobbered with an asm, just as when 
> it's used for a function call.

No. If there is a non-clobbered register, lr can be moved to the
non-clobbered register rather than saved to the stack. However it
looks like (1) gcc doesn't take advantage of that possibility, but (2)
the caller already arranged for there to be space on the stack to save
lr, so the cost is only one store and one load, not any stack
adjustment or other frame setup. So it's probably not a really big
deal. However, just adding "lr" clobber to existing syscall in musl
increased the size of a simple syscall function (getuid) from 20 bytes
to 36 bytes.

> >> > syscall arg registers still preserved? If not, this is a major cost on
> >> > the userspace side, since any call point that has to loop-and-retry
> >> > (e.g. futex) now needs to make its own place to store the original
> >> > values.)
> >> 
> >> Powerpc system calls never did. We could have scv preserve them, but 
> >> you'd still need to restore r3. We could make an ABI which does not
> >> clobber r3 but puts the return value in r9, say. I'd like to see what
> >> the user side code looks like to take advantage of such a thing though.
> > 
> > Oh wow, I hadn't realized that, but indeed the code we have now is
> > allowing for the kernel to clobber them all. So at least this isn't
> > getting any worse I guess. I think it was a very poor choice of
> > behavior though and a disadvantage vs what other archs do (some of
> > them preserve all registers; others preserve only normally call-saved
> > ones plus the syscall arg ones and possibly a few other specials).
> 
> Well, we could change it. Does the generated code improve significantly
> we take those clobbers away?

I'd have to experiment a bit more to see. It's not going to help at
all in functions which are pure syscall wrappers that just do the
syscall and return, since the arg regs are dead after the syscall
anyway (the caller must assume they were clobbered). But where
syscalls are inlined and used in a loop, like a futex wait, it might
make a nontrivial difference.

Unfortunately even if you did change it for the new scv mechanism, it
would be hard to take advantage of the change while also supporting
sc, unless we used a helper function that just did scv directly, but
saved/restored all the arg regs when using the legacy sc mechanism.
Just inlining the hwcap conditional and clobbering more regs in one
code path than in the other likely would not help; gcc won't
shrink-wrap the clobbered/non-clobbered paths separately, and even if
it did, when this were inlined somewhere like a futex loop, it'd end
up having to lift the conditional out of the loop to be very
advantageous, then making the code much larger by producing two copies
of the loop. So I think just behaving similarly to the old sc method
is probably the best option we have...

Rich

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-20  2:08           ` Nicholas Piggin
@ 2020-04-20 21:17             ` Szabolcs Nagy
  2020-04-21  9:57               ` Florian Weimer
  0 siblings, 1 reply; 62+ messages in thread
From: Szabolcs Nagy @ 2020-04-20 21:17 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Rich Felker, Nicholas Piggin via Libc-alpha, libc-dev,
	linuxppc-dev, musl

* Nicholas Piggin <npiggin@gmail.com> [2020-04-20 12:08:36 +1000]:
> Excerpts from Rich Felker's message of April 20, 2020 11:29 am:
> > Also, allowing patching of executable pages is generally frowned upon
> > these days because W^X is a desirable hardening property.
> 
> Right, it would want be write-protected after being patched.

"frowned upon" means that users may have to update
their security policy setting in pax, selinux, apparmor,
seccomp bpf filters and who knows what else that may
monitor and flag W&X mprotect.

libc update can break systems if the new libc does W&X.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-20 21:17             ` Szabolcs Nagy
@ 2020-04-21  9:57               ` Florian Weimer
  0 siblings, 0 replies; 62+ messages in thread
From: Florian Weimer @ 2020-04-21  9:57 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Rich Felker, Nicholas Piggin via Libc-alpha, libc-dev,
	linuxppc-dev, musl

* Szabolcs Nagy:

> * Nicholas Piggin <npiggin@gmail.com> [2020-04-20 12:08:36 +1000]:
>> Excerpts from Rich Felker's message of April 20, 2020 11:29 am:
>> > Also, allowing patching of executable pages is generally frowned upon
>> > these days because W^X is a desirable hardening property.
>> 
>> Right, it would want be write-protected after being patched.
>
> "frowned upon" means that users may have to update
> their security policy setting in pax, selinux, apparmor,
> seccomp bpf filters and who knows what else that may
> monitor and flag W&X mprotect.
>
> libc update can break systems if the new libc does W&X.

It's possible to map over pre-compiled alternative implementations,
though.  Basically, we would do the patching and build time and store
the results in the file.

It works best if the variance is concentrated on a few pages, and
there are very few alternatives.  For example, having two syscall APIs
and supporting threading and no-threading versions would need four
code versions in total, which is likely excessive.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-20  1:10               ` Nicholas Piggin
  2020-04-20  1:34                 ` Rich Felker
@ 2020-04-21 12:28                 ` David Laight
  2020-04-21 14:39                   ` Rich Felker
  1 sibling, 1 reply; 62+ messages in thread
From: David Laight @ 2020-04-21 12:28 UTC (permalink / raw)
  To: 'Nicholas Piggin', Adhemerval Zanella, Rich Felker
  Cc: libc-dev, libc-alpha, linuxppc-dev, musl

From: Nicholas Piggin
> Sent: 20 April 2020 02:10
...
> >> Yes, but does it really matter to optimize this specific usage case
> >> for size? glibc, for instance, tries to leverage the syscall mechanism
> >> by adding some complex pre-processor asm directives.  It optimizes
> >> the syscall code size in most cases.  For instance, kill in static case
> >> generates on x86_64:
> >>
> >> 0000000000000000 <__kill>:
> >>    0:   b8 3e 00 00 00          mov    $0x3e,%eax
> >>    5:   0f 05                   syscall
> >>    7:   48 3d 01 f0 ff ff       cmp    $0xfffffffffffff001,%rax
> >>    d:   0f 83 00 00 00 00       jae    13 <__kill+0x13>

Hmmm... that cmp + jae is unnecessary here.
It is also a 32bit offset jump.
I also suspect it gets predicted very badly.

> >>   13:   c3                      retq
> >>
> >> While on musl:
> >>
> >> 0000000000000000 <kill>:
> >>    0:	48 83 ec 08          	sub    $0x8,%rsp
> >>    4:	48 63 ff             	movslq %edi,%rdi
> >>    7:	48 63 f6             	movslq %esi,%rsi
> >>    a:	b8 3e 00 00 00       	mov    $0x3e,%eax
> >>    f:	0f 05                	syscall
> >>   11:	48 89 c7             	mov    %rax,%rdi
> >>   14:	e8 00 00 00 00       	callq  19 <kill+0x19>
> >>   19:	5a                   	pop    %rdx
> >>   1a:	c3                   	retq
> >
> > Wow that's some extraordinarily bad codegen going on by gcc... The
> > sign-extension is semantically needed and I don't see a good way
> > around it (glibc's asm is kinda a hack taking advantage of kernel not
> > looking at high bits, I think), but the gratuitous stack adjustment
> > and refusal to generate a tail call isn't. I'll see if we can track
> > down what's going on and get it fixed.

A suitable cast might get rid of the sign extension.
Possibly just (unsigned int).

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-21 12:28                 ` David Laight
@ 2020-04-21 14:39                   ` Rich Felker
  2020-04-21 15:00                     ` Adhemerval Zanella
  0 siblings, 1 reply; 62+ messages in thread
From: Rich Felker @ 2020-04-21 14:39 UTC (permalink / raw)
  To: David Laight
  Cc: 'Nicholas Piggin',
	Adhemerval Zanella, libc-dev, libc-alpha, linuxppc-dev, musl

On Tue, Apr 21, 2020 at 12:28:25PM +0000, David Laight wrote:
> From: Nicholas Piggin
> > Sent: 20 April 2020 02:10
> ...
> > >> Yes, but does it really matter to optimize this specific usage case
> > >> for size? glibc, for instance, tries to leverage the syscall mechanism
> > >> by adding some complex pre-processor asm directives.  It optimizes
> > >> the syscall code size in most cases.  For instance, kill in static case
> > >> generates on x86_64:
> > >>
> > >> 0000000000000000 <__kill>:
> > >>    0:   b8 3e 00 00 00          mov    $0x3e,%eax
> > >>    5:   0f 05                   syscall
> > >>    7:   48 3d 01 f0 ff ff       cmp    $0xfffffffffffff001,%rax
> > >>    d:   0f 83 00 00 00 00       jae    13 <__kill+0x13>
> 
> Hmmm... that cmp + jae is unnecessary here.

It's not.. Rather the objdump was just mistakenly done without -r so
it looks like a nop jump rather than a conditional tail call to the
function that sets errno.

> It is also a 32bit offset jump.
> I also suspect it gets predicted very badly.

I doubt that. This is a very standard idiom and the size of the offset
(which is necessarily 32-bit because it has a relocation on it) is
orthogonal to the condition on the jump.

FWIW a syscall like kill takes global kernel-side locks to be able to
address a target process by pid, and the rate of meaningful calls you
can make to it is very low (since it's bounded by time for target
process to act on the signal). Trying to optimize it for speed is
pointless, and even size isn't important locally (although in
aggregate, lots of wasted small size can add up to more pages = more
TLB entries = ...).

> > >>   13:   c3                      retq
> > >>
> > >> While on musl:
> > >>
> > >> 0000000000000000 <kill>:
> > >>    0:	48 83 ec 08          	sub    $0x8,%rsp
> > >>    4:	48 63 ff             	movslq %edi,%rdi
> > >>    7:	48 63 f6             	movslq %esi,%rsi
> > >>    a:	b8 3e 00 00 00       	mov    $0x3e,%eax
> > >>    f:	0f 05                	syscall
> > >>   11:	48 89 c7             	mov    %rax,%rdi
> > >>   14:	e8 00 00 00 00       	callq  19 <kill+0x19>
> > >>   19:	5a                   	pop    %rdx
> > >>   1a:	c3                   	retq
> > >
> > > Wow that's some extraordinarily bad codegen going on by gcc... The
> > > sign-extension is semantically needed and I don't see a good way
> > > around it (glibc's asm is kinda a hack taking advantage of kernel not
> > > looking at high bits, I think), but the gratuitous stack adjustment
> > > and refusal to generate a tail call isn't. I'll see if we can track
> > > down what's going on and get it fixed.
> 
> A suitable cast might get rid of the sign extension.
> Possibly just (unsigned int).

No, it won't. The problem is that there is no representation of the
fact that the kernel is only going to inspect the low 32 bits (by
declaring the kernel-side function as taking an int argument). The
external kill function receives arguments by the ABI, where the upper
bits of int args can contain junk, and the asm register constraints
for syscalls use longs (or rather an abstract syscall-arg type). It
wouldn't even work to have macro magic detect that the expressions
passed are ints and use hacks to avoid that, since it's perfectly
valid to pass an int to a syscall that expects a long argument (e.g.
offset to mmap), in which case it needs to be sign-extended.

The only way to avoid this is encoding somewhere the syscall-specific
knowledge of what arg size the kernel function expects. That's way too
much redundant effort and too error-prone for the incredibly miniscule
size benefit you'd get out of it.

Rich

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-21 14:39                   ` Rich Felker
@ 2020-04-21 15:00                     ` Adhemerval Zanella
  2020-04-21 15:31                       ` David Laight
  2020-04-22  6:54                       ` [musl] " Nicholas Piggin
  0 siblings, 2 replies; 62+ messages in thread
From: Adhemerval Zanella @ 2020-04-21 15:00 UTC (permalink / raw)
  To: Rich Felker, David Laight
  Cc: 'Nicholas Piggin', libc-dev, libc-alpha, linuxppc-dev, musl



On 21/04/2020 11:39, Rich Felker wrote:
> On Tue, Apr 21, 2020 at 12:28:25PM +0000, David Laight wrote:
>> From: Nicholas Piggin
>>> Sent: 20 April 2020 02:10
>> ...
>>>>> Yes, but does it really matter to optimize this specific usage case
>>>>> for size? glibc, for instance, tries to leverage the syscall mechanism
>>>>> by adding some complex pre-processor asm directives.  It optimizes
>>>>> the syscall code size in most cases.  For instance, kill in static case
>>>>> generates on x86_64:
>>>>>
>>>>> 0000000000000000 <__kill>:
>>>>>    0:   b8 3e 00 00 00          mov    $0x3e,%eax
>>>>>    5:   0f 05                   syscall
>>>>>    7:   48 3d 01 f0 ff ff       cmp    $0xfffffffffffff001,%rax
>>>>>    d:   0f 83 00 00 00 00       jae    13 <__kill+0x13>
>>
>> Hmmm... that cmp + jae is unnecessary here.
> 
> It's not.. Rather the objdump was just mistakenly done without -r so
> it looks like a nop jump rather than a conditional tail call to the
> function that sets errno.
> 

Indeed, the output with -r is:

0000000000000000 <__kill>:
   0:   b8 3e 00 00 00          mov    $0x3e,%eax
   5:   0f 05                   syscall 
   7:   48 3d 01 f0 ff ff       cmp    $0xfffffffffffff001,%rax
   d:   0f 83 00 00 00 00       jae    13 <__kill+0x13>
                        f: R_X86_64_PLT32       __syscall_error-0x4
  13:   c3                      retq   

And for x86_64 __syscall_error is defined as:

0000000000000000 <__syscall_error>:
   0:   48 f7 d8                neg    %rax

0000000000000003 <__syscall_error_1>:
   3:   64 89 04 25 00 00 00    mov    %eax,%fs:0x0
   a:   00
                        7: R_X86_64_TPOFF32     errno
   b:   48 83 c8 ff             or     $0xffffffffffffffff,%rax
   f:   c3                      retq

Different than musl, each architecture defines its own error handling
mechanism (some embedded errno setting in syscall itself, other branches
to a __syscall_error like function as x86_64).  

This is due most likely from the glibc long history.  One of my long 
term plan is to just simplify, get rid of the assembly pre-processor,
implement all syscall in C code, and set error handling mechanism in
a platform neutral way using a tail call (most likely you do on musl).

>> It is also a 32bit offset jump.
>> I also suspect it gets predicted very badly.
> 
> I doubt that. This is a very standard idiom and the size of the offset
> (which is necessarily 32-bit because it has a relocation on it) is
> orthogonal to the condition on the jump.
> 
> FWIW a syscall like kill takes global kernel-side locks to be able to
> address a target process by pid, and the rate of meaningful calls you
> can make to it is very low (since it's bounded by time for target
> process to act on the signal). Trying to optimize it for speed is
> pointless, and even size isn't important locally (although in
> aggregate, lots of wasted small size can add up to more pages = more
> TLB entries = ...).

I agree and I would prefer to focus on code simplicity to have a
platform neutral way to handle error and let the compiler optimize
it than messy with assembly macros to squeeze this kind of
micro-optimizations.

> 
>>>>>   13:   c3                      retq
>>>>>
>>>>> While on musl:
>>>>>
>>>>> 0000000000000000 <kill>:
>>>>>    0:	48 83 ec 08          	sub    $0x8,%rsp
>>>>>    4:	48 63 ff             	movslq %edi,%rdi
>>>>>    7:	48 63 f6             	movslq %esi,%rsi
>>>>>    a:	b8 3e 00 00 00       	mov    $0x3e,%eax
>>>>>    f:	0f 05                	syscall
>>>>>   11:	48 89 c7             	mov    %rax,%rdi
>>>>>   14:	e8 00 00 00 00       	callq  19 <kill+0x19>
>>>>>   19:	5a                   	pop    %rdx
>>>>>   1a:	c3                   	retq
>>>>
>>>> Wow that's some extraordinarily bad codegen going on by gcc... The
>>>> sign-extension is semantically needed and I don't see a good way
>>>> around it (glibc's asm is kinda a hack taking advantage of kernel not
>>>> looking at high bits, I think), but the gratuitous stack adjustment
>>>> and refusal to generate a tail call isn't. I'll see if we can track
>>>> down what's going on and get it fixed.
>>
>> A suitable cast might get rid of the sign extension.
>> Possibly just (unsigned int).
> 
> No, it won't. The problem is that there is no representation of the
> fact that the kernel is only going to inspect the low 32 bits (by
> declaring the kernel-side function as taking an int argument). The
> external kill function receives arguments by the ABI, where the upper
> bits of int args can contain junk, and the asm register constraints
> for syscalls use longs (or rather an abstract syscall-arg type). It
> wouldn't even work to have macro magic detect that the expressions
> passed are ints and use hacks to avoid that, since it's perfectly
> valid to pass an int to a syscall that expects a long argument (e.g.
> offset to mmap), in which case it needs to be sign-extended.
> 
> The only way to avoid this is encoding somewhere the syscall-specific
> knowledge of what arg size the kernel function expects. That's way too
> much redundant effort and too error-prone for the incredibly miniscule
> size benefit you'd get out of it.
> 
> Rich
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-21 15:00                     ` Adhemerval Zanella
@ 2020-04-21 15:31                       ` David Laight
  2020-04-22  6:54                       ` [musl] " Nicholas Piggin
  1 sibling, 0 replies; 62+ messages in thread
From: David Laight @ 2020-04-21 15:31 UTC (permalink / raw)
  To: 'Adhemerval Zanella', Rich Felker
  Cc: 'Nicholas Piggin', libc-dev, libc-alpha, linuxppc-dev, musl

From: Adhemerval Zanella
> Sent: 21 April 2020 16:01
> 
> On 21/04/2020 11:39, Rich Felker wrote:
> > On Tue, Apr 21, 2020 at 12:28:25PM +0000, David Laight wrote:
> >> From: Nicholas Piggin
> >>> Sent: 20 April 2020 02:10
> >> ...
> >>>>> Yes, but does it really matter to optimize this specific usage case
> >>>>> for size? glibc, for instance, tries to leverage the syscall mechanism
> >>>>> by adding some complex pre-processor asm directives.  It optimizes
> >>>>> the syscall code size in most cases.  For instance, kill in static case
> >>>>> generates on x86_64:
> >>>>>
> >>>>> 0000000000000000 <__kill>:
> >>>>>    0:   b8 3e 00 00 00          mov    $0x3e,%eax
> >>>>>    5:   0f 05                   syscall
> >>>>>    7:   48 3d 01 f0 ff ff       cmp    $0xfffffffffffff001,%rax
> >>>>>    d:   0f 83 00 00 00 00       jae    13 <__kill+0x13>
> >>
> >> Hmmm... that cmp + jae is unnecessary here.
> >
> > It's not.. Rather the objdump was just mistakenly done without -r so
> > it looks like a nop jump rather than a conditional tail call to the
> > function that sets errno.
> >
> 
> Indeed, the output with -r is:
> 
> 0000000000000000 <__kill>:
>    0:   b8 3e 00 00 00          mov    $0x3e,%eax
>    5:   0f 05                   syscall
>    7:   48 3d 01 f0 ff ff       cmp    $0xfffffffffffff001,%rax
>    d:   0f 83 00 00 00 00       jae    13 <__kill+0x13>
>                         f: R_X86_64_PLT32       __syscall_error-0x4
>   13:   c3                      retq

Yes, I probably should have remembered it looked like that :-)
...
> >> I also suspect it gets predicted very badly.
> >
> > I doubt that. This is a very standard idiom and the size of the offset
> > (which is necessarily 32-bit because it has a relocation on it) is
> > orthogonal to the condition on the jump.

Yes, it only gets mispredicted as badly as any other conditional jump.
I believe modern intel x86 will randomly predict it taken (regardless
of the direction) and then hit a TLB fault on text.unlikely :-)

> > FWIW a syscall like kill takes global kernel-side locks to be able to
> > address a target process by pid, and the rate of meaningful calls you
> > can make to it is very low (since it's bounded by time for target
> > process to act on the signal). Trying to optimize it for speed is
> > pointless, and even size isn't important locally (although in
> > aggregate, lots of wasted small size can add up to more pages = more
> > TLB entries = ...).
> 
> I agree and I would prefer to focus on code simplicity to have a
> platform neutral way to handle error and let the compiler optimize
> it than messy with assembly macros to squeeze this kind of
> micro-optimizations.

syscall entry does get micro-optimised.
Real speed-ups can probably be found by optimising other places.
I've a patch i need to resumbit that should improve the reading
of iov[] from user space.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-20 17:27                         ` Rich Felker
@ 2020-04-22  6:18                           ` Nicholas Piggin
  2020-04-22  6:29                             ` Nicholas Piggin
  2020-04-23  2:36                             ` Rich Felker
  0 siblings, 2 replies; 62+ messages in thread
From: Nicholas Piggin @ 2020-04-22  6:18 UTC (permalink / raw)
  To: Rich Felker; +Cc: Adhemerval Zanella, libc-alpha, libc-dev, linuxppc-dev, musl

Excerpts from Rich Felker's message of April 21, 2020 3:27 am:
> On Mon, Apr 20, 2020 at 02:31:58PM +1000, Nicholas Piggin wrote:
>> Excerpts from Rich Felker's message of April 20, 2020 2:09 pm:
>> > On Mon, Apr 20, 2020 at 12:32:21PM +1000, Nicholas Piggin wrote:
>> >> Excerpts from Rich Felker's message of April 20, 2020 11:34 am:
>> >> > On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote:
>> >> >> Excerpts from Rich Felker's message of April 17, 2020 4:31 am:
>> >> >> > Note that because lr is clobbered we need at least once normally
>> >> >> > call-clobbered register that's not syscall clobbered to save lr in.
>> >> >> > Otherwise stack frame setup is required to spill it.
>> >> >> 
>> >> >> The kernel would like to use r9-r12 for itself. We could do with fewer 
>> >> >> registers, but we have some delay establishing the stack (depends on a
>> >> >> load which depends on a mfspr), and entry code tends to be quite store
>> >> >> heavy whereas on the caller side you have r1 set up (modulo stack 
>> >> >> updates), and the system call is a long delay during which time the 
>> >> >> store queue has significant time to drain.
>> >> >> 
>> >> >> My feeling is it would be better for kernel to have these scratch 
>> >> >> registers.
>> >> > 
>> >> > If your new kernel syscall mechanism requires the caller to make a
>> >> > whole stack frame it otherwise doesn't need and spill registers to it,
>> >> > it becomes a lot less attractive. Some of those 90 cycles saved are
>> >> > immediately lost on the userspace side, plus you either waste icache
>> >> > at the call point or require the syscall to go through a
>> >> > userspace-side helper function that performs the spill and restore.
>> >> 
>> >> You would be surprised how few cycles that takes on a high end CPU. Some 
>> >> might be a couple of %. I am one for counting cycles mind you, I'm not 
>> >> being flippant about it. If we can come up with something faster I'd be 
>> >> up for it.
>> > 
>> > If the cycle count is trivial then just do it on the kernel side.
>> 
>> The cycle count for user is, because you have r1 ready. Kernel does not 
>> have its stack ready, it has to mfspr rX ; ld rY,N(rX); to get stack to 
>> save into.
>> 
>> Which is also wasted work for a userspace.
>> 
>> Now that I think about it, no stack frame is even required! lr is saved 
>> into the caller's stack when its clobbered with an asm, just as when 
>> it's used for a function call.
> 
> No. If there is a non-clobbered register, lr can be moved to the
> non-clobbered register rather than saved to the stack. However it
> looks like (1) gcc doesn't take advantage of that possibility, but (2)
> the caller already arranged for there to be space on the stack to save
> lr, so the cost is only one store and one load, not any stack
> adjustment or other frame setup. So it's probably not a really big
> deal. However, just adding "lr" clobber to existing syscall in musl
> increased the size of a simple syscall function (getuid) from 20 bytes
> to 36 bytes.

Yeah I had a bit of a play around with musl (which is very nice code I
must say). The powerpc64 syscall asm is missing ctr clobber by the way.  
Fortunately adding it doesn't change code generation for me, but it 
should be fixed. glibc had the same bug at one point I think (probably 
due to syscall ABI documentation not existing -- something now lives in 
linux/Documentation/powerpc/syscall64-abi.rst).

Yes lr needs to be saved, I didn't see any new requirement for stack
frames, and it was often already saved, but it does hurt the small 
wrapper functions.

I did look at entirely replacing sc with scv though, just as an 
experiment. One day you might make sc optional! Text size impoves by 
about 3kB with the proposed ABI.

Mostly seems to be the bns+ ; neg sequence. __syscall1/2/3 get
out-of-lined by the compiler in a lot of cases. Linux's bloat-o-meter 
says:

add/remove: 0/5 grow/shrink: 24/260 up/down: 220/-3428 (-3208)
Function                                     old     new   delta
fcntl                                        400     424     +24
popen                                        600     620     +20
times                                         32      40      +8
[...]
alloc_rev                                    816     784     -32
alloc_fwd                                    812     780     -32
__syscall1.constprop                          32       -     -32
__fdopen                                     504     472     -32
__expand_heap                                628     592     -36
__syscall2                                    40       -     -40
__syscall3                                    44       -     -44
fchmodat                                     372     324     -48
__wake.constprop                             408     360     -48
child                                       1116    1064     -52
checker                                      220     156     -64
__bin_chunk                                 1576    1512     -64
malloc                                      1940    1860     -80
__syscall3.constprop                          96       -     -96
__syscall1                                   108       -    -108
Total: Before=613379, After=610171, chg -0.52%

Now if we go a step further we could preserve r0,r4-r8. That gives the 
kernel r9-r12 as scratch while leaving userspace with some spare 
volatile GPRs except in the uncommon syscall6 case.

  static inline long __syscall0(long n)
  {
      register long r0 __asm__("r0") = n;
      register long r3 __asm__("r3");
      __asm__ __volatile__("scv 0"
      : "=r"(r3)
      : "r"(r0)
      : "memory", "cr0", "cr1", "cr5", "cr6", "cr7", "lr", "ctr", "r9", "r10", "r11", "r12"
      return r3;
  }

That saves another ~400 bytes, reducing some of the register shuffling 
for futex loops etc:

[...]
__pthread_cond_timedwait                     964     944     -20
__expand_heap                                592     572     -20
socketpair                                   292     268     -24
__wake.constprop                             360     336     -24
malloc                                      1860    1828     -32
__bin_chunk                                 1512    1472     -40
fcntl                                        424     376     -48
Total: Before=610171, After=609723, chg -0.07%

As you say, the compiler doesn't do a good job of saving lr in a spare 
GPR unfortunately. Saving it ourselves to eliminate the lr clobber is no 
good because it's almost always already saved. At least having 
non-clobbered volatile GPRs could let a future smarter compiler take 
advantage.

If we go further and try to preserve r3 as well by putting the return 
value in r9 or r0, we go backwards about 300 bytes. It's good for the 
lock loops and complex functions, but hurts a lot of simpler functions 
that have to add 'mr r3,r9' etc.  

Most of the time there are saved non-volatile GPRs around anyway though, 
so not sure which way to go on this. Text size savings can't be ignored
and it's pretty easy for the kernel to do (we already save r3-r8 and
zero them on exit, so we could load them instead from cache line that's
should be hot).

So I may be inclined to go this way, even if we won't see benefit now.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-22  6:18                           ` Nicholas Piggin
@ 2020-04-22  6:29                             ` Nicholas Piggin
  2020-04-23  2:36                             ` Rich Felker
  1 sibling, 0 replies; 62+ messages in thread
From: Nicholas Piggin @ 2020-04-22  6:29 UTC (permalink / raw)
  To: Rich Felker; +Cc: Adhemerval Zanella, libc-alpha, libc-dev, linuxppc-dev, musl

Excerpts from Nicholas Piggin's message of April 22, 2020 4:18 pm:
> If we go further and try to preserve r3 as well by putting the return 
> value in r9 or r0, we go backwards about 300 bytes. It's good for the 
> lock loops and complex functions, but hurts a lot of simpler functions 
> that have to add 'mr r3,r9' etc.  
> 
> Most of the time there are saved non-volatile GPRs around anyway though, 
> so not sure which way to go on this. Text size savings can't be ignored
> and it's pretty easy for the kernel to do (we already save r3-r8 and
> zero them on exit, so we could load them instead from cache line that's
> should be hot).
> 
> So I may be inclined to go this way, even if we won't see benefit now.

By, "this way" I don't mean r9 or r0 return value (which is larger code),
but r3 return value with r0,r4-r8 preserved.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [musl] Re: Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-21 15:00                     ` Adhemerval Zanella
  2020-04-21 15:31                       ` David Laight
@ 2020-04-22  6:54                       ` Nicholas Piggin
  2020-04-22  7:15                         ` Florian Weimer
  1 sibling, 1 reply; 62+ messages in thread
From: Nicholas Piggin @ 2020-04-22  6:54 UTC (permalink / raw)
  To: Adhemerval Zanella, Rich Felker, David Laight
  Cc: libc-alpha, libc-dev, linuxppc-dev, musl, Segher Boessenkool

Let me try to summarise what we have.

- vdso style call is ruled out as unnecessary with possible security 
  concerns. Caller can internally use indirect branch to select variant 
  if it wants to use that mechanism to select.

- LR clobber seems to handled okay by gcc. It can increase size of small 
  leaf wrapper functions, but they can use the caller stack frame for 
  this (and even red zone for saving other things if necessary), but not 
  a huge amount.

- -ve error return seems to be favoured by everyone. Experimentally, 
  it's better for musl (but musl could probably improve cr0[SO] error 
  handling a bit 'asm goto').

- Preserving syscall args and volatiles up to r8 is a small but 
  noticable help for cases that inline the call rather than always call 
  wrappers. This is unlikely to be helpful unless 'sc' support is 
  compiled out but I'll consider doing it for the long term. Next step 
  is to trace and test on real hardware.

- One thing that nobody has really asked about is error handling for 
  unsupported scv vectors, so I would like to just go over it:

Today, the scv facility is disabled by the kernel (FSCR[SCV] is 
cleared), which makes any `scv` instruction take a facility
unavailable, which ends up printing a kernel message about SCV
facility unavilable, and SIGILL's the process with ILL_ILLOPC.

Enabling 'scv 0' will enable 1-127 as well, so the kernel has to handle 
those somehow.

What we are saying is that we will allocate HWCAP bits in future if we 
implement more scv vectors, so userspace is not *supposed* to rely on 
this, but kernel has to choose some behaviour for invalid vectors.

My proposal was to do the same SIGILL (with no kernel facility message),
so it appears to behave the same way to userspace as it does now. There 
is also the ILL_ILLOPN code that could be used as invalid operand, but
powerpc does not use this much, and e.g., the static instruction 
coded operands e.g., invalid mfspr generate ILL_ILLOPC so we could 
consider the entire instruction as the opcode, and input register values 
as operands.

Now I don't know why a process would want to distinguish between 
FSCR[SCV]=0 and the case where it is enabled but kernel doesn't 
implement the vector, but maybe it does?

Another option would be to use a different signal. I don't see that any 
are more suitable.

Or return without a signal but -ENOSYS or something in r3. This doesn't 
seem so good because an invalid scv vector is not a system call, and a 
failure ABI would constrain any future implementation just a little bit.

Any objections to SIGILL ILL_ILLOPC?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Re: Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-22  6:54                       ` [musl] " Nicholas Piggin
@ 2020-04-22  7:15                         ` Florian Weimer
  2020-04-22  7:31                           ` Nicholas Piggin
  0 siblings, 1 reply; 62+ messages in thread
From: Florian Weimer @ 2020-04-22  7:15 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Adhemerval Zanella, Rich Felker, David Laight, musl, libc-alpha,
	libc-dev, linuxppc-dev, Segher Boessenkool

* Nicholas Piggin:

> Another option would be to use a different signal. I don't see that any 
> are more suitable.

SIGSYS comes to my mind.  But I don't know how exclusively it is
associated with seccomp these days.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Re: Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-22  7:15                         ` Florian Weimer
@ 2020-04-22  7:31                           ` Nicholas Piggin
  2020-04-22  8:11                             ` Florian Weimer
  0 siblings, 1 reply; 62+ messages in thread
From: Nicholas Piggin @ 2020-04-22  7:31 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Adhemerval Zanella, Rich Felker, David Laight, libc-alpha,
	libc-dev, linuxppc-dev, musl, Segher Boessenkool

Excerpts from Florian Weimer's message of April 22, 2020 5:15 pm:
> * Nicholas Piggin:
> 
>> Another option would be to use a different signal. I don't see that any 
>> are more suitable.
> 
> SIGSYS comes to my mind.  But I don't know how exclusively it is
> associated with seccomp these days.

SIGSYS is entirely seccomp now. There looks like a single obscure MIPS 
user of it in Linux that's not seccomp, but it would be entirely new for 
powerpc (or any of the common platforms, arm, x86 etc).

So I would be disinclined to use SIGSYS unless there are no other better 
signal types, and we don't want to use SIGILL for some good reason -- is 
there a good reason to add complexity for userspace by differentiating 
these two situations?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Re: Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-22  7:31                           ` Nicholas Piggin
@ 2020-04-22  8:11                             ` Florian Weimer
  0 siblings, 0 replies; 62+ messages in thread
From: Florian Weimer @ 2020-04-22  8:11 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Adhemerval Zanella, Rich Felker, David Laight, libc-alpha,
	libc-dev, linuxppc-dev, musl, Segher Boessenkool

* Nicholas Piggin:

> So I would be disinclined to use SIGSYS unless there are no other better 
> signal types, and we don't want to use SIGILL for some good reason -- is 
> there a good reason to add complexity for userspace by differentiating 
> these two situations?

No, SIGILL seems fine to me.  scv 0 and scv 1 could well be considered
different instructions eventually (with different mnemonics).

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-22  6:18                           ` Nicholas Piggin
  2020-04-22  6:29                             ` Nicholas Piggin
@ 2020-04-23  2:36                             ` Rich Felker
  2020-04-23 12:13                               ` Adhemerval Zanella
  2020-04-25  3:30                               ` Nicholas Piggin
  1 sibling, 2 replies; 62+ messages in thread
From: Rich Felker @ 2020-04-23  2:36 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Adhemerval Zanella, libc-alpha, libc-dev, linuxppc-dev, musl

On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote:
> Yeah I had a bit of a play around with musl (which is very nice code I
> must say). The powerpc64 syscall asm is missing ctr clobber by the way.  
> Fortunately adding it doesn't change code generation for me, but it 
> should be fixed. glibc had the same bug at one point I think (probably 
> due to syscall ABI documentation not existing -- something now lives in 
> linux/Documentation/powerpc/syscall64-abi.rst).

Do you know anywhere I can read about the ctr issue, possibly the
relevant glibc bug report? I'm not particularly familiar with ppc
register file (at least I have to refamiliarize myself every time I
work on this stuff) so it'd be nice to understand what's
potentially-wrong now.

Rich

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-23  2:36                             ` Rich Felker
@ 2020-04-23 12:13                               ` Adhemerval Zanella
  2020-04-23 16:18                                 ` Rich Felker
  2020-04-25  3:30                               ` Nicholas Piggin
  1 sibling, 1 reply; 62+ messages in thread
From: Adhemerval Zanella @ 2020-04-23 12:13 UTC (permalink / raw)
  To: Rich Felker, Nicholas Piggin; +Cc: libc-alpha, libc-dev, linuxppc-dev, musl



On 22/04/2020 23:36, Rich Felker wrote:
> On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote:
>> Yeah I had a bit of a play around with musl (which is very nice code I
>> must say). The powerpc64 syscall asm is missing ctr clobber by the way.  
>> Fortunately adding it doesn't change code generation for me, but it 
>> should be fixed. glibc had the same bug at one point I think (probably 
>> due to syscall ABI documentation not existing -- something now lives in 
>> linux/Documentation/powerpc/syscall64-abi.rst).
> 
> Do you know anywhere I can read about the ctr issue, possibly the
> relevant glibc bug report? I'm not particularly familiar with ppc
> register file (at least I have to refamiliarize myself every time I
> work on this stuff) so it'd be nice to understand what's
> potentially-wrong now.

My understanding is the ctr issue only happens for vDSO calls where it
fallback to a syscall in case an error (invalid argument, etc. and
assuming if vDSO does not fallback to a syscall it always succeed).
This makes the vDSO call on powerpc to have same same ABI constraint
as a syscall, where it clobbers CR0.

On glibc we handle by simulating a function call and analysing the CR0
result:

      __asm__ __volatile__ 
      ("mtctr %0\n\t"
       "bctrl\n\t"
       "mfcr  %0\n\t"
       "0:"
       : "+r" (r0), "+r" (r3), "+r" (r4), "+r" (r5),  "+r" (r6),
         "+r" (r7), "+r" (r8)
       : : "r9", "r10", "r11", "r12", "cr0", "ctr", "lr", "memory");
      __asm__ __volatile__ ("" : "=r" (rval) : "r" (r3));

On musl you don't have this issue because it does not enable vDSO
support on powerpc.  And if it eventually does it with the VDSO_*
macros the only issue I see is on when vDSO fallbacks to the syscall 
and it also fails (the return code won't be negated since on musl it 
uses a default C function pointer issue which does not model the CR0
kernel abi). 

So I think the extra ctr constraint on glibc powerpc syscall code is
not really required.  I think I have some patches to optimize this a
bit based on previous discussions.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-23 12:13                               ` Adhemerval Zanella
@ 2020-04-23 16:18                                 ` Rich Felker
  2020-04-23 16:35                                   ` Adhemerval Zanella
  0 siblings, 1 reply; 62+ messages in thread
From: Rich Felker @ 2020-04-23 16:18 UTC (permalink / raw)
  To: Adhemerval Zanella
  Cc: Nicholas Piggin, libc-alpha, libc-dev, linuxppc-dev, musl

On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote:
> 
> 
> On 22/04/2020 23:36, Rich Felker wrote:
> > On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote:
> >> Yeah I had a bit of a play around with musl (which is very nice code I
> >> must say). The powerpc64 syscall asm is missing ctr clobber by the way.  
> >> Fortunately adding it doesn't change code generation for me, but it 
> >> should be fixed. glibc had the same bug at one point I think (probably 
> >> due to syscall ABI documentation not existing -- something now lives in 
> >> linux/Documentation/powerpc/syscall64-abi.rst).
> > 
> > Do you know anywhere I can read about the ctr issue, possibly the
> > relevant glibc bug report? I'm not particularly familiar with ppc
> > register file (at least I have to refamiliarize myself every time I
> > work on this stuff) so it'd be nice to understand what's
> > potentially-wrong now.
> 
> My understanding is the ctr issue only happens for vDSO calls where it
> fallback to a syscall in case an error (invalid argument, etc. and
> assuming if vDSO does not fallback to a syscall it always succeed).
> This makes the vDSO call on powerpc to have same same ABI constraint
> as a syscall, where it clobbers CR0.

I think you mean "vsyscall", the old thing glibc used where there are
in-userspace implementations of some syscalls with call interfaces
roughly equivalent to a syscall. musl has never used this. It only
uses the actual exported functions from the vdso which have normal
external function call ABI.

Rich

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-23 16:18                                 ` Rich Felker
@ 2020-04-23 16:35                                   ` Adhemerval Zanella
  2020-04-23 16:43                                     ` Rich Felker
  0 siblings, 1 reply; 62+ messages in thread
From: Adhemerval Zanella @ 2020-04-23 16:35 UTC (permalink / raw)
  To: Rich Felker; +Cc: Nicholas Piggin, libc-alpha, libc-dev, linuxppc-dev, musl



On 23/04/2020 13:18, Rich Felker wrote:
> On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote:
>>
>>
>> On 22/04/2020 23:36, Rich Felker wrote:
>>> On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote:
>>>> Yeah I had a bit of a play around with musl (which is very nice code I
>>>> must say). The powerpc64 syscall asm is missing ctr clobber by the way.  
>>>> Fortunately adding it doesn't change code generation for me, but it 
>>>> should be fixed. glibc had the same bug at one point I think (probably 
>>>> due to syscall ABI documentation not existing -- something now lives in 
>>>> linux/Documentation/powerpc/syscall64-abi.rst).
>>>
>>> Do you know anywhere I can read about the ctr issue, possibly the
>>> relevant glibc bug report? I'm not particularly familiar with ppc
>>> register file (at least I have to refamiliarize myself every time I
>>> work on this stuff) so it'd be nice to understand what's
>>> potentially-wrong now.
>>
>> My understanding is the ctr issue only happens for vDSO calls where it
>> fallback to a syscall in case an error (invalid argument, etc. and
>> assuming if vDSO does not fallback to a syscall it always succeed).
>> This makes the vDSO call on powerpc to have same same ABI constraint
>> as a syscall, where it clobbers CR0.
> 
> I think you mean "vsyscall", the old thing glibc used where there are
> in-userspace implementations of some syscalls with call interfaces
> roughly equivalent to a syscall. musl has never used this. It only
> uses the actual exported functions from the vdso which have normal
> external function call ABI.

I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing.
The issue is indeed when calling the powerpc provided functions in 
vDSO, which musl might want to do eventually.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-23 16:35                                   ` Adhemerval Zanella
@ 2020-04-23 16:43                                     ` Rich Felker
  2020-04-23 17:15                                       ` Adhemerval Zanella
  0 siblings, 1 reply; 62+ messages in thread
From: Rich Felker @ 2020-04-23 16:43 UTC (permalink / raw)
  To: Adhemerval Zanella
  Cc: Nicholas Piggin, libc-alpha, libc-dev, linuxppc-dev, musl

On Thu, Apr 23, 2020 at 01:35:01PM -0300, Adhemerval Zanella wrote:
> 
> 
> On 23/04/2020 13:18, Rich Felker wrote:
> > On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote:
> >>
> >>
> >> On 22/04/2020 23:36, Rich Felker wrote:
> >>> On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote:
> >>>> Yeah I had a bit of a play around with musl (which is very nice code I
> >>>> must say). The powerpc64 syscall asm is missing ctr clobber by the way.  
> >>>> Fortunately adding it doesn't change code generation for me, but it 
> >>>> should be fixed. glibc had the same bug at one point I think (probably 
> >>>> due to syscall ABI documentation not existing -- something now lives in 
> >>>> linux/Documentation/powerpc/syscall64-abi.rst).
> >>>
> >>> Do you know anywhere I can read about the ctr issue, possibly the
> >>> relevant glibc bug report? I'm not particularly familiar with ppc
> >>> register file (at least I have to refamiliarize myself every time I
> >>> work on this stuff) so it'd be nice to understand what's
> >>> potentially-wrong now.
> >>
> >> My understanding is the ctr issue only happens for vDSO calls where it
> >> fallback to a syscall in case an error (invalid argument, etc. and
> >> assuming if vDSO does not fallback to a syscall it always succeed).
> >> This makes the vDSO call on powerpc to have same same ABI constraint
> >> as a syscall, where it clobbers CR0.
> > 
> > I think you mean "vsyscall", the old thing glibc used where there are
> > in-userspace implementations of some syscalls with call interfaces
> > roughly equivalent to a syscall. musl has never used this. It only
> > uses the actual exported functions from the vdso which have normal
> > external function call ABI.
> 
> I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing.
> The issue is indeed when calling the powerpc provided functions in 
> vDSO, which musl might want to do eventually.

AIUI (at least this is true for all other archs) the functions have
normal external function call ABI and calling them has nothing to do
with syscall mechanisms.

It looks like we're not using them right now and I'm not sure why. It
could be that there are ABI mismatch issues (are 32-bit ones
compatible with secure-plt? are 64-bit ones compatible with ELFv2?) or
just that nobody proposed adding them. Also as of 5.4 32-bit ppc
lacked time64 versions of them; not sure if this is fixed yet.

Rich

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-23 16:43                                     ` Rich Felker
@ 2020-04-23 17:15                                       ` Adhemerval Zanella
  2020-04-23 17:42                                         ` Rich Felker
  0 siblings, 1 reply; 62+ messages in thread
From: Adhemerval Zanella @ 2020-04-23 17:15 UTC (permalink / raw)
  To: Rich Felker; +Cc: Nicholas Piggin, libc-alpha, libc-dev, linuxppc-dev, musl



On 23/04/2020 13:43, Rich Felker wrote:
> On Thu, Apr 23, 2020 at 01:35:01PM -0300, Adhemerval Zanella wrote:
>>
>>
>> On 23/04/2020 13:18, Rich Felker wrote:
>>> On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote:
>>>>
>>>>
>>>> On 22/04/2020 23:36, Rich Felker wrote:
>>>>> On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote:
>>>>>> Yeah I had a bit of a play around with musl (which is very nice code I
>>>>>> must say). The powerpc64 syscall asm is missing ctr clobber by the way.  
>>>>>> Fortunately adding it doesn't change code generation for me, but it 
>>>>>> should be fixed. glibc had the same bug at one point I think (probably 
>>>>>> due to syscall ABI documentation not existing -- something now lives in 
>>>>>> linux/Documentation/powerpc/syscall64-abi.rst).
>>>>>
>>>>> Do you know anywhere I can read about the ctr issue, possibly the
>>>>> relevant glibc bug report? I'm not particularly familiar with ppc
>>>>> register file (at least I have to refamiliarize myself every time I
>>>>> work on this stuff) so it'd be nice to understand what's
>>>>> potentially-wrong now.
>>>>
>>>> My understanding is the ctr issue only happens for vDSO calls where it
>>>> fallback to a syscall in case an error (invalid argument, etc. and
>>>> assuming if vDSO does not fallback to a syscall it always succeed).
>>>> This makes the vDSO call on powerpc to have same same ABI constraint
>>>> as a syscall, where it clobbers CR0.
>>>
>>> I think you mean "vsyscall", the old thing glibc used where there are
>>> in-userspace implementations of some syscalls with call interfaces
>>> roughly equivalent to a syscall. musl has never used this. It only
>>> uses the actual exported functions from the vdso which have normal
>>> external function call ABI.
>>
>> I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing.
>> The issue is indeed when calling the powerpc provided functions in 
>> vDSO, which musl might want to do eventually.
> 
> AIUI (at least this is true for all other archs) the functions have
> normal external function call ABI and calling them has nothing to do
> with syscall mechanisms.

My point is powerpc specifically does not follow it, since it issues a
syscall in fallback and its semantic follow kernel syscalls (error
signalled in cr0, r3 being always a positive value):

--
V_FUNCTION_BEGIN(__kernel_clock_gettime)
  .cfi_startproc
        [...]
        /*
         * syscall fallback
         */
99:
        li      r0,__NR_clock_gettime
  .cfi_restore lr
        sc
        blr
  .cfi_endproc
V_FUNCTION_END(__kernel_clock_gettime)


> 
> It looks like we're not using them right now and I'm not sure why. It
> could be that there are ABI mismatch issues (are 32-bit ones
> compatible with secure-plt? are 64-bit ones compatible with ELFv2?) or
> just that nobody proposed adding them. Also as of 5.4 32-bit ppc
> lacked time64 versions of them; not sure if this is fixed yet.

For 64-bit it also have an issue where vDSO does not provide an OPD
for ELFv1, which has bitten glibc while trying to implement an ifunc
optimization. I don't recall any issue for ELFv2.

For 32-bit I am not sure secure-plt will change anything, at least not
on powerpc where we use the same strategy for 64-bit and use a
mtctr/bctr directly.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-23 17:15                                       ` Adhemerval Zanella
@ 2020-04-23 17:42                                         ` Rich Felker
  2020-04-25  3:40                                           ` Nicholas Piggin
  0 siblings, 1 reply; 62+ messages in thread
From: Rich Felker @ 2020-04-23 17:42 UTC (permalink / raw)
  To: Adhemerval Zanella
  Cc: Nicholas Piggin, libc-alpha, libc-dev, linuxppc-dev, musl

On Thu, Apr 23, 2020 at 02:15:58PM -0300, Adhemerval Zanella wrote:
> 
> 
> On 23/04/2020 13:43, Rich Felker wrote:
> > On Thu, Apr 23, 2020 at 01:35:01PM -0300, Adhemerval Zanella wrote:
> >>
> >>
> >> On 23/04/2020 13:18, Rich Felker wrote:
> >>> On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote:
> >>>>
> >>>>
> >>>> On 22/04/2020 23:36, Rich Felker wrote:
> >>>>> On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote:
> >>>>>> Yeah I had a bit of a play around with musl (which is very nice code I
> >>>>>> must say). The powerpc64 syscall asm is missing ctr clobber by the way.  
> >>>>>> Fortunately adding it doesn't change code generation for me, but it 
> >>>>>> should be fixed. glibc had the same bug at one point I think (probably 
> >>>>>> due to syscall ABI documentation not existing -- something now lives in 
> >>>>>> linux/Documentation/powerpc/syscall64-abi.rst).
> >>>>>
> >>>>> Do you know anywhere I can read about the ctr issue, possibly the
> >>>>> relevant glibc bug report? I'm not particularly familiar with ppc
> >>>>> register file (at least I have to refamiliarize myself every time I
> >>>>> work on this stuff) so it'd be nice to understand what's
> >>>>> potentially-wrong now.
> >>>>
> >>>> My understanding is the ctr issue only happens for vDSO calls where it
> >>>> fallback to a syscall in case an error (invalid argument, etc. and
> >>>> assuming if vDSO does not fallback to a syscall it always succeed).
> >>>> This makes the vDSO call on powerpc to have same same ABI constraint
> >>>> as a syscall, where it clobbers CR0.
> >>>
> >>> I think you mean "vsyscall", the old thing glibc used where there are
> >>> in-userspace implementations of some syscalls with call interfaces
> >>> roughly equivalent to a syscall. musl has never used this. It only
> >>> uses the actual exported functions from the vdso which have normal
> >>> external function call ABI.
> >>
> >> I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing.
> >> The issue is indeed when calling the powerpc provided functions in 
> >> vDSO, which musl might want to do eventually.
> > 
> > AIUI (at least this is true for all other archs) the functions have
> > normal external function call ABI and calling them has nothing to do
> > with syscall mechanisms.
> 
> My point is powerpc specifically does not follow it, since it issues a
> syscall in fallback and its semantic follow kernel syscalls (error
> signalled in cr0, r3 being always a positive value):

Oh, then I think we'll just ignore these unless the kernel can make
ones with a reasonable ABI. It's not worth having ppc-specific code
for this... It would be really nice if ones that actually behave like
functions could be added though.

> --
> V_FUNCTION_BEGIN(__kernel_clock_gettime)
>   .cfi_startproc
>         [...]
>         /*
>          * syscall fallback
>          */
> 99:
>         li      r0,__NR_clock_gettime
>   .cfi_restore lr
>         sc
>         blr
>   .cfi_endproc
> V_FUNCTION_END(__kernel_clock_gettime)
> 
> 
> > 
> > It looks like we're not using them right now and I'm not sure why. It
> > could be that there are ABI mismatch issues (are 32-bit ones
> > compatible with secure-plt? are 64-bit ones compatible with ELFv2?) or
> > just that nobody proposed adding them. Also as of 5.4 32-bit ppc
> > lacked time64 versions of them; not sure if this is fixed yet.
> 
> For 64-bit it also have an issue where vDSO does not provide an OPD
> for ELFv1, which has bitten glibc while trying to implement an ifunc
> optimization. I don't recall any issue for ELFv2.
> 
> For 32-bit I am not sure secure-plt will change anything, at least not
> on powerpc where we use the same strategy for 64-bit and use a
> mtctr/bctr directly.

Indeed, I don't think there's a secure-plt distinction unless you're
making outgoing calls to possibly-cross-DSO functions.

Rich

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-23  2:36                             ` Rich Felker
  2020-04-23 12:13                               ` Adhemerval Zanella
@ 2020-04-25  3:30                               ` Nicholas Piggin
  1 sibling, 0 replies; 62+ messages in thread
From: Nicholas Piggin @ 2020-04-25  3:30 UTC (permalink / raw)
  To: Rich Felker; +Cc: Adhemerval Zanella, libc-alpha, libc-dev, linuxppc-dev, musl

Excerpts from Rich Felker's message of April 23, 2020 12:36 pm:
> On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote:
>> Yeah I had a bit of a play around with musl (which is very nice code I
>> must say). The powerpc64 syscall asm is missing ctr clobber by the way.  
>> Fortunately adding it doesn't change code generation for me, but it 
>> should be fixed. glibc had the same bug at one point I think (probably 
>> due to syscall ABI documentation not existing -- something now lives in 
>> linux/Documentation/powerpc/syscall64-abi.rst).
> 
> Do you know anywhere I can read about the ctr issue, possibly the
> relevant glibc bug report? I'm not particularly familiar with ppc
> register file (at least I have to refamiliarize myself every time I
> work on this stuff) so it'd be nice to understand what's
> potentially-wrong now.

Ah I was misremembering, glibc was (and still is) actually missing cr 
clobbers from its "vsyscall", probably because it copied syscall which 
only clobbers cr0, but vsyscall clobbers cr0-1,5-7 like a normal 
function call.

musl is missing the ctr register clobber from syscalls.

powerpc has gpr0-31 GPRs, cr0-7 condition regs, and lr and ctr branch 
registers (lr is generally used for function returns, ctr for other 
indirect branches). ctr is volatile (caller saved) across C function 
calls, and sc system calls on Linux.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-23 17:42                                         ` Rich Felker
@ 2020-04-25  3:40                                           ` Nicholas Piggin
  2020-04-25  4:52                                             ` Rich Felker
  0 siblings, 1 reply; 62+ messages in thread
From: Nicholas Piggin @ 2020-04-25  3:40 UTC (permalink / raw)
  To: Adhemerval Zanella, Rich Felker; +Cc: libc-alpha, libc-dev, linuxppc-dev, musl

Excerpts from Rich Felker's message of April 24, 2020 3:42 am:
> On Thu, Apr 23, 2020 at 02:15:58PM -0300, Adhemerval Zanella wrote:
>> 
>> 
>> On 23/04/2020 13:43, Rich Felker wrote:
>> > On Thu, Apr 23, 2020 at 01:35:01PM -0300, Adhemerval Zanella wrote:
>> >>
>> >>
>> >> On 23/04/2020 13:18, Rich Felker wrote:
>> >>> On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote:
>> >>>>
>> >>>>
>> >>>> On 22/04/2020 23:36, Rich Felker wrote:
>> >>>>> On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote:
>> >>>>>> Yeah I had a bit of a play around with musl (which is very nice code I
>> >>>>>> must say). The powerpc64 syscall asm is missing ctr clobber by the way.  
>> >>>>>> Fortunately adding it doesn't change code generation for me, but it 
>> >>>>>> should be fixed. glibc had the same bug at one point I think (probably 
>> >>>>>> due to syscall ABI documentation not existing -- something now lives in 
>> >>>>>> linux/Documentation/powerpc/syscall64-abi.rst).
>> >>>>>
>> >>>>> Do you know anywhere I can read about the ctr issue, possibly the
>> >>>>> relevant glibc bug report? I'm not particularly familiar with ppc
>> >>>>> register file (at least I have to refamiliarize myself every time I
>> >>>>> work on this stuff) so it'd be nice to understand what's
>> >>>>> potentially-wrong now.
>> >>>>
>> >>>> My understanding is the ctr issue only happens for vDSO calls where it
>> >>>> fallback to a syscall in case an error (invalid argument, etc. and
>> >>>> assuming if vDSO does not fallback to a syscall it always succeed).
>> >>>> This makes the vDSO call on powerpc to have same same ABI constraint
>> >>>> as a syscall, where it clobbers CR0.
>> >>>
>> >>> I think you mean "vsyscall", the old thing glibc used where there are
>> >>> in-userspace implementations of some syscalls with call interfaces
>> >>> roughly equivalent to a syscall. musl has never used this. It only
>> >>> uses the actual exported functions from the vdso which have normal
>> >>> external function call ABI.
>> >>
>> >> I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing.
>> >> The issue is indeed when calling the powerpc provided functions in 
>> >> vDSO, which musl might want to do eventually.
>> > 
>> > AIUI (at least this is true for all other archs) the functions have
>> > normal external function call ABI and calling them has nothing to do
>> > with syscall mechanisms.
>> 
>> My point is powerpc specifically does not follow it, since it issues a
>> syscall in fallback and its semantic follow kernel syscalls (error
>> signalled in cr0, r3 being always a positive value):
> 
> Oh, then I think we'll just ignore these unless the kernel can make
> ones with a reasonable ABI. It's not worth having ppc-specific code
> for this... It would be really nice if ones that actually behave like
> functions could be added though.

Yeah this is an annoyance for me after making the scv ABI return -ve in 
r3 for error and other things that more closely follow function calls, 
we still have the vdso functions using the old style.

Maybe we should add function call style vdso too.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
  2020-04-25  3:40                                           ` Nicholas Piggin
@ 2020-04-25  4:52                                             ` Rich Felker
  0 siblings, 0 replies; 62+ messages in thread
From: Rich Felker @ 2020-04-25  4:52 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Adhemerval Zanella, libc-alpha, libc-dev, linuxppc-dev, musl

On Sat, Apr 25, 2020 at 01:40:24PM +1000, Nicholas Piggin wrote:
> Excerpts from Rich Felker's message of April 24, 2020 3:42 am:
> > On Thu, Apr 23, 2020 at 02:15:58PM -0300, Adhemerval Zanella wrote:
> >> 
> >> 
> >> On 23/04/2020 13:43, Rich Felker wrote:
> >> > On Thu, Apr 23, 2020 at 01:35:01PM -0300, Adhemerval Zanella wrote:
> >> >>
> >> >>
> >> >> On 23/04/2020 13:18, Rich Felker wrote:
> >> >>> On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote:
> >> >>>>
> >> >>>>
> >> >>>> On 22/04/2020 23:36, Rich Felker wrote:
> >> >>>>> On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote:
> >> >>>>>> Yeah I had a bit of a play around with musl (which is very nice code I
> >> >>>>>> must say). The powerpc64 syscall asm is missing ctr clobber by the way.  
> >> >>>>>> Fortunately adding it doesn't change code generation for me, but it 
> >> >>>>>> should be fixed. glibc had the same bug at one point I think (probably 
> >> >>>>>> due to syscall ABI documentation not existing -- something now lives in 
> >> >>>>>> linux/Documentation/powerpc/syscall64-abi.rst).
> >> >>>>>
> >> >>>>> Do you know anywhere I can read about the ctr issue, possibly the
> >> >>>>> relevant glibc bug report? I'm not particularly familiar with ppc
> >> >>>>> register file (at least I have to refamiliarize myself every time I
> >> >>>>> work on this stuff) so it'd be nice to understand what's
> >> >>>>> potentially-wrong now.
> >> >>>>
> >> >>>> My understanding is the ctr issue only happens for vDSO calls where it
> >> >>>> fallback to a syscall in case an error (invalid argument, etc. and
> >> >>>> assuming if vDSO does not fallback to a syscall it always succeed).
> >> >>>> This makes the vDSO call on powerpc to have same same ABI constraint
> >> >>>> as a syscall, where it clobbers CR0.
> >> >>>
> >> >>> I think you mean "vsyscall", the old thing glibc used where there are
> >> >>> in-userspace implementations of some syscalls with call interfaces
> >> >>> roughly equivalent to a syscall. musl has never used this. It only
> >> >>> uses the actual exported functions from the vdso which have normal
> >> >>> external function call ABI.
> >> >>
> >> >> I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing.
> >> >> The issue is indeed when calling the powerpc provided functions in 
> >> >> vDSO, which musl might want to do eventually.
> >> > 
> >> > AIUI (at least this is true for all other archs) the functions have
> >> > normal external function call ABI and calling them has nothing to do
> >> > with syscall mechanisms.
> >> 
> >> My point is powerpc specifically does not follow it, since it issues a
> >> syscall in fallback and its semantic follow kernel syscalls (error
> >> signalled in cr0, r3 being always a positive value):
> > 
> > Oh, then I think we'll just ignore these unless the kernel can make
> > ones with a reasonable ABI. It's not worth having ppc-specific code
> > for this... It would be really nice if ones that actually behave like
> > functions could be added though.
> 
> Yeah this is an annoyance for me after making the scv ABI return -ve in 
> r3 for error and other things that more closely follow function calls, 
> we still have the vdso functions using the old style.
> 
> Maybe we should add function call style vdso too.

Please do.

Rich

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2020-04-25  4:53 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-15 21:45 [musl] Powerpc Linux 'scv' system call ABI proposal take 2 Nicholas Piggin
2020-04-15 22:55 ` Rich Felker
2020-04-16  0:16   ` Nicholas Piggin
2020-04-16  0:48     ` Rich Felker
2020-04-16  2:24       ` Nicholas Piggin
2020-04-16  2:35         ` Rich Felker
2020-04-16  2:53           ` Nicholas Piggin
2020-04-16  3:03             ` Rich Felker
2020-04-16  3:41               ` Nicholas Piggin
2020-04-16 20:18             ` Florian Weimer
2020-04-16  9:58     ` Szabolcs Nagy
2020-04-20  0:27       ` Nicholas Piggin
2020-04-20  1:29         ` Rich Felker
2020-04-20  2:08           ` Nicholas Piggin
2020-04-20 21:17             ` Szabolcs Nagy
2020-04-21  9:57               ` Florian Weimer
2020-04-16 15:21     ` Jeffrey Walton
2020-04-16 15:40       ` Rich Felker
2020-04-16  4:48   ` Florian Weimer
2020-04-16 15:35     ` Rich Felker
2020-04-16 16:42       ` Florian Weimer
2020-04-16 16:52         ` Rich Felker
2020-04-16 18:12           ` Florian Weimer
2020-04-16 23:02             ` Segher Boessenkool
2020-04-17  0:34               ` Rich Felker
2020-04-17  1:48                 ` Segher Boessenkool
2020-04-17  8:34                   ` Florian Weimer
2020-04-16 14:16   ` Adhemerval Zanella
2020-04-16 15:37     ` Rich Felker
2020-04-16 17:50       ` Adhemerval Zanella
2020-04-16 17:59         ` Rich Felker
2020-04-16 18:18           ` Adhemerval Zanella
2020-04-16 18:31             ` Rich Felker
2020-04-16 18:44               ` Rich Felker
2020-04-16 18:52               ` Adhemerval Zanella
2020-04-20  0:46                 ` Nicholas Piggin
2020-04-20  1:10               ` Nicholas Piggin
2020-04-20  1:34                 ` Rich Felker
2020-04-20  2:32                   ` Nicholas Piggin
2020-04-20  4:09                     ` Rich Felker
2020-04-20  4:31                       ` Nicholas Piggin
2020-04-20 17:27                         ` Rich Felker
2020-04-22  6:18                           ` Nicholas Piggin
2020-04-22  6:29                             ` Nicholas Piggin
2020-04-23  2:36                             ` Rich Felker
2020-04-23 12:13                               ` Adhemerval Zanella
2020-04-23 16:18                                 ` Rich Felker
2020-04-23 16:35                                   ` Adhemerval Zanella
2020-04-23 16:43                                     ` Rich Felker
2020-04-23 17:15                                       ` Adhemerval Zanella
2020-04-23 17:42                                         ` Rich Felker
2020-04-25  3:40                                           ` Nicholas Piggin
2020-04-25  4:52                                             ` Rich Felker
2020-04-25  3:30                               ` Nicholas Piggin
2020-04-21 12:28                 ` David Laight
2020-04-21 14:39                   ` Rich Felker
2020-04-21 15:00                     ` Adhemerval Zanella
2020-04-21 15:31                       ` David Laight
2020-04-22  6:54                       ` [musl] " Nicholas Piggin
2020-04-22  7:15                         ` Florian Weimer
2020-04-22  7:31                           ` Nicholas Piggin
2020-04-22  8:11                             ` Florian Weimer

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).