From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,FREEMAIL_FROM,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED,
	RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_PASS,SUSPICIOUS_RECIPS
	autolearn=ham autolearn_force=no version=3.4.2
Received: (qmail 16156 invoked from network); 20 Apr 2020 01:11:57 -0000
Received-SPF:  pass (mother.openwall.net: domain of lists.openwall.com
  designates 195.42.179.200 as permitted sender)
  receiver=inbox.vuxu.org; client-ip=195.42.179.200
  envelope-from=<musl-return-15765-ml=inbox.vuxu.org@lists.openwall.com>
Received: from mother.openwall.net (195.42.179.200)
  by inbox.vuxu.org with UTF8ESMTPZ; 20 Apr 2020 01:11:57 -0000
Received: (qmail 28190 invoked by uid 550); 20 Apr 2020 01:11:55 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 28170 invoked from network); 20 Apr 2020 01:11:54 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=date:from:subject:to:cc:references:in-reply-to:mime-version
         :message-id:content-transfer-encoding;
        bh=pQZ3uDZs8EBlNFwue5B+t0dPkVQ5nnB6ydOO4wkhzeE=;
        b=uX8WQlvwQT3I6lFaA3FPD5ZO/hBom/p3bsnnnqbuGIJZEB3qR6tGznEpPbu6SBgpDH
         nRQDPJLogl3pUH1Fvx7EpGWS6oC2lDDtJWX4ypBaWv9f+I/c1X1zMplTHvicGgb0KGQD
         VU2LOTmFNb3M6iVlYBd57DV/obhRFkHEwptqUbmT3UuseZvDVF2NhAShaTsv87N07o9b
         BwHqsl9RBC/GavivTSsJ5KGoB0ql/BguNS/2OH3n5oLGemG0/FySi72goc6IlCSrlM+8
         PPsaJDjv2wmyrQO75SHXUJ5ho4q/X0ubn/Nyx0yqPI5zIpeMJ1cXD2SLH140u2mVFxX8
         9tlQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:subject:to:cc:references:in-reply-to
         :mime-version:message-id:content-transfer-encoding;
        bh=pQZ3uDZs8EBlNFwue5B+t0dPkVQ5nnB6ydOO4wkhzeE=;
        b=lEY9lqfdx9jT1zP5yyNIDMu/m8Gq8zipGiyg72mIloP1/kgIUmJXCj+vZxxCMfhE94
         p8/FhDXwGiwSIdX50mLEP6ZOCqptCxlezqpAtftGtZKFPcP0/u6yLRoCxk2euOyFI9ZL
         Ty0HLWitXbh/0VOYxcrKjUIT8UsG8s2GYdIhGOf3JLc68hSd2Sffl5OpVlds47AK4Z8G
         0f2fLXUxyV1eJveFQrD5lwk2k8H7P7NjRJQpNQls4FswlxWsRTWakF+bsDgk1ZYqJ6Z/
         gc7+zAeOdCQLsjnMnzDZFSXmx1kwOMgOYQI/amJDdpQg04efMMnOnMUJy29WXrUbu3I7
         O0Ug==
X-Gm-Message-State: AGi0PubaE3/h1LWcJ70DHPPlYgDC+wxPm6bgbnseOu2xQip8O5rfSQVT
	a4Cm7OqMJkOaqpNo205idnE=
X-Google-Smtp-Source: APiQypL7RXRsNz4GlD0gKfhSghyEZ6n2hWHj7iL0ivA4PK6/ltMTNzDkj8+Bjuj3jYEAXmMPZh6weQ==
X-Received: by 2002:a17:902:9308:: with SMTP id bc8mr14826600plb.278.1587345102124;
        Sun, 19 Apr 2020 18:11:42 -0700 (PDT)
Date: Mon, 20 Apr 2020 11:10:25 +1000
From: Nicholas Piggin <npiggin@gmail.com>
To: Adhemerval Zanella <adhemerval.zanella@linaro.org>, Rich Felker
	<dalias@libc.org>
Cc: libc-alpha@sourceware.org, libc-dev@lists.llvm.org,
	linuxppc-dev@lists.ozlabs.org, musl@lists.openwall.com
References: <1586931450.ub4c8cq8dj.astroid@bobo.none>
	<20200415225539.GL11469@brightrain.aerifal.cx>
	<c2612908-67f7-cceb-d121-700dea096016@linaro.org>
	<20200416153756.GU11469@brightrain.aerifal.cx>
	<4b2a7a56-dd2b-1863-50e5-2f4cdbeef47c@linaro.org>
	<20200416175932.GZ11469@brightrain.aerifal.cx>
	<4f824a37-e660-8912-25aa-fde88d4b79f3@linaro.org>
	<20200416183151.GA11469@brightrain.aerifal.cx>
In-Reply-To: <20200416183151.GA11469@brightrain.aerifal.cx>
MIME-Version: 1.0
Message-Id: <1587344003.daumxvs1kh.astroid@bobo.none>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Subject: Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

Excerpts from Rich Felker's message of April 17, 2020 4:31 am:
> On Thu, Apr 16, 2020 at 03:18:42PM -0300, Adhemerval Zanella wrote:
>>=20
>>=20
>> On 16/04/2020 14:59, Rich Felker wrote:
>> > On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote:
>> >>
>> >>
>> >> On 16/04/2020 12:37, Rich Felker wrote:
>> >>> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote:
>> >>>>> My preference would be that it work just like the i386 AT_SYSINFO
>> >>>>> where you just replace "int $128" with "call *%%gs:16" and the ker=
nel
>> >>>>> provides a stub in the vdso that performs either scv or the old
>> >>>>> mechanism with the same calling convention. Then if the kernel doe=
sn't
>> >>>>> provide it (because the kernel is too old) libc would have to prov=
ide
>> >>>>> its own stub that uses the legacy method and matches the calling
>> >>>>> convention of the one the kernel is expected to provide.
>> >>>>
>> >>>> What about pthread cancellation and the requirement of checking the
>> >>>> cancellable syscall anchors in asynchronous cancellation? My plan i=
s
>> >>>> still to use musl strategy on glibc (BZ#12683) and for i686 it=20
>> >>>> requires to always use old int$128 for program that uses cancellati=
on
>> >>>> (static case) or just threads (dynamic mode, which should be more
>> >>>> common on glibc).
>> >>>>
>> >>>> Using the i686 strategy of a vDSO bridge symbol would require to al=
ways
>> >>>> fallback to 'sc' to still use the same cancellation strategy (and
>> >>>> thus defeating this optimization in such cases).
>> >>>
>> >>> Yes, I assumed it would be the same, ignoring the new syscall
>> >>> mechanism for cancellable syscalls. While there are some exceptions,
>> >>> cancellable syscalls are generally not hot paths but things that are
>> >>> expected to block and to have significant amounts of work to do in
>> >>> kernelspace, so saving a few tens of cycles is rather pointless.
>> >>>
>> >>> It's possible to do a branch/multiple versions of the syscall asm fo=
r
>> >>> cancellation but would require extending the cancellation handler to
>> >>> support checking against multiple independent address ranges or usin=
g
>> >>> some alternate markup of them.
>> >>
>> >> The main issue is at least for glibc dynamic linking is way more comm=
on
>> >> than static linking and once the program become multithread the fallb=
ack
>> >> will be always used.
>> >=20
>> > I'm not relying on static linking optimizing out the cancellable
>> > version. I'm talking about how cancellable syscalls are pretty much
>> > all "heavy" operations to begin with where a few tens of cycles are in
>> > the realm of "measurement noise" relative to the dominating time
>> > costs.
>>=20
>> Yes I am aware, but at same time I am not sure how it plays on real worl=
d.
>> For instance, some workloads might issue kernel query syscalls, such as
>> recv, where buffer copying might not be dominant factor. So I see that i=
f
>> the idea is optimizing syscall mechanism, we should try to leverage it
>> as whole in libc.
>=20
> Have you timed a minimal recv? I'm not assuming buffer copying is the
> dominant factor. I'm assuming the overhead of all the kernel layers
> involved is dominant.
>=20
>> >> And besides the cancellation performance issue, a new bridge vDSO mec=
hanism
>> >> will still require to setup some extra bridge for the case of the old=
er
>> >> kernel.  In the scheme you suggested:
>> >>
>> >>   __asm__("indirect call" ... with common clobbers);
>> >>
>> >> The indirect call will be either the vDSO bridge or an libc provided =
that
>> >> fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gai=
n
>> >> against:
>> >>
>> >>    if (hwcap & PPC_FEATURE2_SCV) {
>> >>      __asm__(... with some clobbers);
>> >>    } else {
>> >>      __asm__(... with different clobbers);
>> >>    }
>> >=20
>> > If the indirect call can be made roughly as efficiently as the sc
>> > sequence now (which already have some cost due to handling the nasty
>> > error return convention, making the indirect call likely just as small
>> > or smaller), it's O(1) additional code size (and thus icache usage)
>> > rather than O(n) where n is number of syscall points.
>> >=20
>> > Of course it would work just as well (for avoiding O(n) growth) to
>> > have a direct call to out-of-line branch like you suggested.
>>=20
>> Yes, but does it really matter to optimize this specific usage case
>> for size? glibc, for instance, tries to leverage the syscall mechanism=20
>> by adding some complex pre-processor asm directives.  It optimizes
>> the syscall code size in most cases.  For instance, kill in static case=20
>> generates on x86_64:
>>=20
>> 0000000000000000 <__kill>:
>>    0:   b8 3e 00 00 00          mov    $0x3e,%eax
>>    5:   0f 05                   syscall=20
>>    7:   48 3d 01 f0 ff ff       cmp    $0xfffffffffffff001,%rax
>>    d:   0f 83 00 00 00 00       jae    13 <__kill+0x13>
>>   13:   c3                      retq  =20
>>=20
>> While on musl:
>>=20
>> 0000000000000000 <kill>:
>>    0:	48 83 ec 08          	sub    $0x8,%rsp
>>    4:	48 63 ff             	movslq %edi,%rdi
>>    7:	48 63 f6             	movslq %esi,%rsi
>>    a:	b8 3e 00 00 00       	mov    $0x3e,%eax
>>    f:	0f 05                	syscall=20
>>   11:	48 89 c7             	mov    %rax,%rdi
>>   14:	e8 00 00 00 00       	callq  19 <kill+0x19>
>>   19:	5a                   	pop    %rdx
>>   1a:	c3                   	retq  =20
>=20
> Wow that's some extraordinarily bad codegen going on by gcc... The
> sign-extension is semantically needed and I don't see a good way
> around it (glibc's asm is kinda a hack taking advantage of kernel not
> looking at high bits, I think), but the gratuitous stack adjustment
> and refusal to generate a tail call isn't. I'll see if we can track
> down what's going on and get it fixed.
>=20
>> But I hardly think it pays off the required code complexity.  Some
>> for providing a O(1) bridge: this will require additional complexity
>> to write it and setup correctly.
>=20
> In some sense I agree, but inline instructions are a lot more
> expensive on ppc (being 32-bit each), and it might take out-of-lining
> anyway to get rid of stack frame setups if that ends up being a
> problem.
>=20
>> >> Specially if 'hwcap & PPC_FEATURE2_SCV' could be optimized with a=20
>> >> TCB member (as we do on glibc) and if we could make the asm clever
>> >> enough to not require different clobbers (although not sure if
>> >> it would be possible).
>> >=20
>> > The easy way not to require different clobbers is just using the union
>> > of the clobbers, no? Does the proposed new method clobber any
>> > call-saved registers that would make it painful (requiring new call
>> > frames to save them in)?
>>=20
>> As far I can tell, it should be ok.
>=20
> Note that because lr is clobbered we need at least once normally
> call-clobbered register that's not syscall clobbered to save lr in.
> Otherwise stack frame setup is required to spill it.

The kernel would like to use r9-r12 for itself. We could do with fewer=20
registers, but we have some delay establishing the stack (depends on a
load which depends on a mfspr), and entry code tends to be quite store
heavy whereas on the caller side you have r1 set up (modulo stack=20
updates), and the system call is a long delay during which time the=20
store queue has significant time to drain.

My feeling is it would be better for kernel to have these scratch=20
registers.

Thanks,
Nick