From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.3 required=5.0 tests=MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL autolearn=ham
	autolearn_force=no version=3.4.4
Received: (qmail 9123 invoked from network); 9 Mar 2021 18:04:51 -0000
Received: from mother.openwall.net (195.42.179.200)
  by inbox.vuxu.org with ESMTPUTF8; 9 Mar 2021 18:04:51 -0000
Received: (qmail 1824 invoked by uid 550); 9 Mar 2021 18:04:47 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 1806 invoked from network); 9 Mar 2021 18:04:46 -0000
Date: Tue, 9 Mar 2021 13:04:34 -0500
From: Rich Felker <dalias@libc.org>
To: Markus Wichmann <nullplan@gmx.net>
Cc: musl@lists.openwall.com
Message-ID: <20210309180434.GV32655@brightrain.aerifal.cx>
References: <20210309035652.32453-1-ericonr@disroot.org>
 <alpine.LNX.2.20.13.2103091200160.16269@monopod.intra.ispras.ru>
 <20210309134242.GS32655@brightrain.aerifal.cx>
 <alpine.LNX.2.20.13.2103091649540.16269@monopod.intra.ispras.ru>
 <20210309150320.GU32655@brightrain.aerifal.cx>
 <20210309165404.GB2766@voyager>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20210309165404.GB2766@voyager>
User-Agent: Mutt/1.5.21 (2010-09-15)
Subject: Re: [musl] [PATCH v2] add qsort_r.

On Tue, Mar 09, 2021 at 05:54:04PM +0100, Markus Wichmann wrote:
> On Tue, Mar 09, 2021 at 10:03:22AM -0500, Rich Felker wrote:
> > On Tue, Mar 09, 2021 at 05:13:39PM +0300, Alexander Monakov wrote:
> > > Second, if you make a "conventional" wrapper, then on popular architectures
> > > it is a single instruction (powerpc64 ABI demonstrates its insanity here):
> > >
> > > static int wrapper_cmp(void *v1, void *v2, void *ctx)
> > > {
> > > 	return ((cmpfun)ctx)(v1, v2);
> > > }
> > >
> > > Some examples:
> > >
> > > amd64:	jmp %rdx
> > > i386:	jmp *12(%esp)
> > > arm:	bx r2
> > > aarch64:br x2
> > >
> > > How is this not obvious?
> >
> > [...]
> >
> > For some reason though it's gigantic on powerpc64. It fails to do a
> > tail call at all...
> >
> 
> So, I experimented a bit with clang (for simplicity, since clang can
> switch targets with a compiler switch). And indeed the above is
> reproducible. Had to search around a bit for the ELFv2 ABI switch for
> clang (the ELFv1 version is even worse, since it uses function
> descriptors, so calling through a function pointer requires reading out
> the function descriptors before being able to use them).
> 
> So with ELFv2, the function consists of buildup and teardown of a stack
> frame, save and restore of R2, and the actual indirect call. The stack
> frame is necessary because of R2 being spilled, and R2 being spilled is
> necessary since the wrapper function might be called locally (so the
> contract is that R2 is preserved), but the function pointer might point
> to a function in another module, so R2 would be overwritten by the call.

Yes, I found the comment in GCC source to that effect (rs6000.c,
rs6000_function_ok_for_sibcall):

  /* Under the AIX or ELFv2 ABIs we can't allow calls to non-local
     functions, because the callee may have a different TOC pointer to
     the caller and there's no way to ensure we restore the TOC when
     we return.  With the secure-plt SYSV ABI we can't make non-local
     calls when -fpic/PIC because the plt call stubs use r30.  */
                       
However, this problem is solvable: just don't have a local entry. If
the function is defined without a local entry, the caller is forced to
save the TOC pointer. So GCC should be enhanced not to suppress tail
calls, but to suppress emitting a local entry point whenever there are
tail calls in function. (Tail call saves A LOT more than eliding the
TOC spill savaes!)

> That makes sense. What doesn't make sense is that the stack frame is
> still used in 32-bit powerpc. Nothing is saved into that stack frame;
> "mtctr 5; bctr" would be a valid implementation. But no matter what
> switches I threw at it, the stack frame remained.

It works for me, but my 32-bit ppc gcc is still very old (5.3). Maybe
this is a regression? Or if you're using clang, a clang-only
limitation?

It's known that 32-bit ppc can't tail call to the PLT with secure-plt
mode, but this is an indirect call not thru the PLT so it shouldn't
matter.

> For other architectures: I could not test microblaze, mipsn32, m68k,
> or1k, riscv64, and sh, since clang did not recognize those
> architectures. Probably not included by default. MIPS and MIPS64 both
> establish a stack frame, and s390x does not.

I tested sh4, sh2/fdpic, rv64, s390x, or1k, m68k, and mips (32-bit)
and they all do the tail call properly. But mips64 (n64 and n32) both
fail to. According to the GCC source, it's some thing to allow lazy
binding. MIPS64 does not use a real PLT, but actually has GOT entries
that might go through a lazy resolver and that expect %gp (call-saved)
to be valid on entry.

musl does not, and will never, do lazy binding, so this is purely
counterproductive for musl and we should probably teach GCC not to do
it. The current logic is:

  /* Sibling calls should not prevent lazy binding.  Lazy-binding stubs
     require $gp to be valid on entry, so sibcalls can only use stubs
     if $gp is call-clobbered.  */
  if (decl
      && TARGET_CALL_SAVED_GP
      && !TARGET_ABICALLS_PIC0
      && !targetm.binds_local_p (decl))
    return false;

TARGET_CALL_SAVED_GP is rightly true (it's the ABI).

TARGET_ABICALLS_PIC0 is rightly false (I'm pretty sure that's a bogus
alt ABI, and defined as TARGET_ABSOLUTE_ABICALLS && TARGET_PLT).

It probably needs an addition condition && TARGET_LAZY_BINDING that we
can define as false. Alternatively the issue could just be fixed not
to go through lazy resolver anywhere.

I opened a bug for it here:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99491


Rich