mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] Release prep for 1.2.1, and afterwards
@ 2020-06-24 20:42 Rich Felker
  2020-06-24 22:39 ` Jeffrey Walton
  2020-06-25  8:15 ` Szabolcs Nagy
  0 siblings, 2 replies; 16+ messages in thread
From: Rich Felker @ 2020-06-24 20:42 UTC (permalink / raw)
  To: musl

I'm about to do last work of merging mallocng, followed soon by
release. Is there anything in the way of overlooked bug reports or
patches that should still be addressed in this release cycle?

Things I'm aware of:

- "Proposal to match behaviour of gethostbyname to glibc". Latest
  patch is probably ok, but could be deferred to after release.

- nsz's new sqrt{,f,l}. I'm hesitant to do all three right away
  without time to test, but replacing sqrtl.c could be appropriate
  since the current one is badly broken on archs with ld wider than
  double. However it would need to accept ld80 in order not to be
  build-breaking on m68k, or m68k would need an alternative.

and some more with open questions or work to be done that can't be
finished now but should be revisited after release:

- fenv overhaul (sorry for dropping this, Damian)
- PTHREAD_RWLOCK_PREFER_WRITER_NONRECURSIVE_NP
- _SC_NPROCESSORS_{CONF,ONLN} behavior
- hexagon port
- rv32 port
- arm fdpic (newly revived interest from users on list)
- dni (dynamic linking without PT_INTERP absolute path) & related ldso
  work by rcombs
- "lutimes: Add checks for input parameters"


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [musl] Release prep for 1.2.1, and afterwards
  2020-06-24 20:42 [musl] Release prep for 1.2.1, and afterwards Rich Felker
@ 2020-06-24 22:39 ` Jeffrey Walton
  2020-06-25  8:15 ` Szabolcs Nagy
  1 sibling, 0 replies; 16+ messages in thread
From: Jeffrey Walton @ 2020-06-24 22:39 UTC (permalink / raw)
  To: musl

On Wed, Jun 24, 2020 at 4:58 PM Rich Felker <dalias@libc.org> wrote:
>
> I'm about to do last work of merging mallocng, followed soon by
> release. Is there anything in the way of overlooked bug reports or
> patches that should still be addressed in this release cycle?
>
> Things I'm aware of:
>
> - "Proposal to match behaviour of gethostbyname to glibc". Latest
>   patch is probably ok, but could be deferred to after release.
>
> - nsz's new sqrt{,f,l}. I'm hesitant to do all three right away
>   without time to test, but replacing sqrtl.c could be appropriate
>   since the current one is badly broken on archs with ld wider than
>   double. However it would need to accept ld80 in order not to be
>   build-breaking on m68k, or m68k would need an alternative.
>
> and some more with open questions or work to be done that can't be
> finished now but should be revisited after release:
>
> - fenv overhaul (sorry for dropping this, Damian)
> - PTHREAD_RWLOCK_PREFER_WRITER_NONRECURSIVE_NP
> - _SC_NPROCESSORS_{CONF,ONLN} behavior
> - hexagon port
> - rv32 port
> - arm fdpic (newly revived interest from users on list)
> - dni (dynamic linking without PT_INTERP absolute path) & related ldso
>   work by rcombs
> - "lutimes: Add checks for input parameters"

It would be nice to see runpath logic loosened up a bit. That is,
don't reject multiple runpaths if one is bad.

This is needed for packages like Perl. Perls screws up rpaths and
runpaths badly. Perl does not escape origin-based paths properly when
setting them in a makefile. Worse, Perl builds makefiles on the fly,
so we cannot manually fix the makefiles after configure.

Jeff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [musl] Release prep for 1.2.1, and afterwards
  2020-06-24 20:42 [musl] Release prep for 1.2.1, and afterwards Rich Felker
  2020-06-24 22:39 ` Jeffrey Walton
@ 2020-06-25  8:15 ` Szabolcs Nagy
  2020-06-25 15:39   ` Rich Felker
  1 sibling, 1 reply; 16+ messages in thread
From: Szabolcs Nagy @ 2020-06-25  8:15 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

* Rich Felker <dalias@libc.org> [2020-06-24 16:42:44 -0400]:

> I'm about to do last work of merging mallocng, followed soon by
> release. Is there anything in the way of overlooked bug reports or
> patches that should still be addressed in this release cycle?
> 
> Things I'm aware of:
> 
> - "Proposal to match behaviour of gethostbyname to glibc". Latest
>   patch is probably ok, but could be deferred to after release.
> 
> - nsz's new sqrt{,f,l}. I'm hesitant to do all three right away
>   without time to test, but replacing sqrtl.c could be appropriate
>   since the current one is badly broken on archs with ld wider than
>   double. However it would need to accept ld80 in order not to be
>   build-breaking on m68k, or m68k would need an alternative.

that's still under work

but it would be nice if we could get the aarch64
memcpy patch in (the c implementation is really
slow and i've seen ppl compare aarch64 vs x86
server performance with some benchmark on alpine..)

> 
> and some more with open questions or work to be done that can't be
> finished now but should be revisited after release:
> 
> - fenv overhaul (sorry for dropping this, Damian)
> - PTHREAD_RWLOCK_PREFER_WRITER_NONRECURSIVE_NP
> - _SC_NPROCESSORS_{CONF,ONLN} behavior
> - hexagon port
> - rv32 port
> - arm fdpic (newly revived interest from users on list)
> - dni (dynamic linking without PT_INTERP absolute path) & related ldso
>   work by rcombs
> - "lutimes: Add checks for input parameters"

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [musl] Release prep for 1.2.1, and afterwards
  2020-06-25  8:15 ` Szabolcs Nagy
@ 2020-06-25 15:39   ` Rich Felker
  2020-06-25 17:31     ` Szabolcs Nagy
  0 siblings, 1 reply; 16+ messages in thread
From: Rich Felker @ 2020-06-25 15:39 UTC (permalink / raw)
  To: musl

On Thu, Jun 25, 2020 at 10:15:04AM +0200, Szabolcs Nagy wrote:
> * Rich Felker <dalias@libc.org> [2020-06-24 16:42:44 -0400]:
> 
> > I'm about to do last work of merging mallocng, followed soon by
> > release. Is there anything in the way of overlooked bug reports or
> > patches that should still be addressed in this release cycle?
> > 
> > Things I'm aware of:
> > 
> > - "Proposal to match behaviour of gethostbyname to glibc". Latest
> >   patch is probably ok, but could be deferred to after release.
> > 
> > - nsz's new sqrt{,f,l}. I'm hesitant to do all three right away
> >   without time to test, but replacing sqrtl.c could be appropriate
> >   since the current one is badly broken on archs with ld wider than
> >   double. However it would need to accept ld80 in order not to be
> >   build-breaking on m68k, or m68k would need an alternative.
> 
> that's still under work

Won't it work just to make it decode/encode the ldshape, and otherwise
use exactly the same code? Or are there double-rounding issues if the
quad code is used with ld80?

> but it would be nice if we could get the aarch64
> memcpy patch in (the c implementation is really
> slow and i've seen ppl compare aarch64 vs x86
> server performance with some benchmark on alpine..)

OK, I'll look again.

Rich

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [musl] Release prep for 1.2.1, and afterwards
  2020-06-25 15:39   ` Rich Felker
@ 2020-06-25 17:31     ` Szabolcs Nagy
  2020-06-25 20:50       ` Rich Felker
  0 siblings, 1 reply; 16+ messages in thread
From: Szabolcs Nagy @ 2020-06-25 17:31 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

* Rich Felker <dalias@libc.org> [2020-06-25 11:39:36 -0400]:

> On Thu, Jun 25, 2020 at 10:15:04AM +0200, Szabolcs Nagy wrote:
> > * Rich Felker <dalias@libc.org> [2020-06-24 16:42:44 -0400]:
> > 
> > > I'm about to do last work of merging mallocng, followed soon by
> > > release. Is there anything in the way of overlooked bug reports or
> > > patches that should still be addressed in this release cycle?
> > > 
> > > Things I'm aware of:
> > > 
> > > - "Proposal to match behaviour of gethostbyname to glibc". Latest
> > >   patch is probably ok, but could be deferred to after release.
> > > 
> > > - nsz's new sqrt{,f,l}. I'm hesitant to do all three right away
> > >   without time to test, but replacing sqrtl.c could be appropriate
> > >   since the current one is badly broken on archs with ld wider than
> > >   double. However it would need to accept ld80 in order not to be
> > >   build-breaking on m68k, or m68k would need an alternative.
> > 
> > that's still under work
> 
> Won't it work just to make it decode/encode the ldshape, and otherwise
> use exactly the same code? Or are there double-rounding issues if the
> quad code is used with ld80?

i think the same code may work for ld80 too,
but i'm still testing the single/double/quad
code, it's not ready for inclusion.

> > but it would be nice if we could get the aarch64
> > memcpy patch in (the c implementation is really
> > slow and i've seen ppl compare aarch64 vs x86
> > server performance with some benchmark on alpine..)
> 
> OK, I'll look again.

thanks.

(there are more aarch64 string functions in the
optimized-routines github repo but i think they
are not as important as memcpy/memmove/memset)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [musl] Release prep for 1.2.1, and afterwards
  2020-06-25 17:31     ` Szabolcs Nagy
@ 2020-06-25 20:50       ` Rich Felker
  2020-06-25 21:15         ` Rich Felker
  2020-06-25 21:43         ` Andre McCurdy
  0 siblings, 2 replies; 16+ messages in thread
From: Rich Felker @ 2020-06-25 20:50 UTC (permalink / raw)
  To: musl

On Thu, Jun 25, 2020 at 07:31:25PM +0200, Szabolcs Nagy wrote:
> * Rich Felker <dalias@libc.org> [2020-06-25 11:39:36 -0400]:
> 
> > On Thu, Jun 25, 2020 at 10:15:04AM +0200, Szabolcs Nagy wrote:
> > > * Rich Felker <dalias@libc.org> [2020-06-24 16:42:44 -0400]:
> > > 
> > > > I'm about to do last work of merging mallocng, followed soon by
> > > > release. Is there anything in the way of overlooked bug reports or
> > > > patches that should still be addressed in this release cycle?
> > > > 
> > > > Things I'm aware of:
> > > > 
> > > > - "Proposal to match behaviour of gethostbyname to glibc". Latest
> > > >   patch is probably ok, but could be deferred to after release.
> > > > 
> > > > - nsz's new sqrt{,f,l}. I'm hesitant to do all three right away
> > > >   without time to test, but replacing sqrtl.c could be appropriate
> > > >   since the current one is badly broken on archs with ld wider than
> > > >   double. However it would need to accept ld80 in order not to be
> > > >   build-breaking on m68k, or m68k would need an alternative.
> > > 
> > > that's still under work
> > 
> > Won't it work just to make it decode/encode the ldshape, and otherwise
> > use exactly the same code? Or are there double-rounding issues if the
> > quad code is used with ld80?
> 
> i think the same code may work for ld80 too,
> but i'm still testing the single/double/quad
> code, it's not ready for inclusion.

OK. I had in mind possibly adding just sqrtl.c since it can't really
be worse than what we have now. But I'm ok with waiting too.

One alternative to getting it working for ld80 right away would be
just adding an asm version of sqrtl for m68k. However we have users
who've indicated an interest in disabling asm optimizations (see
thread "build: allow forcing generic implementations of library
functions") so in the long term I think we should aim for all generic
math functions to work on all ld formats and FLT_EVAL_METHOD rather
than just assuming they get replaced on i386/x86_64 and m68k.

> > > but it would be nice if we could get the aarch64
> > > memcpy patch in (the c implementation is really
> > > slow and i've seen ppl compare aarch64 vs x86
> > > server performance with some benchmark on alpine..)
> > 
> > OK, I'll look again.
> 
> thanks.
> 
> (there are more aarch64 string functions in the
> optimized-routines github repo but i think they
> are not as important as memcpy/memmove/memset)

I found the code. Can you commend on performance and whether memset is
needed? (The C memset should be rather good already, moreso than
memcpy.)

As noted in the past I'd like to get rid of having high level flow
logic in the arch asm and instead have the arch provide string asm
fragments, if desired, to copy blocks, which could then be used in a
shared C skeleton. However as you noted this has been a point of
practical performance problem for a long time and I don't think it's
fair to just keep putting it off for a better solution.

Rich

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [musl] Release prep for 1.2.1, and afterwards
  2020-06-25 20:50       ` Rich Felker
@ 2020-06-25 21:15         ` Rich Felker
  2020-06-26  1:20           ` Rich Felker
  2020-06-25 21:43         ` Andre McCurdy
  1 sibling, 1 reply; 16+ messages in thread
From: Rich Felker @ 2020-06-25 21:15 UTC (permalink / raw)
  To: musl

On Thu, Jun 25, 2020 at 04:50:24PM -0400, Rich Felker wrote:
> > > > but it would be nice if we could get the aarch64
> > > > memcpy patch in (the c implementation is really
> > > > slow and i've seen ppl compare aarch64 vs x86
> > > > server performance with some benchmark on alpine..)
> > > 
> > > OK, I'll look again.
> > 
> > thanks.
> > 
> > (there are more aarch64 string functions in the
> > optimized-routines github repo but i think they
> > are not as important as memcpy/memmove/memset)
> 
> I found the code. Can you commend on performance and whether memset is
> needed? (The C memset should be rather good already, moreso than
> memcpy.)

Are the assumptions (v8-a, unaligned access) documented in memcpy.S
valid for all presently supportable aarch64?

A couple comments for merging if we do, that aren't hard requirements
but preferences:

- I'd like to expand out the macros from ../asmdefs.h since that won't
  be available and they just hide things (I guess they're attractive
  for Apple/macho users or something but not relevant to musl) and
  since the symbol name lines need to be changed anyway to public
  name. "Local var name" macros are ok to leave; changing them would
  be too error-prone and they make the code more readable anyway.

- I'd prefer not to have memmove logic in memcpy since it makes it
  larger and implies that misuse of memcpy when you mean memmove is
  supported usage. I'd be happy with an approach like x86 though,
  defining an __memcpy_fwd alias and having memmove tail call to that
  unless len>128 and reverse is needed, or just leaving memmove.c.

Rich

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [musl] Release prep for 1.2.1, and afterwards
  2020-06-25 20:50       ` Rich Felker
  2020-06-25 21:15         ` Rich Felker
@ 2020-06-25 21:43         ` Andre McCurdy
  2020-06-25 21:51           ` Rich Felker
  1 sibling, 1 reply; 16+ messages in thread
From: Andre McCurdy @ 2020-06-25 21:43 UTC (permalink / raw)
  To: musl

On Thu, Jun 25, 2020 at 1:50 PM Rich Felker <dalias@libc.org> wrote:
>
> As noted in the past I'd like to get rid of having high level flow
> logic in the arch asm and instead have the arch provide string asm
> fragments, if desired, to copy blocks, which could then be used in a
> shared C skeleton. However as you noted this has been a point of
> practical performance problem for a long time and I don't think it's
> fair to just keep putting it off for a better solution.

I'd like to see the patches to enable asm memcpy for big endian ARM
merged. I may be the only user of musl on big endian ARM though (?) so
not sure how much wider interest there is.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [musl] Release prep for 1.2.1, and afterwards
  2020-06-25 21:43         ` Andre McCurdy
@ 2020-06-25 21:51           ` Rich Felker
  2020-06-25 22:03             ` Andre McCurdy
  0 siblings, 1 reply; 16+ messages in thread
From: Rich Felker @ 2020-06-25 21:51 UTC (permalink / raw)
  To: musl

On Thu, Jun 25, 2020 at 02:43:42PM -0700, Andre McCurdy wrote:
> On Thu, Jun 25, 2020 at 1:50 PM Rich Felker <dalias@libc.org> wrote:
> >
> > As noted in the past I'd like to get rid of having high level flow
> > logic in the arch asm and instead have the arch provide string asm
> > fragments, if desired, to copy blocks, which could then be used in a
> > shared C skeleton. However as you noted this has been a point of
> > practical performance problem for a long time and I don't think it's
> > fair to just keep putting it off for a better solution.
> 
> I'd like to see the patches to enable asm memcpy for big endian ARM
> merged. I may be the only user of musl on big endian ARM though (?) so
> not sure how much wider interest there is.

I'd forgotten I hadn't already merged it. However I was just rereading
it and something looks amiss. Can you take a look again?

Rich

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [musl] Release prep for 1.2.1, and afterwards
  2020-06-25 21:51           ` Rich Felker
@ 2020-06-25 22:03             ` Andre McCurdy
  0 siblings, 0 replies; 16+ messages in thread
From: Andre McCurdy @ 2020-06-25 22:03 UTC (permalink / raw)
  To: musl

On Thu, Jun 25, 2020 at 2:51 PM Rich Felker <dalias@libc.org> wrote:
>
> On Thu, Jun 25, 2020 at 02:43:42PM -0700, Andre McCurdy wrote:
> > On Thu, Jun 25, 2020 at 1:50 PM Rich Felker <dalias@libc.org> wrote:
> > >
> > > As noted in the past I'd like to get rid of having high level flow
> > > logic in the arch asm and instead have the arch provide string asm
> > > fragments, if desired, to copy blocks, which could then be used in a
> > > shared C skeleton. However as you noted this has been a point of
> > > practical performance problem for a long time and I don't think it's
> > > fair to just keep putting it off for a better solution.
> >
> > I'd like to see the patches to enable asm memcpy for big endian ARM
> > merged. I may be the only user of musl on big endian ARM though (?) so
> > not sure how much wider interest there is.
>
> I'd forgotten I hadn't already merged it. However I was just rereading
> it and something looks amiss. Can you take a look again?

Is there anything in particular that looks wrong?

The most recent version of the patch still applies cleanly to master.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [musl] Release prep for 1.2.1, and afterwards
  2020-06-25 21:15         ` Rich Felker
@ 2020-06-26  1:20           ` Rich Felker
  2020-06-26  8:40             ` Szabolcs Nagy
  0 siblings, 1 reply; 16+ messages in thread
From: Rich Felker @ 2020-06-26  1:20 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1772 bytes --]

On Thu, Jun 25, 2020 at 05:15:42PM -0400, Rich Felker wrote:
> On Thu, Jun 25, 2020 at 04:50:24PM -0400, Rich Felker wrote:
> > > > > but it would be nice if we could get the aarch64
> > > > > memcpy patch in (the c implementation is really
> > > > > slow and i've seen ppl compare aarch64 vs x86
> > > > > server performance with some benchmark on alpine..)
> > > > 
> > > > OK, I'll look again.
> > > 
> > > thanks.
> > > 
> > > (there are more aarch64 string functions in the
> > > optimized-routines github repo but i think they
> > > are not as important as memcpy/memmove/memset)
> > 
> > I found the code. Can you commend on performance and whether memset is
> > needed? (The C memset should be rather good already, moreso than
> > memcpy.)
> 
> Are the assumptions (v8-a, unaligned access) documented in memcpy.S
> valid for all presently supportable aarch64?
> 
> A couple comments for merging if we do, that aren't hard requirements
> but preferences:
> 
> - I'd like to expand out the macros from ../asmdefs.h since that won't
>   be available and they just hide things (I guess they're attractive
>   for Apple/macho users or something but not relevant to musl) and
>   since the symbol name lines need to be changed anyway to public
>   name. "Local var name" macros are ok to leave; changing them would
>   be too error-prone and they make the code more readable anyway.
> 
> - I'd prefer not to have memmove logic in memcpy since it makes it
>   larger and implies that misuse of memcpy when you mean memmove is
>   supported usage. I'd be happy with an approach like x86 though,
>   defining an __memcpy_fwd alias and having memmove tail call to that
>   unless len>128 and reverse is needed, or just leaving memmove.c.

Something like the attached.

Rich

[-- Attachment #2: memcpy.S --]
[-- Type: text/plain, Size: 4082 bytes --]

/*
 * memcpy - copy memory area
 *
 * Copyright (c) 2012-2020, Arm Limited.
 * SPDX-License-Identifier: MIT
 */

/* Assumptions:
 *
 * ARMv8-a, AArch64, unaligned accesses.
 *
 */

#define dstin	x0
#define src	x1
#define count	x2
#define dst	x3
#define srcend	x4
#define dstend	x5
#define A_l	x6
#define A_lw	w6
#define A_h	x7
#define B_l	x8
#define B_lw	w8
#define B_h	x9
#define C_l	x10
#define C_lw	w10
#define C_h	x11
#define D_l	x12
#define D_h	x13
#define E_l	x14
#define E_h	x15
#define F_l	x16
#define F_h	x17
#define G_l	count
#define G_h	dst
#define H_l	src
#define H_h	srcend
#define tmp1	x14

/* This implementation handles overlaps and supports both memcpy and memmove
   from a single entry point.  It uses unaligned accesses and branchless
   sequences to keep the code small, simple and improve performance.

   Copies are split into 3 main cases: small copies of up to 32 bytes, medium
   copies of up to 128 bytes, and large copies.  The overhead of the overlap
   check is negligible since it is only required for large copies.

   Large copies use a software pipelined loop processing 64 bytes per iteration.
   The destination pointer is 16-byte aligned to minimize unaligned accesses.
   The loop tail is handled by always copying 64 bytes from the end.
*/

.global memcpy
.type memcpy,%function
memcpy:
	add	srcend, src, count
	add	dstend, dstin, count
	cmp	count, 128
	b.hi	.Lcopy_long
	cmp	count, 32
	b.hi	.Lcopy32_128

	/* Small copies: 0..32 bytes.  */
	cmp	count, 16
	b.lo	.Lcopy16
	ldp	A_l, A_h, [src]
	ldp	D_l, D_h, [srcend, -16]
	stp	A_l, A_h, [dstin]
	stp	D_l, D_h, [dstend, -16]
	ret

	/* Copy 8-15 bytes.  */
.Lcopy16:
	tbz	count, 3, .Lcopy8
	ldr	A_l, [src]
	ldr	A_h, [srcend, -8]
	str	A_l, [dstin]
	str	A_h, [dstend, -8]
	ret

	.p2align 3
	/* Copy 4-7 bytes.  */
.Lcopy8:
	tbz	count, 2, .Lcopy4
	ldr	A_lw, [src]
	ldr	B_lw, [srcend, -4]
	str	A_lw, [dstin]
	str	B_lw, [dstend, -4]
	ret

	/* Copy 0..3 bytes using a branchless sequence.  */
.Lcopy4:
	cbz	count, .Lcopy0
	lsr	tmp1, count, 1
	ldrb	A_lw, [src]
	ldrb	C_lw, [srcend, -1]
	ldrb	B_lw, [src, tmp1]
	strb	A_lw, [dstin]
	strb	B_lw, [dstin, tmp1]
	strb	C_lw, [dstend, -1]
.Lcopy0:
	ret

	.p2align 4
	/* Medium copies: 33..128 bytes.  */
.Lcopy32_128:
	ldp	A_l, A_h, [src]
	ldp	B_l, B_h, [src, 16]
	ldp	C_l, C_h, [srcend, -32]
	ldp	D_l, D_h, [srcend, -16]
	cmp	count, 64
	b.hi	.Lcopy128
	stp	A_l, A_h, [dstin]
	stp	B_l, B_h, [dstin, 16]
	stp	C_l, C_h, [dstend, -32]
	stp	D_l, D_h, [dstend, -16]
	ret

	.p2align 4
	/* Copy 65..128 bytes.  */
.Lcopy128:
	ldp	E_l, E_h, [src, 32]
	ldp	F_l, F_h, [src, 48]
	cmp	count, 96
	b.ls	.Lcopy96
	ldp	G_l, G_h, [srcend, -64]
	ldp	H_l, H_h, [srcend, -48]
	stp	G_l, G_h, [dstend, -64]
	stp	H_l, H_h, [dstend, -48]
.Lcopy96:
	stp	A_l, A_h, [dstin]
	stp	B_l, B_h, [dstin, 16]
	stp	E_l, E_h, [dstin, 32]
	stp	F_l, F_h, [dstin, 48]
	stp	C_l, C_h, [dstend, -32]
	stp	D_l, D_h, [dstend, -16]
	ret

	.p2align 4
	/* Copy more than 128 bytes.  */
.Lcopy_long:

	/* Copy 16 bytes and then align dst to 16-byte alignment.  */

	ldp	D_l, D_h, [src]
	and	tmp1, dstin, 15
	bic	dst, dstin, 15
	sub	src, src, tmp1
	add	count, count, tmp1	/* Count is now 16 too large.  */
	ldp	A_l, A_h, [src, 16]
	stp	D_l, D_h, [dstin]
	ldp	B_l, B_h, [src, 32]
	ldp	C_l, C_h, [src, 48]
	ldp	D_l, D_h, [src, 64]!
	subs	count, count, 128 + 16	/* Test and readjust count.  */
	b.ls	.Lcopy64_from_end

.Lloop64:
	stp	A_l, A_h, [dst, 16]
	ldp	A_l, A_h, [src, 16]
	stp	B_l, B_h, [dst, 32]
	ldp	B_l, B_h, [src, 32]
	stp	C_l, C_h, [dst, 48]
	ldp	C_l, C_h, [src, 48]
	stp	D_l, D_h, [dst, 64]!
	ldp	D_l, D_h, [src, 64]!
	subs	count, count, 64
	b.hi	.Lloop64

	/* Write the last iteration and copy 64 bytes from the end.  */
.Lcopy64_from_end:
	ldp	E_l, E_h, [srcend, -64]
	stp	A_l, A_h, [dst, 16]
	ldp	A_l, A_h, [srcend, -48]
	stp	B_l, B_h, [dst, 32]
	ldp	B_l, B_h, [srcend, -32]
	stp	C_l, C_h, [dst, 48]
	ldp	C_l, C_h, [srcend, -16]
	stp	D_l, D_h, [dst, 64]
	stp	E_l, E_h, [dstend, -64]
	stp	A_l, A_h, [dstend, -48]
	stp	B_l, B_h, [dstend, -32]
	stp	C_l, C_h, [dstend, -16]
	ret

.size memcpy,.-memcpy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [musl] Release prep for 1.2.1, and afterwards
  2020-06-26  1:20           ` Rich Felker
@ 2020-06-26  8:40             ` Szabolcs Nagy
  2020-07-06 22:12               ` Rich Felker
  0 siblings, 1 reply; 16+ messages in thread
From: Szabolcs Nagy @ 2020-06-26  8:40 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

* Rich Felker <dalias@libc.org> [2020-06-25 21:20:06 -0400]:
> On Thu, Jun 25, 2020 at 05:15:42PM -0400, Rich Felker wrote:
> > On Thu, Jun 25, 2020 at 04:50:24PM -0400, Rich Felker wrote:
> > > > > > but it would be nice if we could get the aarch64
> > > > > > memcpy patch in (the c implementation is really
> > > > > > slow and i've seen ppl compare aarch64 vs x86
> > > > > > server performance with some benchmark on alpine..)
> > > > > 
> > > > > OK, I'll look again.
> > > > 
> > > > thanks.
> > > > 
> > > > (there are more aarch64 string functions in the
> > > > optimized-routines github repo but i think they
> > > > are not as important as memcpy/memmove/memset)
> > > 
> > > I found the code. Can you commend on performance and whether memset is
> > > needed? (The C memset should be rather good already, moreso than
> > > memcpy.)

the asm seems faster in all measurements but there is
a lot of variance with different size/alignment cases.

the avg improvement on typical workload and the possible
improvements across various cases and cores i'd expect:

memcpy typical: 1.6x-1.7x
memcpy possible: 1.2x-3.1x

memset typical: 1.1x-1.4x
memset possible: 1.0x-2.6x

> > Are the assumptions (v8-a, unaligned access) documented in memcpy.S
> > valid for all presently supportable aarch64?

yes, unaligned access on normal memory in userspace
is valid (part of the base abi on linux).

iirc a core can be configured to trap unaligned access
and it is not valid on device memory so e.g. such
memcpy would not work in the kernel. but avoiding
unaligned access in memcpy is not enough to fix that,
the compiler will generate unaligned load for

int f(char *p)
{
    int i;
    __builtin_memcpy(&i,p,sizeof i);
    return i;
}

> > 
> > A couple comments for merging if we do, that aren't hard requirements
> > but preferences:
> > 
> > - I'd like to expand out the macros from ../asmdefs.h since that won't
> >   be available and they just hide things (I guess they're attractive
> >   for Apple/macho users or something but not relevant to musl) and
> >   since the symbol name lines need to be changed anyway to public
> >   name. "Local var name" macros are ok to leave; changing them would
> >   be too error-prone and they make the code more readable anyway.

the weird macros are there so the code is similar to glibc
asm code (which adds cfi annotation and optionally adds
profile hooks to entry etc)

> > 
> > - I'd prefer not to have memmove logic in memcpy since it makes it
> >   larger and implies that misuse of memcpy when you mean memmove is
> >   supported usage. I'd be happy with an approach like x86 though,
> >   defining an __memcpy_fwd alias and having memmove tail call to that
> >   unless len>128 and reverse is needed, or just leaving memmove.c.

in principle the code should be called memmove, not memcpy,
since it satisfies the memmove contract, which of course
works for memcpy too. so tail calling memmove from memcpy
makes more sense but memcpy is more performance critical
than memmove, so we probably should not add extra branches
there..

> 
> Something like the attached.

looks good to me.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [musl] Release prep for 1.2.1, and afterwards
  2020-06-26  8:40             ` Szabolcs Nagy
@ 2020-07-06 22:12               ` Rich Felker
  2020-07-07 15:00                 ` Szabolcs Nagy
  0 siblings, 1 reply; 16+ messages in thread
From: Rich Felker @ 2020-07-06 22:12 UTC (permalink / raw)
  To: musl

On Fri, Jun 26, 2020 at 10:40:49AM +0200, Szabolcs Nagy wrote:
> * Rich Felker <dalias@libc.org> [2020-06-25 21:20:06 -0400]:
> > On Thu, Jun 25, 2020 at 05:15:42PM -0400, Rich Felker wrote:
> > > On Thu, Jun 25, 2020 at 04:50:24PM -0400, Rich Felker wrote:
> > > > > > > but it would be nice if we could get the aarch64
> > > > > > > memcpy patch in (the c implementation is really
> > > > > > > slow and i've seen ppl compare aarch64 vs x86
> > > > > > > server performance with some benchmark on alpine..)
> > > > > > 
> > > > > > OK, I'll look again.
> > > > > 
> > > > > thanks.
> > > > > 
> > > > > (there are more aarch64 string functions in the
> > > > > optimized-routines github repo but i think they
> > > > > are not as important as memcpy/memmove/memset)
> > > > 
> > > > I found the code. Can you commend on performance and whether memset is
> > > > needed? (The C memset should be rather good already, moreso than
> > > > memcpy.)
> 
> the asm seems faster in all measurements but there is
> a lot of variance with different size/alignment cases.
> 
> the avg improvement on typical workload and the possible
> improvements across various cases and cores i'd expect:
> 
> memcpy typical: 1.6x-1.7x
> memcpy possible: 1.2x-3.1x
> 
> memset typical: 1.1x-1.4x
> memset possible: 1.0x-2.6x
> 
> > > Are the assumptions (v8-a, unaligned access) documented in memcpy.S
> > > valid for all presently supportable aarch64?
> 
> yes, unaligned access on normal memory in userspace
> is valid (part of the base abi on linux).
> 
> iirc a core can be configured to trap unaligned access
> and it is not valid on device memory so e.g. such
> memcpy would not work in the kernel. but avoiding
> unaligned access in memcpy is not enough to fix that,
> the compiler will generate unaligned load for
> 
> int f(char *p)
> {
>     int i;
>     __builtin_memcpy(&i,p,sizeof i);
>     return i;
> }
> 
> > > 
> > > A couple comments for merging if we do, that aren't hard requirements
> > > but preferences:
> > > 
> > > - I'd like to expand out the macros from ../asmdefs.h since that won't
> > >   be available and they just hide things (I guess they're attractive
> > >   for Apple/macho users or something but not relevant to musl) and
> > >   since the symbol name lines need to be changed anyway to public
> > >   name. "Local var name" macros are ok to leave; changing them would
> > >   be too error-prone and they make the code more readable anyway.
> 
> the weird macros are there so the code is similar to glibc
> asm code (which adds cfi annotation and optionally adds
> profile hooks to entry etc)
> 
> > > 
> > > - I'd prefer not to have memmove logic in memcpy since it makes it
> > >   larger and implies that misuse of memcpy when you mean memmove is
> > >   supported usage. I'd be happy with an approach like x86 though,
> > >   defining an __memcpy_fwd alias and having memmove tail call to that
> > >   unless len>128 and reverse is needed, or just leaving memmove.c.
> 
> in principle the code should be called memmove, not memcpy,
> since it satisfies the memmove contract, which of course
> works for memcpy too. so tail calling memmove from memcpy
> makes more sense but memcpy is more performance critical
> than memmove, so we probably should not add extra branches
> there..
> 
> > 
> > Something like the attached.
> 
> looks good to me.

I think you saw already, but just to make it clear on the list too,
it's upstream now. I'm open to further improvements like doing
memmove (either as a separate copy of the full implementation or some
minimal branch-to-__memcpy_fwd approach) but I think what's already
there is sufficient to solve the main practical performance issues
users were hitting that made aarch64 look bad in relation to x86_64.

I'd still like to revisit the topic of minimizing the per-arch code
needed for this so that all archs can benefit from the basic logic,
too.

Rich

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [musl] Release prep for 1.2.1, and afterwards
  2020-07-06 22:12               ` Rich Felker
@ 2020-07-07 15:00                 ` Szabolcs Nagy
  2020-07-07 17:22                   ` Rich Felker
  0 siblings, 1 reply; 16+ messages in thread
From: Szabolcs Nagy @ 2020-07-07 15:00 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

* Rich Felker <dalias@libc.org> [2020-07-06 18:12:43 -0400]:
> I think you saw already, but just to make it clear on the list too,
> it's upstream now. I'm open to further improvements like doing
> memmove (either as a separate copy of the full implementation or some
> minimal branch-to-__memcpy_fwd approach) but I think what's already
> there is sufficient to solve the main practical performance issues
> users were hitting that made aarch64 look bad in relation to x86_64.
> 
> I'd still like to revisit the topic of minimizing the per-arch code
> needed for this so that all archs can benefit from the basic logic,
> too.

thanks.

note that the code has some internal .p2align
directives that assume the entry is aligned to
some large alignment (.p2align 6 in orig code)

i think it would be better to keep the entry
aligned (but i don't know if it makes a big
difference on some existing core, it's more
for consistency with upstream).

musl normally does not align function entries
but for a few select functions it is probably
not too much overhead?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [musl] Release prep for 1.2.1, and afterwards
  2020-07-07 15:00                 ` Szabolcs Nagy
@ 2020-07-07 17:22                   ` Rich Felker
  2020-07-07 18:20                     ` Szabolcs Nagy
  0 siblings, 1 reply; 16+ messages in thread
From: Rich Felker @ 2020-07-07 17:22 UTC (permalink / raw)
  To: musl

On Tue, Jul 07, 2020 at 05:00:20PM +0200, Szabolcs Nagy wrote:
> * Rich Felker <dalias@libc.org> [2020-07-06 18:12:43 -0400]:
> > I think you saw already, but just to make it clear on the list too,
> > it's upstream now. I'm open to further improvements like doing
> > memmove (either as a separate copy of the full implementation or some
> > minimal branch-to-__memcpy_fwd approach) but I think what's already
> > there is sufficient to solve the main practical performance issues
> > users were hitting that made aarch64 look bad in relation to x86_64.
> > 
> > I'd still like to revisit the topic of minimizing the per-arch code
> > needed for this so that all archs can benefit from the basic logic,
> > too.
> 
> thanks.
> 
> note that the code has some internal .p2align
> directives that assume the entry is aligned to
> some large alignment (.p2align 6 in orig code)
> 
> i think it would be better to keep the entry
> aligned (but i don't know if it makes a big
> difference on some existing core, it's more
> for consistency with upstream).
> 
> musl normally does not align function entries
> but for a few select functions it is probably
> not too much overhead?

I was under the impression that any .p2align N in the section
inherently aligns the whole section as if it started with .p2align N,
in which case not writing it explicitly just avoids redundancy and
makes sure you don't actually have an initial alignment that's larger
than any alignment actually wanted later. Is this incorrect?

(To be incorrect I think it would have to do some fancy
elastic-section-contents hack, but maybe aarch64 ELF object ABI has
that..?)

Rich

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [musl] Release prep for 1.2.1, and afterwards
  2020-07-07 17:22                   ` Rich Felker
@ 2020-07-07 18:20                     ` Szabolcs Nagy
  0 siblings, 0 replies; 16+ messages in thread
From: Szabolcs Nagy @ 2020-07-07 18:20 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

* Rich Felker <dalias@libc.org> [2020-07-07 13:22:57 -0400]:

> On Tue, Jul 07, 2020 at 05:00:20PM +0200, Szabolcs Nagy wrote:
> > * Rich Felker <dalias@libc.org> [2020-07-06 18:12:43 -0400]:
> > > I think you saw already, but just to make it clear on the list too,
> > > it's upstream now. I'm open to further improvements like doing
> > > memmove (either as a separate copy of the full implementation or some
> > > minimal branch-to-__memcpy_fwd approach) but I think what's already
> > > there is sufficient to solve the main practical performance issues
> > > users were hitting that made aarch64 look bad in relation to x86_64.
> > > 
> > > I'd still like to revisit the topic of minimizing the per-arch code
> > > needed for this so that all archs can benefit from the basic logic,
> > > too.
> > 
> > thanks.
> > 
> > note that the code has some internal .p2align
> > directives that assume the entry is aligned to
> > some large alignment (.p2align 6 in orig code)
> > 
> > i think it would be better to keep the entry
> > aligned (but i don't know if it makes a big
> > difference on some existing core, it's more
> > for consistency with upstream).
> > 
> > musl normally does not align function entries
> > but for a few select functions it is probably
> > not too much overhead?
> 
> I was under the impression that any .p2align N in the section
> inherently aligns the whole section as if it started with .p2align N,
> in which case not writing it explicitly just avoids redundancy and
> makes sure you don't actually have an initial alignment that's larger
> than any alignment actually wanted later. Is this incorrect?
> 
> (To be incorrect I think it would have to do some fancy
> elastic-section-contents hack, but maybe aarch64 ELF object ABI has
> that..?)

ah you are right, then everything is fine i guess.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-07-07 18:20 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-24 20:42 [musl] Release prep for 1.2.1, and afterwards Rich Felker
2020-06-24 22:39 ` Jeffrey Walton
2020-06-25  8:15 ` Szabolcs Nagy
2020-06-25 15:39   ` Rich Felker
2020-06-25 17:31     ` Szabolcs Nagy
2020-06-25 20:50       ` Rich Felker
2020-06-25 21:15         ` Rich Felker
2020-06-26  1:20           ` Rich Felker
2020-06-26  8:40             ` Szabolcs Nagy
2020-07-06 22:12               ` Rich Felker
2020-07-07 15:00                 ` Szabolcs Nagy
2020-07-07 17:22                   ` Rich Felker
2020-07-07 18:20                     ` Szabolcs Nagy
2020-06-25 21:43         ` Andre McCurdy
2020-06-25 21:51           ` Rich Felker
2020-06-25 22:03             ` Andre McCurdy

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).