* [musl] Release prep for 1.2.1, and afterwards
@ 2020-06-24 20:42 Rich Felker
2020-06-24 22:39 ` Jeffrey Walton
2020-06-25 8:15 ` Szabolcs Nagy
0 siblings, 2 replies; 16+ messages in thread
From: Rich Felker @ 2020-06-24 20:42 UTC (permalink / raw)
To: musl
I'm about to do last work of merging mallocng, followed soon by
release. Is there anything in the way of overlooked bug reports or
patches that should still be addressed in this release cycle?
Things I'm aware of:
- "Proposal to match behaviour of gethostbyname to glibc". Latest
patch is probably ok, but could be deferred to after release.
- nsz's new sqrt{,f,l}. I'm hesitant to do all three right away
without time to test, but replacing sqrtl.c could be appropriate
since the current one is badly broken on archs with ld wider than
double. However it would need to accept ld80 in order not to be
build-breaking on m68k, or m68k would need an alternative.
and some more with open questions or work to be done that can't be
finished now but should be revisited after release:
- fenv overhaul (sorry for dropping this, Damian)
- PTHREAD_RWLOCK_PREFER_WRITER_NONRECURSIVE_NP
- _SC_NPROCESSORS_{CONF,ONLN} behavior
- hexagon port
- rv32 port
- arm fdpic (newly revived interest from users on list)
- dni (dynamic linking without PT_INTERP absolute path) & related ldso
work by rcombs
- "lutimes: Add checks for input parameters"
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards
2020-06-24 20:42 [musl] Release prep for 1.2.1, and afterwards Rich Felker
@ 2020-06-24 22:39 ` Jeffrey Walton
2020-06-25 8:15 ` Szabolcs Nagy
1 sibling, 0 replies; 16+ messages in thread
From: Jeffrey Walton @ 2020-06-24 22:39 UTC (permalink / raw)
To: musl
On Wed, Jun 24, 2020 at 4:58 PM Rich Felker <dalias@libc.org> wrote:
>
> I'm about to do last work of merging mallocng, followed soon by
> release. Is there anything in the way of overlooked bug reports or
> patches that should still be addressed in this release cycle?
>
> Things I'm aware of:
>
> - "Proposal to match behaviour of gethostbyname to glibc". Latest
> patch is probably ok, but could be deferred to after release.
>
> - nsz's new sqrt{,f,l}. I'm hesitant to do all three right away
> without time to test, but replacing sqrtl.c could be appropriate
> since the current one is badly broken on archs with ld wider than
> double. However it would need to accept ld80 in order not to be
> build-breaking on m68k, or m68k would need an alternative.
>
> and some more with open questions or work to be done that can't be
> finished now but should be revisited after release:
>
> - fenv overhaul (sorry for dropping this, Damian)
> - PTHREAD_RWLOCK_PREFER_WRITER_NONRECURSIVE_NP
> - _SC_NPROCESSORS_{CONF,ONLN} behavior
> - hexagon port
> - rv32 port
> - arm fdpic (newly revived interest from users on list)
> - dni (dynamic linking without PT_INTERP absolute path) & related ldso
> work by rcombs
> - "lutimes: Add checks for input parameters"
It would be nice to see runpath logic loosened up a bit. That is,
don't reject multiple runpaths if one is bad.
This is needed for packages like Perl. Perls screws up rpaths and
runpaths badly. Perl does not escape origin-based paths properly when
setting them in a makefile. Worse, Perl builds makefiles on the fly,
so we cannot manually fix the makefiles after configure.
Jeff
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards
2020-06-24 20:42 [musl] Release prep for 1.2.1, and afterwards Rich Felker
2020-06-24 22:39 ` Jeffrey Walton
@ 2020-06-25 8:15 ` Szabolcs Nagy
2020-06-25 15:39 ` Rich Felker
1 sibling, 1 reply; 16+ messages in thread
From: Szabolcs Nagy @ 2020-06-25 8:15 UTC (permalink / raw)
To: Rich Felker; +Cc: musl
* Rich Felker <dalias@libc.org> [2020-06-24 16:42:44 -0400]:
> I'm about to do last work of merging mallocng, followed soon by
> release. Is there anything in the way of overlooked bug reports or
> patches that should still be addressed in this release cycle?
>
> Things I'm aware of:
>
> - "Proposal to match behaviour of gethostbyname to glibc". Latest
> patch is probably ok, but could be deferred to after release.
>
> - nsz's new sqrt{,f,l}. I'm hesitant to do all three right away
> without time to test, but replacing sqrtl.c could be appropriate
> since the current one is badly broken on archs with ld wider than
> double. However it would need to accept ld80 in order not to be
> build-breaking on m68k, or m68k would need an alternative.
that's still under work
but it would be nice if we could get the aarch64
memcpy patch in (the c implementation is really
slow and i've seen ppl compare aarch64 vs x86
server performance with some benchmark on alpine..)
>
> and some more with open questions or work to be done that can't be
> finished now but should be revisited after release:
>
> - fenv overhaul (sorry for dropping this, Damian)
> - PTHREAD_RWLOCK_PREFER_WRITER_NONRECURSIVE_NP
> - _SC_NPROCESSORS_{CONF,ONLN} behavior
> - hexagon port
> - rv32 port
> - arm fdpic (newly revived interest from users on list)
> - dni (dynamic linking without PT_INTERP absolute path) & related ldso
> work by rcombs
> - "lutimes: Add checks for input parameters"
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards
2020-06-25 8:15 ` Szabolcs Nagy
@ 2020-06-25 15:39 ` Rich Felker
2020-06-25 17:31 ` Szabolcs Nagy
0 siblings, 1 reply; 16+ messages in thread
From: Rich Felker @ 2020-06-25 15:39 UTC (permalink / raw)
To: musl
On Thu, Jun 25, 2020 at 10:15:04AM +0200, Szabolcs Nagy wrote:
> * Rich Felker <dalias@libc.org> [2020-06-24 16:42:44 -0400]:
>
> > I'm about to do last work of merging mallocng, followed soon by
> > release. Is there anything in the way of overlooked bug reports or
> > patches that should still be addressed in this release cycle?
> >
> > Things I'm aware of:
> >
> > - "Proposal to match behaviour of gethostbyname to glibc". Latest
> > patch is probably ok, but could be deferred to after release.
> >
> > - nsz's new sqrt{,f,l}. I'm hesitant to do all three right away
> > without time to test, but replacing sqrtl.c could be appropriate
> > since the current one is badly broken on archs with ld wider than
> > double. However it would need to accept ld80 in order not to be
> > build-breaking on m68k, or m68k would need an alternative.
>
> that's still under work
Won't it work just to make it decode/encode the ldshape, and otherwise
use exactly the same code? Or are there double-rounding issues if the
quad code is used with ld80?
> but it would be nice if we could get the aarch64
> memcpy patch in (the c implementation is really
> slow and i've seen ppl compare aarch64 vs x86
> server performance with some benchmark on alpine..)
OK, I'll look again.
Rich
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards
2020-06-25 15:39 ` Rich Felker
@ 2020-06-25 17:31 ` Szabolcs Nagy
2020-06-25 20:50 ` Rich Felker
0 siblings, 1 reply; 16+ messages in thread
From: Szabolcs Nagy @ 2020-06-25 17:31 UTC (permalink / raw)
To: Rich Felker; +Cc: musl
* Rich Felker <dalias@libc.org> [2020-06-25 11:39:36 -0400]:
> On Thu, Jun 25, 2020 at 10:15:04AM +0200, Szabolcs Nagy wrote:
> > * Rich Felker <dalias@libc.org> [2020-06-24 16:42:44 -0400]:
> >
> > > I'm about to do last work of merging mallocng, followed soon by
> > > release. Is there anything in the way of overlooked bug reports or
> > > patches that should still be addressed in this release cycle?
> > >
> > > Things I'm aware of:
> > >
> > > - "Proposal to match behaviour of gethostbyname to glibc". Latest
> > > patch is probably ok, but could be deferred to after release.
> > >
> > > - nsz's new sqrt{,f,l}. I'm hesitant to do all three right away
> > > without time to test, but replacing sqrtl.c could be appropriate
> > > since the current one is badly broken on archs with ld wider than
> > > double. However it would need to accept ld80 in order not to be
> > > build-breaking on m68k, or m68k would need an alternative.
> >
> > that's still under work
>
> Won't it work just to make it decode/encode the ldshape, and otherwise
> use exactly the same code? Or are there double-rounding issues if the
> quad code is used with ld80?
i think the same code may work for ld80 too,
but i'm still testing the single/double/quad
code, it's not ready for inclusion.
> > but it would be nice if we could get the aarch64
> > memcpy patch in (the c implementation is really
> > slow and i've seen ppl compare aarch64 vs x86
> > server performance with some benchmark on alpine..)
>
> OK, I'll look again.
thanks.
(there are more aarch64 string functions in the
optimized-routines github repo but i think they
are not as important as memcpy/memmove/memset)
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards
2020-06-25 17:31 ` Szabolcs Nagy
@ 2020-06-25 20:50 ` Rich Felker
2020-06-25 21:15 ` Rich Felker
2020-06-25 21:43 ` Andre McCurdy
0 siblings, 2 replies; 16+ messages in thread
From: Rich Felker @ 2020-06-25 20:50 UTC (permalink / raw)
To: musl
On Thu, Jun 25, 2020 at 07:31:25PM +0200, Szabolcs Nagy wrote:
> * Rich Felker <dalias@libc.org> [2020-06-25 11:39:36 -0400]:
>
> > On Thu, Jun 25, 2020 at 10:15:04AM +0200, Szabolcs Nagy wrote:
> > > * Rich Felker <dalias@libc.org> [2020-06-24 16:42:44 -0400]:
> > >
> > > > I'm about to do last work of merging mallocng, followed soon by
> > > > release. Is there anything in the way of overlooked bug reports or
> > > > patches that should still be addressed in this release cycle?
> > > >
> > > > Things I'm aware of:
> > > >
> > > > - "Proposal to match behaviour of gethostbyname to glibc". Latest
> > > > patch is probably ok, but could be deferred to after release.
> > > >
> > > > - nsz's new sqrt{,f,l}. I'm hesitant to do all three right away
> > > > without time to test, but replacing sqrtl.c could be appropriate
> > > > since the current one is badly broken on archs with ld wider than
> > > > double. However it would need to accept ld80 in order not to be
> > > > build-breaking on m68k, or m68k would need an alternative.
> > >
> > > that's still under work
> >
> > Won't it work just to make it decode/encode the ldshape, and otherwise
> > use exactly the same code? Or are there double-rounding issues if the
> > quad code is used with ld80?
>
> i think the same code may work for ld80 too,
> but i'm still testing the single/double/quad
> code, it's not ready for inclusion.
OK. I had in mind possibly adding just sqrtl.c since it can't really
be worse than what we have now. But I'm ok with waiting too.
One alternative to getting it working for ld80 right away would be
just adding an asm version of sqrtl for m68k. However we have users
who've indicated an interest in disabling asm optimizations (see
thread "build: allow forcing generic implementations of library
functions") so in the long term I think we should aim for all generic
math functions to work on all ld formats and FLT_EVAL_METHOD rather
than just assuming they get replaced on i386/x86_64 and m68k.
> > > but it would be nice if we could get the aarch64
> > > memcpy patch in (the c implementation is really
> > > slow and i've seen ppl compare aarch64 vs x86
> > > server performance with some benchmark on alpine..)
> >
> > OK, I'll look again.
>
> thanks.
>
> (there are more aarch64 string functions in the
> optimized-routines github repo but i think they
> are not as important as memcpy/memmove/memset)
I found the code. Can you commend on performance and whether memset is
needed? (The C memset should be rather good already, moreso than
memcpy.)
As noted in the past I'd like to get rid of having high level flow
logic in the arch asm and instead have the arch provide string asm
fragments, if desired, to copy blocks, which could then be used in a
shared C skeleton. However as you noted this has been a point of
practical performance problem for a long time and I don't think it's
fair to just keep putting it off for a better solution.
Rich
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards
2020-06-25 20:50 ` Rich Felker
@ 2020-06-25 21:15 ` Rich Felker
2020-06-26 1:20 ` Rich Felker
2020-06-25 21:43 ` Andre McCurdy
1 sibling, 1 reply; 16+ messages in thread
From: Rich Felker @ 2020-06-25 21:15 UTC (permalink / raw)
To: musl
On Thu, Jun 25, 2020 at 04:50:24PM -0400, Rich Felker wrote:
> > > > but it would be nice if we could get the aarch64
> > > > memcpy patch in (the c implementation is really
> > > > slow and i've seen ppl compare aarch64 vs x86
> > > > server performance with some benchmark on alpine..)
> > >
> > > OK, I'll look again.
> >
> > thanks.
> >
> > (there are more aarch64 string functions in the
> > optimized-routines github repo but i think they
> > are not as important as memcpy/memmove/memset)
>
> I found the code. Can you commend on performance and whether memset is
> needed? (The C memset should be rather good already, moreso than
> memcpy.)
Are the assumptions (v8-a, unaligned access) documented in memcpy.S
valid for all presently supportable aarch64?
A couple comments for merging if we do, that aren't hard requirements
but preferences:
- I'd like to expand out the macros from ../asmdefs.h since that won't
be available and they just hide things (I guess they're attractive
for Apple/macho users or something but not relevant to musl) and
since the symbol name lines need to be changed anyway to public
name. "Local var name" macros are ok to leave; changing them would
be too error-prone and they make the code more readable anyway.
- I'd prefer not to have memmove logic in memcpy since it makes it
larger and implies that misuse of memcpy when you mean memmove is
supported usage. I'd be happy with an approach like x86 though,
defining an __memcpy_fwd alias and having memmove tail call to that
unless len>128 and reverse is needed, or just leaving memmove.c.
Rich
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards
2020-06-25 20:50 ` Rich Felker
2020-06-25 21:15 ` Rich Felker
@ 2020-06-25 21:43 ` Andre McCurdy
2020-06-25 21:51 ` Rich Felker
1 sibling, 1 reply; 16+ messages in thread
From: Andre McCurdy @ 2020-06-25 21:43 UTC (permalink / raw)
To: musl
On Thu, Jun 25, 2020 at 1:50 PM Rich Felker <dalias@libc.org> wrote:
>
> As noted in the past I'd like to get rid of having high level flow
> logic in the arch asm and instead have the arch provide string asm
> fragments, if desired, to copy blocks, which could then be used in a
> shared C skeleton. However as you noted this has been a point of
> practical performance problem for a long time and I don't think it's
> fair to just keep putting it off for a better solution.
I'd like to see the patches to enable asm memcpy for big endian ARM
merged. I may be the only user of musl on big endian ARM though (?) so
not sure how much wider interest there is.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards
2020-06-25 21:43 ` Andre McCurdy
@ 2020-06-25 21:51 ` Rich Felker
2020-06-25 22:03 ` Andre McCurdy
0 siblings, 1 reply; 16+ messages in thread
From: Rich Felker @ 2020-06-25 21:51 UTC (permalink / raw)
To: musl
On Thu, Jun 25, 2020 at 02:43:42PM -0700, Andre McCurdy wrote:
> On Thu, Jun 25, 2020 at 1:50 PM Rich Felker <dalias@libc.org> wrote:
> >
> > As noted in the past I'd like to get rid of having high level flow
> > logic in the arch asm and instead have the arch provide string asm
> > fragments, if desired, to copy blocks, which could then be used in a
> > shared C skeleton. However as you noted this has been a point of
> > practical performance problem for a long time and I don't think it's
> > fair to just keep putting it off for a better solution.
>
> I'd like to see the patches to enable asm memcpy for big endian ARM
> merged. I may be the only user of musl on big endian ARM though (?) so
> not sure how much wider interest there is.
I'd forgotten I hadn't already merged it. However I was just rereading
it and something looks amiss. Can you take a look again?
Rich
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards
2020-06-25 21:51 ` Rich Felker
@ 2020-06-25 22:03 ` Andre McCurdy
0 siblings, 0 replies; 16+ messages in thread
From: Andre McCurdy @ 2020-06-25 22:03 UTC (permalink / raw)
To: musl
On Thu, Jun 25, 2020 at 2:51 PM Rich Felker <dalias@libc.org> wrote:
>
> On Thu, Jun 25, 2020 at 02:43:42PM -0700, Andre McCurdy wrote:
> > On Thu, Jun 25, 2020 at 1:50 PM Rich Felker <dalias@libc.org> wrote:
> > >
> > > As noted in the past I'd like to get rid of having high level flow
> > > logic in the arch asm and instead have the arch provide string asm
> > > fragments, if desired, to copy blocks, which could then be used in a
> > > shared C skeleton. However as you noted this has been a point of
> > > practical performance problem for a long time and I don't think it's
> > > fair to just keep putting it off for a better solution.
> >
> > I'd like to see the patches to enable asm memcpy for big endian ARM
> > merged. I may be the only user of musl on big endian ARM though (?) so
> > not sure how much wider interest there is.
>
> I'd forgotten I hadn't already merged it. However I was just rereading
> it and something looks amiss. Can you take a look again?
Is there anything in particular that looks wrong?
The most recent version of the patch still applies cleanly to master.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards
2020-06-25 21:15 ` Rich Felker
@ 2020-06-26 1:20 ` Rich Felker
2020-06-26 8:40 ` Szabolcs Nagy
0 siblings, 1 reply; 16+ messages in thread
From: Rich Felker @ 2020-06-26 1:20 UTC (permalink / raw)
To: musl
[-- Attachment #1: Type: text/plain, Size: 1772 bytes --]
On Thu, Jun 25, 2020 at 05:15:42PM -0400, Rich Felker wrote:
> On Thu, Jun 25, 2020 at 04:50:24PM -0400, Rich Felker wrote:
> > > > > but it would be nice if we could get the aarch64
> > > > > memcpy patch in (the c implementation is really
> > > > > slow and i've seen ppl compare aarch64 vs x86
> > > > > server performance with some benchmark on alpine..)
> > > >
> > > > OK, I'll look again.
> > >
> > > thanks.
> > >
> > > (there are more aarch64 string functions in the
> > > optimized-routines github repo but i think they
> > > are not as important as memcpy/memmove/memset)
> >
> > I found the code. Can you commend on performance and whether memset is
> > needed? (The C memset should be rather good already, moreso than
> > memcpy.)
>
> Are the assumptions (v8-a, unaligned access) documented in memcpy.S
> valid for all presently supportable aarch64?
>
> A couple comments for merging if we do, that aren't hard requirements
> but preferences:
>
> - I'd like to expand out the macros from ../asmdefs.h since that won't
> be available and they just hide things (I guess they're attractive
> for Apple/macho users or something but not relevant to musl) and
> since the symbol name lines need to be changed anyway to public
> name. "Local var name" macros are ok to leave; changing them would
> be too error-prone and they make the code more readable anyway.
>
> - I'd prefer not to have memmove logic in memcpy since it makes it
> larger and implies that misuse of memcpy when you mean memmove is
> supported usage. I'd be happy with an approach like x86 though,
> defining an __memcpy_fwd alias and having memmove tail call to that
> unless len>128 and reverse is needed, or just leaving memmove.c.
Something like the attached.
Rich
[-- Attachment #2: memcpy.S --]
[-- Type: text/plain, Size: 4082 bytes --]
/*
* memcpy - copy memory area
*
* Copyright (c) 2012-2020, Arm Limited.
* SPDX-License-Identifier: MIT
*/
/* Assumptions:
*
* ARMv8-a, AArch64, unaligned accesses.
*
*/
#define dstin x0
#define src x1
#define count x2
#define dst x3
#define srcend x4
#define dstend x5
#define A_l x6
#define A_lw w6
#define A_h x7
#define B_l x8
#define B_lw w8
#define B_h x9
#define C_l x10
#define C_lw w10
#define C_h x11
#define D_l x12
#define D_h x13
#define E_l x14
#define E_h x15
#define F_l x16
#define F_h x17
#define G_l count
#define G_h dst
#define H_l src
#define H_h srcend
#define tmp1 x14
/* This implementation handles overlaps and supports both memcpy and memmove
from a single entry point. It uses unaligned accesses and branchless
sequences to keep the code small, simple and improve performance.
Copies are split into 3 main cases: small copies of up to 32 bytes, medium
copies of up to 128 bytes, and large copies. The overhead of the overlap
check is negligible since it is only required for large copies.
Large copies use a software pipelined loop processing 64 bytes per iteration.
The destination pointer is 16-byte aligned to minimize unaligned accesses.
The loop tail is handled by always copying 64 bytes from the end.
*/
.global memcpy
.type memcpy,%function
memcpy:
add srcend, src, count
add dstend, dstin, count
cmp count, 128
b.hi .Lcopy_long
cmp count, 32
b.hi .Lcopy32_128
/* Small copies: 0..32 bytes. */
cmp count, 16
b.lo .Lcopy16
ldp A_l, A_h, [src]
ldp D_l, D_h, [srcend, -16]
stp A_l, A_h, [dstin]
stp D_l, D_h, [dstend, -16]
ret
/* Copy 8-15 bytes. */
.Lcopy16:
tbz count, 3, .Lcopy8
ldr A_l, [src]
ldr A_h, [srcend, -8]
str A_l, [dstin]
str A_h, [dstend, -8]
ret
.p2align 3
/* Copy 4-7 bytes. */
.Lcopy8:
tbz count, 2, .Lcopy4
ldr A_lw, [src]
ldr B_lw, [srcend, -4]
str A_lw, [dstin]
str B_lw, [dstend, -4]
ret
/* Copy 0..3 bytes using a branchless sequence. */
.Lcopy4:
cbz count, .Lcopy0
lsr tmp1, count, 1
ldrb A_lw, [src]
ldrb C_lw, [srcend, -1]
ldrb B_lw, [src, tmp1]
strb A_lw, [dstin]
strb B_lw, [dstin, tmp1]
strb C_lw, [dstend, -1]
.Lcopy0:
ret
.p2align 4
/* Medium copies: 33..128 bytes. */
.Lcopy32_128:
ldp A_l, A_h, [src]
ldp B_l, B_h, [src, 16]
ldp C_l, C_h, [srcend, -32]
ldp D_l, D_h, [srcend, -16]
cmp count, 64
b.hi .Lcopy128
stp A_l, A_h, [dstin]
stp B_l, B_h, [dstin, 16]
stp C_l, C_h, [dstend, -32]
stp D_l, D_h, [dstend, -16]
ret
.p2align 4
/* Copy 65..128 bytes. */
.Lcopy128:
ldp E_l, E_h, [src, 32]
ldp F_l, F_h, [src, 48]
cmp count, 96
b.ls .Lcopy96
ldp G_l, G_h, [srcend, -64]
ldp H_l, H_h, [srcend, -48]
stp G_l, G_h, [dstend, -64]
stp H_l, H_h, [dstend, -48]
.Lcopy96:
stp A_l, A_h, [dstin]
stp B_l, B_h, [dstin, 16]
stp E_l, E_h, [dstin, 32]
stp F_l, F_h, [dstin, 48]
stp C_l, C_h, [dstend, -32]
stp D_l, D_h, [dstend, -16]
ret
.p2align 4
/* Copy more than 128 bytes. */
.Lcopy_long:
/* Copy 16 bytes and then align dst to 16-byte alignment. */
ldp D_l, D_h, [src]
and tmp1, dstin, 15
bic dst, dstin, 15
sub src, src, tmp1
add count, count, tmp1 /* Count is now 16 too large. */
ldp A_l, A_h, [src, 16]
stp D_l, D_h, [dstin]
ldp B_l, B_h, [src, 32]
ldp C_l, C_h, [src, 48]
ldp D_l, D_h, [src, 64]!
subs count, count, 128 + 16 /* Test and readjust count. */
b.ls .Lcopy64_from_end
.Lloop64:
stp A_l, A_h, [dst, 16]
ldp A_l, A_h, [src, 16]
stp B_l, B_h, [dst, 32]
ldp B_l, B_h, [src, 32]
stp C_l, C_h, [dst, 48]
ldp C_l, C_h, [src, 48]
stp D_l, D_h, [dst, 64]!
ldp D_l, D_h, [src, 64]!
subs count, count, 64
b.hi .Lloop64
/* Write the last iteration and copy 64 bytes from the end. */
.Lcopy64_from_end:
ldp E_l, E_h, [srcend, -64]
stp A_l, A_h, [dst, 16]
ldp A_l, A_h, [srcend, -48]
stp B_l, B_h, [dst, 32]
ldp B_l, B_h, [srcend, -32]
stp C_l, C_h, [dst, 48]
ldp C_l, C_h, [srcend, -16]
stp D_l, D_h, [dst, 64]
stp E_l, E_h, [dstend, -64]
stp A_l, A_h, [dstend, -48]
stp B_l, B_h, [dstend, -32]
stp C_l, C_h, [dstend, -16]
ret
.size memcpy,.-memcpy
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards
2020-06-26 1:20 ` Rich Felker
@ 2020-06-26 8:40 ` Szabolcs Nagy
2020-07-06 22:12 ` Rich Felker
0 siblings, 1 reply; 16+ messages in thread
From: Szabolcs Nagy @ 2020-06-26 8:40 UTC (permalink / raw)
To: Rich Felker; +Cc: musl
* Rich Felker <dalias@libc.org> [2020-06-25 21:20:06 -0400]:
> On Thu, Jun 25, 2020 at 05:15:42PM -0400, Rich Felker wrote:
> > On Thu, Jun 25, 2020 at 04:50:24PM -0400, Rich Felker wrote:
> > > > > > but it would be nice if we could get the aarch64
> > > > > > memcpy patch in (the c implementation is really
> > > > > > slow and i've seen ppl compare aarch64 vs x86
> > > > > > server performance with some benchmark on alpine..)
> > > > >
> > > > > OK, I'll look again.
> > > >
> > > > thanks.
> > > >
> > > > (there are more aarch64 string functions in the
> > > > optimized-routines github repo but i think they
> > > > are not as important as memcpy/memmove/memset)
> > >
> > > I found the code. Can you commend on performance and whether memset is
> > > needed? (The C memset should be rather good already, moreso than
> > > memcpy.)
the asm seems faster in all measurements but there is
a lot of variance with different size/alignment cases.
the avg improvement on typical workload and the possible
improvements across various cases and cores i'd expect:
memcpy typical: 1.6x-1.7x
memcpy possible: 1.2x-3.1x
memset typical: 1.1x-1.4x
memset possible: 1.0x-2.6x
> > Are the assumptions (v8-a, unaligned access) documented in memcpy.S
> > valid for all presently supportable aarch64?
yes, unaligned access on normal memory in userspace
is valid (part of the base abi on linux).
iirc a core can be configured to trap unaligned access
and it is not valid on device memory so e.g. such
memcpy would not work in the kernel. but avoiding
unaligned access in memcpy is not enough to fix that,
the compiler will generate unaligned load for
int f(char *p)
{
int i;
__builtin_memcpy(&i,p,sizeof i);
return i;
}
> >
> > A couple comments for merging if we do, that aren't hard requirements
> > but preferences:
> >
> > - I'd like to expand out the macros from ../asmdefs.h since that won't
> > be available and they just hide things (I guess they're attractive
> > for Apple/macho users or something but not relevant to musl) and
> > since the symbol name lines need to be changed anyway to public
> > name. "Local var name" macros are ok to leave; changing them would
> > be too error-prone and they make the code more readable anyway.
the weird macros are there so the code is similar to glibc
asm code (which adds cfi annotation and optionally adds
profile hooks to entry etc)
> >
> > - I'd prefer not to have memmove logic in memcpy since it makes it
> > larger and implies that misuse of memcpy when you mean memmove is
> > supported usage. I'd be happy with an approach like x86 though,
> > defining an __memcpy_fwd alias and having memmove tail call to that
> > unless len>128 and reverse is needed, or just leaving memmove.c.
in principle the code should be called memmove, not memcpy,
since it satisfies the memmove contract, which of course
works for memcpy too. so tail calling memmove from memcpy
makes more sense but memcpy is more performance critical
than memmove, so we probably should not add extra branches
there..
>
> Something like the attached.
looks good to me.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards
2020-06-26 8:40 ` Szabolcs Nagy
@ 2020-07-06 22:12 ` Rich Felker
2020-07-07 15:00 ` Szabolcs Nagy
0 siblings, 1 reply; 16+ messages in thread
From: Rich Felker @ 2020-07-06 22:12 UTC (permalink / raw)
To: musl
On Fri, Jun 26, 2020 at 10:40:49AM +0200, Szabolcs Nagy wrote:
> * Rich Felker <dalias@libc.org> [2020-06-25 21:20:06 -0400]:
> > On Thu, Jun 25, 2020 at 05:15:42PM -0400, Rich Felker wrote:
> > > On Thu, Jun 25, 2020 at 04:50:24PM -0400, Rich Felker wrote:
> > > > > > > but it would be nice if we could get the aarch64
> > > > > > > memcpy patch in (the c implementation is really
> > > > > > > slow and i've seen ppl compare aarch64 vs x86
> > > > > > > server performance with some benchmark on alpine..)
> > > > > >
> > > > > > OK, I'll look again.
> > > > >
> > > > > thanks.
> > > > >
> > > > > (there are more aarch64 string functions in the
> > > > > optimized-routines github repo but i think they
> > > > > are not as important as memcpy/memmove/memset)
> > > >
> > > > I found the code. Can you commend on performance and whether memset is
> > > > needed? (The C memset should be rather good already, moreso than
> > > > memcpy.)
>
> the asm seems faster in all measurements but there is
> a lot of variance with different size/alignment cases.
>
> the avg improvement on typical workload and the possible
> improvements across various cases and cores i'd expect:
>
> memcpy typical: 1.6x-1.7x
> memcpy possible: 1.2x-3.1x
>
> memset typical: 1.1x-1.4x
> memset possible: 1.0x-2.6x
>
> > > Are the assumptions (v8-a, unaligned access) documented in memcpy.S
> > > valid for all presently supportable aarch64?
>
> yes, unaligned access on normal memory in userspace
> is valid (part of the base abi on linux).
>
> iirc a core can be configured to trap unaligned access
> and it is not valid on device memory so e.g. such
> memcpy would not work in the kernel. but avoiding
> unaligned access in memcpy is not enough to fix that,
> the compiler will generate unaligned load for
>
> int f(char *p)
> {
> int i;
> __builtin_memcpy(&i,p,sizeof i);
> return i;
> }
>
> > >
> > > A couple comments for merging if we do, that aren't hard requirements
> > > but preferences:
> > >
> > > - I'd like to expand out the macros from ../asmdefs.h since that won't
> > > be available and they just hide things (I guess they're attractive
> > > for Apple/macho users or something but not relevant to musl) and
> > > since the symbol name lines need to be changed anyway to public
> > > name. "Local var name" macros are ok to leave; changing them would
> > > be too error-prone and they make the code more readable anyway.
>
> the weird macros are there so the code is similar to glibc
> asm code (which adds cfi annotation and optionally adds
> profile hooks to entry etc)
>
> > >
> > > - I'd prefer not to have memmove logic in memcpy since it makes it
> > > larger and implies that misuse of memcpy when you mean memmove is
> > > supported usage. I'd be happy with an approach like x86 though,
> > > defining an __memcpy_fwd alias and having memmove tail call to that
> > > unless len>128 and reverse is needed, or just leaving memmove.c.
>
> in principle the code should be called memmove, not memcpy,
> since it satisfies the memmove contract, which of course
> works for memcpy too. so tail calling memmove from memcpy
> makes more sense but memcpy is more performance critical
> than memmove, so we probably should not add extra branches
> there..
>
> >
> > Something like the attached.
>
> looks good to me.
I think you saw already, but just to make it clear on the list too,
it's upstream now. I'm open to further improvements like doing
memmove (either as a separate copy of the full implementation or some
minimal branch-to-__memcpy_fwd approach) but I think what's already
there is sufficient to solve the main practical performance issues
users were hitting that made aarch64 look bad in relation to x86_64.
I'd still like to revisit the topic of minimizing the per-arch code
needed for this so that all archs can benefit from the basic logic,
too.
Rich
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards
2020-07-06 22:12 ` Rich Felker
@ 2020-07-07 15:00 ` Szabolcs Nagy
2020-07-07 17:22 ` Rich Felker
0 siblings, 1 reply; 16+ messages in thread
From: Szabolcs Nagy @ 2020-07-07 15:00 UTC (permalink / raw)
To: Rich Felker; +Cc: musl
* Rich Felker <dalias@libc.org> [2020-07-06 18:12:43 -0400]:
> I think you saw already, but just to make it clear on the list too,
> it's upstream now. I'm open to further improvements like doing
> memmove (either as a separate copy of the full implementation or some
> minimal branch-to-__memcpy_fwd approach) but I think what's already
> there is sufficient to solve the main practical performance issues
> users were hitting that made aarch64 look bad in relation to x86_64.
>
> I'd still like to revisit the topic of minimizing the per-arch code
> needed for this so that all archs can benefit from the basic logic,
> too.
thanks.
note that the code has some internal .p2align
directives that assume the entry is aligned to
some large alignment (.p2align 6 in orig code)
i think it would be better to keep the entry
aligned (but i don't know if it makes a big
difference on some existing core, it's more
for consistency with upstream).
musl normally does not align function entries
but for a few select functions it is probably
not too much overhead?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards
2020-07-07 15:00 ` Szabolcs Nagy
@ 2020-07-07 17:22 ` Rich Felker
2020-07-07 18:20 ` Szabolcs Nagy
0 siblings, 1 reply; 16+ messages in thread
From: Rich Felker @ 2020-07-07 17:22 UTC (permalink / raw)
To: musl
On Tue, Jul 07, 2020 at 05:00:20PM +0200, Szabolcs Nagy wrote:
> * Rich Felker <dalias@libc.org> [2020-07-06 18:12:43 -0400]:
> > I think you saw already, but just to make it clear on the list too,
> > it's upstream now. I'm open to further improvements like doing
> > memmove (either as a separate copy of the full implementation or some
> > minimal branch-to-__memcpy_fwd approach) but I think what's already
> > there is sufficient to solve the main practical performance issues
> > users were hitting that made aarch64 look bad in relation to x86_64.
> >
> > I'd still like to revisit the topic of minimizing the per-arch code
> > needed for this so that all archs can benefit from the basic logic,
> > too.
>
> thanks.
>
> note that the code has some internal .p2align
> directives that assume the entry is aligned to
> some large alignment (.p2align 6 in orig code)
>
> i think it would be better to keep the entry
> aligned (but i don't know if it makes a big
> difference on some existing core, it's more
> for consistency with upstream).
>
> musl normally does not align function entries
> but for a few select functions it is probably
> not too much overhead?
I was under the impression that any .p2align N in the section
inherently aligns the whole section as if it started with .p2align N,
in which case not writing it explicitly just avoids redundancy and
makes sure you don't actually have an initial alignment that's larger
than any alignment actually wanted later. Is this incorrect?
(To be incorrect I think it would have to do some fancy
elastic-section-contents hack, but maybe aarch64 ELF object ABI has
that..?)
Rich
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards
2020-07-07 17:22 ` Rich Felker
@ 2020-07-07 18:20 ` Szabolcs Nagy
0 siblings, 0 replies; 16+ messages in thread
From: Szabolcs Nagy @ 2020-07-07 18:20 UTC (permalink / raw)
To: Rich Felker; +Cc: musl
* Rich Felker <dalias@libc.org> [2020-07-07 13:22:57 -0400]:
> On Tue, Jul 07, 2020 at 05:00:20PM +0200, Szabolcs Nagy wrote:
> > * Rich Felker <dalias@libc.org> [2020-07-06 18:12:43 -0400]:
> > > I think you saw already, but just to make it clear on the list too,
> > > it's upstream now. I'm open to further improvements like doing
> > > memmove (either as a separate copy of the full implementation or some
> > > minimal branch-to-__memcpy_fwd approach) but I think what's already
> > > there is sufficient to solve the main practical performance issues
> > > users were hitting that made aarch64 look bad in relation to x86_64.
> > >
> > > I'd still like to revisit the topic of minimizing the per-arch code
> > > needed for this so that all archs can benefit from the basic logic,
> > > too.
> >
> > thanks.
> >
> > note that the code has some internal .p2align
> > directives that assume the entry is aligned to
> > some large alignment (.p2align 6 in orig code)
> >
> > i think it would be better to keep the entry
> > aligned (but i don't know if it makes a big
> > difference on some existing core, it's more
> > for consistency with upstream).
> >
> > musl normally does not align function entries
> > but for a few select functions it is probably
> > not too much overhead?
>
> I was under the impression that any .p2align N in the section
> inherently aligns the whole section as if it started with .p2align N,
> in which case not writing it explicitly just avoids redundancy and
> makes sure you don't actually have an initial alignment that's larger
> than any alignment actually wanted later. Is this incorrect?
>
> (To be incorrect I think it would have to do some fancy
> elastic-section-contents hack, but maybe aarch64 ELF object ABI has
> that..?)
ah you are right, then everything is fine i guess.
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2020-07-07 18:20 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-24 20:42 [musl] Release prep for 1.2.1, and afterwards Rich Felker
2020-06-24 22:39 ` Jeffrey Walton
2020-06-25 8:15 ` Szabolcs Nagy
2020-06-25 15:39 ` Rich Felker
2020-06-25 17:31 ` Szabolcs Nagy
2020-06-25 20:50 ` Rich Felker
2020-06-25 21:15 ` Rich Felker
2020-06-26 1:20 ` Rich Felker
2020-06-26 8:40 ` Szabolcs Nagy
2020-07-06 22:12 ` Rich Felker
2020-07-07 15:00 ` Szabolcs Nagy
2020-07-07 17:22 ` Rich Felker
2020-07-07 18:20 ` Szabolcs Nagy
2020-06-25 21:43 ` Andre McCurdy
2020-06-25 21:51 ` Rich Felker
2020-06-25 22:03 ` Andre McCurdy
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).