* [musl] Release prep for 1.2.1, and afterwards @ 2020-06-24 20:42 Rich Felker 2020-06-24 22:39 ` Jeffrey Walton 2020-06-25 8:15 ` Szabolcs Nagy 0 siblings, 2 replies; 16+ messages in thread From: Rich Felker @ 2020-06-24 20:42 UTC (permalink / raw) To: musl I'm about to do last work of merging mallocng, followed soon by release. Is there anything in the way of overlooked bug reports or patches that should still be addressed in this release cycle? Things I'm aware of: - "Proposal to match behaviour of gethostbyname to glibc". Latest patch is probably ok, but could be deferred to after release. - nsz's new sqrt{,f,l}. I'm hesitant to do all three right away without time to test, but replacing sqrtl.c could be appropriate since the current one is badly broken on archs with ld wider than double. However it would need to accept ld80 in order not to be build-breaking on m68k, or m68k would need an alternative. and some more with open questions or work to be done that can't be finished now but should be revisited after release: - fenv overhaul (sorry for dropping this, Damian) - PTHREAD_RWLOCK_PREFER_WRITER_NONRECURSIVE_NP - _SC_NPROCESSORS_{CONF,ONLN} behavior - hexagon port - rv32 port - arm fdpic (newly revived interest from users on list) - dni (dynamic linking without PT_INTERP absolute path) & related ldso work by rcombs - "lutimes: Add checks for input parameters" ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards 2020-06-24 20:42 [musl] Release prep for 1.2.1, and afterwards Rich Felker @ 2020-06-24 22:39 ` Jeffrey Walton 2020-06-25 8:15 ` Szabolcs Nagy 1 sibling, 0 replies; 16+ messages in thread From: Jeffrey Walton @ 2020-06-24 22:39 UTC (permalink / raw) To: musl On Wed, Jun 24, 2020 at 4:58 PM Rich Felker <dalias@libc.org> wrote: > > I'm about to do last work of merging mallocng, followed soon by > release. Is there anything in the way of overlooked bug reports or > patches that should still be addressed in this release cycle? > > Things I'm aware of: > > - "Proposal to match behaviour of gethostbyname to glibc". Latest > patch is probably ok, but could be deferred to after release. > > - nsz's new sqrt{,f,l}. I'm hesitant to do all three right away > without time to test, but replacing sqrtl.c could be appropriate > since the current one is badly broken on archs with ld wider than > double. However it would need to accept ld80 in order not to be > build-breaking on m68k, or m68k would need an alternative. > > and some more with open questions or work to be done that can't be > finished now but should be revisited after release: > > - fenv overhaul (sorry for dropping this, Damian) > - PTHREAD_RWLOCK_PREFER_WRITER_NONRECURSIVE_NP > - _SC_NPROCESSORS_{CONF,ONLN} behavior > - hexagon port > - rv32 port > - arm fdpic (newly revived interest from users on list) > - dni (dynamic linking without PT_INTERP absolute path) & related ldso > work by rcombs > - "lutimes: Add checks for input parameters" It would be nice to see runpath logic loosened up a bit. That is, don't reject multiple runpaths if one is bad. This is needed for packages like Perl. Perls screws up rpaths and runpaths badly. Perl does not escape origin-based paths properly when setting them in a makefile. Worse, Perl builds makefiles on the fly, so we cannot manually fix the makefiles after configure. Jeff ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards 2020-06-24 20:42 [musl] Release prep for 1.2.1, and afterwards Rich Felker 2020-06-24 22:39 ` Jeffrey Walton @ 2020-06-25 8:15 ` Szabolcs Nagy 2020-06-25 15:39 ` Rich Felker 1 sibling, 1 reply; 16+ messages in thread From: Szabolcs Nagy @ 2020-06-25 8:15 UTC (permalink / raw) To: Rich Felker; +Cc: musl * Rich Felker <dalias@libc.org> [2020-06-24 16:42:44 -0400]: > I'm about to do last work of merging mallocng, followed soon by > release. Is there anything in the way of overlooked bug reports or > patches that should still be addressed in this release cycle? > > Things I'm aware of: > > - "Proposal to match behaviour of gethostbyname to glibc". Latest > patch is probably ok, but could be deferred to after release. > > - nsz's new sqrt{,f,l}. I'm hesitant to do all three right away > without time to test, but replacing sqrtl.c could be appropriate > since the current one is badly broken on archs with ld wider than > double. However it would need to accept ld80 in order not to be > build-breaking on m68k, or m68k would need an alternative. that's still under work but it would be nice if we could get the aarch64 memcpy patch in (the c implementation is really slow and i've seen ppl compare aarch64 vs x86 server performance with some benchmark on alpine..) > > and some more with open questions or work to be done that can't be > finished now but should be revisited after release: > > - fenv overhaul (sorry for dropping this, Damian) > - PTHREAD_RWLOCK_PREFER_WRITER_NONRECURSIVE_NP > - _SC_NPROCESSORS_{CONF,ONLN} behavior > - hexagon port > - rv32 port > - arm fdpic (newly revived interest from users on list) > - dni (dynamic linking without PT_INTERP absolute path) & related ldso > work by rcombs > - "lutimes: Add checks for input parameters" ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards 2020-06-25 8:15 ` Szabolcs Nagy @ 2020-06-25 15:39 ` Rich Felker 2020-06-25 17:31 ` Szabolcs Nagy 0 siblings, 1 reply; 16+ messages in thread From: Rich Felker @ 2020-06-25 15:39 UTC (permalink / raw) To: musl On Thu, Jun 25, 2020 at 10:15:04AM +0200, Szabolcs Nagy wrote: > * Rich Felker <dalias@libc.org> [2020-06-24 16:42:44 -0400]: > > > I'm about to do last work of merging mallocng, followed soon by > > release. Is there anything in the way of overlooked bug reports or > > patches that should still be addressed in this release cycle? > > > > Things I'm aware of: > > > > - "Proposal to match behaviour of gethostbyname to glibc". Latest > > patch is probably ok, but could be deferred to after release. > > > > - nsz's new sqrt{,f,l}. I'm hesitant to do all three right away > > without time to test, but replacing sqrtl.c could be appropriate > > since the current one is badly broken on archs with ld wider than > > double. However it would need to accept ld80 in order not to be > > build-breaking on m68k, or m68k would need an alternative. > > that's still under work Won't it work just to make it decode/encode the ldshape, and otherwise use exactly the same code? Or are there double-rounding issues if the quad code is used with ld80? > but it would be nice if we could get the aarch64 > memcpy patch in (the c implementation is really > slow and i've seen ppl compare aarch64 vs x86 > server performance with some benchmark on alpine..) OK, I'll look again. Rich ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards 2020-06-25 15:39 ` Rich Felker @ 2020-06-25 17:31 ` Szabolcs Nagy 2020-06-25 20:50 ` Rich Felker 0 siblings, 1 reply; 16+ messages in thread From: Szabolcs Nagy @ 2020-06-25 17:31 UTC (permalink / raw) To: Rich Felker; +Cc: musl * Rich Felker <dalias@libc.org> [2020-06-25 11:39:36 -0400]: > On Thu, Jun 25, 2020 at 10:15:04AM +0200, Szabolcs Nagy wrote: > > * Rich Felker <dalias@libc.org> [2020-06-24 16:42:44 -0400]: > > > > > I'm about to do last work of merging mallocng, followed soon by > > > release. Is there anything in the way of overlooked bug reports or > > > patches that should still be addressed in this release cycle? > > > > > > Things I'm aware of: > > > > > > - "Proposal to match behaviour of gethostbyname to glibc". Latest > > > patch is probably ok, but could be deferred to after release. > > > > > > - nsz's new sqrt{,f,l}. I'm hesitant to do all three right away > > > without time to test, but replacing sqrtl.c could be appropriate > > > since the current one is badly broken on archs with ld wider than > > > double. However it would need to accept ld80 in order not to be > > > build-breaking on m68k, or m68k would need an alternative. > > > > that's still under work > > Won't it work just to make it decode/encode the ldshape, and otherwise > use exactly the same code? Or are there double-rounding issues if the > quad code is used with ld80? i think the same code may work for ld80 too, but i'm still testing the single/double/quad code, it's not ready for inclusion. > > but it would be nice if we could get the aarch64 > > memcpy patch in (the c implementation is really > > slow and i've seen ppl compare aarch64 vs x86 > > server performance with some benchmark on alpine..) > > OK, I'll look again. thanks. (there are more aarch64 string functions in the optimized-routines github repo but i think they are not as important as memcpy/memmove/memset) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards 2020-06-25 17:31 ` Szabolcs Nagy @ 2020-06-25 20:50 ` Rich Felker 2020-06-25 21:15 ` Rich Felker 2020-06-25 21:43 ` Andre McCurdy 0 siblings, 2 replies; 16+ messages in thread From: Rich Felker @ 2020-06-25 20:50 UTC (permalink / raw) To: musl On Thu, Jun 25, 2020 at 07:31:25PM +0200, Szabolcs Nagy wrote: > * Rich Felker <dalias@libc.org> [2020-06-25 11:39:36 -0400]: > > > On Thu, Jun 25, 2020 at 10:15:04AM +0200, Szabolcs Nagy wrote: > > > * Rich Felker <dalias@libc.org> [2020-06-24 16:42:44 -0400]: > > > > > > > I'm about to do last work of merging mallocng, followed soon by > > > > release. Is there anything in the way of overlooked bug reports or > > > > patches that should still be addressed in this release cycle? > > > > > > > > Things I'm aware of: > > > > > > > > - "Proposal to match behaviour of gethostbyname to glibc". Latest > > > > patch is probably ok, but could be deferred to after release. > > > > > > > > - nsz's new sqrt{,f,l}. I'm hesitant to do all three right away > > > > without time to test, but replacing sqrtl.c could be appropriate > > > > since the current one is badly broken on archs with ld wider than > > > > double. However it would need to accept ld80 in order not to be > > > > build-breaking on m68k, or m68k would need an alternative. > > > > > > that's still under work > > > > Won't it work just to make it decode/encode the ldshape, and otherwise > > use exactly the same code? Or are there double-rounding issues if the > > quad code is used with ld80? > > i think the same code may work for ld80 too, > but i'm still testing the single/double/quad > code, it's not ready for inclusion. OK. I had in mind possibly adding just sqrtl.c since it can't really be worse than what we have now. But I'm ok with waiting too. One alternative to getting it working for ld80 right away would be just adding an asm version of sqrtl for m68k. However we have users who've indicated an interest in disabling asm optimizations (see thread "build: allow forcing generic implementations of library functions") so in the long term I think we should aim for all generic math functions to work on all ld formats and FLT_EVAL_METHOD rather than just assuming they get replaced on i386/x86_64 and m68k. > > > but it would be nice if we could get the aarch64 > > > memcpy patch in (the c implementation is really > > > slow and i've seen ppl compare aarch64 vs x86 > > > server performance with some benchmark on alpine..) > > > > OK, I'll look again. > > thanks. > > (there are more aarch64 string functions in the > optimized-routines github repo but i think they > are not as important as memcpy/memmove/memset) I found the code. Can you commend on performance and whether memset is needed? (The C memset should be rather good already, moreso than memcpy.) As noted in the past I'd like to get rid of having high level flow logic in the arch asm and instead have the arch provide string asm fragments, if desired, to copy blocks, which could then be used in a shared C skeleton. However as you noted this has been a point of practical performance problem for a long time and I don't think it's fair to just keep putting it off for a better solution. Rich ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards 2020-06-25 20:50 ` Rich Felker @ 2020-06-25 21:15 ` Rich Felker 2020-06-26 1:20 ` Rich Felker 2020-06-25 21:43 ` Andre McCurdy 1 sibling, 1 reply; 16+ messages in thread From: Rich Felker @ 2020-06-25 21:15 UTC (permalink / raw) To: musl On Thu, Jun 25, 2020 at 04:50:24PM -0400, Rich Felker wrote: > > > > but it would be nice if we could get the aarch64 > > > > memcpy patch in (the c implementation is really > > > > slow and i've seen ppl compare aarch64 vs x86 > > > > server performance with some benchmark on alpine..) > > > > > > OK, I'll look again. > > > > thanks. > > > > (there are more aarch64 string functions in the > > optimized-routines github repo but i think they > > are not as important as memcpy/memmove/memset) > > I found the code. Can you commend on performance and whether memset is > needed? (The C memset should be rather good already, moreso than > memcpy.) Are the assumptions (v8-a, unaligned access) documented in memcpy.S valid for all presently supportable aarch64? A couple comments for merging if we do, that aren't hard requirements but preferences: - I'd like to expand out the macros from ../asmdefs.h since that won't be available and they just hide things (I guess they're attractive for Apple/macho users or something but not relevant to musl) and since the symbol name lines need to be changed anyway to public name. "Local var name" macros are ok to leave; changing them would be too error-prone and they make the code more readable anyway. - I'd prefer not to have memmove logic in memcpy since it makes it larger and implies that misuse of memcpy when you mean memmove is supported usage. I'd be happy with an approach like x86 though, defining an __memcpy_fwd alias and having memmove tail call to that unless len>128 and reverse is needed, or just leaving memmove.c. Rich ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards 2020-06-25 21:15 ` Rich Felker @ 2020-06-26 1:20 ` Rich Felker 2020-06-26 8:40 ` Szabolcs Nagy 0 siblings, 1 reply; 16+ messages in thread From: Rich Felker @ 2020-06-26 1:20 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 1772 bytes --] On Thu, Jun 25, 2020 at 05:15:42PM -0400, Rich Felker wrote: > On Thu, Jun 25, 2020 at 04:50:24PM -0400, Rich Felker wrote: > > > > > but it would be nice if we could get the aarch64 > > > > > memcpy patch in (the c implementation is really > > > > > slow and i've seen ppl compare aarch64 vs x86 > > > > > server performance with some benchmark on alpine..) > > > > > > > > OK, I'll look again. > > > > > > thanks. > > > > > > (there are more aarch64 string functions in the > > > optimized-routines github repo but i think they > > > are not as important as memcpy/memmove/memset) > > > > I found the code. Can you commend on performance and whether memset is > > needed? (The C memset should be rather good already, moreso than > > memcpy.) > > Are the assumptions (v8-a, unaligned access) documented in memcpy.S > valid for all presently supportable aarch64? > > A couple comments for merging if we do, that aren't hard requirements > but preferences: > > - I'd like to expand out the macros from ../asmdefs.h since that won't > be available and they just hide things (I guess they're attractive > for Apple/macho users or something but not relevant to musl) and > since the symbol name lines need to be changed anyway to public > name. "Local var name" macros are ok to leave; changing them would > be too error-prone and they make the code more readable anyway. > > - I'd prefer not to have memmove logic in memcpy since it makes it > larger and implies that misuse of memcpy when you mean memmove is > supported usage. I'd be happy with an approach like x86 though, > defining an __memcpy_fwd alias and having memmove tail call to that > unless len>128 and reverse is needed, or just leaving memmove.c. Something like the attached. Rich [-- Attachment #2: memcpy.S --] [-- Type: text/plain, Size: 4082 bytes --] /* * memcpy - copy memory area * * Copyright (c) 2012-2020, Arm Limited. * SPDX-License-Identifier: MIT */ /* Assumptions: * * ARMv8-a, AArch64, unaligned accesses. * */ #define dstin x0 #define src x1 #define count x2 #define dst x3 #define srcend x4 #define dstend x5 #define A_l x6 #define A_lw w6 #define A_h x7 #define B_l x8 #define B_lw w8 #define B_h x9 #define C_l x10 #define C_lw w10 #define C_h x11 #define D_l x12 #define D_h x13 #define E_l x14 #define E_h x15 #define F_l x16 #define F_h x17 #define G_l count #define G_h dst #define H_l src #define H_h srcend #define tmp1 x14 /* This implementation handles overlaps and supports both memcpy and memmove from a single entry point. It uses unaligned accesses and branchless sequences to keep the code small, simple and improve performance. Copies are split into 3 main cases: small copies of up to 32 bytes, medium copies of up to 128 bytes, and large copies. The overhead of the overlap check is negligible since it is only required for large copies. Large copies use a software pipelined loop processing 64 bytes per iteration. The destination pointer is 16-byte aligned to minimize unaligned accesses. The loop tail is handled by always copying 64 bytes from the end. */ .global memcpy .type memcpy,%function memcpy: add srcend, src, count add dstend, dstin, count cmp count, 128 b.hi .Lcopy_long cmp count, 32 b.hi .Lcopy32_128 /* Small copies: 0..32 bytes. */ cmp count, 16 b.lo .Lcopy16 ldp A_l, A_h, [src] ldp D_l, D_h, [srcend, -16] stp A_l, A_h, [dstin] stp D_l, D_h, [dstend, -16] ret /* Copy 8-15 bytes. */ .Lcopy16: tbz count, 3, .Lcopy8 ldr A_l, [src] ldr A_h, [srcend, -8] str A_l, [dstin] str A_h, [dstend, -8] ret .p2align 3 /* Copy 4-7 bytes. */ .Lcopy8: tbz count, 2, .Lcopy4 ldr A_lw, [src] ldr B_lw, [srcend, -4] str A_lw, [dstin] str B_lw, [dstend, -4] ret /* Copy 0..3 bytes using a branchless sequence. */ .Lcopy4: cbz count, .Lcopy0 lsr tmp1, count, 1 ldrb A_lw, [src] ldrb C_lw, [srcend, -1] ldrb B_lw, [src, tmp1] strb A_lw, [dstin] strb B_lw, [dstin, tmp1] strb C_lw, [dstend, -1] .Lcopy0: ret .p2align 4 /* Medium copies: 33..128 bytes. */ .Lcopy32_128: ldp A_l, A_h, [src] ldp B_l, B_h, [src, 16] ldp C_l, C_h, [srcend, -32] ldp D_l, D_h, [srcend, -16] cmp count, 64 b.hi .Lcopy128 stp A_l, A_h, [dstin] stp B_l, B_h, [dstin, 16] stp C_l, C_h, [dstend, -32] stp D_l, D_h, [dstend, -16] ret .p2align 4 /* Copy 65..128 bytes. */ .Lcopy128: ldp E_l, E_h, [src, 32] ldp F_l, F_h, [src, 48] cmp count, 96 b.ls .Lcopy96 ldp G_l, G_h, [srcend, -64] ldp H_l, H_h, [srcend, -48] stp G_l, G_h, [dstend, -64] stp H_l, H_h, [dstend, -48] .Lcopy96: stp A_l, A_h, [dstin] stp B_l, B_h, [dstin, 16] stp E_l, E_h, [dstin, 32] stp F_l, F_h, [dstin, 48] stp C_l, C_h, [dstend, -32] stp D_l, D_h, [dstend, -16] ret .p2align 4 /* Copy more than 128 bytes. */ .Lcopy_long: /* Copy 16 bytes and then align dst to 16-byte alignment. */ ldp D_l, D_h, [src] and tmp1, dstin, 15 bic dst, dstin, 15 sub src, src, tmp1 add count, count, tmp1 /* Count is now 16 too large. */ ldp A_l, A_h, [src, 16] stp D_l, D_h, [dstin] ldp B_l, B_h, [src, 32] ldp C_l, C_h, [src, 48] ldp D_l, D_h, [src, 64]! subs count, count, 128 + 16 /* Test and readjust count. */ b.ls .Lcopy64_from_end .Lloop64: stp A_l, A_h, [dst, 16] ldp A_l, A_h, [src, 16] stp B_l, B_h, [dst, 32] ldp B_l, B_h, [src, 32] stp C_l, C_h, [dst, 48] ldp C_l, C_h, [src, 48] stp D_l, D_h, [dst, 64]! ldp D_l, D_h, [src, 64]! subs count, count, 64 b.hi .Lloop64 /* Write the last iteration and copy 64 bytes from the end. */ .Lcopy64_from_end: ldp E_l, E_h, [srcend, -64] stp A_l, A_h, [dst, 16] ldp A_l, A_h, [srcend, -48] stp B_l, B_h, [dst, 32] ldp B_l, B_h, [srcend, -32] stp C_l, C_h, [dst, 48] ldp C_l, C_h, [srcend, -16] stp D_l, D_h, [dst, 64] stp E_l, E_h, [dstend, -64] stp A_l, A_h, [dstend, -48] stp B_l, B_h, [dstend, -32] stp C_l, C_h, [dstend, -16] ret .size memcpy,.-memcpy ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards 2020-06-26 1:20 ` Rich Felker @ 2020-06-26 8:40 ` Szabolcs Nagy 2020-07-06 22:12 ` Rich Felker 0 siblings, 1 reply; 16+ messages in thread From: Szabolcs Nagy @ 2020-06-26 8:40 UTC (permalink / raw) To: Rich Felker; +Cc: musl * Rich Felker <dalias@libc.org> [2020-06-25 21:20:06 -0400]: > On Thu, Jun 25, 2020 at 05:15:42PM -0400, Rich Felker wrote: > > On Thu, Jun 25, 2020 at 04:50:24PM -0400, Rich Felker wrote: > > > > > > but it would be nice if we could get the aarch64 > > > > > > memcpy patch in (the c implementation is really > > > > > > slow and i've seen ppl compare aarch64 vs x86 > > > > > > server performance with some benchmark on alpine..) > > > > > > > > > > OK, I'll look again. > > > > > > > > thanks. > > > > > > > > (there are more aarch64 string functions in the > > > > optimized-routines github repo but i think they > > > > are not as important as memcpy/memmove/memset) > > > > > > I found the code. Can you commend on performance and whether memset is > > > needed? (The C memset should be rather good already, moreso than > > > memcpy.) the asm seems faster in all measurements but there is a lot of variance with different size/alignment cases. the avg improvement on typical workload and the possible improvements across various cases and cores i'd expect: memcpy typical: 1.6x-1.7x memcpy possible: 1.2x-3.1x memset typical: 1.1x-1.4x memset possible: 1.0x-2.6x > > Are the assumptions (v8-a, unaligned access) documented in memcpy.S > > valid for all presently supportable aarch64? yes, unaligned access on normal memory in userspace is valid (part of the base abi on linux). iirc a core can be configured to trap unaligned access and it is not valid on device memory so e.g. such memcpy would not work in the kernel. but avoiding unaligned access in memcpy is not enough to fix that, the compiler will generate unaligned load for int f(char *p) { int i; __builtin_memcpy(&i,p,sizeof i); return i; } > > > > A couple comments for merging if we do, that aren't hard requirements > > but preferences: > > > > - I'd like to expand out the macros from ../asmdefs.h since that won't > > be available and they just hide things (I guess they're attractive > > for Apple/macho users or something but not relevant to musl) and > > since the symbol name lines need to be changed anyway to public > > name. "Local var name" macros are ok to leave; changing them would > > be too error-prone and they make the code more readable anyway. the weird macros are there so the code is similar to glibc asm code (which adds cfi annotation and optionally adds profile hooks to entry etc) > > > > - I'd prefer not to have memmove logic in memcpy since it makes it > > larger and implies that misuse of memcpy when you mean memmove is > > supported usage. I'd be happy with an approach like x86 though, > > defining an __memcpy_fwd alias and having memmove tail call to that > > unless len>128 and reverse is needed, or just leaving memmove.c. in principle the code should be called memmove, not memcpy, since it satisfies the memmove contract, which of course works for memcpy too. so tail calling memmove from memcpy makes more sense but memcpy is more performance critical than memmove, so we probably should not add extra branches there.. > > Something like the attached. looks good to me. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards 2020-06-26 8:40 ` Szabolcs Nagy @ 2020-07-06 22:12 ` Rich Felker 2020-07-07 15:00 ` Szabolcs Nagy 0 siblings, 1 reply; 16+ messages in thread From: Rich Felker @ 2020-07-06 22:12 UTC (permalink / raw) To: musl On Fri, Jun 26, 2020 at 10:40:49AM +0200, Szabolcs Nagy wrote: > * Rich Felker <dalias@libc.org> [2020-06-25 21:20:06 -0400]: > > On Thu, Jun 25, 2020 at 05:15:42PM -0400, Rich Felker wrote: > > > On Thu, Jun 25, 2020 at 04:50:24PM -0400, Rich Felker wrote: > > > > > > > but it would be nice if we could get the aarch64 > > > > > > > memcpy patch in (the c implementation is really > > > > > > > slow and i've seen ppl compare aarch64 vs x86 > > > > > > > server performance with some benchmark on alpine..) > > > > > > > > > > > > OK, I'll look again. > > > > > > > > > > thanks. > > > > > > > > > > (there are more aarch64 string functions in the > > > > > optimized-routines github repo but i think they > > > > > are not as important as memcpy/memmove/memset) > > > > > > > > I found the code. Can you commend on performance and whether memset is > > > > needed? (The C memset should be rather good already, moreso than > > > > memcpy.) > > the asm seems faster in all measurements but there is > a lot of variance with different size/alignment cases. > > the avg improvement on typical workload and the possible > improvements across various cases and cores i'd expect: > > memcpy typical: 1.6x-1.7x > memcpy possible: 1.2x-3.1x > > memset typical: 1.1x-1.4x > memset possible: 1.0x-2.6x > > > > Are the assumptions (v8-a, unaligned access) documented in memcpy.S > > > valid for all presently supportable aarch64? > > yes, unaligned access on normal memory in userspace > is valid (part of the base abi on linux). > > iirc a core can be configured to trap unaligned access > and it is not valid on device memory so e.g. such > memcpy would not work in the kernel. but avoiding > unaligned access in memcpy is not enough to fix that, > the compiler will generate unaligned load for > > int f(char *p) > { > int i; > __builtin_memcpy(&i,p,sizeof i); > return i; > } > > > > > > > A couple comments for merging if we do, that aren't hard requirements > > > but preferences: > > > > > > - I'd like to expand out the macros from ../asmdefs.h since that won't > > > be available and they just hide things (I guess they're attractive > > > for Apple/macho users or something but not relevant to musl) and > > > since the symbol name lines need to be changed anyway to public > > > name. "Local var name" macros are ok to leave; changing them would > > > be too error-prone and they make the code more readable anyway. > > the weird macros are there so the code is similar to glibc > asm code (which adds cfi annotation and optionally adds > profile hooks to entry etc) > > > > > > > - I'd prefer not to have memmove logic in memcpy since it makes it > > > larger and implies that misuse of memcpy when you mean memmove is > > > supported usage. I'd be happy with an approach like x86 though, > > > defining an __memcpy_fwd alias and having memmove tail call to that > > > unless len>128 and reverse is needed, or just leaving memmove.c. > > in principle the code should be called memmove, not memcpy, > since it satisfies the memmove contract, which of course > works for memcpy too. so tail calling memmove from memcpy > makes more sense but memcpy is more performance critical > than memmove, so we probably should not add extra branches > there.. > > > > > Something like the attached. > > looks good to me. I think you saw already, but just to make it clear on the list too, it's upstream now. I'm open to further improvements like doing memmove (either as a separate copy of the full implementation or some minimal branch-to-__memcpy_fwd approach) but I think what's already there is sufficient to solve the main practical performance issues users were hitting that made aarch64 look bad in relation to x86_64. I'd still like to revisit the topic of minimizing the per-arch code needed for this so that all archs can benefit from the basic logic, too. Rich ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards 2020-07-06 22:12 ` Rich Felker @ 2020-07-07 15:00 ` Szabolcs Nagy 2020-07-07 17:22 ` Rich Felker 0 siblings, 1 reply; 16+ messages in thread From: Szabolcs Nagy @ 2020-07-07 15:00 UTC (permalink / raw) To: Rich Felker; +Cc: musl * Rich Felker <dalias@libc.org> [2020-07-06 18:12:43 -0400]: > I think you saw already, but just to make it clear on the list too, > it's upstream now. I'm open to further improvements like doing > memmove (either as a separate copy of the full implementation or some > minimal branch-to-__memcpy_fwd approach) but I think what's already > there is sufficient to solve the main practical performance issues > users were hitting that made aarch64 look bad in relation to x86_64. > > I'd still like to revisit the topic of minimizing the per-arch code > needed for this so that all archs can benefit from the basic logic, > too. thanks. note that the code has some internal .p2align directives that assume the entry is aligned to some large alignment (.p2align 6 in orig code) i think it would be better to keep the entry aligned (but i don't know if it makes a big difference on some existing core, it's more for consistency with upstream). musl normally does not align function entries but for a few select functions it is probably not too much overhead? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards 2020-07-07 15:00 ` Szabolcs Nagy @ 2020-07-07 17:22 ` Rich Felker 2020-07-07 18:20 ` Szabolcs Nagy 0 siblings, 1 reply; 16+ messages in thread From: Rich Felker @ 2020-07-07 17:22 UTC (permalink / raw) To: musl On Tue, Jul 07, 2020 at 05:00:20PM +0200, Szabolcs Nagy wrote: > * Rich Felker <dalias@libc.org> [2020-07-06 18:12:43 -0400]: > > I think you saw already, but just to make it clear on the list too, > > it's upstream now. I'm open to further improvements like doing > > memmove (either as a separate copy of the full implementation or some > > minimal branch-to-__memcpy_fwd approach) but I think what's already > > there is sufficient to solve the main practical performance issues > > users were hitting that made aarch64 look bad in relation to x86_64. > > > > I'd still like to revisit the topic of minimizing the per-arch code > > needed for this so that all archs can benefit from the basic logic, > > too. > > thanks. > > note that the code has some internal .p2align > directives that assume the entry is aligned to > some large alignment (.p2align 6 in orig code) > > i think it would be better to keep the entry > aligned (but i don't know if it makes a big > difference on some existing core, it's more > for consistency with upstream). > > musl normally does not align function entries > but for a few select functions it is probably > not too much overhead? I was under the impression that any .p2align N in the section inherently aligns the whole section as if it started with .p2align N, in which case not writing it explicitly just avoids redundancy and makes sure you don't actually have an initial alignment that's larger than any alignment actually wanted later. Is this incorrect? (To be incorrect I think it would have to do some fancy elastic-section-contents hack, but maybe aarch64 ELF object ABI has that..?) Rich ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards 2020-07-07 17:22 ` Rich Felker @ 2020-07-07 18:20 ` Szabolcs Nagy 0 siblings, 0 replies; 16+ messages in thread From: Szabolcs Nagy @ 2020-07-07 18:20 UTC (permalink / raw) To: Rich Felker; +Cc: musl * Rich Felker <dalias@libc.org> [2020-07-07 13:22:57 -0400]: > On Tue, Jul 07, 2020 at 05:00:20PM +0200, Szabolcs Nagy wrote: > > * Rich Felker <dalias@libc.org> [2020-07-06 18:12:43 -0400]: > > > I think you saw already, but just to make it clear on the list too, > > > it's upstream now. I'm open to further improvements like doing > > > memmove (either as a separate copy of the full implementation or some > > > minimal branch-to-__memcpy_fwd approach) but I think what's already > > > there is sufficient to solve the main practical performance issues > > > users were hitting that made aarch64 look bad in relation to x86_64. > > > > > > I'd still like to revisit the topic of minimizing the per-arch code > > > needed for this so that all archs can benefit from the basic logic, > > > too. > > > > thanks. > > > > note that the code has some internal .p2align > > directives that assume the entry is aligned to > > some large alignment (.p2align 6 in orig code) > > > > i think it would be better to keep the entry > > aligned (but i don't know if it makes a big > > difference on some existing core, it's more > > for consistency with upstream). > > > > musl normally does not align function entries > > but for a few select functions it is probably > > not too much overhead? > > I was under the impression that any .p2align N in the section > inherently aligns the whole section as if it started with .p2align N, > in which case not writing it explicitly just avoids redundancy and > makes sure you don't actually have an initial alignment that's larger > than any alignment actually wanted later. Is this incorrect? > > (To be incorrect I think it would have to do some fancy > elastic-section-contents hack, but maybe aarch64 ELF object ABI has > that..?) ah you are right, then everything is fine i guess. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards 2020-06-25 20:50 ` Rich Felker 2020-06-25 21:15 ` Rich Felker @ 2020-06-25 21:43 ` Andre McCurdy 2020-06-25 21:51 ` Rich Felker 1 sibling, 1 reply; 16+ messages in thread From: Andre McCurdy @ 2020-06-25 21:43 UTC (permalink / raw) To: musl On Thu, Jun 25, 2020 at 1:50 PM Rich Felker <dalias@libc.org> wrote: > > As noted in the past I'd like to get rid of having high level flow > logic in the arch asm and instead have the arch provide string asm > fragments, if desired, to copy blocks, which could then be used in a > shared C skeleton. However as you noted this has been a point of > practical performance problem for a long time and I don't think it's > fair to just keep putting it off for a better solution. I'd like to see the patches to enable asm memcpy for big endian ARM merged. I may be the only user of musl on big endian ARM though (?) so not sure how much wider interest there is. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards 2020-06-25 21:43 ` Andre McCurdy @ 2020-06-25 21:51 ` Rich Felker 2020-06-25 22:03 ` Andre McCurdy 0 siblings, 1 reply; 16+ messages in thread From: Rich Felker @ 2020-06-25 21:51 UTC (permalink / raw) To: musl On Thu, Jun 25, 2020 at 02:43:42PM -0700, Andre McCurdy wrote: > On Thu, Jun 25, 2020 at 1:50 PM Rich Felker <dalias@libc.org> wrote: > > > > As noted in the past I'd like to get rid of having high level flow > > logic in the arch asm and instead have the arch provide string asm > > fragments, if desired, to copy blocks, which could then be used in a > > shared C skeleton. However as you noted this has been a point of > > practical performance problem for a long time and I don't think it's > > fair to just keep putting it off for a better solution. > > I'd like to see the patches to enable asm memcpy for big endian ARM > merged. I may be the only user of musl on big endian ARM though (?) so > not sure how much wider interest there is. I'd forgotten I hadn't already merged it. However I was just rereading it and something looks amiss. Can you take a look again? Rich ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [musl] Release prep for 1.2.1, and afterwards 2020-06-25 21:51 ` Rich Felker @ 2020-06-25 22:03 ` Andre McCurdy 0 siblings, 0 replies; 16+ messages in thread From: Andre McCurdy @ 2020-06-25 22:03 UTC (permalink / raw) To: musl On Thu, Jun 25, 2020 at 2:51 PM Rich Felker <dalias@libc.org> wrote: > > On Thu, Jun 25, 2020 at 02:43:42PM -0700, Andre McCurdy wrote: > > On Thu, Jun 25, 2020 at 1:50 PM Rich Felker <dalias@libc.org> wrote: > > > > > > As noted in the past I'd like to get rid of having high level flow > > > logic in the arch asm and instead have the arch provide string asm > > > fragments, if desired, to copy blocks, which could then be used in a > > > shared C skeleton. However as you noted this has been a point of > > > practical performance problem for a long time and I don't think it's > > > fair to just keep putting it off for a better solution. > > > > I'd like to see the patches to enable asm memcpy for big endian ARM > > merged. I may be the only user of musl on big endian ARM though (?) so > > not sure how much wider interest there is. > > I'd forgotten I hadn't already merged it. However I was just rereading > it and something looks amiss. Can you take a look again? Is there anything in particular that looks wrong? The most recent version of the patch still applies cleanly to master. ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2020-07-07 18:20 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-06-24 20:42 [musl] Release prep for 1.2.1, and afterwards Rich Felker 2020-06-24 22:39 ` Jeffrey Walton 2020-06-25 8:15 ` Szabolcs Nagy 2020-06-25 15:39 ` Rich Felker 2020-06-25 17:31 ` Szabolcs Nagy 2020-06-25 20:50 ` Rich Felker 2020-06-25 21:15 ` Rich Felker 2020-06-26 1:20 ` Rich Felker 2020-06-26 8:40 ` Szabolcs Nagy 2020-07-06 22:12 ` Rich Felker 2020-07-07 15:00 ` Szabolcs Nagy 2020-07-07 17:22 ` Rich Felker 2020-07-07 18:20 ` Szabolcs Nagy 2020-06-25 21:43 ` Andre McCurdy 2020-06-25 21:51 ` Rich Felker 2020-06-25 22:03 ` Andre McCurdy
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).