* Thinking about release @ 2013-06-13 1:25 Rich Felker 2013-06-13 1:33 ` Andre Renaud ` (2 more replies) 0 siblings, 3 replies; 38+ messages in thread From: Rich Felker @ 2013-06-13 1:25 UTC (permalink / raw) To: musl Hi all, It's been mentioned that we're overdue for a release, and it looks like we have a fair amount of new stuff since the last release. Major changes so far: - Accepting \n in dynamic linker path file - Fixed broken x86_64 sigsetjmp - Removed wrong C++ __STDC_*_MACROS checks - Thread exit synchronization issues - Conforming overflow behavior in clock() - Support for large device numbers - Math optimizations on i386 - C11 float.h macro additions - Fixes for exceptions in fma (gcc bug workarounds) - Fix misaligned stack when running ctors/dtors - Support for %m modifier in scanf - Support for using gcc intrinsic headers with musl-gcc wrapper - PRNG fixes - Per-process and per-thread cputime clocks and new(ish) Linux clocks I think the ether.h functions could definitely still make it in for this release too. inet_makeaddr, etc. could probably also be merged. Most of the other major items left on the agenda since the last release are probably not going to happen right away unless there's a volunteer to do them (zoneinfo, cpuset/affinity, string functions cleanup, C++ ABI matching, ARM-optimized memcpy) and one, the ld.so symlink direction issue, still requires some serious discussion and decision-making. If anyone else has input for what should still go into the next release, please jump in and discuss. Rich ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-06-13 1:25 Thinking about release Rich Felker @ 2013-06-13 1:33 ` Andre Renaud 2013-06-13 1:43 ` Rich Felker 2013-06-13 15:46 ` Isaac 2013-06-26 1:44 ` Rich Felker 2 siblings, 1 reply; 38+ messages in thread From: Andre Renaud @ 2013-06-13 1:33 UTC (permalink / raw) To: musl Hi Rich, > Most of the other major items left on the agenda since the last > release are probably not going to happen right away unless there's a > volunteer to do them (zoneinfo, cpuset/affinity, string functions > cleanup, C++ ABI matching, ARM-optimized memcpy) and one, the ld.so > symlink direction issue, still requires some serious discussion and > decision-making. Regarding the ARM-optimisations - I am happy to have a go at providing a cleaned up implementation, although I can't recall what the final consensus was on how this should be implemented. A simple ARMv4 implementation would cover all the bases, providing near universal support, although would obviously not support the more modern platforms. Is there any intention to move the base level support up to ARMv5? I would consider that reasonable, given the age of ARMv4. Alternatively, should we have multiple implementations (ARMv4/ARMv5/ARMv7), and choose between them either at compile or run-time? Obviously this stuff is probably not destined for the immediate release, but more likely for the one after that. Regards, Andre ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-06-13 1:33 ` Andre Renaud @ 2013-06-13 1:43 ` Rich Felker 2013-07-09 5:06 ` Andre Renaud 0 siblings, 1 reply; 38+ messages in thread From: Rich Felker @ 2013-06-13 1:43 UTC (permalink / raw) To: musl On Thu, Jun 13, 2013 at 01:33:16PM +1200, Andre Renaud wrote: > Hi Rich, > > > Most of the other major items left on the agenda since the last > > release are probably not going to happen right away unless there's a > > volunteer to do them (zoneinfo, cpuset/affinity, string functions > > cleanup, C++ ABI matching, ARM-optimized memcpy) and one, the ld.so > > symlink direction issue, still requires some serious discussion and > > decision-making. > > Regarding the ARM-optimisations - I am happy to have a go at providing > a cleaned up implementation, although I can't recall what the final > consensus was on how this should be implemented. A simple ARMv4 I think the first step should be benchmarking on real machines. Somebody tried the asm that was posted and claimed it was no faster than musl's C code; I don't know the specific hardware they were using and I don't even recall right off who made the claim or where it was reported, but I think before we start writing or importing code we need to have a good idea how the current C code compares in performance to other "optimized" implementations. > implementation would cover all the bases, providing near universal > support, although would obviously not support the more modern > platforms. Is there any intention to move the base level support up to > ARMv5? I would consider that reasonable, given the age of ARMv4. > Alternatively, should we have multiple implementations > (ARMv4/ARMv5/ARMv7), and choose between them either at compile or > run-time? It's possible to branch based on __hwcap at runtime, if this would really help. > Obviously this stuff is probably not destined for the immediate > release, but more likely for the one after that. Yes, this looks like it will be a process that takes some time to sort out the facts and then tune the code. For what it's worth, I just did my first runs of libc-bench on real ARM hardware (well, an FPGA-based ARM). memset is half the speed of glibc's, but strchr and strlen are about 40% faster than glibc's. I don't think libc-bench is really a good benchmark as of yet, so we should probably develop more detailed tests. Rich ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-06-13 1:43 ` Rich Felker @ 2013-07-09 5:06 ` Andre Renaud 2013-07-09 5:37 ` Rich Felker 0 siblings, 1 reply; 38+ messages in thread From: Andre Renaud @ 2013-07-09 5:06 UTC (permalink / raw) To: musl Hi Rich, > I think the first step should be benchmarking on real machines. > Somebody tried the asm that was posted and claimed it was no faster > than musl's C code; I don't know the specific hardware they were using > and I don't even recall right off who made the claim or where it was > reported, but I think before we start writing or importing code we > need to have a good idea how the current C code compares in > performance to other "optimized" implementations. In the interests of furthering this discussion (and because I'd like to start using musl as the basis for some of our projects, but the current speed degradation is noticeable , I've created some patches that enable memcmp, memcpy & memmove ARM optimisations. I've ignored the str* functions, as these are generally not used on the same bulk data as the mem* functions, and as such the performance issue is less noticeable. Using a fairly rudimentary test application, I've benchmarked it as having the following speed improvements (this is all on an actual ARM board - 400MHz arm926ejs): memcpy: 160% memmove: 162% memcmp: 272% These numbers bring musl in line with glibc (at least on ARMv5). memcmp in particular seems to be faster (90MB/s vs 75MB/s on my platform). I haven't looked at using the __hwcap feature at this stage to swap between these implementation and neon optimised versions. I assume this can come later. From a code size point of view (this is all with -O3), memcpy goes from 1996 to 1680 bytes, memmove goes from 2592 to 2088 bytes, and memcmp goes from 1040 to 1452, for a total increase of 224 bytes. The code is from NetBSD and Android (essentially unmodified), and it is all BSD 2-clause licensed. The git tree is available here: https://github.com/AndreRenaud/musl/commit/713023e7320cf45b116d1c29b6155ece28904e69 Does anyone have any comments on the suitability of this code, or what kind of more rigorous testing could be applied? Regards, Andre ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-09 5:06 ` Andre Renaud @ 2013-07-09 5:37 ` Rich Felker 2013-07-09 6:24 ` Harald Becker ` (2 more replies) 0 siblings, 3 replies; 38+ messages in thread From: Rich Felker @ 2013-07-09 5:37 UTC (permalink / raw) To: musl On Tue, Jul 09, 2013 at 05:06:21PM +1200, Andre Renaud wrote: > Hi Rich, > > I think the first step should be benchmarking on real machines. > > Somebody tried the asm that was posted and claimed it was no faster > > than musl's C code; I don't know the specific hardware they were using > > and I don't even recall right off who made the claim or where it was > > reported, but I think before we start writing or importing code we > > need to have a good idea how the current C code compares in > > performance to other "optimized" implementations. > > In the interests of furthering this discussion (and because I'd like > to start using musl as the basis for some of our projects, but the > current speed degradation is noticeable , I've created some patches Then it needs to be fixed. :-) > that enable memcmp, memcpy & memmove ARM optimisations. I've ignored > the str* functions, as these are generally not used on the same bulk > data as the mem* functions, and as such the performance issue is less > noticeable. I think that's a reasonable place to begin. I do mildly question the relevance of memmove to performance, so if we end up having to do a lot of review or changes to get the asm committed, it might make sense to leave memmove for later. > Using a fairly rudimentary test application, I've benchmarked it as > having the following speed improvements (this is all on an actual ARM > board - 400MHz arm926ejs): > memcpy: 160% > memmove: 162% > memcmp: 272% > These numbers bring musl in line with glibc (at least on ARMv5). > memcmp in particular seems to be faster (90MB/s vs 75MB/s on my > platform). > I haven't looked at using the __hwcap feature at this stage to swap > between these implementation and neon optimised versions. I assume > this can come later. > > >From a code size point of view (this is all with -O3), memcpy goes > from 1996 to 1680 bytes, memmove goes from 2592 to 2088 bytes, and > memcmp goes from 1040 to 1452, for a total increase of 224 bytes. > > The code is from NetBSD and Android (essentially unmodified), and it > is all BSD 2-clause licensed. At first glance, this looks like a clear improvement, but have you compared it to much more naive optimizations? My _general_ experience with optimized memcpy asm that's complex like this and that goes out of its way to deal explicitly with cache lines and such is that it's no faster than just naively moving large blocks at a time. Of course this may or may not be the case for ARM, but I'd like to know if you've done any tests. The basic principle in my mind here is that a complex solution is not necessarily wrong if it's a big win in other ways, but that a complex solution which is at most 1-2% faster than a much simpler solution is probably not the best choice. I also have access to a good test system now, by the way, so I could do some tests too. > The git tree is available here: > https://github.com/AndreRenaud/musl/commit/713023e7320cf45b116d1c29b6155ece28904e69 It's an open question whether it's better to sync something like this with an 'upstream' or adapt it to musl coding conventions. Generally musl uses explicit instructions rather than pseudo-instructions/macros for prologue and epilogue, and does not use named labels. > Does anyone have any comments on the suitability of this code, or what If nothing else, it fails to be armv4 compatible. Fixing that should not be hard, but it would require a bit of an audit. The return sequences are the obvious issue, but there may be other instructions in use that are not available on armv4 or maybe not even on armv5...? > kind of more rigorous testing could be applied? See above. What also might be worth testing is whether GCC can compete if you just give it a naive loop (not the fancy pseudo-vectorized stuff currently in musl) and good CFLAGS. I know on x86 I was able to beat the fanciest asm strlen I could come up with simply by writing the naive loop in C and unrolling it a lot. The only reason musl isn't already using that version is that I suspect it hurts branch prediction in the caller.... Rich ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-09 5:37 ` Rich Felker @ 2013-07-09 6:24 ` Harald Becker 2013-07-09 21:28 ` Andre Renaud 2013-07-10 19:38 ` Rob Landley 2 siblings, 0 replies; 38+ messages in thread From: Harald Becker @ 2013-07-09 6:24 UTC (permalink / raw) Cc: musl, dalias Hi Rich ! 09-07-2013 01:37 Rich Felker <dalias@aerifal.cx>: > I also have access to a good test system now, by the way, so I > could do some tests too. I own a Hercules eCAFE SlimHD Netbook with a Freescale i.MX515 CPU (ARMv7 / Cortex-A8 @ 800 MHz / 512 MB RAM). If this may be of use for you, I'm able to do some testing, etc., just let me know. -- Harald ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-09 5:37 ` Rich Felker 2013-07-09 6:24 ` Harald Becker @ 2013-07-09 21:28 ` Andre Renaud 2013-07-09 22:26 ` Andre Renaud ` (2 more replies) 2013-07-10 19:38 ` Rob Landley 2 siblings, 3 replies; 38+ messages in thread From: Andre Renaud @ 2013-07-09 21:28 UTC (permalink / raw) To: musl Hi Rich, > I think that's a reasonable place to begin. I do mildly question the > relevance of memmove to performance, so if we end up having to do a > lot of review or changes to get the asm committed, it might make sense > to leave memmove for later. I wasn't too sure on memmove, but I've seen a reasonable amount of code which just uses memmove as standard (rather than memcpy), to avoid the possibility of overlapping regions. Not a great policy, but still. I'm fine with dropping it at this stage. > At first glance, this looks like a clear improvement, but have you > compared it to much more naive optimizations? My _general_ experience > with optimized memcpy asm that's complex like this and that goes out > of its way to deal explicitly with cache lines and such is that it's > no faster than just naively moving large blocks at a time. Of course > this may or may not be the case for ARM, but I'd like to know if > you've done any tests. > > The basic principle in my mind here is that a complex solution is not > necessarily wrong if it's a big win in other ways, but that a complex > solution which is at most 1-2% faster than a much simpler solution is > probably not the best choice. Certainly if there was a more straight forward C implementation that achieved similar results that would be superior. However the existing musl C memcpy code is already optimised to some degree (doing 32-bit rather than 8-bit copies), and it is difficult to convince gcc to use the load-multiple & store-multiple instructions via C code I've found, without resorting to pretty horrible C code. It may still be preferable to the assembler though. At this stage I haven't benchmarked this - I'll see if I can come up with something. > It's an open question whether it's better to sync something like this > with an 'upstream' or adapt it to musl coding conventions. Generally > musl uses explicit instructions rather than pseudo-instructions/macros > for prologue and epilogue, and does not use named labels. Given that most of the other systems do some form of compile time optimisations (which we're trying to avoid), and that these are not functions that see a lot of code churn, I don't think it's too bad to have it adapted to musl's style. I haven't really done that so far. >> Does anyone have any comments on the suitability of this code, or what > > If nothing else, it fails to be armv4 compatible. Fixing that should > not be hard, but it would require a bit of an audit. The return > sequences are the obvious issue, but there may be other instructions > in use that are not available on armv4 or maybe not even on armv5...? Rob Landley mentioned a while ago that armv4 has issues with the EABI stuff. Is armv4 a definite lower bound for musl support, as opposed to armv4t or armv5? Regards, Andre ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-09 21:28 ` Andre Renaud @ 2013-07-09 22:26 ` Andre Renaud 2013-07-10 6:42 ` Jens Gustedt 2013-07-10 22:44 ` Andre Renaud 2013-07-10 19:42 ` Rich Felker 2013-07-11 4:30 ` Strake 2 siblings, 2 replies; 38+ messages in thread From: Andre Renaud @ 2013-07-09 22:26 UTC (permalink / raw) To: musl Replying to myself > Certainly if there was a more straight forward C implementation that > achieved similar results that would be superior. However the existing > musl C memcpy code is already optimised to some degree (doing 32-bit > rather than 8-bit copies), and it is difficult to convince gcc to use > the load-multiple & store-multiple instructions via C code I've found, > without resorting to pretty horrible C code. It may still be > preferable to the assembler though. At this stage I haven't > benchmarked this - I'll see if I can come up with something. As a comparison, the existing memcpy.c implementation tries to copy sizeof(size_t) bytes at a time, which on ARM is 4. This ends up being a standard load/store. However GCC is smart enough to know that it can use ldm/stm instructions for copying structures > 4 bytes. So if we change memcpy.c to use a structure whose size is > 4 (ie: 16), instead of size_t for it's basic copy unit, we do see some improvements: typedef struct multiple_size_t { size_t d[4]; } multiple_size_t; #define SS (sizeof(multiple_size_t)) #define ALIGN (sizeof(multiple_size_t)-1) void *my_memcpy(void * restrict dest, const void * restrict src, size_t n) { unsigned char *d = dest; const unsigned char *s = src; if (((uintptr_t)d & ALIGN) != ((uintptr_t)s & ALIGN)) goto misaligned; for (; ((uintptr_t)d & ALIGN) && n; n--) *d++ = *s++; if (n) { multiple_size_t *wd = (void *)d; const struct multiple_size_t *ws = (const void *)s; for (; n>=SS; n-=SS) *wd++ = *ws++; d = (void *)wd; s = (const void *)ws; misaligned: for (; n; n--) *d++ = *s++; } return dest; } This results in 95MB/s on my platform (up from 65MB/s for the existing memcpy.c, and down from 105MB/s with the asm optimised version). It is essentially identically readable to the existing memcpy.c. I'm not really famiilar with any other cpu architectures, so I'm not sure if this would improve, or hurt, performance on other platforms. Any comments on using something like this for memcpy instead? Obviously this gives you a higher penalty if the size of the area to be copied is between sizeof(size_t) and sizeof(multiple_size_t). Regards, Andre ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-09 22:26 ` Andre Renaud @ 2013-07-10 6:42 ` Jens Gustedt 2013-07-10 7:50 ` Rich Felker 2013-07-10 22:44 ` Andre Renaud 1 sibling, 1 reply; 38+ messages in thread From: Jens Gustedt @ 2013-07-10 6:42 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 969 bytes --] Am Mittwoch, den 10.07.2013, 10:26 +1200 schrieb Andre Renaud: > typedef struct multiple_size_t { > size_t d[4]; > } multiple_size_t; why not have it typedef size_t _multiple_size[4]; the wrapping into struct just doesn't serve much purpose, I think. Then for your implementation, the commonly used trick would be to have two "slow" phases for misalignment. One at the start for the first bytes up to the next valid alignment boundary, do the "fast" copy for the aligned part, and then handle the last bytes in another slow phase. For small things to copy this adds a bit of arithmetic and a conditional. Jens -- :: INRIA Nancy Grand Est :: http://www.loria.fr/~gustedt/ :: :: AlGorille ::::::::::::::: office Nancy : +33 383593090 :: :: ICube :::::::::::::: office Strasbourg : +33 368854536 :: :: ::::::::::::::::::::::::::: gsm France : +33 651400183 :: :: :::::::::::::::::::: gsm international : +49 15737185122 :: [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-10 6:42 ` Jens Gustedt @ 2013-07-10 7:50 ` Rich Felker 0 siblings, 0 replies; 38+ messages in thread From: Rich Felker @ 2013-07-10 7:50 UTC (permalink / raw) To: musl On Wed, Jul 10, 2013 at 08:42:58AM +0200, Jens Gustedt wrote: > Am Mittwoch, den 10.07.2013, 10:26 +1200 schrieb Andre Renaud: > > typedef struct multiple_size_t { > > size_t d[4]; > > } multiple_size_t; > > why not have it > > typedef size_t _multiple_size[4]; > > the wrapping into struct just doesn't serve much purpose, I think. Because arrays are not assignable. Rich ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-09 22:26 ` Andre Renaud 2013-07-10 6:42 ` Jens Gustedt @ 2013-07-10 22:44 ` Andre Renaud 2013-07-11 3:37 ` Rich Felker 1 sibling, 1 reply; 38+ messages in thread From: Andre Renaud @ 2013-07-10 22:44 UTC (permalink / raw) To: Andre Renaud; +Cc: musl > This results in 95MB/s on my platform (up from 65MB/s for the existing > memcpy.c, and down from 105MB/s with the asm optimised version). It is > essentially identically readable to the existing memcpy.c. I'm not > really famiilar with any other cpu architectures, so I'm not sure if > this would improve, or hurt, performance on other platforms. Reviewing the assembler that is produced, it appears that GCC will never generate an ldm/stm instruction (load/store multiple) that reads into more than 4 registers, where as the optimised assembler does them that read 8 (ie: 8 * 32bit reads in a single instruction). I've tried various tricks/optimisations with the C code, and can't convince GCC to do more than 4. I assume that this is probably where the remaining 10MB/s is between these two variants. Rich - do you have any comments on whether either the C or assembler variants of memcpy might be suitable for inclusion in musl? Regards, Andre ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-10 22:44 ` Andre Renaud @ 2013-07-11 3:37 ` Rich Felker 2013-07-11 4:04 ` Andre Renaud ` (2 more replies) 0 siblings, 3 replies; 38+ messages in thread From: Rich Felker @ 2013-07-11 3:37 UTC (permalink / raw) To: musl; +Cc: Andre Renaud On Thu, Jul 11, 2013 at 10:44:16AM +1200, Andre Renaud wrote: > > This results in 95MB/s on my platform (up from 65MB/s for the existing > > memcpy.c, and down from 105MB/s with the asm optimised version). It is > > essentially identically readable to the existing memcpy.c. I'm not > > really famiilar with any other cpu architectures, so I'm not sure if > > this would improve, or hurt, performance on other platforms. > > Reviewing the assembler that is produced, it appears that GCC will > never generate an ldm/stm instruction (load/store multiple) that reads > into more than 4 registers, where as the optimised assembler does them > that read 8 (ie: 8 * 32bit reads in a single instruction). I've tried For the asm, could we make it more than 8? 10 seems easy, 12 seems doubtful. I don't see a fundamental reason it needs to be a power of two, unless the cache line alignment really helps and isn't just cargo-culting. (This is something I'd still like to know about the asm: whether it's doing unnecessary stuff that does not help performance.) > various tricks/optimisations with the C code, and can't convince GCC > to do more than 4. I assume that this is probably where the remaining > 10MB/s is between these two variants. Yes, I suspect so. One slightly crazy idea I had was to write the function in C with just inline asm for the inner ldm/stm loop. The build system does not yet have support for .c files in the arch dirs instead of .s files, but it could be added. > Rich - do you have any comments on whether either the C or assembler > variants of memcpy might be suitable for inclusion in musl? I would say either might be, but it looks like if we want competitive performance, some asm will be needed (either inline or full). My leaning would be to go for something simpler than the asm you've been experimenting with, but with same or better performance, if this is possible. I realize the code is not that big as-is, in terms of binary size, but it's big from an "understanding it" perspective and I don't like big asm blobs that are hard for somebody to look at and say "oh yeah, this is clearly right". Anyway, the big questions I'd still like to get answered before moving forward is whether the cache line alignment has any benefit. Rich ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-11 3:37 ` Rich Felker @ 2013-07-11 4:04 ` Andre Renaud 2013-07-11 5:10 ` Andre Renaud 2013-07-11 5:27 ` Daniel Cegiełka 2013-07-15 4:25 ` Rob Landley 2 siblings, 1 reply; 38+ messages in thread From: Andre Renaud @ 2013-07-11 4:04 UTC (permalink / raw) To: Rich Felker; +Cc: musl Hi Rich, >> Rich - do you have any comments on whether either the C or assembler >> variants of memcpy might be suitable for inclusion in musl? > > I would say either might be, but it looks like if we want competitive > performance, some asm will be needed (either inline or full). My > leaning would be to go for something simpler than the asm you've been > experimenting with, but with same or better performance, if this is > possible. I realize the code is not that big as-is, in terms of binary > size, but it's big from an "understanding it" perspective and I don't > like big asm blobs that are hard for somebody to look at and say "oh > yeah, this is clearly right". > > Anyway, the big questions I'd still like to get answered before moving > forward is whether the cache line alignment has any benefit. I certainly appreciate the need for concise, well understood, easily readable code. I can't see any obvious reason why this shouldn't work, although the assembler as it stands makes pretty heavy use of all the registers, and I can't immediately see how to rework it to free up 2 more (I can free up 1 by dropping the attempted preload). Given my (lack of) skills with ARM assembler, I'm not sure I'll be able to look too deeply into either of these options, but I'll have a go at the inline ASM version to force 8*4byte loads to see if it improves things. Regards, Andre ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-11 4:04 ` Andre Renaud @ 2013-07-11 5:10 ` Andre Renaud 2013-07-11 12:46 ` Rich Felker 0 siblings, 1 reply; 38+ messages in thread From: Andre Renaud @ 2013-07-11 5:10 UTC (permalink / raw) To: musl; +Cc: Rich Felker > I can't see any obvious reason why this shouldn't work, although the > assembler as it stands makes pretty heavy use of all the registers, > and I can't immediately see how to rework it to free up 2 more (I can > free up 1 by dropping the attempted preload). Given my (lack of) > skills with ARM assembler, I'm not sure I'll be able to look too > deeply into either of these options, but I'll have a go at the inline > ASM version to force 8*4byte loads to see if it improves things. I've given it a bit of a go, and at first it appears to be working (although I don't exactly have a comprehensive test suite, so this is very preliminary). Anyone with some more ARM assembler experience is welcome to chip in with a comment. I also managed to mess up my last set of benchmarking - I'd indicated that I got 65 vs 95 vs 105, however I'd stuffed up the fact that the first call would have poor cache performance. Once I corrected that the results have become more like 65(naive) vs 105(typedef) vs 113(asm). Using the below code, it becomes 65(naive), 113(inline asm), 113(full asm). So the inline is able to do perform as we'd expect. Assuming that it is technically correct (which is probably the biggest question). #define SS (8 * 4) #define ALIGN (SS - 1) void * noinline my_asm_memcpy(void * restrict dest, const void * restrict src, size_t n) { unsigned char *d = dest; const unsigned char *s = src; if (((uintptr_t)d & ALIGN) != ((uintptr_t)s & ALIGN)) goto misaligned; for (; ((uintptr_t)d & ALIGN) && n; n--) *d++ = *s++; if (n) { for (; n>=SS; n-= SS) { __asm__("ldmia %0, {r4-r11}" : "=r" (s) : "0" (s) : "r4", "r5", "r6", "r7", "r8", "r9", "r10", "r11"); s+=SS; __asm__("stmia %0, {r4-r11}" : "=r" (d) :"0" (d)); d+=SS; } misaligned: for (; n; n--) *d++ = *s++; } return dest; } Regards, Andre ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-11 5:10 ` Andre Renaud @ 2013-07-11 12:46 ` Rich Felker 2013-07-11 22:34 ` Andre Renaud 0 siblings, 1 reply; 38+ messages in thread From: Rich Felker @ 2013-07-11 12:46 UTC (permalink / raw) To: musl On Thu, Jul 11, 2013 at 05:10:41PM +1200, Andre Renaud wrote: > > I can't see any obvious reason why this shouldn't work, although the > > assembler as it stands makes pretty heavy use of all the registers, > > and I can't immediately see how to rework it to free up 2 more (I can > > free up 1 by dropping the attempted preload). Given my (lack of) > > skills with ARM assembler, I'm not sure I'll be able to look too > > deeply into either of these options, but I'll have a go at the inline > > ASM version to force 8*4byte loads to see if it improves things. > > I've given it a bit of a go, and at first it appears to be working > (although I don't exactly have a comprehensive test suite, so this is > very preliminary). Anyone with some more ARM assembler experience is > welcome to chip in with a comment. > > I also managed to mess up my last set of benchmarking - I'd indicated > that I got 65 vs 95 vs 105, however I'd stuffed up the fact that the > first call would have poor cache performance. Once I corrected that > the results have become more like 65(naive) vs 105(typedef) vs > 113(asm). > > Using the below code, it becomes 65(naive), 113(inline asm), 113(full > asm). So the inline is able to do perform as we'd expect. Assuming > that it is technically correct (which is probably the biggest > question). It's not. > #define SS (8 * 4) > #define ALIGN (SS - 1) > void * noinline my_asm_memcpy(void * restrict dest, const void * > restrict src, size_t n) > { > unsigned char *d = dest; > const unsigned char *s = src; > > if (((uintptr_t)d & ALIGN) != ((uintptr_t)s & ALIGN)) > goto misaligned; > > for (; ((uintptr_t)d & ALIGN) && n; n--) *d++ = *s++; > if (n) { > for (; n>=SS; n-= SS) { > __asm__("ldmia %0, {r4-r11}" > : "=r" (s) > : "0" (s) > : "r4", "r5", "r6", "r7", "r8", "r9", > "r10", "r11"); > s+=SS; > __asm__("stmia %0, {r4-r11}" > : "=r" (d) > :"0" (d)); > d+=SS; You need both instructions in the same asm block, and proper constraints. As it is, whether the registers keep their values between the two separate asm blocks is up to the compiler's whims. With the proper constraints ("+r" type), the s+=SS and d+=SS are unnecessary, as a bonus. Also there's no reason to force alignment to SS for this loop; that will simply prevent it from being used as much for smaller copies. I would use SS==sizeof(size_t) and then write 8*SS in the for loop. Last night I was in the process of writing something very similar, but I put the for loop in asm too and didn't finish it. If it performs just as well with the loop in C, I like your version better. Rich ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-11 12:46 ` Rich Felker @ 2013-07-11 22:34 ` Andre Renaud 2013-07-12 3:16 ` Rich Felker 0 siblings, 1 reply; 38+ messages in thread From: Andre Renaud @ 2013-07-11 22:34 UTC (permalink / raw) To: musl Hi Rich, > You need both instructions in the same asm block, and proper > constraints. As it is, whether the registers keep their values between > the two separate asm blocks is up to the compiler's whims. > > With the proper constraints ("+r" type), the s+=SS and d+=SS are > unnecessary, as a bonus. Also there's no reason to force alignment to > SS for this loop; that will simply prevent it from being used as much > for smaller copies. I would use SS==sizeof(size_t) and then write 8*SS > in the for loop. > > Last night I was in the process of writing something very similar, but > I put the for loop in asm too and didn't finish it. If it performs > just as well with the loop in C, I like your version better. I've rejiggled it a bit, and it appears to be working. I wasn't entirely sure what you meant about the proper constraints. There is an additional reason why 8*4 was used for the align - to force the whole loop to work in cache-line blocks. I've now done this explicitly on the lead-in by doing the first few copies as 32-bit, then going to the full cache-line asm. This has the same performance as the fully native assembler. However to get that I had to use the same trick that the native assembler uses - doing a load of the next block prior to storing this one. I'm a bit concerned that this would mean we'd be doing a read that was out of bounds, and I can't entirely see why this wouldn't be happening with the existing assembler (but I'm presuming it doesn't). Any comments on this side of it? #define SS sizeof(size_t) #define ALIGN (SS - 1) void * noinline my_asm_memcpy(void * restrict dest, const void * restrict src, size_t n) { unsigned char *d = dest; const unsigned char *s = src; if (((uintptr_t)d & ALIGN) != ((uintptr_t)s & ALIGN)) goto misaligned; /* ARM has 32-byte cache lines, so get us aligned to that */ for (; ((uintptr_t)d & ((8 * SS) - 1)) && n; n-=SS) { *(size_t *)d = *(size_t *)s; d += SS; s+= SS; } /* Do full cache line read/writes */ if (n) { for (; n>=(8 * SS); n-= (8 * SS)) { __asm__ ( "ldmia %0, {r4-r11}\n" "add %0, %0, %4\n" "bic r12, %0, %5\n" "ldrhi r12, [%0]\n" "stmia %1, {r4-r11}\n" "add %1, %1, %4" : "=r"(s), "=r"(d) : "0"(s), "1"(d), "i"(8 * SS), "i"((8 * SS) - 1) : "r4", "r5", "r6", "r7", "r8", "r9", "r10", "r11", "r12"); } misaligned: for (; n; n--) *d++ = *s++; } return dest; } Regards, Andre ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-11 22:34 ` Andre Renaud @ 2013-07-12 3:16 ` Rich Felker 2013-07-12 3:36 ` Andre Renaud 0 siblings, 1 reply; 38+ messages in thread From: Rich Felker @ 2013-07-12 3:16 UTC (permalink / raw) To: musl On Fri, Jul 12, 2013 at 10:34:31AM +1200, Andre Renaud wrote: > I've rejiggled it a bit, and it appears to be working. I wasn't > entirely sure what you meant about the proper constraints. There is an > additional reason why 8*4 was used for the align - to force the whole > loop to work in cache-line blocks. I've now done this explicitly on > the lead-in by doing the first few copies as 32-bit, then going to the > full cache-line asm. This has the same performance as the fully native > assembler. However to get that I had to use the same trick that the > native assembler uses - doing a load of the next block prior to > storing this one. I'm a bit concerned that this would mean we'd be > doing a read that was out of bounds, and I can't entirely see why this > wouldn't be happening with the existing assembler (but I'm presuming > it doesn't). Any comments on this side of it? I was unable to measure any difference in performance of your version with the prefetch hack versus simply: __asm__ __volatile__( "ldmia %1!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t" "stmia %0!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t" : "+r"(d), "+r"(s) : : "a4", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "memory"); in the inner loop. Rich ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-12 3:16 ` Rich Felker @ 2013-07-12 3:36 ` Andre Renaud 2013-07-12 4:16 ` Rich Felker 0 siblings, 1 reply; 38+ messages in thread From: Andre Renaud @ 2013-07-12 3:36 UTC (permalink / raw) To: musl > I was unable to measure any difference in performance of your version > with the prefetch hack versus simply: > > __asm__ __volatile__( > "ldmia %1!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t" > "stmia %0!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t" > : "+r"(d), "+r"(s) : > : "a4", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "memory"); What kind of machine were you using? I see a change of 115MB/s -> 105MB/s when I drop the prefetch, even using the code that you suggested. This is on an Atmel AT91sam9g45 (ARM926ejs @ 400MHz). I'm assuming this is some subtlety about how the cache is operating? Sticking the ldrhi back in brings the speed back, ie: __asm__ __volatile__( "ldmia %1!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t" "ldrhi r12, [%1]\n" "stmia %0!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t" : "+r"(d), "+r"(s) : : "a4", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "r12", "memory"); Regards, Andre ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-12 3:36 ` Andre Renaud @ 2013-07-12 4:16 ` Rich Felker 2013-07-24 1:34 ` Andre Renaud 0 siblings, 1 reply; 38+ messages in thread From: Rich Felker @ 2013-07-12 4:16 UTC (permalink / raw) To: musl On Fri, Jul 12, 2013 at 03:36:42PM +1200, Andre Renaud wrote: > > I was unable to measure any difference in performance of your version > > with the prefetch hack versus simply: > > > > __asm__ __volatile__( > > "ldmia %1!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t" > > "stmia %0!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t" > > : "+r"(d), "+r"(s) : > > : "a4", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "memory"); > > What kind of machine were you using? I see a change of 115MB/s -> It's a combined ARM Cortex-A9 & FPGA chip from Xilinx. Supposedly the timings match the Cortex-A9 in other ARM chips. > 105MB/s when I drop the prefetch, even using the code that you > suggested. This is on an Atmel AT91sam9g45 (ARM926ejs @ 400MHz). I'm > assuming this is some subtlety about how the cache is operating? Perhaps so. By the way, I also did some tests with misaligning the src/dest with respect to cache lines. and the timing did change, but not in any way I could make sense of... It may turn out to be that the issues are sufficiently complex that we won't get ideal performance without either copying the BSD code you suggested or fully understanding what it's doing, and other ARM performance issues, and developing something new based on that understanding... In that case copying/adapting the BSD code might turn out to be the right solution for now. Rich ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-12 4:16 ` Rich Felker @ 2013-07-24 1:34 ` Andre Renaud 2013-07-24 3:48 ` Rich Felker 0 siblings, 1 reply; 38+ messages in thread From: Andre Renaud @ 2013-07-24 1:34 UTC (permalink / raw) To: musl Hi Rich, On 12 July 2013 16:16, Rich Felker <dalias@aerifal.cx> wrote: > By the way, I also did some tests with misaligning the src/dest with > respect to cache lines. and the timing did change, but not in any way > I could make sense of... > > It may turn out to be that the issues are sufficiently complex that we > won't get ideal performance without either copying the BSD code you > suggested or fully understanding what it's doing, and other ARM > performance issues, and developing something new based on that > understanding... In that case copying/adapting the BSD code might turn > out to be the right solution for now. What was the final decision on this? The last version (with mixed inline assembler/C) is (I believe) relatively readable, and appears to be correct. It also compiles on all the available platforms (ie: armv4+). Can this version be accepted? Regards, Andre ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-24 1:34 ` Andre Renaud @ 2013-07-24 3:48 ` Rich Felker 2013-07-24 4:40 ` Andre Renaud 0 siblings, 1 reply; 38+ messages in thread From: Rich Felker @ 2013-07-24 3:48 UTC (permalink / raw) To: musl On Wed, Jul 24, 2013 at 01:34:07PM +1200, Andre Renaud wrote: > Hi Rich, > > On 12 July 2013 16:16, Rich Felker <dalias@aerifal.cx> wrote: > > By the way, I also did some tests with misaligning the src/dest with > > respect to cache lines. and the timing did change, but not in any way > > I could make sense of... > > > > It may turn out to be that the issues are sufficiently complex that we > > won't get ideal performance without either copying the BSD code you > > suggested or fully understanding what it's doing, and other ARM > > performance issues, and developing something new based on that > > understanding... In that case copying/adapting the BSD code might turn > > out to be the right solution for now. > > What was the final decision on this? The last version (with mixed > inline assembler/C) is (I believe) relatively readable, and appears to > be correct. It also compiles on all the available platforms (ie: > armv4+). Can this version be accepted? It looks buggy as-is; as far as I can tell, it will crash if src/dest are aligned with respect to each other but not aligned mod 4, i.e. the code starts out copying word-at-a-time rather than byte-at-a-time. I think the C version would be acceptable if we get the bugs fixed and test it well, but I'd also like to still keep the asm under consideration. There are lots of cases not covered by the C version, like misaligned copies (important for strings, not for much else). Do you think these cases are important? Rich ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-24 3:48 ` Rich Felker @ 2013-07-24 4:40 ` Andre Renaud 2013-07-28 8:09 ` Rich Felker 0 siblings, 1 reply; 38+ messages in thread From: Andre Renaud @ 2013-07-24 4:40 UTC (permalink / raw) To: musl Hi Rich, > It looks buggy as-is; as far as I can tell, it will crash if src/dest > are aligned with respect to each other but not aligned mod 4, i.e. the > code starts out copying word-at-a-time rather than byte-at-a-time. Yes, you are correct, I'd messed that up while looking at the cache alignment stuff (along with anoter small size related bug). Fixing it is relatively straight forward though: #define SS sizeof(size_t) #define ALIGN (SS - 1) void * noinline my_asm_memcpy(void * restrict dest, const void * restrict src, size_t n) { unsigned char *d = dest; const unsigned char *s = src; if (((uintptr_t)d & ALIGN) != ((uintptr_t)s & ALIGN)) goto misaligned; /* Get them word aligned */ for (; ((uintptr_t)d & ALIGN) && n; n--) *d++ = *s++; /* ARM has 32-byte cache lines, so align to that for performance */ for (; ((uintptr_t)d & ((8 * SS) - 1)) && n >= SS; n-=SS) { *(size_t *)d = *(size_t *)s; d += SS; s += SS; } /* Do full cache line read/writes */ for (; n>=(8 * SS); n-= (8 * SS)) __asm__ __volatile__( "ldmia %1!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t" "ldrhi r12, [%1]\n" "stmia %0!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t" : "+r"(d), "+r"(s) : : "a4", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "r12", "memory"); misaligned: for (; n; n--) *d++ = *s++; return dest; } > I think the C version would be acceptable if we get the bugs fixed and > test it well, but I'd also like to still keep the asm under > consideration. There are lots of cases not covered by the C version, > like misaligned copies (important for strings, not for much else). Do > you think these cases are important? At the moment the mis-aligned copies perform terribly (18MB/s vs glibc @ 100MB/s). However the existing C implementation in musl is no different, so we're not degrading the current system. We're essentially missing the non-congruent copying stuff from the asm code. I'll have a look at this and see if I can write a similar C version. Regards, Andre ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-24 4:40 ` Andre Renaud @ 2013-07-28 8:09 ` Rich Felker 0 siblings, 0 replies; 38+ messages in thread From: Rich Felker @ 2013-07-28 8:09 UTC (permalink / raw) To: musl On Wed, Jul 24, 2013 at 04:40:16PM +1200, Andre Renaud wrote: > > I think the C version would be acceptable if we get the bugs fixed and > > test it well, but I'd also like to still keep the asm under > > consideration. There are lots of cases not covered by the C version, > > like misaligned copies (important for strings, not for much else). Do > > you think these cases are important? > > At the moment the mis-aligned copies perform terribly (18MB/s vs glibc > @ 100MB/s). However the existing C implementation in musl is no > different, so we're not degrading the current system. > > We're essentially missing the non-congruent copying stuff from the asm > code. I'll have a look at this and see if I can write a similar C > version. Sorry this hasn't been moving forward more quickly. I've been experimenting with various memcpys on ARM, and I've found: 1. Pure C code that performs comparably (only in the aligned case, so far) to the bionic asm and your inline-asm C version. 2. The prefetch stuff in your inline asm and the bionic version is apparently making it slower. With that removed from your C (basically, my inline asm version) it's 10% faster on the machine I'm running it on. So I feel like we still have a ways to go figuring out the right solution. I know, from one standpoint, it would be nice to have _something_ right now, but I don't want to commit one version only to decide next week it's wrong and throw it all out. Hopefully in the mean time people who are trying to use musl seriously on arm and running into performance problems can drop in the bionic asm or another implementation. Rich ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-11 3:37 ` Rich Felker 2013-07-11 4:04 ` Andre Renaud @ 2013-07-11 5:27 ` Daniel Cegiełka 2013-07-11 12:49 ` Rich Felker 2013-07-15 4:25 ` Rob Landley 2 siblings, 1 reply; 38+ messages in thread From: Daniel Cegiełka @ 2013-07-11 5:27 UTC (permalink / raw) To: musl 2013/7/11 Rich Felker <dalias@aerifal.cx>: > On Thu, Jul 11, 2013 at 10:44:16AM +1200, Andre Renaud wrote: > Yes, I suspect so. One slightly crazy idea I had was to write the > function in C with just inline asm for the inner ldm/stm loop. A bit of useful code (x86): http://dpdk.org/browse/dpdk/tree/lib/librte_eal/common/include/rte_memcpy.h Daniel ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-11 5:27 ` Daniel Cegiełka @ 2013-07-11 12:49 ` Rich Felker 0 siblings, 0 replies; 38+ messages in thread From: Rich Felker @ 2013-07-11 12:49 UTC (permalink / raw) To: musl On Thu, Jul 11, 2013 at 07:27:11AM +0200, Daniel Cegiełka wrote: > 2013/7/11 Rich Felker <dalias@aerifal.cx>: > > On Thu, Jul 11, 2013 at 10:44:16AM +1200, Andre Renaud wrote: > > > Yes, I suspect so. One slightly crazy idea I had was to write the > > function in C with just inline asm for the inner ldm/stm loop. > > A bit of useful code (x86): > > http://dpdk.org/browse/dpdk/tree/lib/librte_eal/common/include/rte_memcpy.h On modern x86 (32-bit), this is slower than even the naive "rep movsb" version. Some x86 chips have problems with rep movsb, so the version in musl does a little bit more work (possibly more than it needs to) to use "rep movsd". On x86_64, there _may_ be faster approaches than the "rep movsq" we have right now, but so far my impression is that they don't work on baseline x86_64 (only later variants) and don't gain much. Rich ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-11 3:37 ` Rich Felker 2013-07-11 4:04 ` Andre Renaud 2013-07-11 5:27 ` Daniel Cegiełka @ 2013-07-15 4:25 ` Rob Landley 2 siblings, 0 replies; 38+ messages in thread From: Rob Landley @ 2013-07-15 4:25 UTC (permalink / raw) To: musl; +Cc: musl, Andre Renaud On 07/10/2013 10:37:55 PM, Rich Felker wrote: > On Thu, Jul 11, 2013 at 10:44:16AM +1200, Andre Renaud wrote: > > > This results in 95MB/s on my platform (up from 65MB/s for the > existing > > > memcpy.c, and down from 105MB/s with the asm optimised version). > It is > > > essentially identically readable to the existing memcpy.c. I'm not > > > really famiilar with any other cpu architectures, so I'm not sure > if > > > this would improve, or hurt, performance on other platforms. > > > > Reviewing the assembler that is produced, it appears that GCC will > > never generate an ldm/stm instruction (load/store multiple) that > reads > > into more than 4 registers, where as the optimised assembler does > them > > that read 8 (ie: 8 * 32bit reads in a single instruction). I've > tried > > For the asm, could we make it more than 8? 10 seems easy, 12 seems > doubtful. I don't see a fundamental reason it needs to be a power of > two, unless the cache line alignment really helps and isn't just > cargo-culting. (This is something I'd still like to know about the > asm: whether it's doing unnecessary stuff that does not help > performance.) You're going to hit bus bandwidth at some point, and that's likely to be a power of two. > > various tricks/optimisations with the C code, and can't convince GCC > > to do more than 4. I assume that this is probably where the > remaining > > 10MB/s is between these two variants. > > Yes, I suspect so. One slightly crazy idea I had was to write the > function in C with just inline asm for the inner ldm/stm loop. The > build system does not yet have support for .c files in the arch dirs > instead of .s files, but it could be added. Does it have support for a header definining a macro containing the assembly bit? > > Rich - do you have any comments on whether either the C or assembler > > variants of memcpy might be suitable for inclusion in musl? > > I would say either might be, but it looks like if we want competitive > performance, some asm will be needed (either inline or full). My > leaning would be to go for something simpler than the asm you've been > experimenting with, but with same or better performance, if this is > possible. I realize the code is not that big as-is, in terms of binary > size, but it's big from an "understanding it" perspective and I don't > like big asm blobs that are hard for somebody to look at and say "oh > yeah, this is clearly right". > > Anyway, the big questions I'd still like to get answered before moving > forward is whether the cache line alignment has any benefit. I'd expect so. Fundamentally what the processor is doing is fetching and writing cachelines. What it does to the contents of the cachelines is just annotating that larger operation. (Several days behind on email, as usual...) > Rich Rob ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-09 21:28 ` Andre Renaud 2013-07-09 22:26 ` Andre Renaud @ 2013-07-10 19:42 ` Rich Felker 2013-07-14 6:37 ` Rob Landley 2013-07-11 4:30 ` Strake 2 siblings, 1 reply; 38+ messages in thread From: Rich Felker @ 2013-07-10 19:42 UTC (permalink / raw) To: musl On Wed, Jul 10, 2013 at 09:28:21AM +1200, Andre Renaud wrote: > >> Does anyone have any comments on the suitability of this code, or what > > > > If nothing else, it fails to be armv4 compatible. Fixing that should > > not be hard, but it would require a bit of an audit. The return > > sequences are the obvious issue, but there may be other instructions > > in use that are not available on armv4 or maybe not even on armv5...? > > Rob Landley mentioned a while ago that armv4 has issues with the EABI > stuff. Is armv4 a definite lower bound for musl support, as opposed to > armv4t or armv5? EABI specifies thumb; however, it's possible to have code which conforms fully to EABI but does not rely on the presence of thumb. GCC is incapable of generating such code, but it could be enhanced to do so, and all of the existing assembly in musl is plain-v4-compatible, so I would prefer not to shut out the possibility of supporting older ARM. Rich ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-10 19:42 ` Rich Felker @ 2013-07-14 6:37 ` Rob Landley 0 siblings, 0 replies; 38+ messages in thread From: Rob Landley @ 2013-07-14 6:37 UTC (permalink / raw) To: musl; +Cc: musl On 07/10/2013 02:42:34 PM, Rich Felker wrote: > On Wed, Jul 10, 2013 at 09:28:21AM +1200, Andre Renaud wrote: > > >> Does anyone have any comments on the suitability of this code, > or what > > > > > > If nothing else, it fails to be armv4 compatible. Fixing that > should > > > not be hard, but it would require a bit of an audit. The return > > > sequences are the obvious issue, but there may be other > instructions > > > in use that are not available on armv4 or maybe not even on > armv5...? > > > > Rob Landley mentioned a while ago that armv4 has issues with the > EABI > > stuff. Is armv4 a definite lower bound for musl support, as opposed > to > > armv4t or armv5? > > EABI specifies thumb; however, it's possible to have code which > conforms fully to EABI but does not rely on the presence of thumb. GCC > is incapable of generating such code, but it could be enhanced to do > so, and all of the existing assembly in musl is plain-v4-compatible, > so I would prefer not to shut out the possibility of supporting older > ARM. One of my larger pending todo items for aboriginal is fishing the last gplv2 release out of gcc git the same way I did for binutils. (In theory, this should give me armv7l support. In practice, the mpfr and gmp split complicates matters...) If somebody wanted to come up with an armv4-eabi patch, I'd happily include it. Or just give me rather a lot of hints on what would be involved, since I'm not much of an arm assembly programmer... Rob ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-09 21:28 ` Andre Renaud 2013-07-09 22:26 ` Andre Renaud 2013-07-10 19:42 ` Rich Felker @ 2013-07-11 4:30 ` Strake 2013-07-11 4:33 ` Rich Felker 2 siblings, 1 reply; 38+ messages in thread From: Strake @ 2013-07-11 4:30 UTC (permalink / raw) To: musl On 09/07/2013, Andre Renaud <andre@bluewatersys.com> wrote: > I wasn't too sure on memmove, but I've seen a reasonable amount of > code which just uses memmove as standard (rather than memcpy), to > avoid the possibility of overlapping regions. Not a great policy Why? What loss with memmove? That it takes 1.0125 times as long as memcpy, other than when memcpy might just trash the array or summon nasal demons anyhow? ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-11 4:30 ` Strake @ 2013-07-11 4:33 ` Rich Felker 0 siblings, 0 replies; 38+ messages in thread From: Rich Felker @ 2013-07-11 4:33 UTC (permalink / raw) To: musl On Wed, Jul 10, 2013 at 11:30:47PM -0500, Strake wrote: > On 09/07/2013, Andre Renaud <andre@bluewatersys.com> wrote: > > I wasn't too sure on memmove, but I've seen a reasonable amount of > > code which just uses memmove as standard (rather than memcpy), to > > avoid the possibility of overlapping regions. Not a great policy > > Why? What loss with memmove? That it takes 1.0125 times as long as > memcpy, other than when memcpy might just trash the array or summon > nasal demons anyhow? If you're performing a copy between objects that overlap, or if you're not sure whether you might be, then it's very likely that you'd doing something wrong. Or at least that's my opinion. Anyway I have no objection to an optimized memmove, but I do think starting with just memcpy is easier for review and cleanup/optimization. Rich ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-09 5:37 ` Rich Felker 2013-07-09 6:24 ` Harald Becker 2013-07-09 21:28 ` Andre Renaud @ 2013-07-10 19:38 ` Rob Landley 2013-07-10 20:34 ` Andre Renaud 2 siblings, 1 reply; 38+ messages in thread From: Rob Landley @ 2013-07-10 19:38 UTC (permalink / raw) To: musl; +Cc: musl On 07/09/2013 12:37:12 AM, Rich Felker wrote: > On Tue, Jul 09, 2013 at 05:06:21PM +1200, Andre Renaud wrote: > > The git tree is available here: > > > https://github.com/AndreRenaud/musl/commit/713023e7320cf45b116d1c29b6155ece28904e69 > > It's an open question whether it's better to sync something like this > with an 'upstream' or adapt it to musl coding conventions. Generally > musl uses explicit instructions rather than pseudo-instructions/macros > for prologue and epilogue, and does not use named labels. Do your own local version. You can always copy ideas from other projects if "upstream" changes later. > > Does anyone have any comments on the suitability of this code, or > what > > If nothing else, it fails to be armv4 compatible. Query: did you ever implement a non-thumb version of armv4 eabi support? I remember some discussion about it being possible, I don't remember the outcome. > Fixing that should > not be hard, but it would require a bit of an audit. The return > sequences are the obvious issue, but there may be other instructions > in use that are not available on armv4 or maybe not even on armv5...? I've beaten armv4-only, armv4t-only, and armv5-only modes out of qemu. That's the reason for the first half of my versatile patch: http://landley.net/hg/aboriginal/file/1612/sources/patches/linux-arm.patch > > kind of more rigorous testing could be applied? > > See above. > > What also might be worth testing is whether GCC can compete if you > just give it a naive loop (not the fancy pseudo-vectorized stuff > currently in musl) and good CFLAGS. I know on x86 I was able to beat > the fanciest asm strlen I could come up with simply by writing the > naive loop in C and unrolling it a lot. Duff's device! Rob ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-10 19:38 ` Rob Landley @ 2013-07-10 20:34 ` Andre Renaud 2013-07-10 20:49 ` Nathan McSween 2013-07-10 21:01 ` Rich Felker 0 siblings, 2 replies; 38+ messages in thread From: Andre Renaud @ 2013-07-10 20:34 UTC (permalink / raw) To: musl >> What also might be worth testing is whether GCC can compete if you >> just give it a naive loop (not the fancy pseudo-vectorized stuff >> currently in musl) and good CFLAGS. I know on x86 I was able to beat >> the fanciest asm strlen I could come up with simply by writing the >> naive loop in C and unrolling it a lot. > > > Duff's device! That was exactly my first idea too, but interestingly it turns out not to have really added any performance improvement. Looking at the assembler, with -O3, gcc does a pretty good job of unrolling as it is. Regards, Andre ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-10 20:34 ` Andre Renaud @ 2013-07-10 20:49 ` Nathan McSween 2013-07-10 21:01 ` Rich Felker 1 sibling, 0 replies; 38+ messages in thread From: Nathan McSween @ 2013-07-10 20:49 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 892 bytes --] I would think the iterate-per-char-till-zero would take the most time, even if GCC vectorized without SIMD it would still need to iterate to find the zero in the word with the zero, current musl does this as well though. On Jul 10, 2013 1:34 PM, "Andre Renaud" <andre@bluewatersys.com> wrote: > >> What also might be worth testing is whether GCC can compete if you > >> just give it a naive loop (not the fancy pseudo-vectorized stuff > >> currently in musl) and good CFLAGS. I know on x86 I was able to beat > >> the fanciest asm strlen I could come up with simply by writing the > >> naive loop in C and unrolling it a lot. > > > > > > Duff's device! > > That was exactly my first idea too, but interestingly it turns out not > to have really added any performance improvement. Looking at the > assembler, with -O3, gcc does a pretty good job of unrolling as it is. > > Regards, > Andre > [-- Attachment #2: Type: text/html, Size: 1231 bytes --] ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-07-10 20:34 ` Andre Renaud 2013-07-10 20:49 ` Nathan McSween @ 2013-07-10 21:01 ` Rich Felker 1 sibling, 0 replies; 38+ messages in thread From: Rich Felker @ 2013-07-10 21:01 UTC (permalink / raw) To: musl On Thu, Jul 11, 2013 at 08:34:03AM +1200, Andre Renaud wrote: > >> What also might be worth testing is whether GCC can compete if you > >> just give it a naive loop (not the fancy pseudo-vectorized stuff > >> currently in musl) and good CFLAGS. I know on x86 I was able to beat > >> the fanciest asm strlen I could come up with simply by writing the > >> naive loop in C and unrolling it a lot. > > > > > > Duff's device! > > That was exactly my first idea too, but interestingly it turns out not > to have really added any performance improvement. Looking at the > assembler, with -O3, gcc does a pretty good job of unrolling as it is. For what it's worth, my testing showed the current memcpy code in musl and the naive "while (n--) *d++=*s++;" version performing near-identically at -O3, and both got about 20% faster with -funroll-all-loops. With -O2 or -Os, the naive version was about 5 times slower. Rich ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-06-13 1:25 Thinking about release Rich Felker 2013-06-13 1:33 ` Andre Renaud @ 2013-06-13 15:46 ` Isaac 2013-06-26 1:44 ` Rich Felker 2 siblings, 0 replies; 38+ messages in thread From: Isaac @ 2013-06-13 15:46 UTC (permalink / raw) To: musl On Wed, Jun 12, 2013 at 09:25:17PM -0400, Rich Felker wrote: > Hi all, > > It's been mentioned that we're overdue for a release, and it looks > like we have a fair amount of new stuff since the last release. Major > changes so far: > > - Accepting \n in dynamic linker path file > - Fixed broken x86_64 sigsetjmp > - Removed wrong C++ __STDC_*_MACROS checks > - Thread exit synchronization issues > - Conforming overflow behavior in clock() > - Support for large device numbers > - Math optimizations on i386 > - C11 float.h macro additions > - Fixes for exceptions in fma (gcc bug workarounds) > - Fix misaligned stack when running ctors/dtors > - Support for %m modifier in scanf > - Support for using gcc intrinsic headers with musl-gcc wrapper > - PRNG fixes > - Per-process and per-thread cputime clocks and new(ish) Linux clocks > > I think the ether.h functions could definitely still make it in for > this release too. inet_makeaddr, etc. could probably also be merged. ether.h is needed for a full build of busybox: $ grep -r ether_aton_r */ networking/udhcp/files.c: if (!mac_string || !ether_aton_r(mac_string, &mac_bytes)) networking/ether-wake.c: eap = ether_aton_r(hostid, eaddr); networking/nameif.c: ch->mac = ether_aton_r(selector + (strncmp(selector, "mac=", 4) != 0 ? 0 : 4), lmac); $ grep -r ether_ntoa */ networking/ether-wake.c: bb_debug_msg("The target station address is %s\n\n", ether_ntoa(eap)); networking/ether-wake.c: bb_debug_msg("Station address for hostname %s is %s\n\n", hostid, ether_ntoa(eaddr)); networking/zcip.c: ether_ntoa(sha), networking/zcip.c: ether_ntoa(tha), networking/arping.c: ether_ntoa((struct ether_addr *) p)); networking/arping.c: ether_ntoa((struct ether_addr *) p + ah->ar_hln + 4)); Most of these aren't universally needed, but I wanted to enable some of them. Right now, I'm using Strake's patch and compiling musl with -Os. Of course, these aren't going to be enough to make busybox allyesconfig work right; I also ran into the following issues building busybox: -<sys/personality.h> needs the macros from <linux/personality.h> (for setarch) -A few cases of using extra headers that break the busybox build: <net/if_slip.h> can be changed to <linux/if_slip.h> <net/if_packet.h> is included (needlessly for musl) in networking/libiproute/iplink.c; this header is roughly the macros from <sys/socket.h> + struct sockaddr_pkt <netinet/ether.h> is used in arp and zcip -CONFIG_COMPAT_EXTRA turns on use of glibc regex.h extensions in grep. -Of course, vi still uses glibc regexes if you enable search and replace. > Rich HTH, Isaac Dunham ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-06-13 1:25 Thinking about release Rich Felker 2013-06-13 1:33 ` Andre Renaud 2013-06-13 15:46 ` Isaac @ 2013-06-26 1:44 ` Rich Felker 2013-06-26 10:19 ` Szabolcs Nagy 2 siblings, 1 reply; 38+ messages in thread From: Rich Felker @ 2013-06-26 1:44 UTC (permalink / raw) To: musl On Wed, Jun 12, 2013 at 09:25:17PM -0400, Rich Felker wrote: > Hi all, > > It's been mentioned that we're overdue for a release, and it looks > like we have a fair amount of new stuff since the last release. Major > changes so far: > > - Accepting \n in dynamic linker path file This was buggy and apparently nobody (even the person who requested it?!) ever tested it. Just fixed it. Could possibly use some further review still to make sure it handles odd path files well. > I think the ether.h functions could definitely still make it in for > this release too. Just committed these, almost the exact versions Strake submitted. Apologies for taking so long! Thank you Strake. > inet_makeaddr, etc. could probably also be merged. I just wrote a new one based on the man page so as not to pull in more 3-clause BSD license mess. It's untested so hopefully it does the right thing. Just committed it. > Most of the other major items left on the agenda since the last > release are probably not going to happen right away unless there's a > volunteer to do them (zoneinfo, cpuset/affinity, string functions > cleanup, C++ ABI matching, ARM-optimized memcpy) and one, the ld.so > symlink direction issue, still requires some serious discussion and > decision-making. I think the status on all these issues remains the same. Rich ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-06-26 1:44 ` Rich Felker @ 2013-06-26 10:19 ` Szabolcs Nagy 2013-06-26 14:21 ` Rich Felker 0 siblings, 1 reply; 38+ messages in thread From: Szabolcs Nagy @ 2013-06-26 10:19 UTC (permalink / raw) To: musl * Rich Felker <dalias@aerifal.cx> [2013-06-25 21:44:07 -0400]: > On Wed, Jun 12, 2013 at 09:25:17PM -0400, Rich Felker wrote: > > Hi all, > > > > It's been mentioned that we're overdue for a release, and it looks > > like we have a fair amount of new stuff since the last release. Major > > changes so far: > > > > - Accepting \n in dynamic linker path file > > This was buggy and apparently nobody (even the person who requested > it?!) ever tested it. Just fixed it. Could possibly use some further > review still to make sure it handles odd path files well. > if the path file is empty the fixed (and original) code invokes ub (it is not critical since an empty path file does not make much sense, but i think it should be fixed) if (!sys_path) { FILE *f = fopen(ETC_LDSO_PATH, "rbe"); if (f) { // if f is empty then getdelim returns -1, allocates space for sys_path // but sys_path is not null terminated if (getdelim(&sys_path, (size_t[1]){0}, 0, f) > 0) { size_t l = strlen(sys_path); if (l && sys_path[l-1]=='\n') sys_path[l-1] = 0; } fclose(f); } } // sys_path is non-null and not a valid string here if (!sys_path) sys_path = "/lib:/usr/local/lib:/usr/lib"; fd = path_open(name, sys_path, buf, sizeof buf); i think either getdelim should be fixed so it makes sys_path null terminated even on eof/error or sys_path should be freed and set to 0 in case of failure so we fall back to the default value ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Thinking about release 2013-06-26 10:19 ` Szabolcs Nagy @ 2013-06-26 14:21 ` Rich Felker 0 siblings, 0 replies; 38+ messages in thread From: Rich Felker @ 2013-06-26 14:21 UTC (permalink / raw) To: musl On Wed, Jun 26, 2013 at 12:19:34PM +0200, Szabolcs Nagy wrote: > > > - Accepting \n in dynamic linker path file > > > > This was buggy and apparently nobody (even the person who requested > > it?!) ever tested it. Just fixed it. Could possibly use some further > > review still to make sure it handles odd path files well. > > > > if the path file is empty the fixed (and original) code > invokes ub (it is not critical since an empty path file > does not make much sense, but i think it should be fixed) I just fixed this by falling back to an empty path on read error or empty path file. Rich ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2013-07-28 8:09 UTC | newest] Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-06-13 1:25 Thinking about release Rich Felker 2013-06-13 1:33 ` Andre Renaud 2013-06-13 1:43 ` Rich Felker 2013-07-09 5:06 ` Andre Renaud 2013-07-09 5:37 ` Rich Felker 2013-07-09 6:24 ` Harald Becker 2013-07-09 21:28 ` Andre Renaud 2013-07-09 22:26 ` Andre Renaud 2013-07-10 6:42 ` Jens Gustedt 2013-07-10 7:50 ` Rich Felker 2013-07-10 22:44 ` Andre Renaud 2013-07-11 3:37 ` Rich Felker 2013-07-11 4:04 ` Andre Renaud 2013-07-11 5:10 ` Andre Renaud 2013-07-11 12:46 ` Rich Felker 2013-07-11 22:34 ` Andre Renaud 2013-07-12 3:16 ` Rich Felker 2013-07-12 3:36 ` Andre Renaud 2013-07-12 4:16 ` Rich Felker 2013-07-24 1:34 ` Andre Renaud 2013-07-24 3:48 ` Rich Felker 2013-07-24 4:40 ` Andre Renaud 2013-07-28 8:09 ` Rich Felker 2013-07-11 5:27 ` Daniel Cegiełka 2013-07-11 12:49 ` Rich Felker 2013-07-15 4:25 ` Rob Landley 2013-07-10 19:42 ` Rich Felker 2013-07-14 6:37 ` Rob Landley 2013-07-11 4:30 ` Strake 2013-07-11 4:33 ` Rich Felker 2013-07-10 19:38 ` Rob Landley 2013-07-10 20:34 ` Andre Renaud 2013-07-10 20:49 ` Nathan McSween 2013-07-10 21:01 ` Rich Felker 2013-06-13 15:46 ` Isaac 2013-06-26 1:44 ` Rich Felker 2013-06-26 10:19 ` Szabolcs Nagy 2013-06-26 14:21 ` Rich Felker
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).