* ARM optimisations @ 2013-02-28 23:15 Andre Renaud 2013-02-28 23:30 ` Rich Felker 0 siblings, 1 reply; 7+ messages in thread From: Andre Renaud @ 2013-02-28 23:15 UTC (permalink / raw) To: musl Hi, Can anyone tell me what the policy for musl is regarding ARM optimised assembly implementations of functions such as memcpy/memmove? I notice that there are i386/x86_64 versions for some of these. Doing some simple testing on an ARM platform I found that an ARM asm implementation of memcpy is ~80% faster than the C one currently in MUSL (this is on an ARMv5, so no NEON instructions or similar). I don't think I'm capable of writing the optimised version entirely myself, however there are various implementations floating around in libraries such as bionic etc... Is it possible to have BSD licensed code brought in to musl (which is MIT licensed)? Regards, Andre ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ARM optimisations 2013-02-28 23:15 ARM optimisations Andre Renaud @ 2013-02-28 23:30 ` Rich Felker 2013-03-02 4:33 ` Rob Landley 0 siblings, 1 reply; 7+ messages in thread From: Rich Felker @ 2013-02-28 23:30 UTC (permalink / raw) To: musl On Fri, Mar 01, 2013 at 12:15:21PM +1300, Andre Renaud wrote: > Hi, > Can anyone tell me what the policy for musl is regarding ARM optimised > assembly implementations of functions such as memcpy/memmove? I notice > that there are i386/x86_64 versions for some of these. Doing some > simple testing on an ARM platform I found that an ARM asm > implementation of memcpy is ~80% faster than the C one currently in > MUSL (this is on an ARMv5, so no NEON instructions or similar). > > I don't think I'm capable of writing the optimised version entirely > myself, however there are various implementations floating around in > libraries such as bionic etc... Is it possible to have BSD licensed > code brought in to musl (which is MIT licensed)? ARM optimizations are welcome as long as they're thoroughly tested, not heavily bloated, and support all v4 (including no-thumb) and later cpu models, either by using universally-available features or conditioning use of features on the .hidden __hwcap provided in musl. Modern BSD license without advert clause is fully compatible with MIT license, so I don't have an objection to such code, but I'm also not a fan of pure copy-and-paste coding. If nothing else, imported code would probably need to be cleaned up to build as .s rather than .S, removing #ifdefs and stuff like that. If you'd like to introduce some possible implementations we could use or just ideas for how these functions should work, myself and others on the project would be happy to review them. Rich ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ARM optimisations 2013-02-28 23:30 ` Rich Felker @ 2013-03-02 4:33 ` Rob Landley 2013-03-02 6:21 ` Rich Felker 2013-03-02 11:34 ` Szabolcs Nagy 0 siblings, 2 replies; 7+ messages in thread From: Rob Landley @ 2013-03-02 4:33 UTC (permalink / raw) To: musl; +Cc: musl On 02/28/2013 05:30:51 PM, Rich Felker wrote: > On Fri, Mar 01, 2013 at 12:15:21PM +1300, Andre Renaud wrote: > > Hi, > > Can anyone tell me what the policy for musl is regarding ARM > optimised > > assembly implementations of functions such as memcpy/memmove? I > notice > > that there are i386/x86_64 versions for some of these. Doing some > > simple testing on an ARM platform I found that an ARM asm > > implementation of memcpy is ~80% faster than the C one currently in > > MUSL (this is on an ARMv5, so no NEON instructions or similar). > > > > I don't think I'm capable of writing the optimised version entirely > > myself, however there are various implementations floating around in > > libraries such as bionic etc... Is it possible to have BSD licensed > > code brought in to musl (which is MIT licensed)? > > ARM optimizations are welcome as long as they're thoroughly tested, > not heavily bloated, and support all v4 (including no-thumb) and later > cpu models, either by using universally-available features or > conditioning use of features on the .hidden __hwcap provided in musl. Out of curiosity, why armv4 no thumb? I'd actually say that armv5 is probably the one to optimize for, because it's somewhere over 80% of the installed base of arm systems and generally provides an additonal 25% speedup from armv4 to armv5. Anything lower than that can use C, anything newer than that can benefit from an armv5 version vs C. The reason armv4t _without_ thumb isn't interesting is you need at least armv4t to use EABI, and I had to patch my compiler to make even that work because telling it EABI hardwired output to <= armv5l even though that wasn't technically required. (Presumably since fixed but the point is nobody _noticed_ for several years.) Newer compilers have dropped support for OABI entirely, and armv4t systems aren't that common. (They existed, the tin can tools nail board used one, but the generic C code works for them. Point is I'm not sure they're worth _optimizing_ for if it costs the vast majority of systems a 25% performance hit and we don't want to maintain multiple versions. If you _have_ an armv5 version, the armv4 one won't/shouldn't get much testing.) I believe armv6 was mostly just SMP extensions, so not worth optimizing memcpy for. armv7 is nice but not uibiquitous the way armv5 is, and armv7 brings with it the "thumb2" instruction set which means you'd need 2 versions depending on what target you wanted to compile for... Rob ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ARM optimisations 2013-03-02 4:33 ` Rob Landley @ 2013-03-02 6:21 ` Rich Felker 2013-03-04 18:55 ` Rob Landley 2013-03-02 11:34 ` Szabolcs Nagy 1 sibling, 1 reply; 7+ messages in thread From: Rich Felker @ 2013-03-02 6:21 UTC (permalink / raw) To: musl On Fri, Mar 01, 2013 at 10:33:19PM -0600, Rob Landley wrote: > On 02/28/2013 05:30:51 PM, Rich Felker wrote: > >On Fri, Mar 01, 2013 at 12:15:21PM +1300, Andre Renaud wrote: > >> Hi, > >> Can anyone tell me what the policy for musl is regarding ARM > >optimised > >> assembly implementations of functions such as memcpy/memmove? I > >notice > >> that there are i386/x86_64 versions for some of these. Doing some > >> simple testing on an ARM platform I found that an ARM asm > >> implementation of memcpy is ~80% faster than the C one currently in > >> MUSL (this is on an ARMv5, so no NEON instructions or similar). > >> > >> I don't think I'm capable of writing the optimised version entirely > >> myself, however there are various implementations floating around in > >> libraries such as bionic etc... Is it possible to have BSD licensed > >> code brought in to musl (which is MIT licensed)? > > > >ARM optimizations are welcome as long as they're thoroughly tested, > >not heavily bloated, and support all v4 (including no-thumb) and later > >cpu models, either by using universally-available features or > >conditioning use of features on the .hidden __hwcap provided in musl. > > Out of curiosity, why armv4 no thumb? > > I'd actually say that armv5 is probably the one to optimize for, > because it's somewhere over 80% of the installed base of arm systems > and generally provides an additonal 25% speedup from armv4 to armv5. > Anything lower than that can use C, anything newer than that can > benefit from an armv5 version vs C. > > The reason armv4t _without_ thumb isn't interesting is you need at > least armv4t to use EABI, and I had to patch my compiler to make This is a compiler bug. If the compiler can be made to generate proper return code, EABI works with armv4 (non-thumb) too. > Newer compilers have dropped support for OABI entirely, and armv4t OABI is not supported by musl at all. The intent is simply not to _preclude_ use of non-thumb, even though there are other obstacles to its use now. > systems aren't that common. (They existed, the tin can tools nail > board used one, but the generic C code works for them. Point is I'm > not sure they're worth _optimizing_ for if it costs the vast > majority of systems a 25% performance hit and we don't want to > maintain multiple versions. If you _have_ an armv5 version, the > armv4 one won't/shouldn't get much testing.) Can you explain why you think a version that's v4 compatible will be that much slower? If so, v5 code can be used as long as it checks __hwcap and falls back to a simple working version... Rich ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ARM optimisations 2013-03-02 6:21 ` Rich Felker @ 2013-03-04 18:55 ` Rob Landley 0 siblings, 0 replies; 7+ messages in thread From: Rob Landley @ 2013-03-04 18:55 UTC (permalink / raw) To: musl; +Cc: musl On 03/02/2013 12:21:02 AM, Rich Felker wrote: > > systems aren't that common. (They existed, the tin can tools nail > > board used one, but the generic C code works for them. Point is I'm > > not sure they're worth _optimizing_ for if it costs the vast > > majority of systems a 25% performance hit and we don't want to > > maintain multiple versions. If you _have_ an armv5 version, the > > armv4 one won't/shouldn't get much testing.) > > Can you explain why you think a version that's v4 compatible will be > that much slower? If so, v5 code can be used as long as it checks > __hwcap and falls back to a simple working version... Alas, I do not have recent benchmarks. The timesys guys benched various stuff in 2006 and that's where I grabbed the 25% figure. I mostly test under qemu, where benchmarks are meaningless for real hardware. If I'm in error, ignore me. Rob ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ARM optimisations 2013-03-02 4:33 ` Rob Landley 2013-03-02 6:21 ` Rich Felker @ 2013-03-02 11:34 ` Szabolcs Nagy 2013-03-02 20:33 ` Andre Renaud 1 sibling, 1 reply; 7+ messages in thread From: Szabolcs Nagy @ 2013-03-02 11:34 UTC (permalink / raw) To: musl * Rob Landley <rob@landley.net> [2013-03-01 22:33:19 -0600]: > I'd actually say that armv5 is probably the one to optimize for, > because it's somewhere over 80% of the installed base of arm systems > and generally provides an additonal 25% speedup from armv4 to armv5. > Anything lower than that can use C, anything newer than that can > benefit from an armv5 version vs C. ... > I believe armv6 was mostly just SMP extensions, so not worth > optimizing memcpy for. armv7 is nice but not uibiquitous the way > armv5 is, and armv7 brings with it the "thumb2" instruction set > which means you'd need 2 versions depending on what target you > wanted to compile for... a quick research shows that glibc has ifdefs for armv5te and armv4t optimizations http://sourceware.org/git/?p=glibc.git;a=blob;f=ports/sysdeps/arm/memcpy.S linaro has armv7 optimized version http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/view/head:/src/linaro-a9/memcpy.S olibc (the bionic one not the openbsd one) has armv7+neon optimized memcpy https://github.com/olibc/olibc/blob/master/libc/arch-arm/bionic/memcpy.S ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ARM optimisations 2013-03-02 11:34 ` Szabolcs Nagy @ 2013-03-02 20:33 ` Andre Renaud 0 siblings, 0 replies; 7+ messages in thread From: Andre Renaud @ 2013-03-02 20:33 UTC (permalink / raw) To: musl On 3 March 2013 00:34, Szabolcs Nagy <nsz@port70.net> wrote: > * Rob Landley <rob@landley.net> [2013-03-01 22:33:19 -0600]: >> I'd actually say that armv5 is probably the one to optimize for, >> because it's somewhere over 80% of the installed base of arm systems >> and generally provides an additonal 25% speedup from armv4 to armv5. >> Anything lower than that can use C, anything newer than that can >> benefit from an armv5 version vs C. > ... >> I believe armv6 was mostly just SMP extensions, so not worth >> optimizing memcpy for. armv7 is nice but not uibiquitous the way >> armv5 is, and armv7 brings with it the "thumb2" instruction set >> which means you'd need 2 versions depending on what target you >> wanted to compile for... > > a quick research shows that > > glibc has ifdefs for armv5te and armv4t optimizations > http://sourceware.org/git/?p=glibc.git;a=blob;f=ports/sysdeps/arm/memcpy.S > > linaro has armv7 optimized version > http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/view/head:/src/linaro-a9/memcpy.S > > olibc (the bionic one not the openbsd one) has armv7+neon optimized memcpy > https://github.com/olibc/olibc/blob/master/libc/arch-arm/bionic/memcpy.S The bionic code uses a couple of pre-processor tricks to combine the ARMv4 & ARMv5 code, specifically around the PLD and CALIGN instructions. Since (I assume) bionic is built at compile time for a specific CPU, it is relatively easy to do these, however I got the impression (and may be mistaken) that we were trying to avoid compile time CPU detection in favour of run-time CPU detection. If that is the case, then you would need two separate implementations (possibly with some code sharing), and I thought that the overall code-size bloat that this would bring wouldn't be worth it. This is especially true when you talk about ARM NEON/v7, as it is essentially completely different, so you'd end up with somewhere between 300% & 500% code size increase on ARM to support all three platforms (based on the current implementation going from 1k to 1.5k when I used the ASM optimised version). Having said all that, I do tend to agree that the ARMv4 platforms are relatively archaic, and simply not having an optimised version for them could be an acceptable alternative. ARMv5t is probably still too popular to ignore. Regards, Andre ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2013-03-04 18:55 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-02-28 23:15 ARM optimisations Andre Renaud 2013-02-28 23:30 ` Rich Felker 2013-03-02 4:33 ` Rob Landley 2013-03-02 6:21 ` Rich Felker 2013-03-04 18:55 ` Rob Landley 2013-03-02 11:34 ` Szabolcs Nagy 2013-03-02 20:33 ` Andre Renaud
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).