ARM optimisations

mailing list of musl libc
 help / color / mirror / code / Atom feed

* ARM optimisations
@ 2013-02-28 23:15 Andre Renaud
  2013-02-28 23:30 ` Rich Felker
  0 siblings, 1 reply; 7+ messages in thread
From: Andre Renaud @ 2013-02-28 23:15 UTC (permalink / raw)
  To: musl

Hi,
Can anyone tell me what the policy for musl is regarding ARM optimised
assembly implementations of functions such as memcpy/memmove? I notice
that there are i386/x86_64 versions for some of these. Doing some
simple testing on an ARM platform I found that an ARM asm
implementation of memcpy is ~80% faster than the C one currently in
MUSL (this is on an ARMv5, so no NEON instructions or similar).

I don't think I'm capable of writing the optimised version entirely
myself, however there are various implementations floating around in
libraries such as bionic etc... Is it possible to have BSD licensed
code brought in to musl (which is MIT licensed)?

Regards,
Andre

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ARM optimisations
  2013-02-28 23:15 ARM optimisations Andre Renaud
@ 2013-02-28 23:30 ` Rich Felker
  2013-03-02  4:33   ` Rob Landley
  0 siblings, 1 reply; 7+ messages in thread
From: Rich Felker @ 2013-02-28 23:30 UTC (permalink / raw)
  To: musl

On Fri, Mar 01, 2013 at 12:15:21PM +1300, Andre Renaud wrote:
> Hi,
> Can anyone tell me what the policy for musl is regarding ARM optimised
> assembly implementations of functions such as memcpy/memmove? I notice
> that there are i386/x86_64 versions for some of these. Doing some
> simple testing on an ARM platform I found that an ARM asm
> implementation of memcpy is ~80% faster than the C one currently in
> MUSL (this is on an ARMv5, so no NEON instructions or similar).
> 
> I don't think I'm capable of writing the optimised version entirely
> myself, however there are various implementations floating around in
> libraries such as bionic etc... Is it possible to have BSD licensed
> code brought in to musl (which is MIT licensed)?

ARM optimizations are welcome as long as they're thoroughly tested,
not heavily bloated, and support all v4 (including no-thumb) and later
cpu models, either by using universally-available features or
conditioning use of features on the .hidden __hwcap provided in musl.

Modern BSD license without advert clause is fully compatible with MIT
license, so I don't have an objection to such code, but I'm also not a
fan of pure copy-and-paste coding. If nothing else, imported code
would probably need to be cleaned up to build as .s rather than .S,
removing #ifdefs and stuff like that.

If you'd like to introduce some possible implementations we could use
or just ideas for how these functions should work, myself and others
on the project would be happy to review them.

Rich

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ARM optimisations
  2013-02-28 23:30 ` Rich Felker
@ 2013-03-02  4:33   ` Rob Landley
  2013-03-02  6:21     ` Rich Felker
  2013-03-02 11:34     ` Szabolcs Nagy
  0 siblings, 2 replies; 7+ messages in thread
From: Rob Landley @ 2013-03-02  4:33 UTC (permalink / raw)
  To: musl; +Cc: musl

On 02/28/2013 05:30:51 PM, Rich Felker wrote:
> On Fri, Mar 01, 2013 at 12:15:21PM +1300, Andre Renaud wrote:
> > Hi,
> > Can anyone tell me what the policy for musl is regarding ARM  
> optimised
> > assembly implementations of functions such as memcpy/memmove? I  
> notice
> > that there are i386/x86_64 versions for some of these. Doing some
> > simple testing on an ARM platform I found that an ARM asm
> > implementation of memcpy is ~80% faster than the C one currently in
> > MUSL (this is on an ARMv5, so no NEON instructions or similar).
> >
> > I don't think I'm capable of writing the optimised version entirely
> > myself, however there are various implementations floating around in
> > libraries such as bionic etc... Is it possible to have BSD licensed
> > code brought in to musl (which is MIT licensed)?
> 
> ARM optimizations are welcome as long as they're thoroughly tested,
> not heavily bloated, and support all v4 (including no-thumb) and later
> cpu models, either by using universally-available features or
> conditioning use of features on the .hidden __hwcap provided in musl.

Out of curiosity, why armv4 no thumb?

I'd actually say that armv5 is probably the one to optimize for,  
because it's somewhere over 80% of the installed base of arm systems  
and generally provides an additonal 25% speedup from armv4 to armv5.  
Anything lower than that can use C, anything newer than that can  
benefit from an armv5 version vs C.

The reason armv4t _without_ thumb isn't interesting is you need at  
least armv4t to use EABI, and I had to patch my compiler to make even  
that work because telling it EABI hardwired output to <= armv5l even  
though that wasn't technically required. (Presumably since fixed but  
the point is nobody _noticed_ for several years.)

Newer compilers have dropped support for OABI entirely, and armv4t  
systems aren't that common. (They existed, the tin can tools nail board  
used one, but the generic C code works for them. Point is I'm not sure  
they're worth _optimizing_ for if it costs the vast majority of systems  
a 25% performance hit and we don't want to maintain multiple versions.  
If you _have_ an armv5 version, the armv4 one won't/shouldn't get much  
testing.)

I believe armv6 was mostly just SMP extensions, so not worth optimizing  
memcpy for. armv7 is nice but not uibiquitous the way armv5 is, and  
armv7 brings with it the "thumb2" instruction set which means you'd  
need 2 versions depending on what target you wanted to compile for...

Rob

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ARM optimisations
  2013-03-02  4:33   ` Rob Landley
@ 2013-03-02  6:21     ` Rich Felker
  2013-03-04 18:55       ` Rob Landley
  2013-03-02 11:34     ` Szabolcs Nagy
  1 sibling, 1 reply; 7+ messages in thread
From: Rich Felker @ 2013-03-02  6:21 UTC (permalink / raw)
  To: musl

On Fri, Mar 01, 2013 at 10:33:19PM -0600, Rob Landley wrote:
> On 02/28/2013 05:30:51 PM, Rich Felker wrote:
> >On Fri, Mar 01, 2013 at 12:15:21PM +1300, Andre Renaud wrote:
> >> Hi,
> >> Can anyone tell me what the policy for musl is regarding ARM
> >optimised
> >> assembly implementations of functions such as memcpy/memmove? I
> >notice
> >> that there are i386/x86_64 versions for some of these. Doing some
> >> simple testing on an ARM platform I found that an ARM asm
> >> implementation of memcpy is ~80% faster than the C one currently in
> >> MUSL (this is on an ARMv5, so no NEON instructions or similar).
> >>
> >> I don't think I'm capable of writing the optimised version entirely
> >> myself, however there are various implementations floating around in
> >> libraries such as bionic etc... Is it possible to have BSD licensed
> >> code brought in to musl (which is MIT licensed)?
> >
> >ARM optimizations are welcome as long as they're thoroughly tested,
> >not heavily bloated, and support all v4 (including no-thumb) and later
> >cpu models, either by using universally-available features or
> >conditioning use of features on the .hidden __hwcap provided in musl.
> 
> Out of curiosity, why armv4 no thumb?
> 
> I'd actually say that armv5 is probably the one to optimize for,
> because it's somewhere over 80% of the installed base of arm systems
> and generally provides an additonal 25% speedup from armv4 to armv5.
> Anything lower than that can use C, anything newer than that can
> benefit from an armv5 version vs C.
> 
> The reason armv4t _without_ thumb isn't interesting is you need at
> least armv4t to use EABI, and I had to patch my compiler to make

This is a compiler bug. If the compiler can be made to generate proper
return code, EABI works with armv4 (non-thumb) too.

> Newer compilers have dropped support for OABI entirely, and armv4t

OABI is not supported by musl at all. The intent is simply not to
_preclude_ use of non-thumb, even though there are other obstacles to
its use now.

> systems aren't that common. (They existed, the tin can tools nail
> board used one, but the generic C code works for them. Point is I'm
> not sure they're worth _optimizing_ for if it costs the vast
> majority of systems a 25% performance hit and we don't want to
> maintain multiple versions. If you _have_ an armv5 version, the
> armv4 one won't/shouldn't get much testing.)

Can you explain why you think a version that's v4 compatible will be
that much slower? If so, v5 code can be used as long as it checks
__hwcap and falls back to a simple working version...

Rich


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ARM optimisations
  2013-03-02  4:33   ` Rob Landley
  2013-03-02  6:21     ` Rich Felker
@ 2013-03-02 11:34     ` Szabolcs Nagy
  2013-03-02 20:33       ` Andre Renaud
  1 sibling, 1 reply; 7+ messages in thread
From: Szabolcs Nagy @ 2013-03-02 11:34 UTC (permalink / raw)
  To: musl

* Rob Landley <rob@landley.net> [2013-03-01 22:33:19 -0600]:
> I'd actually say that armv5 is probably the one to optimize for,
> because it's somewhere over 80% of the installed base of arm systems
> and generally provides an additonal 25% speedup from armv4 to armv5.
> Anything lower than that can use C, anything newer than that can
> benefit from an armv5 version vs C.
...
> I believe armv6 was mostly just SMP extensions, so not worth
> optimizing memcpy for. armv7 is nice but not uibiquitous the way
> armv5 is, and armv7 brings with it the "thumb2" instruction set
> which means you'd need 2 versions depending on what target you
> wanted to compile for...

a quick research shows that

glibc has ifdefs for armv5te and armv4t optimizations
http://sourceware.org/git/?p=glibc.git;a=blob;f=ports/sysdeps/arm/memcpy.S

linaro has armv7 optimized version
http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/view/head:/src/linaro-a9/memcpy.S

olibc (the bionic one not the openbsd one) has armv7+neon optimized memcpy
https://github.com/olibc/olibc/blob/master/libc/arch-arm/bionic/memcpy.S


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ARM optimisations
  2013-03-02 11:34     ` Szabolcs Nagy
@ 2013-03-02 20:33       ` Andre Renaud
  0 siblings, 0 replies; 7+ messages in thread
From: Andre Renaud @ 2013-03-02 20:33 UTC (permalink / raw)
  To: musl

On 3 March 2013 00:34, Szabolcs Nagy <nsz@port70.net> wrote:
> * Rob Landley <rob@landley.net> [2013-03-01 22:33:19 -0600]:
>> I'd actually say that armv5 is probably the one to optimize for,
>> because it's somewhere over 80% of the installed base of arm systems
>> and generally provides an additonal 25% speedup from armv4 to armv5.
>> Anything lower than that can use C, anything newer than that can
>> benefit from an armv5 version vs C.
> ...
>> I believe armv6 was mostly just SMP extensions, so not worth
>> optimizing memcpy for. armv7 is nice but not uibiquitous the way
>> armv5 is, and armv7 brings with it the "thumb2" instruction set
>> which means you'd need 2 versions depending on what target you
>> wanted to compile for...
>
> a quick research shows that
>
> glibc has ifdefs for armv5te and armv4t optimizations
> http://sourceware.org/git/?p=glibc.git;a=blob;f=ports/sysdeps/arm/memcpy.S
>
> linaro has armv7 optimized version
> http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/view/head:/src/linaro-a9/memcpy.S
>
> olibc (the bionic one not the openbsd one) has armv7+neon optimized memcpy
> https://github.com/olibc/olibc/blob/master/libc/arch-arm/bionic/memcpy.S

The bionic code uses a couple of pre-processor tricks to combine the
ARMv4 & ARMv5 code, specifically around the PLD and CALIGN
instructions. Since (I assume) bionic is built at compile time for a
specific CPU, it is relatively easy to do these, however I got the
impression (and may be mistaken) that we were trying to avoid compile
time CPU detection in favour of run-time CPU detection. If that is the
case, then you would need two separate implementations (possibly with
some code sharing), and I thought that the overall code-size bloat
that this would bring wouldn't be worth it. This is especially true
when you talk about ARM NEON/v7, as it is essentially completely
different, so you'd end up with somewhere between 300% & 500% code
size increase on ARM to support all three platforms (based on the
current implementation going from 1k to 1.5k when I used the ASM
optimised version).

Having said all that, I do tend to agree that the ARMv4 platforms are
relatively archaic, and simply not having an optimised version for
them could be an acceptable alternative. ARMv5t is probably still too
popular to ignore.

Regards,
Andre

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ARM optimisations
  2013-03-02  6:21     ` Rich Felker
@ 2013-03-04 18:55       ` Rob Landley
  0 siblings, 0 replies; 7+ messages in thread
From: Rob Landley @ 2013-03-04 18:55 UTC (permalink / raw)
  To: musl; +Cc: musl

On 03/02/2013 12:21:02 AM, Rich Felker wrote:
> > systems aren't that common. (They existed, the tin can tools nail
> > board used one, but the generic C code works for them. Point is I'm
> > not sure they're worth _optimizing_ for if it costs the vast
> > majority of systems a 25% performance hit and we don't want to
> > maintain multiple versions. If you _have_ an armv5 version, the
> > armv4 one won't/shouldn't get much testing.)
> 
> Can you explain why you think a version that's v4 compatible will be
> that much slower? If so, v5 code can be used as long as it checks
> __hwcap and falls back to a simple working version...

Alas, I do not have recent benchmarks. The timesys guys benched various  
stuff in 2006 and that's where I grabbed the 25% figure. I mostly test  
under qemu, where benchmarks are meaningless for real hardware.

If I'm in error, ignore me.

Rob

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-03-04 18:55 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-28 23:15 ARM optimisations Andre Renaud
2013-02-28 23:30 ` Rich Felker
2013-03-02  4:33   ` Rob Landley
2013-03-02  6:21     ` Rich Felker
2013-03-04 18:55       ` Rob Landley
2013-03-02 11:34     ` Szabolcs Nagy
2013-03-02 20:33       ` Andre Renaud

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).