ARM memcpy post-0.9.12-release thread

mailing list of musl libc
 help / color / mirror / code / Atom feed

* ARM memcpy post-0.9.12-release thread
@ 2013-07-31  2:26 Rich Felker
  2013-07-31  3:13 ` Harald Becker
  2013-08-02 20:41 ` Rich Felker
  0 siblings, 2 replies; 9+ messages in thread
From: Rich Felker @ 2013-07-31  2:26 UTC (permalink / raw)
  To: musl

Hi all (especially Andre),

I've been doing some experimenting with ARM memcpy, and I have not
found any way to beat the Bionic asm file for misaligned copies. The
best I could do with simple inline asm (reading multi-words and
writing byte-at-a-time or vice versa) improved the performance nearly
40% compared to musl's current code, but it was still worse than half
the speed of the Bionic asm.

For the aligned case, however, as I've said before, the Bionic code
runs 10% slower for me than the C-with-inline-asm I posted to the
list. Commenting out the prefetch code in the Bionic version brings
the performance up to the same as my version.

I also found that the Bionic code was mysteriously crashing on the
real system I test on (it worked on my toolchain with qemu). On
further investigation, the test system's toolchain had -mthumb (with
thumb2) as the default; adding -marm made it work. Both ways the asm
was being interpreted as arm; the problem was that the *calling* code
being thumb broke it. The solution was adding .type memcpy,%function
to the asm file. Without that, the linker cannot know that the symbol
it's resolving is a function name and thus that it has to adjust the
low bit of the relocated address as a flag for whether the code is arm
or thumb. I've now got the code working reliably it seems.

Sizes so far:
Current C code: 260 bytes
My best-attempt inline asm: 352 bytes
Bionic (with prefetch removed): 764 bytes

Obviously the Bionic code is a bit larger than the others and than I'd
like it to be, but it looks really hard to trim it down without
ruining performance for misaligned copies; roughly half of the asm
covers the misaligned case, which is expensive because you have three
different code paths for different ways it can be off mod 4.

One other issue we have to consider if we go with the Bionic code is
that we'd need to add sub-arch asm dirs to use it. As-is, the code is
hard-coded for little endian. It will shuffle the byte order badly
when copying on a big endian machine.

Some rough times (128k copy repeated 10000 times):

Aligned case:
Current C code: 1.2s
My best-attempt C code: 0.75s
My best-attempt inline asm: 0.57s
Bionic asm: 0.63s
Bionic asm without prefetch: 0.57s

Misaligned case:
Current C code: 4.7s
My best-attempt inline asm: 2.9s
Bionic asm: 1.1s

Rich

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ARM memcpy post-0.9.12-release thread
  2013-07-31  2:26 ARM memcpy post-0.9.12-release thread Rich Felker
@ 2013-07-31  3:13 ` Harald Becker
  2013-07-31  3:23   ` Rich Felker
  2013-08-02 20:41 ` Rich Felker
  1 sibling, 1 reply; 9+ messages in thread
From: Harald Becker @ 2013-07-31  3:13 UTC (permalink / raw)
  Cc: musl, dalias

Hi Rich !

30-07-2013 22:26 Rich Felker <dalias@aerifal.cx>:

> Some rough times (128k copy repeated 10000 times):
> 
> Aligned case:
> Current C code: 1.2s
> My best-attempt C code: 0.75s
> My best-attempt inline asm: 0.57s
> Bionic asm: 0.63s
> Bionic asm without prefetch: 0.57s
> 
> Misaligned case:
> Current C code: 4.7s
> My best-attempt inline asm: 2.9s
> Bionic asm: 1.1s

I like to throw in a question, as my cent to this topic:

Does modern C Compiler not try to align all data types? So
following this path in most cases aligned data structures are
used and copying them around usually hit the aligned case. The
misaligned case happens mostly due to working with strings, and
those are usually short. Can't we consider other misaligned cases
violation of the programmer or code generator? If so, I would
prefer the best-attempt inline asm versions of code or even
best attempt C code over arch specific asm versions ... and add
a warning for performance lose on misaligned data in
documentation, with giving a rough percentage of this lose.

Those who really need to work with misaligned data may follow
the link and consider to add an optimized memcpy to there work.

May be, musl archive or web sit may hold a contribution directory
with such optimized replacement functions, (nearly) ready for
inclusion in other projects, but as officially unmaintained code.

--
Harald

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ARM memcpy post-0.9.12-release thread
  2013-07-31  3:13 ` Harald Becker
@ 2013-07-31  3:23   ` Rich Felker
  2013-07-31  4:18     ` Harald Becker
  0 siblings, 1 reply; 9+ messages in thread
From: Rich Felker @ 2013-07-31  3:23 UTC (permalink / raw)
  To: Harald Becker; +Cc: musl

On Wed, Jul 31, 2013 at 05:13:47AM +0200, Harald Becker wrote:
> Hi Rich !
> 
> 30-07-2013 22:26 Rich Felker <dalias@aerifal.cx>:
> 
> > Some rough times (128k copy repeated 10000 times):
> > 
> > Aligned case:
> > Current C code: 1.2s
> > My best-attempt C code: 0.75s
> > My best-attempt inline asm: 0.57s
> > Bionic asm: 0.63s
> > Bionic asm without prefetch: 0.57s
> > 
> > Misaligned case:
> > Current C code: 4.7s
> > My best-attempt inline asm: 2.9s
> > Bionic asm: 1.1s
> 
> I like to throw in a question, as my cent to this topic:
> 
> Does modern C Compiler not try to align all data types? So
> following this path in most cases aligned data structures are
> used and copying them around usually hit the aligned case. The

Yes but these are small anyway and the compiler will be generating
inline code to copy them with ldmia/stmia.

> misaligned case happens mostly due to working with strings, and
> those are usually short. Can't we consider other misaligned cases
> violation of the programmer or code generator? If so, I would
> prefer the best-attempt inline asm versions of code or even
> best attempt C code over arch specific asm versions ... and add

Part of the problem discussed on #musl was that I was having to be
really careful with "best attempt C" since GCC will _generate_ calls
to memcpy for some code, even when -ffreestanding is used. The folks
on #gcc claim this is not a bug. So, if compilers deem themselves at
liberty to make this kind of transformation, any C implementation of
memcpy that's not intentionally crippled (e.g. using volatile temps
and 20x slower than it should be) is a time-bomb that might blow up on
us with the next GCC version...

This makes asm (either inline or standalone) a lot more appealing for
memcpy than it otherwise would be.

> a warning for performance lose on misaligned data in
> documentation, with giving a rough percentage of this lose.

You'd prefer video processing being 4 to 5 times slower? Video
typically consists of single-byte samples (planar YUV) and operations
like cropping to a non-multiple-of-4 size, motion compensation, etc.
all involve misaligned memcpy. Same goes for image transformations in
gimp, image blitting in web browsers (not necessarily aligned to
multiple-of-4 boundaries unless you're using 32bpp), etc...

Rich

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ARM memcpy post-0.9.12-release thread
  2013-07-31  3:23   ` Rich Felker
@ 2013-07-31  4:18     ` Harald Becker
  2013-07-31  6:13       ` Rich Felker
  0 siblings, 1 reply; 9+ messages in thread
From: Harald Becker @ 2013-07-31  4:18 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

Hi Rich !

30-07-2013 23:23 Rich Felker <dalias@aerifal.cx>:

> > misaligned case happens mostly due to working with strings,
> > and those are usually short. Can't we consider other
> > misaligned cases violation of the programmer or code
> > generator? If so, I would prefer the best-attempt inline asm
> > versions of code or even best attempt C code over arch
> > specific asm versions ... and add
> 
> Part of the problem discussed on #musl was that I was having to
> be really careful with "best attempt C" since GCC will
> _generate_ calls to memcpy for some code, even when
> -ffreestanding is used. The folks on #gcc claim this is not a
> bug. So, if compilers deem themselves at liberty to make this
> kind of transformation, any C implementation of memcpy that's
> not intentionally crippled (e.g. using volatile temps and 20x
> slower than it should be) is a time-bomb that might blow up on
> us with the next GCC version...

I never deal with the details of this type of gcc code
generation, but doesn't this only happen on small and structure
copies? Structure copies which shall usually be aligned? So if
they are aligned the simpler version saves code space.

> This makes asm (either inline or standalone) a lot more
> appealing for memcpy than it otherwise would be.

Optimization is always a question of decision, which I consider
the hard part of the job ... :(

> > a warning for performance lose on misaligned data in
> > documentation, with giving a rough percentage of this lose.
> 
> You'd prefer video processing being 4 to 5 times slower?

No, definitely not, but video processing is one of the cases I
consider candidate for optimized processing. So such projects
shall include an optimize version of of low level processing
functions (including memcpy, but not only - candidate for
library with optimized functions?). 

> Video typically consists of single-byte samples (planar YUV) and
> operations like cropping to a non-multiple-of-4 size, motion
> compensation, etc. all involve misaligned memcpy. Same goes for
> image transformations in gimp, image blitting in web browsers
> (not necessarily aligned to multiple-of-4 boundaries unless
> you're using 32bpp), etc...

You are all right, but the programmer shall know of this and
consider to use appropriate functions. You can write the code for
those parts which need the speed in a way, which call optimized
functions. A way which usually does not conflict with gcc self
inserted calls. So this self inserted calls usually hit the
aligned scope, or the programmer did not behave well (not the
compiler).

--
Harald

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ARM memcpy post-0.9.12-release thread
  2013-07-31  4:18     ` Harald Becker
@ 2013-07-31  6:13       ` Rich Felker
  0 siblings, 0 replies; 9+ messages in thread
From: Rich Felker @ 2013-07-31  6:13 UTC (permalink / raw)
  To: musl

On Wed, Jul 31, 2013 at 06:18:58AM +0200, Harald Becker wrote:
> Hi Rich !
> 
> 30-07-2013 23:23 Rich Felker <dalias@aerifal.cx>:
> 
> > > misaligned case happens mostly due to working with strings,
> > > and those are usually short. Can't we consider other
> > > misaligned cases violation of the programmer or code
> > > generator? If so, I would prefer the best-attempt inline asm
> > > versions of code or even best attempt C code over arch
> > > specific asm versions ... and add
> > 
> > Part of the problem discussed on #musl was that I was having to
> > be really careful with "best attempt C" since GCC will
> > _generate_ calls to memcpy for some code, even when
> > -ffreestanding is used. The folks on #gcc claim this is not a
> > bug. So, if compilers deem themselves at liberty to make this
> > kind of transformation, any C implementation of memcpy that's
> > not intentionally crippled (e.g. using volatile temps and 20x
> > slower than it should be) is a time-bomb that might blow up on
> > us with the next GCC version...
> 
> I never deal with the details of this type of gcc code
> generation, but doesn't this only happen on small and structure
> copies? Structure copies which shall usually be aligned? So if
> they are aligned the simpler version saves code space.

I'm sorry, I don't think I was clear. The issue is that GCC recognizes
certain patterns and generates calls to memcpy rather than doing the
work inline. If it does this in memcpy.c, you end up with a version of
memcpy that invokes infinite recursion and is thereby unusable.

The issue I hit was that GCC was generating memcpy calls for copying
struct { char block[32]; }, which has no alignment requirement. This
technique was probably the best bet at getting the compiler to
generate an efficient memcpy (in fact, it works quite well on some
other archs), but on ARM it blew away the stack.

When looking for a solution, however, I came across this:

http://gcc.gnu.org/bugzilla//show_bug.cgi?id=56888

It looks to me like the situation is that, as compilers get smarter
and smarter, it's going to become increasingly difficult to ensure
that memcpy doesn't get compiled to a call to memcpy. So, my long term
plan (this is still open to discussion) is to do something like this:

Have one or more C memcpy implementations on-hand that empirically
generate good code. For important archs, have hand-optimized asm; this
is both smaller and better-performing than anything decent we can
achieve with C. For archs where we don't yet have arm, generate asm
from whichever C implementation works best. Then, instead of having
the performance-oriented C in the source tree, have a fail-safe C
version that the compiler can't possibly mess up; this ensures that
future ports can get started without having to worry about whether the
compiler breaks memcpy.

> > This makes asm (either inline or standalone) a lot more
> > appealing for memcpy than it otherwise would be.
> 
> Optimization is always a question of decision, which I consider
> the hard part of the job ... :(
>  
> > > a warning for performance lose on misaligned data in
> > > documentation, with giving a rough percentage of this lose.
> > 
> > You'd prefer video processing being 4 to 5 times slower?
> 
> No, definitely not, but video processing is one of the cases I
> consider candidate for optimized processing. So such projects
> shall include an optimize version of of low level processing
> functions (including memcpy, but not only - candidate for
> library with optimized functions?). 

Are you aware that redefining functions with the standardf names
invokes undefined behavior? Yes you could write your own memcpy by
another name, but then it can't get used by things like stdio (where,
if it's slow, it's likely a large portion of time spent on file io),
TLS image copying (per-thread startup cost), etc.

Of all the functions in libc, memcpy is definitely the most
performance-critical to the most applications. The other things that
matter are malloc/free, math (sometimes), stdio, qsort, and
searching/matching functions like regex, strstr, etc.

> > Video typically consists of single-byte samples (planar YUV) and
> > operations like cropping to a non-multiple-of-4 size, motion
> > compensation, etc. all involve misaligned memcpy. Same goes for
> > image transformations in gimp, image blitting in web browsers
> > (not necessarily aligned to multiple-of-4 boundaries unless
> > you're using 32bpp), etc...
> 
> You are all right, but the programmer shall know of this and
> consider to use appropriate functions. You can write the code for

The programmer should write asm for 20 different archs? Most people
have better things to do with their time..

Back to the point, musl is not dietlibc. If you want the
smallest, lowest-quality imaginable libc, there's dietlibc you can
use. musl's aim is to be a robust general-purpose libc. "Switch from
Bionic to musl and make your apps run five times slower" is not
appealing to anybody.

If the choice were between having fully general, clean C code that
runs 5-10% slower or giant gobs of asm with a heavy maintenance
burden that runs 5-10% faster, I would probably agree and just figure
the people who really need that last 5-10% can drop in some fancy asm.
But that's not the situation we're in. The current code is half the
speed of a decent (still probably not even the fastest) implementation
for aligned copies and nearly five times slower for misaligned copies.
That's well outside the range of "special interest" and into the range
of "our implementation sucks".

Moreover, the choice here is not between clean C and dirty asm. It's
between dirty C and, well, whatever you think of the asm. The only
"clean" C memcpy is:

    while (n--) *d++ = *s++;

Our C memcpy depends on implementation-defined behavior (casting
pointers to integers to inspect their alignment) as well as undefined
behavior (aliasing violations to copy as size_t units). The latter
cannot be detected by a compiler that's not performing LTO/whole
program optimization, so it's "safe" for the most part, but it's still
wrong. So from a standpoint of clean code, getting decent asm on all
the archs and then possibly replacing the C with something more naive
would probably be a step forward.

Rich

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ARM memcpy post-0.9.12-release thread
  2013-07-31  2:26 ARM memcpy post-0.9.12-release thread Rich Felker
  2013-07-31  3:13 ` Harald Becker
@ 2013-08-02 20:41 ` Rich Felker
  2013-08-02 22:03   ` Andre Renaud
  1 sibling, 1 reply; 9+ messages in thread
From: Rich Felker @ 2013-08-02 20:41 UTC (permalink / raw)
  To: musl; +Cc: Andre Renaud

Andre, do you have any input on this? (Cc'ing)

Rich


On Tue, Jul 30, 2013 at 10:26:31PM -0400, Rich Felker wrote:
> Hi all (especially Andre),
> 
> I've been doing some experimenting with ARM memcpy, and I have not
> found any way to beat the Bionic asm file for misaligned copies. The
> best I could do with simple inline asm (reading multi-words and
> writing byte-at-a-time or vice versa) improved the performance nearly
> 40% compared to musl's current code, but it was still worse than half
> the speed of the Bionic asm.
> 
> For the aligned case, however, as I've said before, the Bionic code
> runs 10% slower for me than the C-with-inline-asm I posted to the
> list. Commenting out the prefetch code in the Bionic version brings
> the performance up to the same as my version.
> 
> I also found that the Bionic code was mysteriously crashing on the
> real system I test on (it worked on my toolchain with qemu). On
> further investigation, the test system's toolchain had -mthumb (with
> thumb2) as the default; adding -marm made it work. Both ways the asm
> was being interpreted as arm; the problem was that the *calling* code
> being thumb broke it. The solution was adding .type memcpy,%function
> to the asm file. Without that, the linker cannot know that the symbol
> it's resolving is a function name and thus that it has to adjust the
> low bit of the relocated address as a flag for whether the code is arm
> or thumb. I've now got the code working reliably it seems.
> 
> Sizes so far:
> Current C code: 260 bytes
> My best-attempt inline asm: 352 bytes
> Bionic (with prefetch removed): 764 bytes
> 
> Obviously the Bionic code is a bit larger than the others and than I'd
> like it to be, but it looks really hard to trim it down without
> ruining performance for misaligned copies; roughly half of the asm
> covers the misaligned case, which is expensive because you have three
> different code paths for different ways it can be off mod 4.
> 
> One other issue we have to consider if we go with the Bionic code is
> that we'd need to add sub-arch asm dirs to use it. As-is, the code is
> hard-coded for little endian. It will shuffle the byte order badly
> when copying on a big endian machine.
> 
> Some rough times (128k copy repeated 10000 times):
> 
> Aligned case:
> Current C code: 1.2s
> My best-attempt C code: 0.75s
> My best-attempt inline asm: 0.57s
> Bionic asm: 0.63s
> Bionic asm without prefetch: 0.57s
> 
> Misaligned case:
> Current C code: 4.7s
> My best-attempt inline asm: 2.9s
> Bionic asm: 1.1s
> 
> Rich


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ARM memcpy post-0.9.12-release thread
  2013-08-02 20:41 ` Rich Felker
@ 2013-08-02 22:03   ` Andre Renaud
  2013-08-03  0:01     ` Rich Felker
  2013-08-05 21:24     ` Rich Felker
  0 siblings, 2 replies; 9+ messages in thread
From: Andre Renaud @ 2013-08-02 22:03 UTC (permalink / raw)
  To: musl; +Cc: Andre Renaud

Hi Rich,

On 3 August 2013 08:41, Rich Felker <dalias@aerifal.cx> wrote:
> Andre, do you have any input on this? (Cc'ing)
>
> Rich

Sorry, I've been reading the emails, but haven't had a chance to get
back to the code. I don't really have an opinion on the gcc memcpy
issue, however I was still hopeful that we could come up with a
relatively clean mixed C/asm solution for the misaligned/non-congruent
copy scenario. Having said that, I haven't done anything on it yet.

To be honest, although a solution probably exists, I doubt it's ever
going to be much better than the bionic code (with the exception of
possibly being less to read).

Regards,
Andre

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ARM memcpy post-0.9.12-release thread
  2013-08-02 22:03   ` Andre Renaud
@ 2013-08-03  0:01     ` Rich Felker
  2013-08-05 21:24     ` Rich Felker
  1 sibling, 0 replies; 9+ messages in thread
From: Rich Felker @ 2013-08-03  0:01 UTC (permalink / raw)
  To: musl; +Cc: Andre Renaud

On Sat, Aug 03, 2013 at 10:03:14AM +1200, Andre Renaud wrote:
> Hi Rich,
> 
> On 3 August 2013 08:41, Rich Felker <dalias@aerifal.cx> wrote:
> > Andre, do you have any input on this? (Cc'ing)
> >
> > Rich
> 
> Sorry, I've been reading the emails, but haven't had a chance to get
> back to the code. I don't really have an opinion on the gcc memcpy
> issue, however I was still hopeful that we could come up with a
> relatively clean mixed C/asm solution for the misaligned/non-congruent
> copy scenario. Having said that, I haven't done anything on it yet.
> 
> To be honest, although a solution probably exists, I doubt it's ever
> going to be much better than the bionic code (with the exception of
> possibly being less to read).

I'm not sure about the "less to read" either. I would very much _like_
some generic C code for this, since the same basic strategy is
applicable to all RISC-y archs with lots of registers but no
misaligned memory access:

1. Read several aligned words.
2. Bitshift them with carry to adjust for the relative misalignment
   of the destination.
3. Write several aligned words.

Unfortunately what this amounts to is N-1, where N is the alignment (4
for ARM, 8 for 64-bit-register archs), versions of the misaligned copy
code, one for each value of (dest-src)%N.

Oh, and you need separate cases for little and big endian too, since
the bitshifts work with values rather than just representations.

My guess is that at best we'll only get about 80% of the performance
of the bionic asm, but I could be pleasantly surprised. What makes it
nice is that this could get us acceptable memcpy performance on mips,
powerpc, microblaze, etc. without having to write assembly for them
all.

I'll probably add the bionic asm for now, but I can't do it without
first adding a way to disable it for "armeb".

Rich

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ARM memcpy post-0.9.12-release thread
  2013-08-02 22:03   ` Andre Renaud
  2013-08-03  0:01     ` Rich Felker
@ 2013-08-05 21:24     ` Rich Felker
  1 sibling, 0 replies; 9+ messages in thread
From: Rich Felker @ 2013-08-05 21:24 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1389 bytes --]

On Sat, Aug 03, 2013 at 10:03:14AM +1200, Andre Renaud wrote:
> Hi Rich,
> 
> On 3 August 2013 08:41, Rich Felker <dalias@aerifal.cx> wrote:
> > Andre, do you have any input on this? (Cc'ing)
> >
> > Rich
> 
> Sorry, I've been reading the emails, but haven't had a chance to get
> back to the code. I don't really have an opinion on the gcc memcpy
> issue, however I was still hopeful that we could come up with a
> relatively clean mixed C/asm solution for the misaligned/non-congruent
> copy scenario. Having said that, I haven't done anything on it yet.
> 
> To be honest, although a solution probably exists, I doubt it's ever
> going to be much better than the bionic code (with the exception of
> possibly being less to read).

Attached is a "C version of the concept in the Bionic asm". Without
spending any effort getting the compiler to optimize it better, it's
only taking about 80% longer than the asm to run on misaligned input.
I would say something like this is _potentially_ a candidate for
replacing memcpy.c in musl, since it should do well on most RISC
architectures (and does decently even on x86).

Of course this is no replacement for the asm, as it's much slower, but
it would allow us to get by longer without adding asm for new archs.

The aligned case should probably be changed to use structure copies if
we can be sure they won't generate calls to memcpy.

Rich

[-- Attachment #2: memcpy_risc.c --]
[-- Type: text/plain, Size: 1901 bytes --]

#include <string.h>
#include <stdlib.h>
#include <stdint.h>

void *memcpy(void *restrict dest, const void *restrict src, size_t n)
{
	unsigned char *d = dest;
	const unsigned char *s = src;
	uint32_t w, x;

	for (; (uintptr_t)s % 8 && n; n--) *d++ = *s++;
	if (!n) return dest;

	if (n>=4) switch ((uintptr_t)d % 4) {
	case 0:
		if (!(uintptr_t)d % 8) for (; n>=8; s+=8, d+=8, n-=8)
			*(uint64_t *)d = *(uint64_t *)s;
		else for (; n>=4; s+=4, d+=4, n-=4)
			*(uint32_t *)d = *(uint32_t *)s;
		break;
	case 1:
		if (!(union { int i; char c; }){1}.c) break;
		w = *(uint32_t *)s;
		*d++ = *s++;
		*d++ = *s++;
		*d++ = *s++;
		n -= 3;
		for (; n>=17; s+=16, d+=16, n-=16) {
			x = *(uint32_t *)(s+1);
			*(uint32_t *)(d+0) = (w>>24) | (x<<8);
			w = *(uint32_t *)(s+5);
			*(uint32_t *)(d+4) = (x>>24) | (w<<8);
			x = *(uint32_t *)(s+9);
			*(uint32_t *)(d+8) = (w>>24) | (x<<8);
			w = *(uint32_t *)(s+13);
			*(uint32_t *)(d+12) = (x>>24) | (w<<8);
		}
		break;
	case 2:
		if (!(union { int i; char c; }){1}.c) break;
		w = *(uint32_t *)s;
		*d++ = *s++;
		*d++ = *s++;
		n -= 2;
		for (; n>=18; s+=16, d+=16, n-=16) {
			x = *(uint32_t *)(s+2);
			*(uint32_t *)(d+0) = (w>>16) | (x<<16);
			w = *(uint32_t *)(s+6);
			*(uint32_t *)(d+4) = (x>>16) | (w<<16);
			x = *(uint32_t *)(s+10);
			*(uint32_t *)(d+8) = (w>>16) | (x<<16);
			w = *(uint32_t *)(s+14);
			*(uint32_t *)(d+12) = (x>>16) | (w<<16);
		}
		break;
	case 3:
		if (!(union { int i; char c; }){1}.c) break;
		w = *(uint32_t *)s;
		*d++ = *s++;
		n -= 1;
		for (; n>=19; s+=16, d+=16, n-=16) {
			x = *(uint32_t *)(s+3);
			*(uint32_t *)(d+0) = (w>>8) | (x<<24);
			w = *(uint32_t *)(s+7);
			*(uint32_t *)(d+4) = (x>>8) | (w<<24);
			x = *(uint32_t *)(s+11);
			*(uint32_t *)(d+8) = (w>>8) | (x<<24);
			w = *(uint32_t *)(s+15);
			*(uint32_t *)(d+12) = (x>>8) | (w<<24);
		}
		break;
	}

	for (; n; n--) *d++ = *s++;
	return dest;
}

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-08-05 21:24 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-31  2:26 ARM memcpy post-0.9.12-release thread Rich Felker
2013-07-31  3:13 ` Harald Becker
2013-07-31  3:23   ` Rich Felker
2013-07-31  4:18     ` Harald Becker
2013-07-31  6:13       ` Rich Felker
2013-08-02 20:41 ` Rich Felker
2013-08-02 22:03   ` Andre Renaud
2013-08-03  0:01     ` Rich Felker
2013-08-05 21:24     ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).