Thinking about release

mailing list of musl libc
 help / color / mirror / code / Atom feed

* Thinking about release
@ 2013-06-13  1:25 Rich Felker
  2013-06-13  1:33 ` Andre Renaud
                   ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Rich Felker @ 2013-06-13  1:25 UTC (permalink / raw)
  To: musl

Hi all,

It's been mentioned that we're overdue for a release, and it looks
like we have a fair amount of new stuff since the last release. Major
changes so far:

- Accepting \n in dynamic linker path file
- Fixed broken x86_64 sigsetjmp
- Removed wrong C++ __STDC_*_MACROS checks
- Thread exit synchronization issues
- Conforming overflow behavior in clock()
- Support for large device numbers
- Math optimizations on i386
- C11 float.h macro additions
- Fixes for exceptions in fma (gcc bug workarounds)
- Fix misaligned stack when running ctors/dtors
- Support for %m modifier in scanf
- Support for using gcc intrinsic headers with musl-gcc wrapper
- PRNG fixes
- Per-process and per-thread cputime clocks and new(ish) Linux clocks

I think the ether.h functions could definitely still make it in for
this release too. inet_makeaddr, etc. could probably also be merged.
Most of the other major items left on the agenda since the last
release are probably not going to happen right away unless there's a
volunteer to do them (zoneinfo, cpuset/affinity, string functions
cleanup, C++ ABI matching, ARM-optimized memcpy) and one, the ld.so
symlink direction issue, still requires some serious discussion and
decision-making.

If anyone else has input for what should still go into the next
release, please jump in and discuss.

Rich

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-06-13  1:25 Thinking about release Rich Felker
@ 2013-06-13  1:33 ` Andre Renaud
  2013-06-13  1:43   ` Rich Felker
  2013-06-13 15:46 ` Isaac
  2013-06-26  1:44 ` Rich Felker
  2 siblings, 1 reply; 38+ messages in thread
From: Andre Renaud @ 2013-06-13  1:33 UTC (permalink / raw)
  To: musl

Hi Rich,

> Most of the other major items left on the agenda since the last
> release are probably not going to happen right away unless there's a
> volunteer to do them (zoneinfo, cpuset/affinity, string functions
> cleanup, C++ ABI matching, ARM-optimized memcpy) and one, the ld.so
> symlink direction issue, still requires some serious discussion and
> decision-making.

Regarding the ARM-optimisations - I am happy to have a go at providing
a cleaned up implementation, although I can't recall what the final
consensus was on how this should be implemented. A simple ARMv4
implementation would cover all the bases, providing near universal
support, although would obviously not support the more modern
platforms. Is there any intention to move the base level support up to
ARMv5? I would consider that reasonable, given the age of ARMv4.
Alternatively, should we have multiple implementations
(ARMv4/ARMv5/ARMv7), and choose between them either at compile or
run-time?

Obviously this stuff is probably not destined for the immediate
release, but more likely for the one after that.

Regards,
Andre

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-06-13  1:33 ` Andre Renaud
@ 2013-06-13  1:43   ` Rich Felker
  2013-07-09  5:06     ` Andre Renaud
  0 siblings, 1 reply; 38+ messages in thread
From: Rich Felker @ 2013-06-13  1:43 UTC (permalink / raw)
  To: musl

On Thu, Jun 13, 2013 at 01:33:16PM +1200, Andre Renaud wrote:
> Hi Rich,
> 
> > Most of the other major items left on the agenda since the last
> > release are probably not going to happen right away unless there's a
> > volunteer to do them (zoneinfo, cpuset/affinity, string functions
> > cleanup, C++ ABI matching, ARM-optimized memcpy) and one, the ld.so
> > symlink direction issue, still requires some serious discussion and
> > decision-making.
> 
> Regarding the ARM-optimisations - I am happy to have a go at providing
> a cleaned up implementation, although I can't recall what the final
> consensus was on how this should be implemented. A simple ARMv4

I think the first step should be benchmarking on real machines.
Somebody tried the asm that was posted and claimed it was no faster
than musl's C code; I don't know the specific hardware they were using
and I don't even recall right off who made the claim or where it was
reported, but I think before we start writing or importing code we
need to have a good idea how the current C code compares in
performance to other "optimized" implementations.

> implementation would cover all the bases, providing near universal
> support, although would obviously not support the more modern
> platforms. Is there any intention to move the base level support up to
> ARMv5? I would consider that reasonable, given the age of ARMv4.
> Alternatively, should we have multiple implementations
> (ARMv4/ARMv5/ARMv7), and choose between them either at compile or
> run-time?

It's possible to branch based on __hwcap at runtime, if this would
really help.

> Obviously this stuff is probably not destined for the immediate
> release, but more likely for the one after that.

Yes, this looks like it will be a process that takes some time to sort
out the facts and then tune the code.

For what it's worth, I just did my first runs of libc-bench on real
ARM hardware (well, an FPGA-based ARM). memset is half the speed of
glibc's, but strchr and strlen are about 40% faster than glibc's. I
don't think libc-bench is really a good benchmark as of yet, so we
should probably develop more detailed tests.

Rich

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-06-13  1:25 Thinking about release Rich Felker
  2013-06-13  1:33 ` Andre Renaud
@ 2013-06-13 15:46 ` Isaac
  2013-06-26  1:44 ` Rich Felker
  2 siblings, 0 replies; 38+ messages in thread
From: Isaac @ 2013-06-13 15:46 UTC (permalink / raw)
  To: musl

On Wed, Jun 12, 2013 at 09:25:17PM -0400, Rich Felker wrote:
> Hi all,
> 
> It's been mentioned that we're overdue for a release, and it looks
> like we have a fair amount of new stuff since the last release. Major
> changes so far:
> 
> - Accepting \n in dynamic linker path file
> - Fixed broken x86_64 sigsetjmp
> - Removed wrong C++ __STDC_*_MACROS checks
> - Thread exit synchronization issues
> - Conforming overflow behavior in clock()
> - Support for large device numbers
> - Math optimizations on i386
> - C11 float.h macro additions
> - Fixes for exceptions in fma (gcc bug workarounds)
> - Fix misaligned stack when running ctors/dtors
> - Support for %m modifier in scanf
> - Support for using gcc intrinsic headers with musl-gcc wrapper
> - PRNG fixes
> - Per-process and per-thread cputime clocks and new(ish) Linux clocks
> 
> I think the ether.h functions could definitely still make it in for
> this release too. inet_makeaddr, etc. could probably also be merged.

ether.h is needed for a full build of busybox:
$ grep -r ether_aton_r */                                  
networking/udhcp/files.c:       if (!mac_string || !ether_aton_r(mac_string, &mac_bytes))
networking/ether-wake.c:        eap = ether_aton_r(hostid, eaddr);
networking/nameif.c:                    ch->mac = ether_aton_r(selector + (strncmp(selector, "mac=", 4) != 0 ? 0 : 4), lmac);
$ grep -r ether_ntoa */                                    
networking/ether-wake.c:                bb_debug_msg("The target station address is %s\n\n", ether_ntoa(eap));
networking/ether-wake.c:                bb_debug_msg("Station address for hostname %s is %s\n\n", hostid, ether_ntoa(eaddr));
networking/zcip.c:                                      ether_ntoa(sha),
networking/zcip.c:                                      ether_ntoa(tha),
networking/arping.c:                    ether_ntoa((struct ether_addr *) p));
networking/arping.c:                            ether_ntoa((struct ether_addr *) p + ah->ar_hln + 4));

Most of these aren't universally needed, but I wanted to enable some of them.
Right now, I'm using Strake's patch and compiling musl with -Os.

Of course, these aren't going to be enough to make busybox allyesconfig 
work right; I also ran into the following issues building busybox:
-<sys/personality.h> needs the macros from <linux/personality.h>
(for setarch) 
-A few cases of using extra headers that break the busybox build:
<net/if_slip.h> can be changed to <linux/if_slip.h>
<net/if_packet.h> is included (needlessly for musl) in 
networking/libiproute/iplink.c;
this header is roughly the macros from <sys/socket.h> + struct sockaddr_pkt
<netinet/ether.h> is used in arp and zcip

-CONFIG_COMPAT_EXTRA turns on use of glibc regex.h extensions in grep.
-Of course, vi still uses glibc regexes if you enable search and replace.

> Rich

HTH,
Isaac Dunham



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-06-13  1:25 Thinking about release Rich Felker
  2013-06-13  1:33 ` Andre Renaud
  2013-06-13 15:46 ` Isaac
@ 2013-06-26  1:44 ` Rich Felker
  2013-06-26 10:19   ` Szabolcs Nagy
  2 siblings, 1 reply; 38+ messages in thread
From: Rich Felker @ 2013-06-26  1:44 UTC (permalink / raw)
  To: musl

On Wed, Jun 12, 2013 at 09:25:17PM -0400, Rich Felker wrote:
> Hi all,
> 
> It's been mentioned that we're overdue for a release, and it looks
> like we have a fair amount of new stuff since the last release. Major
> changes so far:
> 
> - Accepting \n in dynamic linker path file

This was buggy and apparently nobody (even the person who requested
it?!) ever tested it. Just fixed it. Could possibly use some further
review still to make sure it handles odd path files well.

> I think the ether.h functions could definitely still make it in for
> this release too.

Just committed these, almost the exact versions Strake submitted.
Apologies for taking so long! Thank you Strake.

> inet_makeaddr, etc. could probably also be merged.

I just wrote a new one based on the man page so as not to pull in more
3-clause BSD license mess. It's untested so hopefully it does the
right thing. Just committed it.

> Most of the other major items left on the agenda since the last
> release are probably not going to happen right away unless there's a
> volunteer to do them (zoneinfo, cpuset/affinity, string functions
> cleanup, C++ ABI matching, ARM-optimized memcpy) and one, the ld.so
> symlink direction issue, still requires some serious discussion and
> decision-making.

I think the status on all these issues remains the same.

Rich

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-06-26  1:44 ` Rich Felker
@ 2013-06-26 10:19   ` Szabolcs Nagy
  2013-06-26 14:21     ` Rich Felker
  0 siblings, 1 reply; 38+ messages in thread
From: Szabolcs Nagy @ 2013-06-26 10:19 UTC (permalink / raw)
  To: musl

* Rich Felker <dalias@aerifal.cx> [2013-06-25 21:44:07 -0400]:
> On Wed, Jun 12, 2013 at 09:25:17PM -0400, Rich Felker wrote:
> > Hi all,
> > 
> > It's been mentioned that we're overdue for a release, and it looks
> > like we have a fair amount of new stuff since the last release. Major
> > changes so far:
> > 
> > - Accepting \n in dynamic linker path file
> 
> This was buggy and apparently nobody (even the person who requested
> it?!) ever tested it. Just fixed it. Could possibly use some further
> review still to make sure it handles odd path files well.
> 

if the path file is empty the fixed (and original) code
invokes ub (it is not critical since an empty path file
does not make much sense, but i think it should be fixed)

	if (!sys_path) {
		FILE *f = fopen(ETC_LDSO_PATH, "rbe");
		if (f) {
// if f is empty then getdelim returns -1, allocates space for sys_path
// but sys_path is not null terminated
			if (getdelim(&sys_path, (size_t[1]){0}, 0, f) > 0) {
				size_t l = strlen(sys_path);
				if (l && sys_path[l-1]=='\n')
					sys_path[l-1] = 0;
			}
			fclose(f);
		}
	}
// sys_path is non-null and not a valid string here
	if (!sys_path) sys_path = "/lib:/usr/local/lib:/usr/lib";
	fd = path_open(name, sys_path, buf, sizeof buf);

i think either getdelim should be fixed so it
makes sys_path null terminated even on eof/error
or sys_path should be freed and set to 0 in case
of failure so we fall back to the default value


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-06-26 10:19   ` Szabolcs Nagy
@ 2013-06-26 14:21     ` Rich Felker
  0 siblings, 0 replies; 38+ messages in thread
From: Rich Felker @ 2013-06-26 14:21 UTC (permalink / raw)
  To: musl

On Wed, Jun 26, 2013 at 12:19:34PM +0200, Szabolcs Nagy wrote:
> > > - Accepting \n in dynamic linker path file
> > 
> > This was buggy and apparently nobody (even the person who requested
> > it?!) ever tested it. Just fixed it. Could possibly use some further
> > review still to make sure it handles odd path files well.
> > 
> 
> if the path file is empty the fixed (and original) code
> invokes ub (it is not critical since an empty path file
> does not make much sense, but i think it should be fixed)

I just fixed this by falling back to an empty path on read error or
empty path file.

Rich


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-06-13  1:43   ` Rich Felker
@ 2013-07-09  5:06     ` Andre Renaud
  2013-07-09  5:37       ` Rich Felker
  0 siblings, 1 reply; 38+ messages in thread
From: Andre Renaud @ 2013-07-09  5:06 UTC (permalink / raw)
  To: musl

Hi Rich,
> I think the first step should be benchmarking on real machines.
> Somebody tried the asm that was posted and claimed it was no faster
> than musl's C code; I don't know the specific hardware they were using
> and I don't even recall right off who made the claim or where it was
> reported, but I think before we start writing or importing code we
> need to have a good idea how the current C code compares in
> performance to other "optimized" implementations.

In the interests of furthering this discussion (and because I'd like
to start using musl as the basis for some of our projects, but the
current speed degradation is noticeable , I've created some patches
that enable memcmp, memcpy & memmove ARM optimisations. I've ignored
the str* functions, as these are generally not used on the same bulk
data as the mem* functions, and as such the performance issue is less
noticeable.

Using a fairly rudimentary test application, I've benchmarked it as
having the following speed improvements (this is all on an actual ARM
board - 400MHz arm926ejs):
memcpy: 160%
memmove: 162%
memcmp: 272%
These numbers bring musl in line with glibc (at least on ARMv5).
memcmp in particular seems to be faster (90MB/s vs 75MB/s on my
platform).
I haven't looked at using the __hwcap feature at this stage to swap
between these implementation and neon optimised versions. I assume
this can come later.

From a code size point of view (this is all with -O3), memcpy goes
from 1996 to 1680 bytes, memmove goes from 2592 to 2088 bytes, and
memcmp goes from 1040 to 1452, for a total increase of 224 bytes.

The code is from NetBSD and Android (essentially unmodified), and it
is all BSD 2-clause licensed.

The git tree is available here:
https://github.com/AndreRenaud/musl/commit/713023e7320cf45b116d1c29b6155ece28904e69

Does anyone have any comments on the suitability of this code, or what
kind of more rigorous testing could be applied?

Regards,
Andre

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-09  5:06     ` Andre Renaud
@ 2013-07-09  5:37       ` Rich Felker
  2013-07-09  6:24         ` Harald Becker
                           ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Rich Felker @ 2013-07-09  5:37 UTC (permalink / raw)
  To: musl

On Tue, Jul 09, 2013 at 05:06:21PM +1200, Andre Renaud wrote:
> Hi Rich,
> > I think the first step should be benchmarking on real machines.
> > Somebody tried the asm that was posted and claimed it was no faster
> > than musl's C code; I don't know the specific hardware they were using
> > and I don't even recall right off who made the claim or where it was
> > reported, but I think before we start writing or importing code we
> > need to have a good idea how the current C code compares in
> > performance to other "optimized" implementations.
> 
> In the interests of furthering this discussion (and because I'd like
> to start using musl as the basis for some of our projects, but the
> current speed degradation is noticeable , I've created some patches

Then it needs to be fixed. :-)

> that enable memcmp, memcpy & memmove ARM optimisations. I've ignored
> the str* functions, as these are generally not used on the same bulk
> data as the mem* functions, and as such the performance issue is less
> noticeable.

I think that's a reasonable place to begin. I do mildly question the
relevance of memmove to performance, so if we end up having to do a
lot of review or changes to get the asm committed, it might make sense
to leave memmove for later.

> Using a fairly rudimentary test application, I've benchmarked it as
> having the following speed improvements (this is all on an actual ARM
> board - 400MHz arm926ejs):
> memcpy: 160%
> memmove: 162%
> memcmp: 272%
> These numbers bring musl in line with glibc (at least on ARMv5).
> memcmp in particular seems to be faster (90MB/s vs 75MB/s on my
> platform).
> I haven't looked at using the __hwcap feature at this stage to swap
> between these implementation and neon optimised versions. I assume
> this can come later.
> 
> >From a code size point of view (this is all with -O3), memcpy goes
> from 1996 to 1680 bytes, memmove goes from 2592 to 2088 bytes, and
> memcmp goes from 1040 to 1452, for a total increase of 224 bytes.
> 
> The code is from NetBSD and Android (essentially unmodified), and it
> is all BSD 2-clause licensed.

At first glance, this looks like a clear improvement, but have you
compared it to much more naive optimizations? My _general_ experience
with optimized memcpy asm that's complex like this and that goes out
of its way to deal explicitly with cache lines and such is that it's
no faster than just naively moving large blocks at a time. Of course
this may or may not be the case for ARM, but I'd like to know if
you've done any tests.

The basic principle in my mind here is that a complex solution is not
necessarily wrong if it's a big win in other ways, but that a complex
solution which is at most 1-2% faster than a much simpler solution is
probably not the best choice.

I also have access to a good test system now, by the way, so I could
do some tests too.

> The git tree is available here:
> https://github.com/AndreRenaud/musl/commit/713023e7320cf45b116d1c29b6155ece28904e69

It's an open question whether it's better to sync something like this
with an 'upstream' or adapt it to musl coding conventions. Generally
musl uses explicit instructions rather than pseudo-instructions/macros
for prologue and epilogue, and does not use named labels.

> Does anyone have any comments on the suitability of this code, or what

If nothing else, it fails to be armv4 compatible. Fixing that should
not be hard, but it would require a bit of an audit. The return
sequences are the obvious issue, but there may be other instructions
in use that are not available on armv4 or maybe not even on armv5...?

> kind of more rigorous testing could be applied?

See above.

What also might be worth testing is whether GCC can compete if you
just give it a naive loop (not the fancy pseudo-vectorized stuff
currently in musl) and good CFLAGS. I know on x86 I was able to beat
the fanciest asm strlen I could come up with simply by writing the
naive loop in C and unrolling it a lot. The only reason musl isn't
already using that version is that I suspect it hurts branch
prediction in the caller....

Rich

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-09  5:37       ` Rich Felker
@ 2013-07-09  6:24         ` Harald Becker
  2013-07-09 21:28         ` Andre Renaud
  2013-07-10 19:38         ` Rob Landley
  2 siblings, 0 replies; 38+ messages in thread
From: Harald Becker @ 2013-07-09  6:24 UTC (permalink / raw)
  Cc: musl, dalias

Hi Rich !

09-07-2013 01:37 Rich Felker <dalias@aerifal.cx>:

> I also have access to a good test system now, by the way, so I
> could do some tests too.

I own a Hercules eCAFE SlimHD Netbook with a Freescale i.MX515 CPU
(ARMv7 / Cortex-A8 @ 800 MHz / 512 MB RAM). If this may be of use
for you, I'm able to do some testing, etc., just let me know.

--
Harald


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-09  5:37       ` Rich Felker
  2013-07-09  6:24         ` Harald Becker
@ 2013-07-09 21:28         ` Andre Renaud
  2013-07-09 22:26           ` Andre Renaud
                             ` (2 more replies)
  2013-07-10 19:38         ` Rob Landley
  2 siblings, 3 replies; 38+ messages in thread
From: Andre Renaud @ 2013-07-09 21:28 UTC (permalink / raw)
  To: musl

Hi Rich,

> I think that's a reasonable place to begin. I do mildly question the
> relevance of memmove to performance, so if we end up having to do a
> lot of review or changes to get the asm committed, it might make sense
> to leave memmove for later.

I wasn't too sure on memmove, but I've seen a reasonable amount of
code which just uses memmove as standard (rather than memcpy), to
avoid the possibility of overlapping regions. Not a great policy, but
still. I'm fine with dropping it at this stage.

> At first glance, this looks like a clear improvement, but have you
> compared it to much more naive optimizations? My _general_ experience
> with optimized memcpy asm that's complex like this and that goes out
> of its way to deal explicitly with cache lines and such is that it's
> no faster than just naively moving large blocks at a time. Of course
> this may or may not be the case for ARM, but I'd like to know if
> you've done any tests.
>
> The basic principle in my mind here is that a complex solution is not
> necessarily wrong if it's a big win in other ways, but that a complex
> solution which is at most 1-2% faster than a much simpler solution is
> probably not the best choice.

Certainly if there was a more straight forward C implementation that
achieved similar results that would be superior. However the existing
musl C memcpy code is already optimised to some degree (doing 32-bit
rather than 8-bit copies), and it is difficult to convince gcc to use
the load-multiple & store-multiple instructions via C code I've found,
without resorting to pretty horrible C code. It may still be
preferable to the assembler though. At this stage I haven't
benchmarked this - I'll see if I can come up with something.

> It's an open question whether it's better to sync something like this
> with an 'upstream' or adapt it to musl coding conventions. Generally
> musl uses explicit instructions rather than pseudo-instructions/macros
> for prologue and epilogue, and does not use named labels.

Given that most of the other systems do some form of compile time
optimisations (which we're trying to avoid), and that these are not
functions that see a lot of code churn, I don't think it's too bad to
have it adapted to musl's style. I haven't really done that so far.

>> Does anyone have any comments on the suitability of this code, or what
>
> If nothing else, it fails to be armv4 compatible. Fixing that should
> not be hard, but it would require a bit of an audit. The return
> sequences are the obvious issue, but there may be other instructions
> in use that are not available on armv4 or maybe not even on armv5...?

Rob Landley mentioned a while ago that armv4 has issues with the EABI
stuff. Is armv4 a definite lower bound for musl support, as opposed to
armv4t or armv5?

Regards,
Andre

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-09 21:28         ` Andre Renaud
@ 2013-07-09 22:26           ` Andre Renaud
  2013-07-10  6:42             ` Jens Gustedt
  2013-07-10 22:44             ` Andre Renaud
  2013-07-10 19:42           ` Rich Felker
  2013-07-11  4:30           ` Strake
  2 siblings, 2 replies; 38+ messages in thread
From: Andre Renaud @ 2013-07-09 22:26 UTC (permalink / raw)
  To: musl

Replying to myself

> Certainly if there was a more straight forward C implementation that
> achieved similar results that would be superior. However the existing
> musl C memcpy code is already optimised to some degree (doing 32-bit
> rather than 8-bit copies), and it is difficult to convince gcc to use
> the load-multiple & store-multiple instructions via C code I've found,
> without resorting to pretty horrible C code. It may still be
> preferable to the assembler though. At this stage I haven't
> benchmarked this - I'll see if I can come up with something.

As a comparison, the existing memcpy.c implementation tries to copy
sizeof(size_t) bytes at a time, which on ARM is 4. This ends up being
a standard load/store. However GCC is smart enough to know that it can
use ldm/stm instructions for copying structures > 4 bytes. So if we
change memcpy.c to use a structure whose size is > 4 (ie: 16), instead
of size_t for it's basic copy unit, we do see some improvements:

typedef struct multiple_size_t {
    size_t d[4];
} multiple_size_t;

#define SS (sizeof(multiple_size_t))
#define ALIGN (sizeof(multiple_size_t)-1)

void *my_memcpy(void * restrict dest, const void * restrict src, size_t n)
{
    unsigned char *d = dest;
    const unsigned char *s = src;

    if (((uintptr_t)d & ALIGN) != ((uintptr_t)s & ALIGN))
        goto misaligned;

    for (; ((uintptr_t)d & ALIGN) && n; n--) *d++ = *s++;
    if (n) {
        multiple_size_t *wd = (void *)d;
        const struct multiple_size_t *ws = (const void *)s;

        for (; n>=SS; n-=SS) *wd++ = *ws++;

        d = (void *)wd;
        s = (const void *)ws;
misaligned:
        for (; n; n--) *d++ = *s++;
    }
    return dest;

}

This results in 95MB/s on my platform (up from 65MB/s for the existing
memcpy.c, and down from 105MB/s with the asm optimised version). It is
essentially identically readable to the existing memcpy.c. I'm not
really famiilar with any other cpu architectures, so I'm not sure if
this would improve, or hurt, performance on other platforms.

Any comments on using something like this for memcpy instead?
Obviously this gives you a higher penalty if the size of the area to
be copied is between sizeof(size_t) and sizeof(multiple_size_t).

Regards,
Andre

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-09 22:26           ` Andre Renaud
@ 2013-07-10  6:42             ` Jens Gustedt
  2013-07-10  7:50               ` Rich Felker
  2013-07-10 22:44             ` Andre Renaud
  1 sibling, 1 reply; 38+ messages in thread
From: Jens Gustedt @ 2013-07-10  6:42 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 969 bytes --]

Am Mittwoch, den 10.07.2013, 10:26 +1200 schrieb Andre Renaud:
> typedef struct multiple_size_t {
>     size_t d[4];
> } multiple_size_t;

why not have it

typedef size_t _multiple_size[4];

the wrapping into struct just doesn't serve much purpose, I think.

Then for your implementation, the commonly used trick would be to have
two "slow" phases for misalignment. One at the start for the first
bytes up to the next valid alignment boundary, do the "fast" copy for
the aligned part, and then handle the last bytes in another slow
phase. For small things to copy this adds a bit of arithmetic and a
conditional.

Jens

-- 
:: INRIA Nancy Grand Est :: http://www.loria.fr/~gustedt/   ::
:: AlGorille ::::::::::::::: office Nancy : +33 383593090   ::
:: ICube :::::::::::::: office Strasbourg : +33 368854536   ::
:: ::::::::::::::::::::::::::: gsm France : +33 651400183   ::
:: :::::::::::::::::::: gsm international : +49 15737185122 ::



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-10  6:42             ` Jens Gustedt
@ 2013-07-10  7:50               ` Rich Felker
  0 siblings, 0 replies; 38+ messages in thread
From: Rich Felker @ 2013-07-10  7:50 UTC (permalink / raw)
  To: musl

On Wed, Jul 10, 2013 at 08:42:58AM +0200, Jens Gustedt wrote:
> Am Mittwoch, den 10.07.2013, 10:26 +1200 schrieb Andre Renaud:
> > typedef struct multiple_size_t {
> >     size_t d[4];
> > } multiple_size_t;
> 
> why not have it
> 
> typedef size_t _multiple_size[4];
> 
> the wrapping into struct just doesn't serve much purpose, I think.

Because arrays are not assignable.

Rich


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-09  5:37       ` Rich Felker
  2013-07-09  6:24         ` Harald Becker
  2013-07-09 21:28         ` Andre Renaud
@ 2013-07-10 19:38         ` Rob Landley
  2013-07-10 20:34           ` Andre Renaud
  2 siblings, 1 reply; 38+ messages in thread
From: Rob Landley @ 2013-07-10 19:38 UTC (permalink / raw)
  To: musl; +Cc: musl

On 07/09/2013 12:37:12 AM, Rich Felker wrote:
> On Tue, Jul 09, 2013 at 05:06:21PM +1200, Andre Renaud wrote:
> > The git tree is available here:
> >  
> https://github.com/AndreRenaud/musl/commit/713023e7320cf45b116d1c29b6155ece28904e69
> 
> It's an open question whether it's better to sync something like this
> with an 'upstream' or adapt it to musl coding conventions. Generally
> musl uses explicit instructions rather than pseudo-instructions/macros
> for prologue and epilogue, and does not use named labels.

Do your own local version. You can always copy ideas from other  
projects if "upstream" changes later.

> > Does anyone have any comments on the suitability of this code, or  
> what
> 
> If nothing else, it fails to be armv4 compatible.

Query: did you ever implement a non-thumb version of armv4 eabi  
support? I remember some discussion about it being possible, I don't  
remember the outcome.

> Fixing that should
> not be hard, but it would require a bit of an audit. The return
> sequences are the obvious issue, but there may be other instructions
> in use that are not available on armv4 or maybe not even on armv5...?

I've beaten armv4-only, armv4t-only, and armv5-only modes out of qemu.  
That's the reason for the first half of my versatile patch:

    
http://landley.net/hg/aboriginal/file/1612/sources/patches/linux-arm.patch

> > kind of more rigorous testing could be applied?
> 
> See above.
> 
> What also might be worth testing is whether GCC can compete if you
> just give it a naive loop (not the fancy pseudo-vectorized stuff
> currently in musl) and good CFLAGS. I know on x86 I was able to beat
> the fanciest asm strlen I could come up with simply by writing the
> naive loop in C and unrolling it a lot.

Duff's device!

Rob


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-09 21:28         ` Andre Renaud
  2013-07-09 22:26           ` Andre Renaud
@ 2013-07-10 19:42           ` Rich Felker
  2013-07-14  6:37             ` Rob Landley
  2013-07-11  4:30           ` Strake
  2 siblings, 1 reply; 38+ messages in thread
From: Rich Felker @ 2013-07-10 19:42 UTC (permalink / raw)
  To: musl

On Wed, Jul 10, 2013 at 09:28:21AM +1200, Andre Renaud wrote:
> >> Does anyone have any comments on the suitability of this code, or what
> >
> > If nothing else, it fails to be armv4 compatible. Fixing that should
> > not be hard, but it would require a bit of an audit. The return
> > sequences are the obvious issue, but there may be other instructions
> > in use that are not available on armv4 or maybe not even on armv5...?
> 
> Rob Landley mentioned a while ago that armv4 has issues with the EABI
> stuff. Is armv4 a definite lower bound for musl support, as opposed to
> armv4t or armv5?

EABI specifies thumb; however, it's possible to have code which
conforms fully to EABI but does not rely on the presence of thumb. GCC
is incapable of generating such code, but it could be enhanced to do
so, and all of the existing assembly in musl is plain-v4-compatible,
so I would prefer not to shut out the possibility of supporting older
ARM.

Rich


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-10 19:38         ` Rob Landley
@ 2013-07-10 20:34           ` Andre Renaud
  2013-07-10 20:49             ` Nathan McSween
  2013-07-10 21:01             ` Rich Felker
  0 siblings, 2 replies; 38+ messages in thread
From: Andre Renaud @ 2013-07-10 20:34 UTC (permalink / raw)
  To: musl

>> What also might be worth testing is whether GCC can compete if you
>> just give it a naive loop (not the fancy pseudo-vectorized stuff
>> currently in musl) and good CFLAGS. I know on x86 I was able to beat
>> the fanciest asm strlen I could come up with simply by writing the
>> naive loop in C and unrolling it a lot.
>
>
> Duff's device!

That was exactly my first idea too, but interestingly it turns out not
to have really added any performance improvement. Looking at the
assembler, with -O3, gcc does a pretty good job of unrolling as it is.

Regards,
Andre


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-10 20:34           ` Andre Renaud
@ 2013-07-10 20:49             ` Nathan McSween
  2013-07-10 21:01             ` Rich Felker
  1 sibling, 0 replies; 38+ messages in thread
From: Nathan McSween @ 2013-07-10 20:49 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 892 bytes --]

I would think the iterate-per-char-till-zero would take the most time, even
if GCC vectorized without SIMD it would still need to iterate to find the
zero in the word with the zero, current musl does this as well though.
On Jul 10, 2013 1:34 PM, "Andre Renaud" <andre@bluewatersys.com> wrote:

> >> What also might be worth testing is whether GCC can compete if you
> >> just give it a naive loop (not the fancy pseudo-vectorized stuff
> >> currently in musl) and good CFLAGS. I know on x86 I was able to beat
> >> the fanciest asm strlen I could come up with simply by writing the
> >> naive loop in C and unrolling it a lot.
> >
> >
> > Duff's device!
>
> That was exactly my first idea too, but interestingly it turns out not
> to have really added any performance improvement. Looking at the
> assembler, with -O3, gcc does a pretty good job of unrolling as it is.
>
> Regards,
> Andre
>

[-- Attachment #2: Type: text/html, Size: 1231 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-10 20:34           ` Andre Renaud
  2013-07-10 20:49             ` Nathan McSween
@ 2013-07-10 21:01             ` Rich Felker
  1 sibling, 0 replies; 38+ messages in thread
From: Rich Felker @ 2013-07-10 21:01 UTC (permalink / raw)
  To: musl

On Thu, Jul 11, 2013 at 08:34:03AM +1200, Andre Renaud wrote:
> >> What also might be worth testing is whether GCC can compete if you
> >> just give it a naive loop (not the fancy pseudo-vectorized stuff
> >> currently in musl) and good CFLAGS. I know on x86 I was able to beat
> >> the fanciest asm strlen I could come up with simply by writing the
> >> naive loop in C and unrolling it a lot.
> >
> >
> > Duff's device!
> 
> That was exactly my first idea too, but interestingly it turns out not
> to have really added any performance improvement. Looking at the
> assembler, with -O3, gcc does a pretty good job of unrolling as it is.

For what it's worth, my testing showed the current memcpy code in musl
and the naive "while (n--) *d++=*s++;" version performing
near-identically at -O3, and both got about 20% faster with
-funroll-all-loops. With -O2 or -Os, the naive version was about 5
times slower.

Rich


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-09 22:26           ` Andre Renaud
  2013-07-10  6:42             ` Jens Gustedt
@ 2013-07-10 22:44             ` Andre Renaud
  2013-07-11  3:37               ` Rich Felker
  1 sibling, 1 reply; 38+ messages in thread
From: Andre Renaud @ 2013-07-10 22:44 UTC (permalink / raw)
  To: Andre Renaud; +Cc: musl

> This results in 95MB/s on my platform (up from 65MB/s for the existing
> memcpy.c, and down from 105MB/s with the asm optimised version). It is
> essentially identically readable to the existing memcpy.c. I'm not
> really famiilar with any other cpu architectures, so I'm not sure if
> this would improve, or hurt, performance on other platforms.

Reviewing the assembler that is produced, it appears that GCC will
never generate an ldm/stm instruction (load/store multiple) that reads
into more than 4 registers, where as the optimised assembler does them
that read 8 (ie: 8 * 32bit reads in a single instruction). I've tried
various tricks/optimisations with the C code, and can't convince GCC
to do more than 4. I assume that this is probably where the remaining
10MB/s is between these two variants.

Rich - do you have any comments on whether either the C or assembler
variants of memcpy might be suitable for inclusion in musl?

Regards,
Andre

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-10 22:44             ` Andre Renaud
@ 2013-07-11  3:37               ` Rich Felker
  2013-07-11  4:04                 ` Andre Renaud
                                   ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Rich Felker @ 2013-07-11  3:37 UTC (permalink / raw)
  To: musl; +Cc: Andre Renaud

On Thu, Jul 11, 2013 at 10:44:16AM +1200, Andre Renaud wrote:
> > This results in 95MB/s on my platform (up from 65MB/s for the existing
> > memcpy.c, and down from 105MB/s with the asm optimised version). It is
> > essentially identically readable to the existing memcpy.c. I'm not
> > really famiilar with any other cpu architectures, so I'm not sure if
> > this would improve, or hurt, performance on other platforms.
> 
> Reviewing the assembler that is produced, it appears that GCC will
> never generate an ldm/stm instruction (load/store multiple) that reads
> into more than 4 registers, where as the optimised assembler does them
> that read 8 (ie: 8 * 32bit reads in a single instruction). I've tried

For the asm, could we make it more than 8? 10 seems easy, 12 seems
doubtful. I don't see a fundamental reason it needs to be a power of
two, unless the cache line alignment really helps and isn't just
cargo-culting. (This is something I'd still like to know about the
asm: whether it's doing unnecessary stuff that does not help
performance.)

> various tricks/optimisations with the C code, and can't convince GCC
> to do more than 4. I assume that this is probably where the remaining
> 10MB/s is between these two variants.

Yes, I suspect so. One slightly crazy idea I had was to write the
function in C with just inline asm for the inner ldm/stm loop. The
build system does not yet have support for .c files in the arch dirs
instead of .s files, but it could be added.

> Rich - do you have any comments on whether either the C or assembler
> variants of memcpy might be suitable for inclusion in musl?

I would say either might be, but it looks like if we want competitive
performance, some asm will be needed (either inline or full). My
leaning would be to go for something simpler than the asm you've been
experimenting with, but with same or better performance, if this is
possible. I realize the code is not that big as-is, in terms of binary
size, but it's big from an "understanding it" perspective and I don't
like big asm blobs that are hard for somebody to look at and say "oh
yeah, this is clearly right".

Anyway, the big questions I'd still like to get answered before moving
forward is whether the cache line alignment has any benefit.

Rich

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-11  3:37               ` Rich Felker
@ 2013-07-11  4:04                 ` Andre Renaud
  2013-07-11  5:10                   ` Andre Renaud
  2013-07-11  5:27                 ` Daniel Cegiełka
  2013-07-15  4:25                 ` Rob Landley
  2 siblings, 1 reply; 38+ messages in thread
From: Andre Renaud @ 2013-07-11  4:04 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

Hi Rich,
>> Rich - do you have any comments on whether either the C or assembler
>> variants of memcpy might be suitable for inclusion in musl?
>
> I would say either might be, but it looks like if we want competitive
> performance, some asm will be needed (either inline or full). My
> leaning would be to go for something simpler than the asm you've been
> experimenting with, but with same or better performance, if this is
> possible. I realize the code is not that big as-is, in terms of binary
> size, but it's big from an "understanding it" perspective and I don't
> like big asm blobs that are hard for somebody to look at and say "oh
> yeah, this is clearly right".
>
> Anyway, the big questions I'd still like to get answered before moving
> forward is whether the cache line alignment has any benefit.

I certainly appreciate the need for concise, well understood, easily
readable code.

I can't see any obvious reason why this shouldn't work, although the
assembler as it stands makes pretty heavy use of all the registers,
and I can't immediately see how to rework it to free up 2 more (I can
free up 1 by dropping the attempted preload). Given my (lack of)
skills with ARM assembler, I'm not sure I'll be able to look too
deeply into either of these options, but I'll have a go at the inline
ASM version to force 8*4byte loads to see if it improves things.

Regards,
Andre


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-09 21:28         ` Andre Renaud
  2013-07-09 22:26           ` Andre Renaud
  2013-07-10 19:42           ` Rich Felker
@ 2013-07-11  4:30           ` Strake
  2013-07-11  4:33             ` Rich Felker
  2 siblings, 1 reply; 38+ messages in thread
From: Strake @ 2013-07-11  4:30 UTC (permalink / raw)
  To: musl

On 09/07/2013, Andre Renaud <andre@bluewatersys.com> wrote:
> I wasn't too sure on memmove, but I've seen a reasonable amount of
> code which just uses memmove as standard (rather than memcpy), to
> avoid the possibility of overlapping regions. Not a great policy

Why? What loss with memmove? That it takes 1.0125 times as long as
memcpy, other than when memcpy might just trash the array or summon
nasal demons anyhow?


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-11  4:30           ` Strake
@ 2013-07-11  4:33             ` Rich Felker
  0 siblings, 0 replies; 38+ messages in thread
From: Rich Felker @ 2013-07-11  4:33 UTC (permalink / raw)
  To: musl

On Wed, Jul 10, 2013 at 11:30:47PM -0500, Strake wrote:
> On 09/07/2013, Andre Renaud <andre@bluewatersys.com> wrote:
> > I wasn't too sure on memmove, but I've seen a reasonable amount of
> > code which just uses memmove as standard (rather than memcpy), to
> > avoid the possibility of overlapping regions. Not a great policy
> 
> Why? What loss with memmove? That it takes 1.0125 times as long as
> memcpy, other than when memcpy might just trash the array or summon
> nasal demons anyhow?

If you're performing a copy between objects that overlap, or if you're
not sure whether you might be, then it's very likely that you'd doing
something wrong. Or at least that's my opinion.

Anyway I have no objection to an optimized memmove, but I do think
starting with just memcpy is easier for review and
cleanup/optimization.

Rich


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-11  4:04                 ` Andre Renaud
@ 2013-07-11  5:10                   ` Andre Renaud
  2013-07-11 12:46                     ` Rich Felker
  0 siblings, 1 reply; 38+ messages in thread
From: Andre Renaud @ 2013-07-11  5:10 UTC (permalink / raw)
  To: musl; +Cc: Rich Felker

> I can't see any obvious reason why this shouldn't work, although the
> assembler as it stands makes pretty heavy use of all the registers,
> and I can't immediately see how to rework it to free up 2 more (I can
> free up 1 by dropping the attempted preload). Given my (lack of)
> skills with ARM assembler, I'm not sure I'll be able to look too
> deeply into either of these options, but I'll have a go at the inline
> ASM version to force 8*4byte loads to see if it improves things.

I've given it a bit of a go, and at first it appears to be working
(although I don't exactly have a comprehensive test suite, so this is
very preliminary). Anyone with some more ARM assembler experience is
welcome to chip in with a comment.

I also managed to mess up my last set of benchmarking - I'd indicated
that I got 65 vs 95 vs 105, however I'd stuffed up the fact that the
first call would have poor cache performance. Once I corrected that
the results have become more like 65(naive) vs 105(typedef) vs
113(asm).

Using the below code, it becomes 65(naive), 113(inline asm), 113(full
asm). So the inline is able to do perform as we'd expect. Assuming
that it is technically correct (which is probably the biggest
question).

#define SS (8 * 4)
#define ALIGN (SS - 1)
void * noinline my_asm_memcpy(void * restrict dest, const void *
restrict src, size_t n)
{
    unsigned char *d = dest;
    const unsigned char *s = src;

    if (((uintptr_t)d & ALIGN) != ((uintptr_t)s & ALIGN))
        goto misaligned;

    for (; ((uintptr_t)d & ALIGN) && n; n--) *d++ = *s++;
    if (n) {
        for (; n>=SS; n-= SS) {
                __asm__("ldmia %0, {r4-r11}"
                                : "=r" (s)
                                : "0" (s)
                                : "r4", "r5", "r6", "r7", "r8", "r9",
"r10", "r11");
                s+=SS;
                __asm__("stmia %0, {r4-r11}"
                                : "=r" (d)
                                :"0" (d));
                d+=SS;
        }

misaligned:
        for (; n; n--) *d++ = *s++;
    }
    return dest;
}

Regards,
Andre


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-11  3:37               ` Rich Felker
  2013-07-11  4:04                 ` Andre Renaud
@ 2013-07-11  5:27                 ` Daniel Cegiełka
  2013-07-11 12:49                   ` Rich Felker
  2013-07-15  4:25                 ` Rob Landley
  2 siblings, 1 reply; 38+ messages in thread
From: Daniel Cegiełka @ 2013-07-11  5:27 UTC (permalink / raw)
  To: musl

2013/7/11 Rich Felker <dalias@aerifal.cx>:
> On Thu, Jul 11, 2013 at 10:44:16AM +1200, Andre Renaud wrote:

> Yes, I suspect so. One slightly crazy idea I had was to write the
> function in C with just inline asm for the inner ldm/stm loop.

A bit of useful code (x86):

http://dpdk.org/browse/dpdk/tree/lib/librte_eal/common/include/rte_memcpy.h

Daniel


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-11  5:10                   ` Andre Renaud
@ 2013-07-11 12:46                     ` Rich Felker
  2013-07-11 22:34                       ` Andre Renaud
  0 siblings, 1 reply; 38+ messages in thread
From: Rich Felker @ 2013-07-11 12:46 UTC (permalink / raw)
  To: musl

On Thu, Jul 11, 2013 at 05:10:41PM +1200, Andre Renaud wrote:
> > I can't see any obvious reason why this shouldn't work, although the
> > assembler as it stands makes pretty heavy use of all the registers,
> > and I can't immediately see how to rework it to free up 2 more (I can
> > free up 1 by dropping the attempted preload). Given my (lack of)
> > skills with ARM assembler, I'm not sure I'll be able to look too
> > deeply into either of these options, but I'll have a go at the inline
> > ASM version to force 8*4byte loads to see if it improves things.
> 
> I've given it a bit of a go, and at first it appears to be working
> (although I don't exactly have a comprehensive test suite, so this is
> very preliminary). Anyone with some more ARM assembler experience is
> welcome to chip in with a comment.
> 
> I also managed to mess up my last set of benchmarking - I'd indicated
> that I got 65 vs 95 vs 105, however I'd stuffed up the fact that the
> first call would have poor cache performance. Once I corrected that
> the results have become more like 65(naive) vs 105(typedef) vs
> 113(asm).
> 
> Using the below code, it becomes 65(naive), 113(inline asm), 113(full
> asm). So the inline is able to do perform as we'd expect. Assuming
> that it is technically correct (which is probably the biggest
> question).

It's not.

> #define SS (8 * 4)
> #define ALIGN (SS - 1)
> void * noinline my_asm_memcpy(void * restrict dest, const void *
> restrict src, size_t n)
> {
>     unsigned char *d = dest;
>     const unsigned char *s = src;
> 
>     if (((uintptr_t)d & ALIGN) != ((uintptr_t)s & ALIGN))
>         goto misaligned;
> 
>     for (; ((uintptr_t)d & ALIGN) && n; n--) *d++ = *s++;
>     if (n) {
>         for (; n>=SS; n-= SS) {
>                 __asm__("ldmia %0, {r4-r11}"
>                                 : "=r" (s)
>                                 : "0" (s)
>                                 : "r4", "r5", "r6", "r7", "r8", "r9",
> "r10", "r11");
>                 s+=SS;
>                 __asm__("stmia %0, {r4-r11}"
>                                 : "=r" (d)
>                                 :"0" (d));
>                 d+=SS;

You need both instructions in the same asm block, and proper
constraints. As it is, whether the registers keep their values between
the two separate asm blocks is up to the compiler's whims.

With the proper constraints ("+r" type), the s+=SS and d+=SS are
unnecessary, as a bonus. Also there's no reason to force alignment to
SS for this loop; that will simply prevent it from being used as much
for smaller copies. I would use SS==sizeof(size_t) and then write 8*SS
in the for loop.

Last night I was in the process of writing something very similar, but
I put the for loop in asm too and didn't finish it. If it performs
just as well with the loop in C, I like your version better.

Rich


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-11  5:27                 ` Daniel Cegiełka
@ 2013-07-11 12:49                   ` Rich Felker
  0 siblings, 0 replies; 38+ messages in thread
From: Rich Felker @ 2013-07-11 12:49 UTC (permalink / raw)
  To: musl

On Thu, Jul 11, 2013 at 07:27:11AM +0200, Daniel Cegiełka wrote:
> 2013/7/11 Rich Felker <dalias@aerifal.cx>:
> > On Thu, Jul 11, 2013 at 10:44:16AM +1200, Andre Renaud wrote:
> 
> > Yes, I suspect so. One slightly crazy idea I had was to write the
> > function in C with just inline asm for the inner ldm/stm loop.
> 
> A bit of useful code (x86):
> 
> http://dpdk.org/browse/dpdk/tree/lib/librte_eal/common/include/rte_memcpy.h

On modern x86 (32-bit), this is slower than even the naive "rep movsb"
version. Some x86 chips have problems with rep movsb, so the version
in musl does a little bit more work (possibly more than it needs to)
to use "rep movsd".

On x86_64, there _may_ be faster approaches than the "rep movsq" we
have right now, but so far my impression is that they don't work on
baseline x86_64 (only later variants) and don't gain much.

Rich

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-11 12:46                     ` Rich Felker
@ 2013-07-11 22:34                       ` Andre Renaud
  2013-07-12  3:16                         ` Rich Felker
  0 siblings, 1 reply; 38+ messages in thread
From: Andre Renaud @ 2013-07-11 22:34 UTC (permalink / raw)
  To: musl

Hi Rich,

> You need both instructions in the same asm block, and proper
> constraints. As it is, whether the registers keep their values between
> the two separate asm blocks is up to the compiler's whims.
>
> With the proper constraints ("+r" type), the s+=SS and d+=SS are
> unnecessary, as a bonus. Also there's no reason to force alignment to
> SS for this loop; that will simply prevent it from being used as much
> for smaller copies. I would use SS==sizeof(size_t) and then write 8*SS
> in the for loop.
>
> Last night I was in the process of writing something very similar, but
> I put the for loop in asm too and didn't finish it. If it performs
> just as well with the loop in C, I like your version better.

I've rejiggled it a bit, and it appears to be working. I wasn't
entirely sure what you meant about the proper constraints. There is an
additional reason why 8*4 was used for the align - to force the whole
loop to work in cache-line blocks. I've now done this explicitly on
the lead-in by doing the first few copies as 32-bit, then going to the
full cache-line asm. This has the same performance as the fully native
assembler. However to get that I had to use the same trick that the
native assembler uses - doing a load of the next block prior to
storing this one. I'm a bit concerned that this would mean we'd be
doing a read that was out of bounds, and I can't entirely see why this
wouldn't be happening with the existing assembler (but I'm presuming
it doesn't). Any comments on this side of it?

#define SS sizeof(size_t)
#define ALIGN (SS - 1)
void * noinline my_asm_memcpy(void * restrict dest, const void *
restrict src, size_t n)
{
    unsigned char *d = dest;
    const unsigned char *s = src;

    if (((uintptr_t)d & ALIGN) != ((uintptr_t)s & ALIGN))
        goto misaligned;

    /* ARM has 32-byte cache lines, so get us aligned to that */
    for (; ((uintptr_t)d & ((8 * SS) - 1)) && n; n-=SS) {
            *(size_t *)d = *(size_t *)s;
            d += SS;
            s+= SS;
    }
    /* Do full cache line read/writes */
    if (n) {
        for (; n>=(8 * SS); n-= (8 * SS)) {
                __asm__ (
                        "ldmia %0, {r4-r11}\n"
                        "add %0, %0, %4\n"
                        "bic r12, %0, %5\n"
                        "ldrhi r12, [%0]\n"
                        "stmia %1, {r4-r11}\n"
                        "add %1, %1, %4"
                        : "=r"(s), "=r"(d)
                        : "0"(s), "1"(d), "i"(8 * SS), "i"((8 * SS) - 1)
                        : "r4", "r5", "r6", "r7", "r8",
                          "r9", "r10", "r11", "r12");
        }

misaligned:
        for (; n; n--) *d++ = *s++;
    }
    return dest;

}

Regards,
Andre


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-11 22:34                       ` Andre Renaud
@ 2013-07-12  3:16                         ` Rich Felker
  2013-07-12  3:36                           ` Andre Renaud
  0 siblings, 1 reply; 38+ messages in thread
From: Rich Felker @ 2013-07-12  3:16 UTC (permalink / raw)
  To: musl

On Fri, Jul 12, 2013 at 10:34:31AM +1200, Andre Renaud wrote:
> I've rejiggled it a bit, and it appears to be working. I wasn't
> entirely sure what you meant about the proper constraints. There is an
> additional reason why 8*4 was used for the align - to force the whole
> loop to work in cache-line blocks. I've now done this explicitly on
> the lead-in by doing the first few copies as 32-bit, then going to the
> full cache-line asm. This has the same performance as the fully native
> assembler. However to get that I had to use the same trick that the
> native assembler uses - doing a load of the next block prior to
> storing this one. I'm a bit concerned that this would mean we'd be
> doing a read that was out of bounds, and I can't entirely see why this
> wouldn't be happening with the existing assembler (but I'm presuming
> it doesn't). Any comments on this side of it?

I was unable to measure any difference in performance of your version
with the prefetch hack versus simply:

	__asm__ __volatile__(
	"ldmia %1!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t"
	"stmia %0!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t"
	: "+r"(d), "+r"(s) :
	: "a4", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "memory");

in the inner loop.

Rich


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-12  3:16                         ` Rich Felker
@ 2013-07-12  3:36                           ` Andre Renaud
  2013-07-12  4:16                             ` Rich Felker
  0 siblings, 1 reply; 38+ messages in thread
From: Andre Renaud @ 2013-07-12  3:36 UTC (permalink / raw)
  To: musl

> I was unable to measure any difference in performance of your version
> with the prefetch hack versus simply:
>
>         __asm__ __volatile__(
>         "ldmia %1!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t"
>         "stmia %0!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t"
>         : "+r"(d), "+r"(s) :
>         : "a4", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "memory");

What kind of machine were you using? I see a change of 115MB/s ->
105MB/s when I drop the prefetch, even using the code that you
suggested. This is on an Atmel AT91sam9g45 (ARM926ejs @ 400MHz). I'm
assuming this is some subtlety about how the cache is operating?
Sticking the ldrhi back in brings the speed back, ie:
          __asm__ __volatile__(
                                "ldmia %1!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t"
                                "ldrhi r12, [%1]\n"
                                "stmia %0!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t"
                                : "+r"(d), "+r"(s) :
                                : "a4", "v1", "v2", "v3", "v4", "v5",
"v6", "v7", "r12", "memory");

Regards,
Andre


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-12  3:36                           ` Andre Renaud
@ 2013-07-12  4:16                             ` Rich Felker
  2013-07-24  1:34                               ` Andre Renaud
  0 siblings, 1 reply; 38+ messages in thread
From: Rich Felker @ 2013-07-12  4:16 UTC (permalink / raw)
  To: musl

On Fri, Jul 12, 2013 at 03:36:42PM +1200, Andre Renaud wrote:
> > I was unable to measure any difference in performance of your version
> > with the prefetch hack versus simply:
> >
> >         __asm__ __volatile__(
> >         "ldmia %1!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t"
> >         "stmia %0!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t"
> >         : "+r"(d), "+r"(s) :
> >         : "a4", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "memory");
> 
> What kind of machine were you using? I see a change of 115MB/s ->

It's a combined ARM Cortex-A9 & FPGA chip from Xilinx. Supposedly the
timings match the Cortex-A9 in other ARM chips.

> 105MB/s when I drop the prefetch, even using the code that you
> suggested. This is on an Atmel AT91sam9g45 (ARM926ejs @ 400MHz). I'm
> assuming this is some subtlety about how the cache is operating?

Perhaps so.

By the way, I also did some tests with misaligning the src/dest with
respect to cache lines. and the timing did change, but not in any way
I could make sense of...

It may turn out to be that the issues are sufficiently complex that we
won't get ideal performance without either copying the BSD code you
suggested or fully understanding what it's doing, and other ARM
performance issues, and developing something new based on that
understanding... In that case copying/adapting the BSD code might turn
out to be the right solution for now.

Rich

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-10 19:42           ` Rich Felker
@ 2013-07-14  6:37             ` Rob Landley
  0 siblings, 0 replies; 38+ messages in thread
From: Rob Landley @ 2013-07-14  6:37 UTC (permalink / raw)
  To: musl; +Cc: musl

On 07/10/2013 02:42:34 PM, Rich Felker wrote:
> On Wed, Jul 10, 2013 at 09:28:21AM +1200, Andre Renaud wrote:
> > >> Does anyone have any comments on the suitability of this code,  
> or what
> > >
> > > If nothing else, it fails to be armv4 compatible. Fixing that  
> should
> > > not be hard, but it would require a bit of an audit. The return
> > > sequences are the obvious issue, but there may be other  
> instructions
> > > in use that are not available on armv4 or maybe not even on  
> armv5...?
> >
> > Rob Landley mentioned a while ago that armv4 has issues with the  
> EABI
> > stuff. Is armv4 a definite lower bound for musl support, as opposed  
> to
> > armv4t or armv5?
> 
> EABI specifies thumb; however, it's possible to have code which
> conforms fully to EABI but does not rely on the presence of thumb. GCC
> is incapable of generating such code, but it could be enhanced to do
> so, and all of the existing assembly in musl is plain-v4-compatible,
> so I would prefer not to shut out the possibility of supporting older
> ARM.

One of my larger pending todo items for aboriginal is fishing the last  
gplv2 release out of gcc git the same way I did for binutils. (In  
theory, this should give me armv7l support. In practice, the mpfr and  
gmp split complicates matters...)

If somebody wanted to come up with an armv4-eabi patch, I'd happily  
include it. Or just give me rather a lot of hints on what would be  
involved, since I'm not much of an arm assembly programmer...

Rob

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-11  3:37               ` Rich Felker
  2013-07-11  4:04                 ` Andre Renaud
  2013-07-11  5:27                 ` Daniel Cegiełka
@ 2013-07-15  4:25                 ` Rob Landley
  2 siblings, 0 replies; 38+ messages in thread
From: Rob Landley @ 2013-07-15  4:25 UTC (permalink / raw)
  To: musl; +Cc: musl, Andre Renaud

On 07/10/2013 10:37:55 PM, Rich Felker wrote:
> On Thu, Jul 11, 2013 at 10:44:16AM +1200, Andre Renaud wrote:
> > > This results in 95MB/s on my platform (up from 65MB/s for the  
> existing
> > > memcpy.c, and down from 105MB/s with the asm optimised version).  
> It is
> > > essentially identically readable to the existing memcpy.c. I'm not
> > > really famiilar with any other cpu architectures, so I'm not sure  
> if
> > > this would improve, or hurt, performance on other platforms.
> >
> > Reviewing the assembler that is produced, it appears that GCC will
> > never generate an ldm/stm instruction (load/store multiple) that  
> reads
> > into more than 4 registers, where as the optimised assembler does  
> them
> > that read 8 (ie: 8 * 32bit reads in a single instruction). I've  
> tried
> 
> For the asm, could we make it more than 8? 10 seems easy, 12 seems
> doubtful. I don't see a fundamental reason it needs to be a power of
> two, unless the cache line alignment really helps and isn't just
> cargo-culting. (This is something I'd still like to know about the
> asm: whether it's doing unnecessary stuff that does not help
> performance.)

You're going to hit bus bandwidth at some point, and that's likely to  
be a power of two.

> > various tricks/optimisations with the C code, and can't convince GCC
> > to do more than 4. I assume that this is probably where the  
> remaining
> > 10MB/s is between these two variants.
> 
> Yes, I suspect so. One slightly crazy idea I had was to write the
> function in C with just inline asm for the inner ldm/stm loop. The
> build system does not yet have support for .c files in the arch dirs
> instead of .s files, but it could be added.

Does it have support for a header definining a macro containing the  
assembly bit?

> > Rich - do you have any comments on whether either the C or assembler
> > variants of memcpy might be suitable for inclusion in musl?
> 
> I would say either might be, but it looks like if we want competitive
> performance, some asm will be needed (either inline or full). My
> leaning would be to go for something simpler than the asm you've been
> experimenting with, but with same or better performance, if this is
> possible. I realize the code is not that big as-is, in terms of binary
> size, but it's big from an "understanding it" perspective and I don't
> like big asm blobs that are hard for somebody to look at and say "oh
> yeah, this is clearly right".
> 
> Anyway, the big questions I'd still like to get answered before moving
> forward is whether the cache line alignment has any benefit.

I'd expect so. Fundamentally what the processor is doing is fetching  
and writing cachelines. What it does to the contents of the cachelines  
is just annotating that larger operation.

(Several days behind on email, as usual...)

> Rich

Rob

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-12  4:16                             ` Rich Felker
@ 2013-07-24  1:34                               ` Andre Renaud
  2013-07-24  3:48                                 ` Rich Felker
  0 siblings, 1 reply; 38+ messages in thread
From: Andre Renaud @ 2013-07-24  1:34 UTC (permalink / raw)
  To: musl

Hi Rich,

On 12 July 2013 16:16, Rich Felker <dalias@aerifal.cx> wrote:
> By the way, I also did some tests with misaligning the src/dest with
> respect to cache lines. and the timing did change, but not in any way
> I could make sense of...
>
> It may turn out to be that the issues are sufficiently complex that we
> won't get ideal performance without either copying the BSD code you
> suggested or fully understanding what it's doing, and other ARM
> performance issues, and developing something new based on that
> understanding... In that case copying/adapting the BSD code might turn
> out to be the right solution for now.

What was the final decision on this? The last version (with mixed
inline assembler/C) is (I believe) relatively readable, and appears to
be correct. It also compiles on all the available platforms (ie:
armv4+). Can this version be accepted?

Regards,
Andre


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-24  1:34                               ` Andre Renaud
@ 2013-07-24  3:48                                 ` Rich Felker
  2013-07-24  4:40                                   ` Andre Renaud
  0 siblings, 1 reply; 38+ messages in thread
From: Rich Felker @ 2013-07-24  3:48 UTC (permalink / raw)
  To: musl

On Wed, Jul 24, 2013 at 01:34:07PM +1200, Andre Renaud wrote:
> Hi Rich,
> 
> On 12 July 2013 16:16, Rich Felker <dalias@aerifal.cx> wrote:
> > By the way, I also did some tests with misaligning the src/dest with
> > respect to cache lines. and the timing did change, but not in any way
> > I could make sense of...
> >
> > It may turn out to be that the issues are sufficiently complex that we
> > won't get ideal performance without either copying the BSD code you
> > suggested or fully understanding what it's doing, and other ARM
> > performance issues, and developing something new based on that
> > understanding... In that case copying/adapting the BSD code might turn
> > out to be the right solution for now.
> 
> What was the final decision on this? The last version (with mixed
> inline assembler/C) is (I believe) relatively readable, and appears to
> be correct. It also compiles on all the available platforms (ie:
> armv4+). Can this version be accepted?

It looks buggy as-is; as far as I can tell, it will crash if src/dest
are aligned with respect to each other but not aligned mod 4, i.e. the
code starts out copying word-at-a-time rather than byte-at-a-time.

I think the C version would be acceptable if we get the bugs fixed and
test it well, but I'd also like to still keep the asm under
consideration. There are lots of cases not covered by the C version,
like misaligned copies (important for strings, not for much else). Do
you think these cases are important?

Rich


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-24  3:48                                 ` Rich Felker
@ 2013-07-24  4:40                                   ` Andre Renaud
  2013-07-28  8:09                                     ` Rich Felker
  0 siblings, 1 reply; 38+ messages in thread
From: Andre Renaud @ 2013-07-24  4:40 UTC (permalink / raw)
  To: musl

Hi Rich,
> It looks buggy as-is; as far as I can tell, it will crash if src/dest
> are aligned with respect to each other but not aligned mod 4, i.e. the
> code starts out copying word-at-a-time rather than byte-at-a-time.

Yes, you are correct, I'd messed that up while looking at the cache
alignment stuff (along with anoter small size related bug). Fixing it
is relatively straight forward though:
#define SS sizeof(size_t)
#define ALIGN (SS - 1)
void * noinline my_asm_memcpy(void * restrict dest, const void *
restrict src, size_t n)
{
    unsigned char *d = dest;
    const unsigned char *s = src;

    if (((uintptr_t)d & ALIGN) != ((uintptr_t)s & ALIGN))
        goto misaligned;

    /* Get them word aligned */
    for (; ((uintptr_t)d & ALIGN) && n; n--) *d++ = *s++;

    /* ARM has 32-byte cache lines, so align to that for performance */
    for (; ((uintptr_t)d & ((8 * SS) - 1)) && n >= SS; n-=SS) {
            *(size_t *)d = *(size_t *)s;
            d += SS;
            s += SS;
    }
    /* Do full cache line read/writes */
    for (; n>=(8 * SS); n-= (8 * SS))
            __asm__ __volatile__(
                            "ldmia %1!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t"
                            "ldrhi r12, [%1]\n"
                            "stmia %0!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t"
                            : "+r"(d), "+r"(s) :
                            : "a4", "v1", "v2", "v3", "v4", "v5",
"v6", "v7", "r12", "memory");

misaligned:
        for (; n; n--) *d++ = *s++;
    return dest;

}

> I think the C version would be acceptable if we get the bugs fixed and
> test it well, but I'd also like to still keep the asm under
> consideration. There are lots of cases not covered by the C version,
> like misaligned copies (important for strings, not for much else). Do
> you think these cases are important?

At the moment the mis-aligned copies perform terribly (18MB/s vs glibc
@ 100MB/s). However the existing C implementation in musl is no
different, so we're not degrading the current system.

We're essentially missing the non-congruent copying stuff from the asm
code. I'll have a look at this and see if I can write a similar C
version.

Regards,
Andre


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Thinking about release
  2013-07-24  4:40                                   ` Andre Renaud
@ 2013-07-28  8:09                                     ` Rich Felker
  0 siblings, 0 replies; 38+ messages in thread
From: Rich Felker @ 2013-07-28  8:09 UTC (permalink / raw)
  To: musl

On Wed, Jul 24, 2013 at 04:40:16PM +1200, Andre Renaud wrote:
> > I think the C version would be acceptable if we get the bugs fixed and
> > test it well, but I'd also like to still keep the asm under
> > consideration. There are lots of cases not covered by the C version,
> > like misaligned copies (important for strings, not for much else). Do
> > you think these cases are important?
> 
> At the moment the mis-aligned copies perform terribly (18MB/s vs glibc
> @ 100MB/s). However the existing C implementation in musl is no
> different, so we're not degrading the current system.
> 
> We're essentially missing the non-congruent copying stuff from the asm
> code. I'll have a look at this and see if I can write a similar C
> version.

Sorry this hasn't been moving forward more quickly. I've been
experimenting with various memcpys on ARM, and I've found:

1. Pure C code that performs comparably (only in the aligned case, so
   far) to the bionic asm and your inline-asm C version.

2. The prefetch stuff in your inline asm and the bionic version is
   apparently making it slower. With that removed from your C
   (basically, my inline asm version) it's 10% faster on the machine
   I'm running it on.

So I feel like we still have a ways to go figuring out the right
solution. I know, from one standpoint, it would be nice to have
_something_ right now, but I don't want to commit one version only to
decide next week it's wrong and throw it all out. Hopefully in the
mean time people who are trying to use musl seriously on arm and
running into performance problems can drop in the bionic asm or
another implementation.

Rich

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2013-07-28  8:09 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-13  1:25 Thinking about release Rich Felker
2013-06-13  1:33 ` Andre Renaud
2013-06-13  1:43   ` Rich Felker
2013-07-09  5:06     ` Andre Renaud
2013-07-09  5:37       ` Rich Felker
2013-07-09  6:24         ` Harald Becker
2013-07-09 21:28         ` Andre Renaud
2013-07-09 22:26           ` Andre Renaud
2013-07-10  6:42             ` Jens Gustedt
2013-07-10  7:50               ` Rich Felker
2013-07-10 22:44             ` Andre Renaud
2013-07-11  3:37               ` Rich Felker
2013-07-11  4:04                 ` Andre Renaud
2013-07-11  5:10                   ` Andre Renaud
2013-07-11 12:46                     ` Rich Felker
2013-07-11 22:34                       ` Andre Renaud
2013-07-12  3:16                         ` Rich Felker
2013-07-12  3:36                           ` Andre Renaud
2013-07-12  4:16                             ` Rich Felker
2013-07-24  1:34                               ` Andre Renaud
2013-07-24  3:48                                 ` Rich Felker
2013-07-24  4:40                                   ` Andre Renaud
2013-07-28  8:09                                     ` Rich Felker
2013-07-11  5:27                 ` Daniel Cegiełka
2013-07-11 12:49                   ` Rich Felker
2013-07-15  4:25                 ` Rob Landley
2013-07-10 19:42           ` Rich Felker
2013-07-14  6:37             ` Rob Landley
2013-07-11  4:30           ` Strake
2013-07-11  4:33             ` Rich Felker
2013-07-10 19:38         ` Rob Landley
2013-07-10 20:34           ` Andre Renaud
2013-07-10 20:49             ` Nathan McSween
2013-07-10 21:01             ` Rich Felker
2013-06-13 15:46 ` Isaac
2013-06-26  1:44 ` Rich Felker
2013-06-26 10:19   ` Szabolcs Nagy
2013-06-26 14:21     ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).