mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] [PATCH] add memcmpeq: memcmp that returns length of first mismatch
@ 2024-02-27 14:07 James Tirta Halim
  2024-02-27 14:49 ` Rich Felker
  0 siblings, 1 reply; 10+ messages in thread
From: James Tirta Halim @ 2024-02-27 14:07 UTC (permalink / raw)
  To: musl; +Cc: James Tirta Halim

(unaligned access must be supported for mempcmpeq if using word access)

mempcmpeq returns the length of the first mismatched byte. If S1 and S2
are equal, n is returned.

The function is meant to be used internally in strstr and memmem:
https://inbox.vuxu.org/musl/20181108193451.GD5150@brightrain.aerifal.cx/

glibc bench-memcmpeq timings (Core i3-1115G4):
mempcmpeq memcmp_musl memcmp __memcmpeq_evex __memcmpeq_avx2 __memcmpeq_sse2
Average:
62.3978 269.813 16.0426 14.9072 14.1125 20.6527
Total:
30637.3 132478 7876.92 7319.43 6929.23 10140.5

Passes glibc test-memcmpeq.

Return value in testing and benchmarking is changed to suit the
behavior of memcmpeq.

---
 src/string/mempcmpeq.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)
 create mode 100644 src/string/mempcmpeq.c

diff --git a/src/string/mempcmpeq.c b/src/string/mempcmpeq.c
new file mode 100644
index 00000000..bc17bc58
--- /dev/null
+++ b/src/string/mempcmpeq.c
@@ -0,0 +1,19 @@
+#include <stddef.h>
+
+size_t
+mempcmpeq(const void *s1,
+          const void *s2,
+          size_t n)
+{
+	const size_t length = n;
+#ifdef __GNUC__
+	typedef size_t __attribute__((__may_alias__)) word;
+	const unsigned char *p1 = (const unsigned char *)s1;
+	const unsigned char *p2 = (const unsigned char *)s2;
+	for (; n >= sizeof(word) && *(word *)p1 == *(word *)p2; p1+=sizeof(word), p2+=sizeof(word), n-=sizeof(word));
+#endif
+	for (; n; --n)
+		if (*p1++ != *p2++)
+			return (size_t)((p1 - 1 - (const unsigned char *)s1));
+	return length;
+}
-- 
2.44.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [musl] [PATCH] add memcmpeq: memcmp that returns length of first mismatch
  2024-02-27 14:07 [musl] [PATCH] add memcmpeq: memcmp that returns length of first mismatch James Tirta Halim
@ 2024-02-27 14:49 ` Rich Felker
  2024-02-27 14:51   ` Rich Felker
  0 siblings, 1 reply; 10+ messages in thread
From: Rich Felker @ 2024-02-27 14:49 UTC (permalink / raw)
  To: James Tirta Halim; +Cc: musl

On Tue, Feb 27, 2024 at 09:07:56PM +0700, James Tirta Halim wrote:
> (unaligned access must be supported for mempcmpeq if using word access)
> 
> mempcmpeq returns the length of the first mismatched byte. If S1 and S2
> are equal, n is returned.
> 
> The function is meant to be used internally in strstr and memmem:
> https://inbox.vuxu.org/musl/20181108193451.GD5150@brightrain.aerifal.cx/
> 
> glibc bench-memcmpeq timings (Core i3-1115G4):
> mempcmpeq memcmp_musl memcmp __memcmpeq_evex __memcmpeq_avx2 __memcmpeq_sse2
> Average:
> 62.3978 269.813 16.0426 14.9072 14.1125 20.6527
> Total:
> 30637.3 132478 7876.92 7319.43 6929.23 10140.5
> 
> Passes glibc test-memcmpeq.
> 
> Return value in testing and benchmarking is changed to suit the
> behavior of memcmpeq.

I don't understand this proposal. The only "memcmpeq" I can find
anywhere else is __memcmpeq in glibc, whose contract is very
different: it's a relaxed version of memcmp that's only usable for
testing equality and that does not give an order relationship.

As noted in the above-linked message, a portable C version of this
function is not going to help strstr or memmem because there's no
reason to expect the needle to be aligned modulo any wordsize in the
haystack. For it to help, you would need a version that's capable of
doing misaligned accesses, contingent on arch support for that.
Without that, you're just adding the cost of function call overhead
and making it slower.

> ---
>  src/string/mempcmpeq.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
>  create mode 100644 src/string/mempcmpeq.c
> 
> diff --git a/src/string/mempcmpeq.c b/src/string/mempcmpeq.c
> new file mode 100644
> index 00000000..bc17bc58
> --- /dev/null
> +++ b/src/string/mempcmpeq.c
> @@ -0,0 +1,19 @@
> +#include <stddef.h>
> +
> +size_t
> +mempcmpeq(const void *s1,
> +          const void *s2,
> +          size_t n)
> +{
> +	const size_t length = n;
> +#ifdef __GNUC__
> +	typedef size_t __attribute__((__may_alias__)) word;
> +	const unsigned char *p1 = (const unsigned char *)s1;
> +	const unsigned char *p2 = (const unsigned char *)s2;
> +	for (; n >= sizeof(word) && *(word *)p1 == *(word *)p2; p1+=sizeof(word), p2+=sizeof(word), n-=sizeof(word));
> +#endif
> +	for (; n; --n)
> +		if (*p1++ != *p2++)
> +			return (size_t)((p1 - 1 - (const unsigned char *)s1));
> +	return length;
> +}
> -- 
> 2.44.0

This code does not do what you claim it does and does not seem to
match your test results. As written, it should return immediately,
because the entire body except

> +	const size_t length = n;
> +	return length;

is dead code.

Rich

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [musl] [PATCH] add memcmpeq: memcmp that returns length of first mismatch
  2024-02-27 14:49 ` Rich Felker
@ 2024-02-27 14:51   ` Rich Felker
  2024-02-28  4:40     ` Markus Wichmann
  0 siblings, 1 reply; 10+ messages in thread
From: Rich Felker @ 2024-02-27 14:51 UTC (permalink / raw)
  To: James Tirta Halim; +Cc: musl

On Tue, Feb 27, 2024 at 09:49:27AM -0500, Rich Felker wrote:
> On Tue, Feb 27, 2024 at 09:07:56PM +0700, James Tirta Halim wrote:
> > ---
> >  src/string/mempcmpeq.c | 19 +++++++++++++++++++
> >  1 file changed, 19 insertions(+)
> >  create mode 100644 src/string/mempcmpeq.c
> > 
> > diff --git a/src/string/mempcmpeq.c b/src/string/mempcmpeq.c
> > new file mode 100644
> > index 00000000..bc17bc58
> > --- /dev/null
> > +++ b/src/string/mempcmpeq.c
> > @@ -0,0 +1,19 @@
> > +#include <stddef.h>
> > +
> > +size_t
> > +mempcmpeq(const void *s1,
> > +          const void *s2,
> > +          size_t n)
> > +{
> > +	const size_t length = n;
> > +#ifdef __GNUC__
> > +	typedef size_t __attribute__((__may_alias__)) word;
> > +	const unsigned char *p1 = (const unsigned char *)s1;
> > +	const unsigned char *p2 = (const unsigned char *)s2;
> > +	for (; n >= sizeof(word) && *(word *)p1 == *(word *)p2; p1+=sizeof(word), p2+=sizeof(word), n-=sizeof(word));
> > +#endif
> > +	for (; n; --n)
> > +		if (*p1++ != *p2++)
> > +			return (size_t)((p1 - 1 - (const unsigned char *)s1));
> > +	return length;
> > +}
> > -- 
> > 2.44.0
> 
> This code does not do what you claim it does and does not seem to
> match your test results. As written, it should return immediately,
> because the entire body except
> 
> > +	const size_t length = n;
> > +	return length;
> 
> is dead code.

Sorry, this part is wrong -- I missed the return above, so it appears
it actually does behave as described. Dunno how I overlooked that.
EMORECOFFEE.

Rich

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [musl] [PATCH] add memcmpeq: memcmp that returns length of first mismatch
  2024-02-27 14:51   ` Rich Felker
@ 2024-02-28  4:40     ` Markus Wichmann
  2024-02-28 23:14       ` Thorsten Glaser
  0 siblings, 1 reply; 10+ messages in thread
From: Markus Wichmann @ 2024-02-28  4:40 UTC (permalink / raw)
  To: musl; +Cc: James Tirta Halim

Am Tue, Feb 27, 2024 at 09:51:24AM -0500 schrieb Rich Felker:
> On Tue, Feb 27, 2024 at 09:49:27AM -0500, Rich Felker wrote:
> > On Tue, Feb 27, 2024 at 09:07:56PM +0700, James Tirta Halim wrote:
> > > ---
> > >  src/string/mempcmpeq.c | 19 +++++++++++++++++++
> > >  1 file changed, 19 insertions(+)
> > >  create mode 100644 src/string/mempcmpeq.c
> > >
> > > diff --git a/src/string/mempcmpeq.c b/src/string/mempcmpeq.c
> > > new file mode 100644
> > > index 00000000..bc17bc58
> > > --- /dev/null
> > > +++ b/src/string/mempcmpeq.c
> > > @@ -0,0 +1,19 @@
> > > +#include <stddef.h>
> > > +
> > > +size_t
> > > +mempcmpeq(const void *s1,
> > > +          const void *s2,
> > > +          size_t n)
> > > +{
> > > +	const size_t length = n;
> > > +#ifdef __GNUC__
> > > +	typedef size_t __attribute__((__may_alias__)) word;
> > > +	const unsigned char *p1 = (const unsigned char *)s1;
> > > +	const unsigned char *p2 = (const unsigned char *)s2;
> > > +	for (; n >= sizeof(word) && *(word *)p1 == *(word *)p2; p1+=sizeof(word), p2+=sizeof(word), n-=sizeof(word));
> > > +#endif
> > > +	for (; n; --n)
> > > +		if (*p1++ != *p2++)
> > > +			return (size_t)((p1 - 1 - (const unsigned char *)s1));
> > > +	return length;
> > > +}
> > > --
> > > 2.44.0
> >
> > This code does not do what you claim it does and does not seem to
> > match your test results. As written, it should return immediately,
> > because the entire body except
> >
> > > +	const size_t length = n;
> > > +	return length;
> >
> > is dead code.
>
> Sorry, this part is wrong -- I missed the return above, so it appears
> it actually does behave as described. Dunno how I overlooked that.
> EMORECOFFEE.
>
> Rich

And I puzzled over it a solid five minutes wondering how you got to your
conclusion. But there are a few more issues here:
1. The declarations of p1 and p2 need to be moved outside the #ifdef,
since they are used outside the #ifdef.
2. The explicit conversions in the initializations are unnecessary and
should be elided.
3. If this function is to be used by strstr, it needs to get a double
underscore name, since strstr is an ISO-C function and must only call
ISO-C functions or double underscore functions (wonky counterexamples
like system() notwithstanding. And even that one doesn't call extension
functions). If the function should also be an externally accessible
extension, it can get a weak alias.
4. All the other musl C code avoids misaligned word access. I don't know
which architecture/ABI doesn't allow it, but it is nevertheless the
case. And I don't know if this function is still worth it if it has to
take all possible misalignments into account.

Ciao,
Markus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [musl] [PATCH] add memcmpeq: memcmp that returns length of first mismatch
  2024-02-28  4:40     ` Markus Wichmann
@ 2024-02-28 23:14       ` Thorsten Glaser
  2024-02-29  0:02         ` Pedro Falcato
  2024-02-29  4:19         ` Markus Wichmann
  0 siblings, 2 replies; 10+ messages in thread
From: Thorsten Glaser @ 2024-02-28 23:14 UTC (permalink / raw)
  To: musl; +Cc: James Tirta Halim

Markus Wichmann dixit:

>> > > +	for (; n >= sizeof(word) && *(word *)p1 == *(word *)p2; p1+=sizeof(word), p2+=sizeof(word), n-=sizeof(word));

Very much UB.

>4. All the other musl C code avoids misaligned word access. I don't know
>which architecture/ABI doesn't allow it, but it is nevertheless the

Almost all of them. i386/amd64 penalise it heavily and it can cause
trouble when crossing page sizes, cacheline sizes, etc. as well. On
alpha, many ARM, SPARC, and others, it’s an instant SIGBUS/SIGSEGV.

bye,
//mirabilos
-- 
I believe no one can invent an algorithm. One just happens to hit upon it
when God enlightens him. Or only God invents algorithms, we merely copy them.
If you don't believe in God, just consider God as Nature if you won't deny
existence.		-- Coywolf Qi Hunt

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [musl] [PATCH] add memcmpeq: memcmp that returns length of first mismatch
  2024-02-28 23:14       ` Thorsten Glaser
@ 2024-02-29  0:02         ` Pedro Falcato
  2024-02-29  0:10           ` Thorsten Glaser
  2024-02-29  4:19         ` Markus Wichmann
  1 sibling, 1 reply; 10+ messages in thread
From: Pedro Falcato @ 2024-02-29  0:02 UTC (permalink / raw)
  To: musl; +Cc: James Tirta Halim

On Wed, Feb 28, 2024 at 11:18 PM Thorsten Glaser <tg@mirbsd.de> wrote:
>
> Markus Wichmann dixit:
>
> >> > > +        for (; n >= sizeof(word) && *(word *)p1 == *(word *)p2; p1+=sizeof(word), p2+=sizeof(word), n-=sizeof(word));
>
> Very much UB.
>
> >4. All the other musl C code avoids misaligned word access. I don't know
> >which architecture/ABI doesn't allow it, but it is nevertheless the
>
> Almost all of them. i386/amd64 penalise it heavily and it can cause

Small note: This isn't quite true for remotely modern x86, unaligned
accesses are pretty cheap compared to extra branches, and this fact is
abused very frequently in optimized stringops implementations (see
every optimized memcpy - which use overlapping loads and stores,
effectively abusing unaligned mem ops).

-- 
Pedro

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [musl] [PATCH] add memcmpeq: memcmp that returns length of first mismatch
  2024-02-29  0:02         ` Pedro Falcato
@ 2024-02-29  0:10           ` Thorsten Glaser
  2024-02-29  0:57             ` Robert Clausecker
  0 siblings, 1 reply; 10+ messages in thread
From: Thorsten Glaser @ 2024-02-29  0:10 UTC (permalink / raw)
  To: musl; +Cc: James Tirta Halim

Pedro Falcato dixit:

>Small note: This isn't quite true for remotely modern x86, unaligned

It’s very much true, e.g. it breaks atomicity (ok, not relevant
*here*, but in general).

AIUI, even modern amd64 chips of all vendors are reverting to
optimising rep movsb/lodsb instead again, for stringops.

Of course the status on other architectures should be sufficient to
not use unaligned accesses.

bye,
//mirabilos
-- 
08:05⎜<XTaran:#grml> mika: Does grml have an tool to read Apple
     ⎜    System Log (asl) files? :)
08:08⎜<ft:#grml> yeah. /bin/rm. ;)       08:09⎜<mrud:#grml> hexdump -C
08:31⎜<XTaran:#grml> ft, mrud: *g*

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [musl] [PATCH] add memcmpeq: memcmp that returns length of first mismatch
  2024-02-29  0:10           ` Thorsten Glaser
@ 2024-02-29  0:57             ` Robert Clausecker
  2024-02-29  1:13               ` Pedro Falcato
  0 siblings, 1 reply; 10+ messages in thread
From: Robert Clausecker @ 2024-02-29  0:57 UTC (permalink / raw)
  To: musl

Greetings,

Am Thu, Feb 29, 2024 at 12:10:05AM +0000 schrieb Thorsten Glaser:
> Pedro Falcato dixit:
> 
> >Small note: This isn't quite true for remotely modern x86, unaligned
> 
> It’s very much true, e.g. it breaks atomicity (ok, not relevant
> *here*, but in general).
> 
> AIUI, even modern amd64 chips of all vendors are reverting to
> optimising rep movsb/lodsb instead again, for stringops.

That is not the case.  REP MOVSB and friends have a high startup latency,
so you only want to use them for large-ish blocks.  Too large and all of
the sudden AVX-512 is faster again.  For small blocks however, you do not
want to use this instruction.  It's indeed much better to do a pair of
overlapping stores.  They do not perform crazy well, but it's still better
than all alternatives.

Also note that REP LODSB is pretty useless; did you perhaps mean REP STOSB?

Source: have spent a good part of last year implementing <string.h> in
x86 assembly for FreeBSD's libc.

> Of course the status on other architectures should be sufficient to
> not use unaligned accesses.

Yours,
Robert Clausecker

-- 
()  ascii ribbon campaign - for an encoding-agnostic world
/\  - against html email  - against proprietary attachments

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [musl] [PATCH] add memcmpeq: memcmp that returns length of first mismatch
  2024-02-29  0:57             ` Robert Clausecker
@ 2024-02-29  1:13               ` Pedro Falcato
  0 siblings, 0 replies; 10+ messages in thread
From: Pedro Falcato @ 2024-02-29  1:13 UTC (permalink / raw)
  To: musl

On Thu, Feb 29, 2024 at 12:58 AM Robert Clausecker <fuz@fuz.su> wrote:
>
> Greetings,
>
> Am Thu, Feb 29, 2024 at 12:10:05AM +0000 schrieb Thorsten Glaser:
> > Pedro Falcato dixit:
> >
> > >Small note: This isn't quite true for remotely modern x86, unaligned
> >
> > It’s very much true, e.g. it breaks atomicity (ok, not relevant
> > *here*, but in general).
> >
> > AIUI, even modern amd64 chips of all vendors are reverting to
> > optimising rep movsb/lodsb instead again, for stringops.
>
> That is not the case.  REP MOVSB and friends have a high startup latency,
> so you only want to use them for large-ish blocks.  Too large and all of
> the sudden AVX-512 is faster again.  For small blocks however, you do not
> want to use this instruction.  It's indeed much better to do a pair of
> overlapping stores.  They do not perform crazy well, but it's still better
> than all alternatives.

This. Plus the "FSRM" (fast short rep movsb) stuff is all fugazi - I'm
yet to see a microarchitecture where my GPR-only memcpy (which
resembles/resembled mjg@'s FreeBSD kernel memcpy, which you might've
read before) doesn't completely beat out rep movsb under ~256 bytes.
Oh n, and they're sometimes incredibly naive - the zen rep movsb's get
into an awful fallback mode (which I suspect is either byte-by-byte or
word-by-word) if you give it an unaligned buffer, it doesn't seem to
ever attempt to use wider stores.

Anyway, I digress, this is somewhat offtopic :)

-- 
Pedro

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [musl] [PATCH] add memcmpeq: memcmp that returns length of first mismatch
  2024-02-28 23:14       ` Thorsten Glaser
  2024-02-29  0:02         ` Pedro Falcato
@ 2024-02-29  4:19         ` Markus Wichmann
  1 sibling, 0 replies; 10+ messages in thread
From: Markus Wichmann @ 2024-02-29  4:19 UTC (permalink / raw)
  To: musl; +Cc: James Tirta Halim

Am Wed, Feb 28, 2024 at 11:14:27PM +0000 schrieb Thorsten Glaser:
> Markus Wichmann dixit:
> >4. All the other musl C code avoids misaligned word access. I don't know
> >which architecture/ABI doesn't allow it, but it is nevertheless the
>
> Almost all of them. i386/amd64 penalise it heavily and it can cause
> trouble when crossing page sizes, cacheline sizes, etc. as well. On
> alpha, many ARM, SPARC, and others, it’s an instant SIGBUS/SIGSEGV.
>

For i386 and AMD64, musl already performs misaligned access at least in
the memset. I can't find a source for it right now, but I believe it was
tested to basically make no timing difference inside of a cache line, a
very minor difference across cache lines, and a bit of a spike across
pages. Of course, only one in 1024 word accesses crosses a page boundary
even if misaligned. The idea that misaligned memory access is slow on
x86 is a pervasive myth that people really need to get over.

Musl doesn't support either alpha or sparc. But according to [1], armv4
was actually worse than the description above: It would actually perform
some memory access, but at the address rounded down to a word boundary,
and then rotate the resulting value according to the low bits.

So that's a new thing to add to my list of horrors. Misaligned accesses
crashing I can at least deal with. Misaligned accesses returning the
wrong data is a thing that was not on my radar before. They fixed it in
armv7, though.

The reason I asked, BTW, was because I had always thought PowerPC to be
on the list of architectures that don't support misaligned access in the
baseline ISA, but actually, most PowerPC implementations do, and also
even the most general 32-bit PowerPC architecture description says that
an alignment interrupt may be triggered by an integer load/store
instruction only when crossing a page boundary. And otherwise there is
no issue.

Ciao,
Markus

[1] https://medium.com/@iLevex/the-curious-case-of-unaligned-access-on-arm-5dd0ebe24965

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-02-29  4:20 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-27 14:07 [musl] [PATCH] add memcmpeq: memcmp that returns length of first mismatch James Tirta Halim
2024-02-27 14:49 ` Rich Felker
2024-02-27 14:51   ` Rich Felker
2024-02-28  4:40     ` Markus Wichmann
2024-02-28 23:14       ` Thorsten Glaser
2024-02-29  0:02         ` Pedro Falcato
2024-02-29  0:10           ` Thorsten Glaser
2024-02-29  0:57             ` Robert Clausecker
2024-02-29  1:13               ` Pedro Falcato
2024-02-29  4:19         ` Markus Wichmann

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).