mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] strcmp() guarantees and assumptions
@ 2023-07-16 17:22 Robert Clausecker
  2023-07-16 17:49 ` NRK
  0 siblings, 1 reply; 8+ messages in thread
From: Robert Clausecker @ 2023-07-16 17:22 UTC (permalink / raw)
  To: musl; +Cc: mjg

Greetings,

I am currently developing SIMD-enhanced implementations of libc
functions for the FreeBSD libc.  One of the next functions I want to
tackle is strcmp().  There, the following question obtains:

    Is strcmp() permitted to assume that its arguments are NUL
    terminated strings?

Or to phrase it differently, is the following a legal implementation of
strcmp()?

    int strcmp(char *a, char *b) {
    	size_t la = strlen(a), lb = strlen(b);

    	if (la != lb)
    		return ((la > lb) - (lb > la));

    	return memcmp(a, b, la);
    }

A situation I dimly recall where this assumption did not hold was in a
program that used strcmp() to compare two buffers known to have a
mismatch somewhere, but without guaranteed NUL termination.  A naïve
strcmp() implementation processed this just fine, but this one might
crash.

I have previously asked the ISO/IEC 9899:2023 editor [1] who indicated
that he believes my interpretation to be correct, but asked me to look
for a second opinion.

Assuming that my assumption on strcmp() is correct, is this an
assumption common libc implementations make?  Or is it generally
agreed upon that libc implementations support strcmp() calls on
unterminated strings?

Thank you for your help.

Yours,
Robert Clausecker

[1]: https://twitter.com/__phantomderp/status/1680614038567354370

-- 
()  ascii ribbon campaign - for an 8-bit clean world 
/\  - against html email  - against proprietary attachments

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] strcmp() guarantees and assumptions
  2023-07-16 17:22 [musl] strcmp() guarantees and assumptions Robert Clausecker
@ 2023-07-16 17:49 ` NRK
  2023-07-16 17:59   ` Robert Clausecker
  0 siblings, 1 reply; 8+ messages in thread
From: NRK @ 2023-07-16 17:49 UTC (permalink / raw)
  To: musl

Hi Robert,

> Or to phrase it differently, is the following a legal implementation of
> strcmp()?
> 
>     int strcmp(char *a, char *b) {
>     	size_t la = strlen(a), lb = strlen(b);
> 
>     	if (la != lb)
>     		return ((la > lb) - (lb > la));
      		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I don't see how this can ever be a valid strcmp implementation. The
return value of the comparison functions must be about the first
mismatching byte, not about the string lengths.

| The sign of a nonzero value returned by the comparison functions is
| determined by the sign of the difference between the values of the
| first pair of characters that differ in the objects being compared.
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

ref: https://port70.net/~nsz/c/c11/n1570.html#7.24.4p1

> Or is it generally agreed upon that libc implementations support
> strcmp() calls on unterminated strings?

memchr (since C11) has the following requirement:

| The implementation shall behave as if it reads the characters
| sequentially and stops as soon as a matching character is found.

I don't believe any such requirement exists for strcmp, so unless
someone proves otherwise, I'd say it's fair game for libc to assume that
the strings are nul-terminated.

Moreover strcmp's description states the following:

| The strcmp function compares the string pointed to by s1 to the string pointed to by s2.
                                   ^^^^^^                         ^^^^^^

And "string" according to the C standard is always nul-terminated.

- NRK

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] strcmp() guarantees and assumptions
  2023-07-16 17:49 ` NRK
@ 2023-07-16 17:59   ` Robert Clausecker
  2023-07-16 19:24     ` Pedro Falcato
  2023-07-16 19:33     ` Markus Wichmann
  0 siblings, 2 replies; 8+ messages in thread
From: Robert Clausecker @ 2023-07-16 17:59 UTC (permalink / raw)
  To: musl

Hi NRK,

Thank you for your response.

Am Sun, Jul 16, 2023 at 11:49:45PM +0600 schrieb NRK:
> Hi Robert,
> 
> > Or to phrase it differently, is the following a legal implementation of
> > strcmp()?
> > 
> >     int strcmp(char *a, char *b) {
> >     	size_t la = strlen(a), lb = strlen(b);
> > 
> >     	if (la != lb)
> >     		return ((la > lb) - (lb > la));
>       		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> I don't see how this can ever be a valid strcmp implementation. The
> return value of the comparison functions must be about the first
> mismatching byte, not about the string lengths.
> 
> | The sign of a nonzero value returned by the comparison functions is
> | determined by the sign of the difference between the values of the
> | first pair of characters that differ in the objects being compared.
>   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Yes, sorry.  The code would have to be extended to call memcmp() on the
common prefix in case there is a mismatch in length.  E.g.

    if (la != lb)
        return (memcmp(la, lb, la > lb ? lb + 1 : la + 1));

> ref: https://port70.net/~nsz/c/c11/n1570.html#7.24.4p1
> 
> > Or is it generally agreed upon that libc implementations support
> > strcmp() calls on unterminated strings?
> 
> memchr (since C11) has the following requirement:
> 
> | The implementation shall behave as if it reads the characters
> | sequentially and stops as soon as a matching character is found.
> 
> I don't believe any such requirement exists for strcmp, so unless
> someone proves otherwise, I'd say it's fair game for libc to assume that
> the strings are nul-terminated.

That's good to hear.  Any idea on the “what do existing libc
implementations permit” bit?

Yours,
Robert Clausecker

-- 
()  ascii ribbon campaign - for an 8-bit clean world 
/\  - against html email  - against proprietary attachments

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] strcmp() guarantees and assumptions
  2023-07-16 17:59   ` Robert Clausecker
@ 2023-07-16 19:24     ` Pedro Falcato
  2023-07-16 19:29       ` Pedro Falcato
  2023-07-16 19:33     ` Markus Wichmann
  1 sibling, 1 reply; 8+ messages in thread
From: Pedro Falcato @ 2023-07-16 19:24 UTC (permalink / raw)
  To: musl

On Sun, Jul 16, 2023 at 7:00 PM Robert Clausecker <fuz@fuz.su> wrote:
>
> Hi NRK,
>
> Thank you for your response.
>
> Am Sun, Jul 16, 2023 at 11:49:45PM +0600 schrieb NRK:
> > Hi Robert,
> >
> > > Or to phrase it differently, is the following a legal implementation of
> > > strcmp()?
> > >
> > >     int strcmp(char *a, char *b) {
> > >             size_t la = strlen(a), lb = strlen(b);
> > >
> > >             if (la != lb)
> > >                     return ((la > lb) - (lb > la));
> >                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >
> > I don't see how this can ever be a valid strcmp implementation. The
> > return value of the comparison functions must be about the first
> > mismatching byte, not about the string lengths.
> >
> > | The sign of a nonzero value returned by the comparison functions is
> > | determined by the sign of the difference between the values of the
> > | first pair of characters that differ in the objects being compared.
> >   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Yes, sorry.  The code would have to be extended to call memcmp() on the
> common prefix in case there is a mismatch in length.  E.g.
>
>     if (la != lb)
>         return (memcmp(la, lb, la > lb ? lb + 1 : la + 1));
>
> > ref: https://port70.net/~nsz/c/c11/n1570.html#7.24.4p1
> >
> > > Or is it generally agreed upon that libc implementations support
> > > strcmp() calls on unterminated strings?
> >
> > memchr (since C11) has the following requirement:
> >
> > | The implementation shall behave as if it reads the characters
> > | sequentially and stops as soon as a matching character is found.
> >
> > I don't believe any such requirement exists for strcmp, so unless
> > someone proves otherwise, I'd say it's fair game for libc to assume that
> > the strings are nul-terminated.
>
> That's good to hear.  Any idea on the “what do existing libc
> implementations permit” bit?

Looks like it's permissive.
At the moment, musl does (non-SIMD, obviously) unsigned long loads *as
long as they're aligned* (you don't want to page fault! and reads
don't have side effects unless it's MMIO or something, and that's
non-standard) and does standard(tm) bit tricks to find null bytes in
that same word.

-- 
Pedro

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] strcmp() guarantees and assumptions
  2023-07-16 19:24     ` Pedro Falcato
@ 2023-07-16 19:29       ` Pedro Falcato
  0 siblings, 0 replies; 8+ messages in thread
From: Pedro Falcato @ 2023-07-16 19:29 UTC (permalink / raw)
  To: musl

On Sun, Jul 16, 2023 at 8:24 PM Pedro Falcato <pedro.falcato@gmail.com> wrote:
>
> On Sun, Jul 16, 2023 at 7:00 PM Robert Clausecker <fuz@fuz.su> wrote:
> >
> > Hi NRK,
> >
> > Thank you for your response.
> >
> > Am Sun, Jul 16, 2023 at 11:49:45PM +0600 schrieb NRK:
> > > Hi Robert,
> > >
> > > > Or to phrase it differently, is the following a legal implementation of
> > > > strcmp()?
> > > >
> > > >     int strcmp(char *a, char *b) {
> > > >             size_t la = strlen(a), lb = strlen(b);
> > > >
> > > >             if (la != lb)
> > > >                     return ((la > lb) - (lb > la));
> > >                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > >
> > > I don't see how this can ever be a valid strcmp implementation. The
> > > return value of the comparison functions must be about the first
> > > mismatching byte, not about the string lengths.
> > >
> > > | The sign of a nonzero value returned by the comparison functions is
> > > | determined by the sign of the difference between the values of the
> > > | first pair of characters that differ in the objects being compared.
> > >   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >
> > Yes, sorry.  The code would have to be extended to call memcmp() on the
> > common prefix in case there is a mismatch in length.  E.g.
> >
> >     if (la != lb)
> >         return (memcmp(la, lb, la > lb ? lb + 1 : la + 1));
> >
> > > ref: https://port70.net/~nsz/c/c11/n1570.html#7.24.4p1
> > >
> > > > Or is it generally agreed upon that libc implementations support
> > > > strcmp() calls on unterminated strings?
> > >
> > > memchr (since C11) has the following requirement:
> > >
> > > | The implementation shall behave as if it reads the characters
> > > | sequentially and stops as soon as a matching character is found.
> > >
> > > I don't believe any such requirement exists for strcmp, so unless
> > > someone proves otherwise, I'd say it's fair game for libc to assume that
> > > the strings are nul-terminated.
> >
> > That's good to hear.  Any idea on the “what do existing libc
> > implementations permit” bit?
>
> Looks like it's permissive.
> At the moment, musl does (non-SIMD, obviously) unsigned long loads *as
> long as they're aligned* (you don't want to page fault! and reads
> don't have side effects unless it's MMIO or something, and that's
> non-standard) and does standard(tm) bit tricks to find null bytes in
> that same word.

Oops, sorry, had a brainfart there and misread your strcmp as strlen.
In any case, it is AFAIK permissive as you could tell from
implementations such as bionic's ssse3-strcmp-atom.S.

-- 
Pedro

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] strcmp() guarantees and assumptions
  2023-07-16 17:59   ` Robert Clausecker
  2023-07-16 19:24     ` Pedro Falcato
@ 2023-07-16 19:33     ` Markus Wichmann
  2023-07-16 21:13       ` Robert Clausecker
  1 sibling, 1 reply; 8+ messages in thread
From: Markus Wichmann @ 2023-07-16 19:33 UTC (permalink / raw)
  To: musl

Am Sun, Jul 16, 2023 at 07:59:57PM +0200 schrieb Robert Clausecker:
> That's good to hear.  Any idea on the “what do existing libc
> implementations permit” bit?
>

So I quickly checked musl, dietlibc, bionic, and glibc, and
unsurprisingly, all of the implementations I looked at allow the strings
to be unterminated if they mismatch before access becomes restricted.
This is, of course, an implementation detail that applications must not
rely on, but it nevertheless is the case.

The problem in your implementation is that the calls to strlen() will
iterate over both input strings to the end, causing basically a cache
flush for large inputs, only to then iterate over both inputs a second
time. Iterating only once is a major benefit, since it avoids half of
the cache misses.

Also, glibc already has an SSE strcmp implementation you may want to
look at.

Ciao,
Markus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] strcmp() guarantees and assumptions
  2023-07-16 19:33     ` Markus Wichmann
@ 2023-07-16 21:13       ` Robert Clausecker
  2023-07-17 16:22         ` Adhemerval Zanella Netto
  0 siblings, 1 reply; 8+ messages in thread
From: Robert Clausecker @ 2023-07-16 21:13 UTC (permalink / raw)
  To: musl

Hi Markus,

Am Sun, Jul 16, 2023 at 09:33:16PM +0200 schrieb Markus Wichmann:
> Am Sun, Jul 16, 2023 at 07:59:57PM +0200 schrieb Robert Clausecker:
> > That's good to hear.  Any idea on the “what do existing libc
> > implementations permit” bit?
> >
> 
> So I quickly checked musl, dietlibc, bionic, and glibc, and
> unsurprisingly, all of the implementations I looked at allow the strings
> to be unterminated if they mismatch before access becomes restricted.
> This is, of course, an implementation detail that applications must not
> rely on, but it nevertheless is the case.
> 
> The problem in your implementation is that the calls to strlen() will
> iterate over both input strings to the end, causing basically a cache
> flush for large inputs, only to then iterate over both inputs a second
> time. Iterating only once is a major benefit, since it avoids half of
> the cache misses.

Of course.  This was merely a simple example to demonstrate the general
point.  I of course do not plan to do anything like that.

> Also, glibc already has an SSE strcmp implementation you may want to
> look at.

I'm not going to look at glibc as it's LGPL licensed.  I am aware of the
Intel implementation, but I don't like that it has to duplicate the code
16 times for each possible misalignment pattern.  Without having to
ensure that a cacheline of data is only touched once we confirm there
is no previous mismatch, it might be possible to write simpler code, but
I'm currently not entirely sure how.

> Ciao,
> Markus

Yours,
Robert Clausecker

-- 
()  ascii ribbon campaign - for an 8-bit clean world 
/\  - against html email  - against proprietary attachments

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] strcmp() guarantees and assumptions
  2023-07-16 21:13       ` Robert Clausecker
@ 2023-07-17 16:22         ` Adhemerval Zanella Netto
  0 siblings, 0 replies; 8+ messages in thread
From: Adhemerval Zanella Netto @ 2023-07-17 16:22 UTC (permalink / raw)
  To: musl, Robert Clausecker



On 16/07/23 18:13, Robert Clausecker wrote:
> Hi Markus,
> 
>> Also, glibc already has an SSE strcmp implementation you may want to
>> look at.
> 
> I'm not going to look at glibc as it's LGPL licensed.  I am aware of the
> Intel implementation, but I don't like that it has to duplicate the code
> 16 times for each possible misalignment pattern.  Without having to
> ensure that a cacheline of data is only touched once we confirm there
> is no previous mismatch, it might be possible to write simpler code, but
> I'm currently not entirely sure how.

On glibc side we have discussed the requirement of null-terminated for
string functions on some occasions [1][2][3] and we leaned toward allows
non-null terminated buffers for strncmp (where implementations must stop
are either first null or the at end of the buffer) and require 
null-terminated buffers for strcmp.

[1] https://sourceware.org/pipermail/libc-alpha/2023-February/145840.html
[2] https://sourceware.org/pipermail/libc-alpha/2023-February/145860.html
[3] https://sourceware.org/pipermail/libc-alpha/2023-February/145907.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-07-17 16:22 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-16 17:22 [musl] strcmp() guarantees and assumptions Robert Clausecker
2023-07-16 17:49 ` NRK
2023-07-16 17:59   ` Robert Clausecker
2023-07-16 19:24     ` Pedro Falcato
2023-07-16 19:29       ` Pedro Falcato
2023-07-16 19:33     ` Markus Wichmann
2023-07-16 21:13       ` Robert Clausecker
2023-07-17 16:22         ` Adhemerval Zanella Netto

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).