mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] strcasestr("", "") returns NULL
@ 2025-05-16  7:32 Nuno Cruces
  2025-05-16 12:50 ` Rich Felker
  0 siblings, 1 reply; 9+ messages in thread
From: Nuno Cruces @ 2025-05-16  7:32 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 360 bytes --]

Hi,

Currently strcasestr("", "") returns NULL which is inconsistent
with strstr("", "").

For strstr, the C standard specifies "If s2 points to a string with zero
length, the function returns s1."

strcasestr is a nonstandard extension, but to the best of my knowledge,
both glibc and the BSDs decide to be consistent with strstr in this case.

Regards,
Nuno

[-- Attachment #2: Type: text/html, Size: 540 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: strcasestr("", "") returns NULL
  2025-05-16  7:32 [musl] strcasestr("", "") returns NULL Nuno Cruces
@ 2025-05-16 12:50 ` Rich Felker
  2025-05-16 16:02   ` [musl] " Nuno Cruces
  0 siblings, 1 reply; 9+ messages in thread
From: Rich Felker @ 2025-05-16 12:50 UTC (permalink / raw)
  To: Nuno Cruces; +Cc: musl

On Fri, May 16, 2025 at 08:32:00AM +0100, Nuno Cruces wrote:
> Hi,
> 
> Currently strcasestr("", "") returns NULL which is inconsistent
> with strstr("", "").
> 
> For strstr, the C standard specifies "If s2 points to a string with zero
> length, the function returns s1."
> 
> strcasestr is a nonstandard extension, but to the best of my knowledge,
> both glibc and the BSDs decide to be consistent with strstr in this case.

Indeed, this should be fixed. Thanks for the report.

As an aside, strcasestr is an awful function we should probably never
have provided and that no one should use, that doesn't admit any
decent optimization. I believe it's vaguely possible to do the twoway
algorithm for the current ascii-only case equivalence strcasecmp does,
but I declined to investigate further or implement because if we ever
want to have strcasecmp do more, the effort would have been wasted (or
worse, would incentivize setting LC_ALL=C for performance purposes if
we left the code conditionally in place).

I'll write a fix and push it along with a big queue of stuff I didn't
realize had piled up.

Rich


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [musl] strcasestr("", "") returns NULL
  2025-05-16 12:50 ` Rich Felker
@ 2025-05-16 16:02   ` Nuno Cruces
  2025-05-16 16:09     ` Rich Felker
  0 siblings, 1 reply; 9+ messages in thread
From: Nuno Cruces @ 2025-05-16 16:02 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 1629 bytes --]

I don't know about it not allowing optimization.

The C/POSIX local does, as you wrote. And regular expression engines kinda
prove it's at least possible not to be quadratic on the length of the
haystack (although they spend a lot more effort preprocessing the needle).

But yeah, at a minimum any locale where case differences interfere with
random access invalidates all the classical algorithms.

Nuno

On Fri 16 May 2025, 13:50 Rich Felker, <dalias@libc.org> wrote:

> On Fri, May 16, 2025 at 08:32:00AM +0100, Nuno Cruces wrote:
> > Hi,
> >
> > Currently strcasestr("", "") returns NULL which is inconsistent
> > with strstr("", "").
> >
> > For strstr, the C standard specifies "If s2 points to a string with zero
> > length, the function returns s1."
> >
> > strcasestr is a nonstandard extension, but to the best of my knowledge,
> > both glibc and the BSDs decide to be consistent with strstr in this case.
>
> Indeed, this should be fixed. Thanks for the report.
>
> As an aside, strcasestr is an awful function we should probably never
> have provided and that no one should use, that doesn't admit any
> decent optimization. I believe it's vaguely possible to do the twoway
> algorithm for the current ascii-only case equivalence strcasecmp does,
> but I declined to investigate further or implement because if we ever
> want to have strcasecmp do more, the effort would have been wasted (or
> worse, would incentivize setting LC_ALL=C for performance purposes if
> we left the code conditionally in place).
>
> I'll write a fix and push it along with a big queue of stuff I didn't
> realize had piled up.
>
> Rich
>

[-- Attachment #2: Type: text/html, Size: 2302 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [musl] strcasestr("", "") returns NULL
  2025-05-16 16:02   ` [musl] " Nuno Cruces
@ 2025-05-16 16:09     ` Rich Felker
  2025-05-16 16:59       ` Nuno Cruces
  0 siblings, 1 reply; 9+ messages in thread
From: Rich Felker @ 2025-05-16 16:09 UTC (permalink / raw)
  To: Nuno Cruces; +Cc: musl

On Fri, May 16, 2025 at 05:02:25PM +0100, Nuno Cruces wrote:
> I don't know about it not allowing optimization.
> 
> The C/POSIX local does, as you wrote. And regular expression engines kinda
> prove it's at least possible not to be quadratic on the length of the
> haystack (although they spend a lot more effort preprocessing the needle).

Regular expression isn't O(1) in space so not a possible
implementation for strcasestr. And even if you do have a space budget,
the optimizations needed to make it not quadratic-in-time have very
large space tradeoffs. It's not the magic solution fans make it sound
like.

Rich

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [musl] strcasestr("", "") returns NULL
  2025-05-16 16:09     ` Rich Felker
@ 2025-05-16 16:59       ` Nuno Cruces
  2025-05-16 17:37         ` Rich Felker
  0 siblings, 1 reply; 9+ messages in thread
From: Nuno Cruces @ 2025-05-16 16:59 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 1034 bytes --]

I wasn't suggesting they'd be used, just that they kinda demonstrate the
problem is not absolutely intractable.
In any case, if you're interested, glibc seems to use two-way for
strcasestr, though I'm not sure if it supports every locale.
I haven't looked into the details.

Thanks a lot for musl!

Regards,
Nuno


On Fri, 16 May 2025 at 17:09, Rich Felker <dalias@libc.org> wrote:

> On Fri, May 16, 2025 at 05:02:25PM +0100, Nuno Cruces wrote:
> > I don't know about it not allowing optimization.
> >
> > The C/POSIX local does, as you wrote. And regular expression engines
> kinda
> > prove it's at least possible not to be quadratic on the length of the
> > haystack (although they spend a lot more effort preprocessing the
> needle).
>
> Regular expression isn't O(1) in space so not a possible
> implementation for strcasestr. And even if you do have a space budget,
> the optimizations needed to make it not quadratic-in-time have very
> large space tradeoffs. It's not the magic solution fans make it sound
> like.
>
> Rich
>

[-- Attachment #2: Type: text/html, Size: 1529 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: strcasestr("", "") returns NULL
  2025-05-16 16:59       ` Nuno Cruces
@ 2025-05-16 17:37         ` Rich Felker
  2025-05-17  6:49           ` Markus Wichmann
  0 siblings, 1 reply; 9+ messages in thread
From: Rich Felker @ 2025-05-16 17:37 UTC (permalink / raw)
  To: Nuno Cruces; +Cc: musl

On Fri, May 16, 2025 at 05:59:22PM +0100, Nuno Cruces wrote:
> I wasn't suggesting they'd be used, just that they kinda demonstrate the
> problem is not absolutely intractable.
> In any case, if you're interested, glibc seems to use two-way for
> strcasestr, though I'm not sure if it supports every locale.
> I haven't looked into the details.

Interesting -- I wonder how they do it. If it's just for singlebyte
locales that's entirely useless (they're long obsolete) but maybe they
have some way to make it work with variable character length? Naively
I would think lack of random access makes this hard if not impossible
to do efficiently.

Rich


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: strcasestr("", "") returns NULL
  2025-05-16 17:37         ` Rich Felker
@ 2025-05-17  6:49           ` Markus Wichmann
  2025-05-29 13:47             ` Gabriel Ravier
  0 siblings, 1 reply; 9+ messages in thread
From: Markus Wichmann @ 2025-05-17  6:49 UTC (permalink / raw)
  To: musl; +Cc: Nuno Cruces

Am Fri, May 16, 2025 at 01:37:41PM -0400 schrieb Rich Felker:
> On Fri, May 16, 2025 at 05:59:22PM +0100, Nuno Cruces wrote:
> > I wasn't suggesting they'd be used, just that they kinda demonstrate the
> > problem is not absolutely intractable.
> > In any case, if you're interested, glibc seems to use two-way for
> > strcasestr, though I'm not sure if it supports every locale.
> > I haven't looked into the details.
> 
> Interesting -- I wonder how they do it.

tolower(). That's how they do it. So yeah, it is just for single-byte
locales.

Ciao,
Markus


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: strcasestr("", "") returns NULL
  2025-05-17  6:49           ` Markus Wichmann
@ 2025-05-29 13:47             ` Gabriel Ravier
  2025-05-29 14:08               ` Rich Felker
  0 siblings, 1 reply; 9+ messages in thread
From: Gabriel Ravier @ 2025-05-29 13:47 UTC (permalink / raw)
  To: musl, Markus Wichmann; +Cc: Nuno Cruces, Rich Felker

On 5/17/25 8:49 AM, Markus Wichmann wrote:
> Am Fri, May 16, 2025 at 01:37:41PM -0400 schrieb Rich Felker:
>> On Fri, May 16, 2025 at 05:59:22PM +0100, Nuno Cruces wrote:
>>> I wasn't suggesting they'd be used, just that they kinda demonstrate the
>>> problem is not absolutely intractable.
>>> In any case, if you're interested, glibc seems to use two-way for
>>> strcasestr, though I'm not sure if it supports every locale.
>>> I haven't looked into the details.
>> Interesting -- I wonder how they do it.
> tolower(). That's how they do it. So yeah, it is just for single-byte
> locales.
>
> Ciao,
> Markus

Glibc's code for strcasestr has a comment that simply states:
 > This function gives unspecified results in multibyte locales.

So they don't even try to make it work in those conditions, yeah. I do 
wonder if it's any worse than e.g. strcasecmp though, given that in 
glibc, both strcasecmp and strcasestr simply directly use tolower on 
input characters to canonicalize them before usage in the function but 
otherwise don't do anything special to make the case-insensitive 
comparison work.
This is also the case in musl (for strcasecmp, and in practice for 
strcasestr, since it calls strncasecmp which behaves the same as 
strcasecmp w.r.t. how it does the actual comparison), so while I would 
be far more hesitant to directly think simple canonicalization with 
tolower would work with strcasestr than with strcasecmp (especially 
given glibc's comment in strcasestr on giving unspecified results with 
multibyte locales), I do wonder how strcasecmp itself works with 
multibyte locales in the first place... Does it ?

Then again, POSIX does state (about strcasecmp and strncasecmp):
 > When the LC_CTYPE category of the locale being used is from the POSIX 
locale, these functions shall behave as if the strings had been 
converted to lowercase and then a byte comparison performed. Otherwise, 
the results are unspecified.

So if strcasestr was added to POSIX at some point, I would expect POSIX 
to state the same thing about it (anything else would make basically no 
sense unless they also change strcasecmp and strncasecmp to work with 
multibyte locales) - perhaps that implies these functions should be 
implemented as if only single-byte locales matter ?




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: strcasestr("", "") returns NULL
  2025-05-29 13:47             ` Gabriel Ravier
@ 2025-05-29 14:08               ` Rich Felker
  0 siblings, 0 replies; 9+ messages in thread
From: Rich Felker @ 2025-05-29 14:08 UTC (permalink / raw)
  To: Gabriel Ravier; +Cc: musl, Markus Wichmann, Nuno Cruces

On Thu, May 29, 2025 at 03:47:06PM +0200, Gabriel Ravier wrote:
> On 5/17/25 8:49 AM, Markus Wichmann wrote:
> > Am Fri, May 16, 2025 at 01:37:41PM -0400 schrieb Rich Felker:
> > > On Fri, May 16, 2025 at 05:59:22PM +0100, Nuno Cruces wrote:
> > > > I wasn't suggesting they'd be used, just that they kinda demonstrate the
> > > > problem is not absolutely intractable.
> > > > In any case, if you're interested, glibc seems to use two-way for
> > > > strcasestr, though I'm not sure if it supports every locale.
> > > > I haven't looked into the details.
> > > Interesting -- I wonder how they do it.
> > tolower(). That's how they do it. So yeah, it is just for single-byte
> > locales.
> > 
> > Ciao,
> > Markus
> 
> Glibc's code for strcasestr has a comment that simply states:
> > This function gives unspecified results in multibyte locales.

Yes, I'm well aware that the underlying strcasecmp is unspecified in
other locales. My mistake in opening that can of worms was what
brought us the new POSIX requirement for the byte-based C locale...

> So they don't even try to make it work in those conditions, yeah. I do
> wonder if it's any worse than e.g. strcasecmp though, given that in glibc,
> both strcasecmp and strcasestr simply directly use tolower on input
> characters to canonicalize them before usage in the function but otherwise
> don't do anything special to make the case-insensitive comparison work.
> This is also the case in musl (for strcasecmp, and in practice for
> strcasestr, since it calls strncasecmp which behaves the same as strcasecmp
> w.r.t. how it does the actual comparison), so while I would be far more
> hesitant to directly think simple canonicalization with tolower would work
> with strcasestr than with strcasecmp (especially given glibc's comment in
> strcasestr on giving unspecified results with multibyte locales), I do
> wonder how strcasecmp itself works with multibyte locales in the first
> place... Does it ?

It doesn't.

> Then again, POSIX does state (about strcasecmp and strncasecmp):
> > When the LC_CTYPE category of the locale being used is from the POSIX
> locale, these functions shall behave as if the strings had been converted to
> lowercase and then a byte comparison performed. Otherwise, the results are
> unspecified.
> 
> So if strcasestr was added to POSIX at some point, I would expect POSIX to
> state the same thing about it (anything else would make basically no sense
> unless they also change strcasecmp and strncasecmp to work with multibyte
> locales) - perhaps that implies these functions should be implemented as if
> only single-byte locales matter ?

POSIX does not specify them for single-byte locales in general, *only*
for the C/POSIX locale. Basically, these functions just should not be
used. They are not compatible with multilingual text support, and
they're not even safe to use with ASCII-only input unless your
application rejects the locale system entirely, since just having
LC_CTYPE set to the user's locale may cause them not to work *even on
ASCII bytes*.

Rich


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-05-29 14:08 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-16  7:32 [musl] strcasestr("", "") returns NULL Nuno Cruces
2025-05-16 12:50 ` Rich Felker
2025-05-16 16:02   ` [musl] " Nuno Cruces
2025-05-16 16:09     ` Rich Felker
2025-05-16 16:59       ` Nuno Cruces
2025-05-16 17:37         ` Rich Felker
2025-05-17  6:49           ` Markus Wichmann
2025-05-29 13:47             ` Gabriel Ravier
2025-05-29 14:08               ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).