mailing list of musl libc
 help / color / mirror / code / Atom feed
* On current (and future) use of LCTRANS
@ 2025-06-14 22:38 Pablo Correa Gomez
  2025-06-15  6:50 ` Markus Wichmann
  0 siblings, 1 reply; 3+ messages in thread
From: Pablo Correa Gomez @ 2025-06-14 22:38 UTC (permalink / raw)
  To: musl

Hi all,

As a follow-up from https://www.openwall.com/lists/musl/2025/06/02/2 I
have been looking at the different places where we currently do
translations. Those basically come out as the places where LCTRANS and
LCTRANS_CUR macros are used. There was one single place where the macro
was not used, addressed by
https://www.openwall.com/lists/musl/2025/06/02/1 so the discussion
considers that patch will be merged. 

We basically have 2 different kind of things that use these macros:

* Static strings:
  * Hard-coded error messages in src/errno/__strerror.h
  * Hard-coded human-readable signal names in src/string/strsignal.c
  * Hard-coded error messages in src/regex/regerror.c
  * Hard-coded error messages in src/network/hstrerror.c
  * Hard-coded error messages in src/network/gai_strerror.c
* Actual translations. This basically happens in src/locale/langinfo.c
It is exposed as the public functions nl_langinfo{_l}, but also used
internally in several places:
  * strftime{_l}
  * wcsftime{_l}
  * asctime{_r} and in consequence ctime{_r}
  * strptime
  * catopen

Unfortunately, there are quite some many functions that currently
ignore locales when being passed to them:

* is{w}alnum_l
* is{w}alpha_l
* is{w}blank_l
* is{w}cntrl_l
* is{w}digit_l
* is{w}graph_l
* is{w}lower_l
* is{w}print_l
* is{w}punct_l
* is{w}space_l
* is{w}upper_l
* is{w}xdigit_l
* iswctype_l
* strfmon_l
* to{w}lower_l
* to{w}upper_l
* towctrans_l

We might be able to go without using locales in some of them (like
isdigit), but we certainly cannot with others that currently use ASCII
codes where letters in other alphabets don't fit.

In addition, we have some functions related to collation where this is
also ignored:

* {wcs,wcsn,str,strn}casecmp_l
* {wcs,str}coll_l
* {wcs,str}xfrm_l
* wctrans_l
* wctype_l

In addition to this, we have the RADIXCHAR, which we hard-code in many
places while doing transformations. Finding the exact places where it
has to be implemented might be more tricky, but a non-exhaustive list:

* vstrfmon_l (internal used by strfmon family)
* fmt_ft (internal used by printf family)
* dec_float,hex_float (internal used by floatscan family)

Hopefully, this should be a good start to a discussion on
implementation details, and things we shouldn't care about.

Best,
Pablo.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: On current (and future) use of LCTRANS
  2025-06-14 22:38 On current (and future) use of LCTRANS Pablo Correa Gomez
@ 2025-06-15  6:50 ` Markus Wichmann
  2025-06-15 21:42   ` [musl] " Rich Felker
  0 siblings, 1 reply; 3+ messages in thread
From: Markus Wichmann @ 2025-06-15  6:50 UTC (permalink / raw)
  To: musl

Am Sun, Jun 15, 2025 at 12:38:46AM +0200 schrieb Pablo Correa Gomez:
> Unfortunately, there are quite some many functions that currently
> ignore locales when being passed to them:
> 

I am wondering why you care so much about the _l variants of these
functions. IMO, the only sensible implementation of these is:

1. if the base function is locale-independent, ignore the locale
   argument.
2. otherwise, wrap the base function in uselocale() calls.

The main functionality should stay in the base function. And that second
one then also shows how to avoid the need for any more _l functions in
future.

> * is{w}alnum_l
> * is{w}alpha_l
> * is{w}blank_l
> * is{w}cntrl_l
> * is{w}digit_l
> * is{w}graph_l
> * is{w}lower_l
> * is{w}print_l
> * is{w}punct_l
> * is{w}space_l
> * is{w}upper_l
> * is{w}xdigit_l
> * iswctype_l
> * strfmon_l
> * to{w}lower_l
> * to{w}upper_l
> * towctrans_l
> 

Isn't that by design? Except for strfmon, these are all the ctype and
wctype functions. Those should remain locale independent, shouldn't
they? musl only supports two runtime codesets, namely ASCII and UTF-8,
and only one wide character codeset, namely Unicode. (And ASCII only on
sufferance, because POSIX decided to require MB_CUR_MAX in the POSIX
locale to be 1).

So the ctype functions can only ever be their ASCII versions, because
the ASCII bytes are the only possible single-byte characters (in ASCII
mode, the high bytes aren't characters, and in UTF-8 mode, the high
bytes aren't complete characters), and the wctype functions can only
ever be their Unicode versions, and no locale can ever change that.

Support for other character sets is relegated to the iconv() API, which
can convert anything else into the only sensible choice, UTF-8.

Also note that
    - is{,w}digit can never be changed as per POSIX
    - is{,w}xdigit can only be changed for the alphabetic characters,
      but then the number parsing functions have to be changed to be
      consistent.

> We might be able to go without using locales in some of them (like
> isdigit), but we certainly cannot with others that currently use ASCII
> codes where letters in other alphabets don't fit.
> 
> In addition, we have some functions related to collation where this is
> also ignored:
> 

Well, this is because for now, musl has only been using codepoint
collation in all locales. But that is what Rich is currently working on,
isn't it? I don't know how locale-independent the result will be.

> * {wcs,wcsn,str,strn}casecmp_l
> * {wcs,str}coll_l
> * {wcs,str}xfrm_l
> * wctrans_l
> * wctype_l
> 

strcasecmp() is a bad API that is underspecified for multibyte codesets.
Its specification allows, but does not require, an implementation to
perform the case mapping in wide-character space, and to deal with
encoding errors by returning an error, without specifying how to do so.
For this reason, applications cannot expect any sensible behaviour out
of it as soon as the input strays from ASCII, and so any locale
dependency is just moot.

Also, the casecmp and wctrans functions deal with case mapping, which is
different from collation, isn't it? And very definitely locale
independent, except possibly for the multibyte↔widechar conversion.

> In addition to this, we have the RADIXCHAR, which we hard-code in many
> places while doing transformations. Finding the exact places where it
> has to be implemented might be more tricky, but a non-exhaustive list:
> 
> * vstrfmon_l (internal used by strfmon family)
> * fmt_ft (internal used by printf family)
> * dec_float,hex_float (internal used by floatscan family)
> 

Rich has in the past expressed his wish to at most allow one other radix
character, namely the comma, and I support that. Having RADIXCHAR be
freely definable makes things way overcomplicated for no real gain.
Imagine someone setting RADIXCHAR to '\r'.

And I think those are all the places where float can be converted to a
string or vice versa.

Ciao,
Markus


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [musl] On current (and future) use of LCTRANS
  2025-06-15  6:50 ` Markus Wichmann
@ 2025-06-15 21:42   ` Rich Felker
  0 siblings, 0 replies; 3+ messages in thread
From: Rich Felker @ 2025-06-15 21:42 UTC (permalink / raw)
  To: Markus Wichmann; +Cc: musl, Pablo Correa Gomez

On Sun, Jun 15, 2025 at 08:50:21AM +0200, Markus Wichmann wrote:
> Am Sun, Jun 15, 2025 at 12:38:46AM +0200 schrieb Pablo Correa Gomez:
> > Unfortunately, there are quite some many functions that currently
> > ignore locales when being passed to them:
> > 
> 
> I am wondering why you care so much about the _l variants of these
> functions. IMO, the only sensible implementation of these is:
> 
> 1. if the base function is locale-independent, ignore the locale
>    argument.
> 2. otherwise, wrap the base function in uselocale() calls.

I'm not opposed to making the _l version of some functions primary if
it's cleaner to implement that way, especially if the non-_l version
is not sufficiently a hot-path that the extra call frame wrapping the
_l version is going to matter. For example we're already doing this
with strftime.

There are two main reasons it's the other way around (non-_l primary)
for most things now:

- When the locale_t argument is going to be ignored anyway (because
  the functionality doesn't vary by locale), the call from the _l
  version to the non-_l version can always be a tail call (simple
  jump), but the call from the non-_l version to the _l version would
  require a whole call frame setup on archs where args are passed in
  on the stack or in registers but with a requirement to reserve spill
  space on the stack.

- For functions in the C standard, the _l version is outside the
  reserved namespace, so we need to introduce a namespace-safe symbol
  to call the _l version by the if non-_l version is going to call the
  _l version.

But both of these are issues that can be dealt with if there's a good
reason to do one the other way.

> The main functionality should stay in the base function. And that second
> one then also shows how to avoid the need for any more _l functions in
> future.
> 
> > * is{w}alnum_l
> > * is{w}alpha_l
> > * is{w}blank_l
> > * is{w}cntrl_l
> > * is{w}digit_l
> > * is{w}graph_l
> > * is{w}lower_l
> > * is{w}print_l
> > * is{w}punct_l
> > * is{w}space_l
> > * is{w}upper_l
> > * is{w}xdigit_l
> > * iswctype_l
> > * strfmon_l
> > * to{w}lower_l
> > * to{w}upper_l
> > * towctrans_l
> > 
> 
> Isn't that by design? Except for strfmon, these are all the ctype and
> wctype functions. Those should remain locale independent, shouldn't
> they? musl only supports two runtime codesets, namely ASCII and UTF-8,
> and only one wide character codeset, namely Unicode. (And ASCII only on
> sufferance, because POSIX decided to require MB_CUR_MAX in the POSIX
> locale to be 1).
> 
> So the ctype functions can only ever be their ASCII versions, because
> the ASCII bytes are the only possible single-byte characters (in ASCII
> mode, the high bytes aren't characters, and in UTF-8 mode, the high
> bytes aren't complete characters), and the wctype functions can only
> ever be their Unicode versions, and no locale can ever change that.

Indeed, the reasons these don't do anything with the locale_t argument
and that they weren't included in the plan for the locale overhaul is
that they don't have anything locale-specific to do.

The mapping of the byte-based C locale, to map high bytes onto wchar_t
values that are not valid Unicode Scalar Values, was chosen
intentionally so that the isw*() functions do not classify them as
falling into any of the classes and the tow*() functions don't map
them. This was part of the design discussion when the byte-based C
locale was begrudgingly added added.

Other than that, the encoding is always UTF-8, and character identity
or classification is not locale-specific.

> > We might be able to go without using locales in some of them (like
> > isdigit), but we certainly cannot with others that currently use ASCII
> > codes where letters in other alphabets don't fit.
> > 
> > In addition, we have some functions related to collation where this is
> > also ignored:
> > 
> 
> Well, this is because for now, musl has only been using codepoint
> collation in all locales. But that is what Rich is currently working on,
> isn't it? I don't know how locale-independent the result will be.
> 
> > * {wcs,wcsn,str,strn}casecmp_l
> > * {wcs,str}coll_l
> > * {wcs,str}xfrm_l
> > * wctrans_l
> > * wctype_l
> 
> strcasecmp() is a bad API that is underspecified for multibyte codesets.
> Its specification allows, but does not require, an implementation to
> perform the case mapping in wide-character space, and to deal with
> encoding errors by returning an error, without specifying how to do so.
> For this reason, applications cannot expect any sensible behaviour out
> of it as soon as the input strays from ASCII, and so any locale
> dependency is just moot.
> 
> Also, the casecmp and wctrans functions deal with case mapping, which is
> different from collation, isn't it? And very definitely locale
> independent, except possibly for the multibyte↔widechar conversion.

strcasecmp is indeed underspecified, and I'm not aware of any systems
that implement it a way that would do something useful. At least glibc
certainly does not. Last I checked, they implement it with tolower or
toupper, which was only meaningful in the pre-UTF-8 era.

The wctype/wctrans functions, like the isw*/tow* functions above,
already work and are just not locale-specific for the same reason.

> > In addition to this, we have the RADIXCHAR, which we hard-code in many
> > places while doing transformations. Finding the exact places where it
> > has to be implemented might be more tricky, but a non-exhaustive list:
> > 
> > * vstrfmon_l (internal used by strfmon family)
> > * fmt_ft (internal used by printf family)
> > * dec_float,hex_float (internal used by floatscan family)
> > 
> 
> Rich has in the past expressed his wish to at most allow one other radix
> character, namely the comma, and I support that. Having RADIXCHAR be
> freely definable makes things way overcomplicated for no real gain.
> Imagine someone setting RADIXCHAR to '\r'.

Yes. The localedef system allowing arbitrary radixchar is dangerously
underspecified. '\r' isn't really special except for ugly and
misleading presentation, but things like setting it to '0' would be
much worse, breaking invariants about round-tripping and basically
breaking anything that would parse or format numbers entirely.

On top of that, it's rather meaningless to support arbitrary radix
characters unless you also allow that they be multibyte, and that
breaks all kinds of invariants callers might expect about how much
storage is needed for the output (in other words it potentially
exposes buffer overflow vulns in programs that are otherwise
mostly-correct except for a subtly wrong assumption about locales).

So I am largely against making the radixchar setting anything more
than a 1-bit field that's xor'd with bit 1 of the '.' character.

> And I think those are all the places where float can be converted to a
> string or vice versa.

LC_MONETARY, used by strfmon, has a separate radix char vs LC_NUMERIC.

The only two places LC_NUMERIC radix char should be needed are fmt_fp
in src/stdio/vfprintf.c, and src/internal/floatscan.c for use by
strto{f,d,ld} and *scanf.

Rich

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-06-15 21:43 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-14 22:38 On current (and future) use of LCTRANS Pablo Correa Gomez
2025-06-15  6:50 ` Markus Wichmann
2025-06-15 21:42   ` [musl] " Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).