Revisiting byte-based C locale

mailing list of musl libc
 help / color / mirror / code / Atom feed

* Revisiting byte-based C locale
@ 2015-05-22  2:22 Rich Felker
  2015-05-22  4:04 ` Josiah Worcester
  2015-06-04 20:53 ` Rich Felker
  0 siblings, 2 replies; 8+ messages in thread
From: Rich Felker @ 2015-05-22  2:22 UTC (permalink / raw)
  To: musl

The last time the the byte-based C locale topic was visited ("Possible
bytelocale patch", http://www.openwall.com/lists/musl/2014/07/03/2),
it was a rather ugly patch introducing lots of code duplication. Now,
I believe the callers of multibyte/wide char functions which need to
always work in UTF-8 mode (iconv) or need to match a previously-saved
mode (stdio wide functions, which save the encoding in the FILE when
it becomes wide-oriented) can simply swap __pthread_self()->locale
back and forth. There is no longer a possibility that the thread
pointer may be uninitialized, nor a heavy synchronization cost of
switching thread-local locales from the atomics in uselocale -- commit
68630b55c0c7219fe9df70dc28ffbf9efc8021d8 removed all that.

Thus, I think we're at a point where we can evaluate the choice to
support or not to support a byte-based C locale on the basis of things
like standards conformance and impact on users and on software
compatibility without having to weigh implementation costs (which
would have contributed to "impact on users").

Since last year, the issue of byte-based C locale has come up a few
more times as a stumbling point for users on the IRC channel and/or
mailing list (I forget which and haven't gone back to look it up yet).
In particular, broken configure tests passing binary data to grep
failed, and I believe one or more language interpreters loading source
files in the C locale errored out due to a Latin-1 encoded "©"
character in source comments. Personally I'm in favor of getting the
broken stuff fixed, but I can see both sides.

There are also minor conformance reasons to consider the byte-based C
locale even without accepting the resolution to Austin Group issue 663
(which is supposedly imposing the requirement, someday). In
particular, the C standard seems to allow the current behavior of
musl, where the C locale has extra characters for which isw*() return
true, as long as the non-wide is*() functions don't have such extra
characters. C doesn't even define abstract character classes that
these functions report, just loose requirements on their behavior. But
POSIX specifies LC_CTYPE in terms of character classes which have
members, and does not leave room for extra characters in the C locale
as far as I can tell. This could affect real-world usage cases where
an application intentionally running in the C locale expects the
regex/fnmatch bracket [[:alpha:]] not to match anything but ASCII
letters. As mentioned several times in the past, this non-conformance
could be addressed by changes in the isw*() functions (making them
locale-aware) rather than by adding the byte-based C locale, but if
there are other motivations to support the byte-based C locale, it
may make sense to solve both issues with one change.

Any new opinions on the topic? Or interest in re-emphasizing a
previously stated opinion? :)

Rich

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Revisiting byte-based C locale
  2015-05-22  2:22 Revisiting byte-based C locale Rich Felker
@ 2015-05-22  4:04 ` Josiah Worcester
  2015-05-22  4:09   ` Rich Felker
  2015-06-04 20:53 ` Rich Felker
  1 sibling, 1 reply; 8+ messages in thread
From: Josiah Worcester @ 2015-05-22  4:04 UTC (permalink / raw)
  To: musl

On Thu, May 21, 2015 at 9:22 PM, Rich Felker <dalias@libc.org> wrote:
> The last time the the byte-based C locale topic was visited ("Possible
> bytelocale patch", http://www.openwall.com/lists/musl/2014/07/03/2),
> it was a rather ugly patch introducing lots of code duplication. Now,
> I believe the callers of multibyte/wide char functions which need to
> always work in UTF-8 mode (iconv) or need to match a previously-saved
> mode (stdio wide functions, which save the encoding in the FILE when
> it becomes wide-oriented) can simply swap __pthread_self()->locale
> back and forth. There is no longer a possibility that the thread
> pointer may be uninitialized, nor a heavy synchronization cost of
> switching thread-local locales from the atomics in uselocale -- commit
> 68630b55c0c7219fe9df70dc28ffbf9efc8021d8 removed all that.
>
> Thus, I think we're at a point where we can evaluate the choice to
> support or not to support a byte-based C locale on the basis of things
> like standards conformance and impact on users and on software
> compatibility without having to weigh implementation costs (which
> would have contributed to "impact on users").
>
> Since last year, the issue of byte-based C locale has come up a few
> more times as a stumbling point for users on the IRC channel and/or
> mailing list (I forget which and haven't gone back to look it up yet).
> In particular, broken configure tests passing binary data to grep
> failed, and I believe one or more language interpreters loading source
> files in the C locale errored out due to a Latin-1 encoded "©"
> character in source comments. Personally I'm in favor of getting the
> broken stuff fixed, but I can see both sides.
>
> There are also minor conformance reasons to consider the byte-based C
> locale even without accepting the resolution to Austin Group issue 663
> (which is supposedly imposing the requirement, someday). In
> particular, the C standard seems to allow the current behavior of
> musl, where the C locale has extra characters for which isw*() return
> true, as long as the non-wide is*() functions don't have such extra
> characters. C doesn't even define abstract character classes that
> these functions report, just loose requirements on their behavior. But
> POSIX specifies LC_CTYPE in terms of character classes which have
> members, and does not leave room for extra characters in the C locale
> as far as I can tell. This could affect real-world usage cases where
> an application intentionally running in the C locale expects the
> regex/fnmatch bracket [[:alpha:]] not to match anything but ASCII
> letters. As mentioned several times in the past, this non-conformance
> could be addressed by changes in the isw*() functions (making them
> locale-aware) rather than by adding the byte-based C locale, but if
> there are other motivations to support the byte-based C locale, it
> may make sense to solve both issues with one change.
>
> Any new opinions on the topic? Or interest in re-emphasizing a
> previously stated opinion? :)
>
> Rich

Given the POSIX rules on LC_CTYPE character classes effecting
[[:alpha:]], it seems to me now that the clear intent (if not
statement) is in fact for a byte-based C locale. Though maybe
unfortunate, it does seem like as though that is in fact the most
conformant way of doing it, and conforming looks to have little cost
now.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Revisiting byte-based C locale
  2015-05-22  4:04 ` Josiah Worcester
@ 2015-05-22  4:09   ` Rich Felker
  0 siblings, 0 replies; 8+ messages in thread
From: Rich Felker @ 2015-05-22  4:09 UTC (permalink / raw)
  To: musl

On Thu, May 21, 2015 at 11:04:47PM -0500, Josiah Worcester wrote:
> Given the POSIX rules on LC_CTYPE character classes effecting
> [[:alpha:]], it seems to me now that the clear intent (if not
> statement) is in fact for a byte-based C locale. Though maybe
> unfortunate, it does seem like as though that is in fact the most
> conformant way of doing it, and conforming looks to have little cost
> now.

Not necessarily. There's no rule against the existence of additional
characters in the C locale -- in fact, the proposal to make the C
locale "8-bit-clean" requires an additional 128 characters -- but the
additional ones can't be in classes like alpha.

Rich


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Revisiting byte-based C locale
  2015-05-22  2:22 Revisiting byte-based C locale Rich Felker
  2015-05-22  4:04 ` Josiah Worcester
@ 2015-06-04 20:53 ` Rich Felker
  2015-06-04 21:00   ` Christian Neukirchen
  1 sibling, 1 reply; 8+ messages in thread
From: Rich Felker @ 2015-06-04 20:53 UTC (permalink / raw)
  To: musl

On Thu, May 21, 2015 at 10:22:03PM -0400, Rich Felker wrote:
> Any new opinions on the topic? Or interest in re-emphasizing a
> previously stated opinion? :)

No new opinions on this? I've tentatively added drafting a new
proposed byte-based C locale patch as a roadmap item for this release
cycle, not necessarily to commit it, but as a way to re-evaluate
whether it's still costly to implement.

Rich


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Revisiting byte-based C locale
  2015-06-04 20:53 ` Rich Felker
@ 2015-06-04 21:00   ` Christian Neukirchen
  2015-06-05  1:39     ` Rich Felker
  0 siblings, 1 reply; 8+ messages in thread
From: Christian Neukirchen @ 2015-06-04 21:00 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

Rich Felker <dalias@libc.org> writes:

> On Thu, May 21, 2015 at 10:22:03PM -0400, Rich Felker wrote:
>> Any new opinions on the topic? Or interest in re-emphasizing a
>> previously stated opinion? :)
>
> No new opinions on this? I've tentatively added drafting a new
> proposed byte-based C locale patch as a roadmap item for this release
> cycle, not necessarily to commit it, but as a way to re-evaluate
> whether it's still costly to implement.

Will it support regexec on 8-bit binary data?  We found out file(1)
needs this.

-- 
Christian Neukirchen  <chneukirchen@gmail.com>  http://chneukirchen.org


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Revisiting byte-based C locale
  2015-06-04 21:00   ` Christian Neukirchen
@ 2015-06-05  1:39     ` Rich Felker
  2015-06-05  4:48       ` Isaac Dunham
  2015-06-05  8:58       ` Christian Neukirchen
  0 siblings, 2 replies; 8+ messages in thread
From: Rich Felker @ 2015-06-05  1:39 UTC (permalink / raw)
  To: musl

On Thu, Jun 04, 2015 at 11:00:10PM +0200, Christian Neukirchen wrote:
> Rich Felker <dalias@libc.org> writes:
> 
> > On Thu, May 21, 2015 at 10:22:03PM -0400, Rich Felker wrote:
> >> Any new opinions on the topic? Or interest in re-emphasizing a
> >> previously stated opinion? :)
> >
> > No new opinions on this? I've tentatively added drafting a new
> > proposed byte-based C locale patch as a roadmap item for this release
> > cycle, not necessarily to commit it, but as a way to re-evaluate
> > whether it's still costly to implement.
> 
> Will it support regexec on 8-bit binary data?

Yes, as long as the program has done one of the following:

- Not called setlocale at all.
- Called setlocale with an explicit "C" argument or in environment.
- Called uselocale with a locale_t for "C".

> We found out file(1)
> needs this.

Indeed, aside from the Austin Group issue 663, having this topic come
up several times in real-world usage is the motivation for
reconsidering it. I believe file(1) _attempts_ to do this right,
making use of uselocale.

Rich


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Revisiting byte-based C locale
  2015-06-05  1:39     ` Rich Felker
@ 2015-06-05  4:48       ` Isaac Dunham
  2015-06-05  8:58       ` Christian Neukirchen
  1 sibling, 0 replies; 8+ messages in thread
From: Isaac Dunham @ 2015-06-05  4:48 UTC (permalink / raw)
  To: musl

On Thu, Jun 04, 2015 at 09:39:11PM -0400, Rich Felker wrote:
> On Thu, Jun 04, 2015 at 11:00:10PM +0200, Christian Neukirchen wrote:
> > Rich Felker <dalias@libc.org> writes:
> > 
> > > On Thu, May 21, 2015 at 10:22:03PM -0400, Rich Felker wrote:
> > >> Any new opinions on the topic? Or interest in re-emphasizing a
> > >> previously stated opinion? :)
> > >
> > > No new opinions on this? I've tentatively added drafting a new
> > > proposed byte-based C locale patch as a roadmap item for this release
> > > cycle, not necessarily to commit it, but as a way to re-evaluate
> > > whether it's still costly to implement.
> > 
> > Will it support regexec on 8-bit binary data?
> 
> Yes, as long as the program has done one of the following:
> 
> - Not called setlocale at all.
> - Called setlocale with an explicit "C" argument or in environment.
> - Called uselocale with a locale_t for "C".

I'd like to see it happen, and those conditions are reasonable.

Thanks,
Isaac Dunham


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Revisiting byte-based C locale
  2015-06-05  1:39     ` Rich Felker
  2015-06-05  4:48       ` Isaac Dunham
@ 2015-06-05  8:58       ` Christian Neukirchen
  1 sibling, 0 replies; 8+ messages in thread
From: Christian Neukirchen @ 2015-06-05  8:58 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

Rich Felker <dalias@libc.org> writes:

> On Thu, Jun 04, 2015 at 11:00:10PM +0200, Christian Neukirchen wrote:
>> Rich Felker <dalias@libc.org> writes:
>> 
>> > On Thu, May 21, 2015 at 10:22:03PM -0400, Rich Felker wrote:
>> >> Any new opinions on the topic? Or interest in re-emphasizing a
>> >> previously stated opinion? :)
>> >
>> > No new opinions on this? I've tentatively added drafting a new
>> > proposed byte-based C locale patch as a roadmap item for this release
>> > cycle, not necessarily to commit it, but as a way to re-evaluate
>> > whether it's still costly to implement.
>> 
>> Will it support regexec on 8-bit binary data?
>
> Yes, as long as the program has done one of the following:
>
> - Not called setlocale at all.
> - Called setlocale with an explicit "C" argument or in environment.
> - Called uselocale with a locale_t for "C".

AFAICS it does:

in main:
        (void)setlocale(LC_CTYPE, "");

protected int
file_regcomp(file_regex_t *rx, const char *pat, int flags)
{
#ifdef USE_C_LOCALE
        rx->c_lc_ctype = newlocale(LC_CTYPE_MASK, "C", 0);
        assert(rx->c_lc_ctype != NULL);
        rx->old_lc_ctype = uselocale(rx->c_lc_ctype);
        assert(rx->old_lc_ctype != NULL);
#endif
        rx->pat = pat;

        return rx->rc = regcomp(&rx->rx, pat, flags);
}


>> We found out file(1)
>> needs this.
>
> Indeed, aside from the Austin Group issue 663, having this topic come
> up several times in real-world usage is the motivation for
> reconsidering it. I believe file(1) _attempts_ to do this right,
> making use of uselocale.

A strong +1 from me then.  I'll be glad to help testing it on Void Linux.

-- 
Christian Neukirchen  <chneukirchen@gmail.com>  http://chneukirchen.org


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-06-05  8:58 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-22  2:22 Revisiting byte-based C locale Rich Felker
2015-05-22  4:04 ` Josiah Worcester
2015-05-22  4:09   ` Rich Felker
2015-06-04 20:53 ` Rich Felker
2015-06-04 21:00   ` Christian Neukirchen
2015-06-05  1:39     ` Rich Felker
2015-06-05  4:48       ` Isaac Dunham
2015-06-05  8:58       ` Christian Neukirchen

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).