Locale framework RFC

mailing list of musl libc
 help / color / mirror / code / Atom feed

* Locale framework RFC
@ 2014-06-27 19:04 Rich Felker
  2014-06-28  1:25 ` Locale framework RFC - more on proposed implementation Rich Felker
  2014-06-29  2:14 ` Locale framework RFC Rich Felker
  0 siblings, 2 replies; 3+ messages in thread
From: Rich Felker @ 2014-06-27 19:04 UTC (permalink / raw)
  To: musl

Background:

One of the agenda items for this release cycle is locale framework.
This is needed both to support a byte-based POSIX locale (the
unfortunate fallout of Austin Group issue #663) and for legitimate
locale purposes like collation, localized time formats, etc.

Note that, at present, musl's "everything is UTF-8" C locale is
already non-conforming to the requirements of ISO C, because it places
extra characters in the C locale's character classes like alpha, etc.
beyond what the standard allows. This could be fixed by making the
definitions of the character classes locale-dependent, but if we just
accept the (bad, wrong, backwards, etc.) new POSIX requirements for
the C/POSIX locale, we get a fix for free: it doesn't matter if
iswalpha(0xc0) returns true in the C locale if the wchar_t value 0xc0
can never be generated in the C locale.

My proposed solution is to provide a backwards C locale where bytes
0x80 through 0xff are interpreted as abstract bytes that decode to
wchar_t values which are either invalid Unicode or PUA codepoints. The
latter is probably preferable since generating invalid codepoints may,
strictly speaking, make it wrong to define __STDC_ISO_10646__.

How does this affect real programs? Not much at all. A program that
hasn't called setlocale() can't expect to be able to use the multibyte
interfaces reasonably anyway, so it doesn't matter that they default
to byte mode when the program starts up. And if the program does call
setlocale correctly (with an argument of "", which means to use the
'default locale', defined by POSIX with the LC_* env vars), it will
get a proper UTF-8-based locale anyway unless the user has explicitly
overridden that by setting LC_CTYPE=C or LC_ALL=C. So really all
that's seriously affected are scripts using LC_CTYPE=C or LC_ALL=C to
do byte-based processing using the standard utilities, and the
behavior of these is "improved".

Implementation:

Three new fields in the libc structure:

1. locale_t global_locale;

This is the locale presently selected by setlocale() and which affects
all threads which have not called uselocale() or which called
uselocale with LC_GLOBAL_LOCALE as the argument.

2. int uselocale_cnt;

uselocale_cnt is the current number of threads with a thread-local
locale. It's incremented/decremented (atomically) by the uselocale
function when transitioning from LC_GLOBAL_LOCALE to a thread-local
locale or vice versa, respectively, and also decremented (atomically)
in pthread_exit if the exiting thread has a thread-local locale. The
purpose of having uselocale_cnt is that, whenever uselocale_cnt is
zero, libc.global_locale can be used directly with no TLS access to
determine if the current thread has a thread-local locale.

3. int bytelocale_cnt_minus_1

This is a second atomic counter which behaves similarly to
uselocale_cnt, except that it is only incremented/decremented when the
thread-local locale being activated/deactivated is non-UTF-8
(byte-based). The global locale set by setlocale is also tracked in
the count, and the result is offset by -1.

Initially at program startup (when setlocale has not been called), the
value of bytelocale_cnt_minus_1 is zero. Setting any locale but "C" or
"POSIX" for LC_CTYLE with setlocale will enable UTF-8 and thus
decrement the value to -1. Setting any thread-local locale to "C" or
"POSIX" for LC_CTYPE will increment the value to something
non-negative.

All functions which are optimized for the sane case of all data being
UTF-8 therefore have a trivial fast-path: if
libc.bytelocale_cnt_minus_1 is negative, they can immediately assume
UTF-8 with no further tests. Otherwise checking libc.uselocale_cnt is
necessary to determine whether to inspect libc.global_locale or
__pthread_self()->locale to determine whether to decode UTF-8 or treat
bytes as abstract bytes.

Per earlier testing I did when Austin Group issue #663 was being
discussed, a single access and conditional jump based on data in the
libc structure does not yield measurable performance cost in UTF-8
decoding. For encoding (wc[r]tomb) there may be a small performance
cost added on archs that need a GOT pointer for GOT-relative accesses
(vs direct PC-relative), since the current code has no GOT pointer.
Fortunately decoding, not encoding, is the performance-critical
operation.

Code which uses locale:

The basic idiom for getting the locale will be:

	locale_t loc = libc.uselocale_cnt && __pthread_self()->locale
		? __pthread_self()->locale
		: libc.global_locale;

And if all that's needed is a UTF-8 flag:

	int is_utf8 = libc.bytelocale_cnt_minus_1<0 || loc->utf8;

where "loc" is the result of the previous expression above. This test
looks fairly expensive, but the only cases with any cost are when
there's at least one thread with a non-UTF-8 locale. Even in the case
where uselocale is in heavy use, as long as it's not being used to
turn off UTF-8, there's no performance penalty.

Components affected:

1. Multibyte functions: must use the above tests to see whether to
process UTF-8 or behave as dumb byte functions. Note that the
restartable multibyte functions (those which use mbstate_t) can skip
the check when the state is not the initial state, since use of the
state after changing locale is UB.

2. Character class functions: should not be affected at all, but we
need to make sure they're all returning false for the characters
decoded from high bytes in bytelocale mode.

3. Stdio wide mode: It's required to bind to the character encoding in
effect at the time the FILE goes into wide mode, rather than at the
time of the IO operation. So rather than using mbrtowc or wcrtomb, it
needs to store the state at the time of enterring wide mode and use a
conversion that's conditional on this saved flag rather than on the
locale.

4. Code which uses mbtowc and/or wctomb assuming they always process
UTF-8: Aside from the above-mentioned use in stdio, this is probably
just iconv. To fix this, I propose adding new functions which don't
check the locale but always process UTF-8. These could also be used
for stdio wide mode, and they could use a different API than the
standard functions in order to be more efficient (e.g. returning the
decoded character, or negative for errors, rather than storing the
result via a pointer argument).

5. MB_CUR_MAX macro: It needs to expand to a function call rather than
an integer constant expression, since it has to be 1 for the new POSIX
locale. The function can in turn use the is_utf8 pattern above to
determine the right return value.

6. setlocale, uselocale, and related functions: These need to
implement the locale switching and above atomic counters logic.

7. pthread_exit: Needs to decrement revelant atomic counters.

8. nl_langinfo and nl_langinfo_l: At present, the only item they need
to support on a per-thread basis is CODESET. For the byte-based C
locale, this could be "8BIT", "BINARY", "ASCII+8BIT" or similar. Here
it needs to be decided whether nl_langinfo should be responsible for
determining the locale_t to pass to nl_langinfo_l, or whether
nl_langinfo_l should accept (locale_t)0 and do its own determination.
This issue will also affect other */*_l pairs that need non-trivial
implementations later.

9. iconv: In addition to the above issue in item 4, iconv should
support whatever value nl_langinfo(CODESET) returns for the C locale
as a from/to argument, even if it's largely useless.

Overall Impact:

Should be near-zero on programs that don't use locale-related
features: a few bytes in the global libc struct and a couple extra
lines in pthread_exit.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Locale framework RFC - more on proposed implementation
  2014-06-27 19:04 Locale framework RFC Rich Felker
@ 2014-06-28  1:25 ` Rich Felker
  2014-06-29  2:14 ` Locale framework RFC Rich Felker
  1 sibling, 0 replies; 3+ messages in thread
From: Rich Felker @ 2014-06-28  1:25 UTC (permalink / raw)
  To: musl

I realize I left out some details about how setlocale/uselocale/etc.
will actually work, which should be included:

General behavior, implementation-defined behaviors:

Per C, when a locale of "" is requested, the "default locale" is used.
Per POSIX, default locale is determined by applying the LC_* and LANG
env vars in the usual order (LC_ALL overrides all, LC_* are for
individual categories, LANG is the fallback if LC_* are not set) with
an implementation-defined default in the case where none of the vars
are defined. musl's implementation-defined default will be the current
"C.UTF-8".

Here are the values for different locale categories that will be
meaningful to musl:

All categories: C.UTF-8, suppresses any possible filesystem access
looking for locale files.

LC_CTYPE - C or POSIX have a special meaning, byte-based, non-UTF-8

LC_MESSAGES - Language name to be used in pathname components for
translated message files. Kept and used regardless of whether such
directories exist since applications may have their own translated
messages in languages libc is not aware of.

LC_TIME - Language/culture name to be used for loading a file that
will control time formatting. This _might_ simply be a .mo file with
message translations for the English names, or a catgets-format file
using the nl_langinfo item tags as keys. At first it won't even be
implemented so it doesn't matter.

LC_COLLATE - Language/culture name to be used for loading a collation
file, probably in some compiled version of Unicode collation algorithm
format rather than the POSIX localedef format. But for now this will
be also remain unimplemented.

LC_MONETARY - Language/culture name to be used for loading a file with
custom strfmon parameters. Since strfmon does not even work properly
now, this is not a priority, and it will remain unimplemented for the
time being.

LC_NUMERIC - None.

Unrecognized locale names will also be accepted as aliases for
C.UTF-8; this both faciliates easy use of message translation files
for which libc is not aware (i.e. setting LANG to such a language
won't cause setlocale(LC_ALL, "") to fail) and avoids the unfortunate
possibility of an accidental bogus environment setting from causing
programs to regress from UTF-8 to byte-based mode, which would happen
if setlocale were to fail on unknown arguments.

When setlocale(LC_ALL, 0) is called, it needs to return a string that
encodes all of the locale categories and which can be used to set the
locale back to the same state. I don't think the format for this
string needs to be documented/specified, but it will probably just be
a delimited list of the settings for each configurable category, in
numeric order.

Implementation:

Atomicity:

The standards are less than ideal with regard to what happens when the
locale is changed out from under interfaces whose behavior depends on
locale. Rather than worry about this, I'd rather everything just be
safe. So aside from thread-local locale structures used by uselocale,
pretty much all of the data the locale system works with should be
immutable -- once allocated/mapped, it should never be freed.
Otherwise expensive locking is required on every use. To avoid
excessive costs/memory leaks, each locale resource loaded should only
be loaded once, and reused if requested again, much like the way
dlopen/dlclose work in musl.

Locale structure:

The locale structure, which represents either a global locale setting
or a locale allocated by newlocale for use with uselocale, needs to
reflect each of the individual locale categories. With setlocale, it's
possible to change any of the categories for the global locale
independently, and the confusingly-named newlocale actually changes
just one category at a time for the input locale it modifies.

Here is a possible structure:

struct __locale_struct {
	int ctype_utf8;
	char *name[4];
	void *cat[4];
};

Since the only property of the LC_CTYPE category which varies is
encoding (binary vs UTF-8), a single atomically-written int suffices
to implement LC_CTYPE.

For the remaining four categories which can vary, a name and a pointer
are stored.

The names are only used internally by the locale system (e.g. for
constructing the return string setlocale produces) so they do not need
any atomicity. They could could be stored in dynamic allocated storage
(makes sense for newlocale/uselocale) or static storage (makes sense
for setlocale where we don't want to link free).

The category data pointers need to be replaced atomically such that
another thread accessing the category's data sees either the old value
or the new value; both point to valid data since all data is
immutable. The specifics of the data pointed to will be defined later.

Storage:

I had previously suggested adding a pointer to the global locale to
the libc structure, but I think it may make more sense to actually
embed the global locale structure there. Either approach would work,
but if the locale structure isn't embedded, a few places where it's
used would need to check whether the pointer is null and have suitable
fallbacks.

For uselocale, __pthread_self()->locale should just point to whatever
locale is in use, possibly the global locale or one created by
newlocale. With this design there's no requirement (as in the previous
document) for checking whether __pthread_self()->locale is NULL and
falling back to the global locale. Instead, when libc.uselocale_cnt is
nonzero, __pthread_self()->locale can just be used directly.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Locale framework RFC
  2014-06-27 19:04 Locale framework RFC Rich Felker
  2014-06-28  1:25 ` Locale framework RFC - more on proposed implementation Rich Felker
@ 2014-06-29  2:14 ` Rich Felker
  1 sibling, 0 replies; 3+ messages in thread
From: Rich Felker @ 2014-06-29  2:14 UTC (permalink / raw)
  To: musl

On Fri, Jun 27, 2014 at 03:04:12PM -0400, Rich Felker wrote:
> Components affected:
> [...]
> 
> 3. Stdio wide mode: It's required to bind to the character encoding in
> effect at the time the FILE goes into wide mode, rather than at the
> time of the IO operation. So rather than using mbrtowc or wcrtomb, it
> needs to store the state at the time of enterring wide mode and use a
> conversion that's conditional on this saved flag rather than on the
> locale.
> 
> 4. Code which uses mbtowc and/or wctomb assuming they always process
> UTF-8: Aside from the above-mentioned use in stdio, this is probably
> just iconv. To fix this, I propose adding new functions which don't
> check the locale but always process UTF-8. These could also be used
> for stdio wide mode, and they could use a different API than the
> standard functions in order to be more efficient (e.g. returning the
> decoded character, or negative for errors, rather than storing the
> result via a pointer argument).

These two items are turning out to be something of a pain: in
particular, the need for non-locale-sensitive UTF-8 encoding and
decoding functions. They can be solved by duplicating mbrtowc.c with
an identical file except for omitting the locale check that's being
added (and likewise wcrtomb.c), but that's rather ugly. Another
solution would be to somehow process the first byte in the caller so
that the mbstate_t would be non-initial by the time mbrtowc is called.
That would force mbrtowc to handle the sequence as UTF-8. But it also
spreads out the logic into places I'd rather it not be.

Eventually when I do the iconv overhaul, I'd probably like to inline
UTF-8 processing anyway and make it a good deal faster, operating on a
larger intermediate buffer when possible rather than working
character-by-character. However I don't want the current locale work
to be dependent on future iconv work for correct behavior, so a decent
short-term solution is needed too. And of course the stdio wide
functions need a solution.

Rich

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2014-06-29  2:14 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-27 19:04 Locale framework RFC Rich Felker
2014-06-28  1:25 ` Locale framework RFC - more on proposed implementation Rich Felker
2014-06-29  2:14 ` Locale framework RFC Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).