[ Guidance ] Potential New Routines; Requesting Help

mailing list of musl libc
 help / color / mirror / code / Atom feed

* [ Guidance ] Potential New Routines; Requesting Help
@ 2019-12-24 23:06 JeanHeyd Meneide
  2019-12-25 20:07 ` Florian Weimer
  2019-12-30 17:31 ` Rich Felker
  0 siblings, 2 replies; 11+ messages in thread
From: JeanHeyd Meneide @ 2019-12-24 23:06 UTC (permalink / raw)
  To: musl

Dear musl Maintainers and Contributors,

     I hope this e-mail finds you doing well this Holiday Season! I am
interested in developing a few fast routines for text encoding for
musl after the positive reception of a paper for the C Standard
related to fast conversion routines:

     https://thephd.github.io/vendor/future_cxx/papers/source/C%20-%20Efficient%20Character%20Conversions.html

     While I have a basic implementation, I would like to use some
processor and compiler intrinsics to make it faster and make sure my
first contribution meets both quality and speed standards for a C
library.

     Is there a place in the codebase I can look to for guidance on
how to handle intrinsics properly within musl libc? If there is
already infrastructure and common idioms in place, I would rather use
that then starting to spin up my own.

Sincerely,
JeanHeyd Meneide


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [ Guidance ] Potential New Routines; Requesting Help
  2019-12-24 23:06 [ Guidance ] Potential New Routines; Requesting Help JeanHeyd Meneide
@ 2019-12-25 20:07 ` Florian Weimer
  2019-12-26  2:13   ` Rich Felker
  2019-12-30 17:31 ` Rich Felker
  1 sibling, 1 reply; 11+ messages in thread
From: Florian Weimer @ 2019-12-25 20:07 UTC (permalink / raw)
  To: JeanHeyd Meneide; +Cc: musl

* JeanHeyd Meneide:

>      I hope this e-mail finds you doing well this Holiday Season! I am
> interested in developing a few fast routines for text encoding for
> musl after the positive reception of a paper for the C Standard
> related to fast conversion routines:
>
>      https://thephd.github.io/vendor/future_cxx/papers/source/C%20-%20Efficient%20Character%20Conversions.html

I'm somewhat concerned that the C multibyte functions are too broken
to be useful.  There is a at least one widely implemented character
set (Big5 as specified for HTML5) which does not fit the model implied
by the standard.  Big5 does not have shift states, but a C
implementation using UTF-32 for wchar_t has to pretend it has because
correct conversion from Unicode to Big5 needs lookahead and cannot be
performed one character at a time.

This would at least affect the proposed c8rtomb function.

I posted a brief review of the problematic charsets in glibc here:

  <https://sourceware.org/ml/libc-alpha/2019-05/msg00079.html>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [ Guidance ] Potential New Routines; Requesting Help
  2019-12-25 20:07 ` Florian Weimer
@ 2019-12-26  2:13   ` Rich Felker
  2019-12-26  5:43     ` JeanHeyd Meneide
  2019-12-26  9:43     ` Florian Weimer
  0 siblings, 2 replies; 11+ messages in thread
From: Rich Felker @ 2019-12-26  2:13 UTC (permalink / raw)
  To: Florian Weimer; +Cc: JeanHeyd Meneide, musl

On Wed, Dec 25, 2019 at 09:07:05PM +0100, Florian Weimer wrote:
> * JeanHeyd Meneide:
> 
> >      I hope this e-mail finds you doing well this Holiday Season! I am
> > interested in developing a few fast routines for text encoding for
> > musl after the positive reception of a paper for the C Standard
> > related to fast conversion routines:
> >
> >      https://thephd.github.io/vendor/future_cxx/papers/source/C%20-%20Efficient%20Character%20Conversions.html
> 
> I'm somewhat concerned that the C multibyte functions are too broken
> to be useful.  There is a at least one widely implemented character
> set (Big5 as specified for HTML5) which does not fit the model implied
> by the standard.  Big5 does not have shift states, but a C
> implementation using UTF-32 for wchar_t has to pretend it has because
> correct conversion from Unicode to Big5 needs lookahead and cannot be
> performed one character at a time.

I don't think this can be modeled with shift states. C explicitly
forbids a stateful wchar_t encoding/multi-wchar_t-characters. Shift
states would be meaningful for the other direction.

In any case I don't think it really matters. There are no existing
implementations with this version of Big5 (with the offending HKSCS
characters included) as the locale charset, since it can't work, and
there really is no good reason to be adding *new* locale encodings.
The reason we (speaking of the larger community; musl doesn't) have
non-UTF-8 locales is legacy compatibility for users who need or insist
on keeping them.

If there really is an insistence on using this version of Big5, the
characters should be added to Unicode as <compat> characters so that
there's an unambiguous one-to-one correspondence, and the people who
care about it working should take responsibility for doing that.

> This would at least affect the proposed c8rtomb function.
> 
> I posted a brief review of the problematic charsets in glibc here:
> 
>   <https://sourceware.org/ml/libc-alpha/2019-05/msg00079.html>

I've read it but seemingly not in enough detail to gather what parts
are relevant to this conversation.

Rich

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [ Guidance ] Potential New Routines; Requesting Help
  2019-12-26  2:13   ` Rich Felker
@ 2019-12-26  5:43     ` JeanHeyd Meneide
  2019-12-30 17:28       ` Rich Felker
  2019-12-26  9:43     ` Florian Weimer
  1 sibling, 1 reply; 11+ messages in thread
From: JeanHeyd Meneide @ 2019-12-26  5:43 UTC (permalink / raw)
  To: Rich Felker; +Cc: Florian Weimer, musl

Dear Rich Felker and Florien Weimer,

     Thank you for taking an interest in this!

On Wed, Dec 25, 2019 at 9:14 PM Rich Felker <dalias@libc.org> wrote:
>
> On Wed, Dec 25, 2019 at 09:07:05PM +0100, Florian Weimer wrote:
> >
> > I'm somewhat concerned that the C multibyte functions are too broken
> > to be useful.  There is a at least one widely implemented character
> > set (Big5 as specified for HTML5) which does not fit the model implied
> > by the standard.  Big5 does not have shift states, but a C
> > implementation using UTF-32 for wchar_t has to pretend it has because
> > correct conversion from Unicode to Big5 needs lookahead and cannot be
> > performed one character at a time.
>
> I don't think this can be modeled with shift states. C explicitly
> forbids a stateful wchar_t encoding/multi-wchar_t-characters. Shift
> states would be meaningful for the other direction.
>
> In any case I don't think it really matters. There are no existing
> implementations with this version of Big5 (with the offending HKSCS
> characters included) as the locale charset, since it can't work, and
> there really is no good reason to be adding *new* locale encodings.
> The reason we (speaking of the larger community; musl doesn't) have
> non-UTF-8 locales is legacy compatibility for users who need or insist
> on keeping them.

     I have no intentions on adding new locale-based charsets (and I
absolutely agree that we should not be adding anymore). That being
said, I want to focus on the main part of this, which is the ability
to model existing encodings which have both/either shift states and/or
multi-character expansions.

     It is true wchar_t is invariably busted. There is no way a wide
character string can be multi-unit: that shipped sailed when wchar_t
was specified as is, and when it was codified in various APIs such as
mbtowc/wctomb and friends. I will, however, note that the paper
specifically wants to add the Restartable versions of "single unit" wc
and mb to/from functions. The reason I chose the restartable forms is
because in C2x a defect report was accepted that clarified the
original intent of the single-character functions with respect to
their R versions:

> "After discussion, the committee concluded that mbstate was already specified to handle this case, and as such the second interpretation is intended. The committee believes that there is an underspecification, and solicited a further paper from the author along the lines of the second option. Although not discussed a Suggested Technical Corrigendum can be found at N2040." -- WG14, April 2016, http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2059.htm#dr_488

    This means that while wcto* and *towc functions are broken, the
restartable versions need not be. By returning a value of "0" (e.g.,
for c8rtomb) or returning a value of "-3" (e.g., for mbrtoc8), we can
write out multiple characters based on the input and any data stored
in mbstate_t. This allows us to handle e.g. UTF-16 for c16,
multi-conversions for c8, and more. My understanding is that for one
of the referenced encodings in the linked mailing post (TSCII), this
would cover that use case. My understanding is also that for some Big5
encodings it would -- as you stated -- treat ambiguous leading
sequences as a shift state, accumulate data in the mbstate_t, and then
write out data if the sequence is made unambiguous by having further
data provided to the next call.

     Is my understanding incorrect? Is there an implementation
limitation I am missing here? I would hate to do this the wrong way
and make the encoding situation even worse: my goal is to absolutely
ensure we can transition legacy encodings to statically-known
encodings.

Sincerely,
JeanHeyd Meneide


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [ Guidance ] Potential New Routines; Requesting Help
  2019-12-26  5:43     ` JeanHeyd Meneide
@ 2019-12-30 17:28       ` Rich Felker
  2019-12-30 18:53         ` JeanHeyd Meneide
  0 siblings, 1 reply; 11+ messages in thread
From: Rich Felker @ 2019-12-30 17:28 UTC (permalink / raw)
  To: JeanHeyd Meneide; +Cc: Florian Weimer, musl

On Thu, Dec 26, 2019 at 12:43:45AM -0500, JeanHeyd Meneide wrote:
> Dear Rich Felker and Florien Weimer,
> 
>      Thank you for taking an interest in this!
> 
> On Wed, Dec 25, 2019 at 9:14 PM Rich Felker <dalias@libc.org> wrote:
> >
> > On Wed, Dec 25, 2019 at 09:07:05PM +0100, Florian Weimer wrote:
> > >
> > > I'm somewhat concerned that the C multibyte functions are too broken
> > > to be useful.  There is a at least one widely implemented character
> > > set (Big5 as specified for HTML5) which does not fit the model implied
> > > by the standard.  Big5 does not have shift states, but a C
> > > implementation using UTF-32 for wchar_t has to pretend it has because
> > > correct conversion from Unicode to Big5 needs lookahead and cannot be
> > > performed one character at a time.
> >
> > I don't think this can be modeled with shift states. C explicitly
> > forbids a stateful wchar_t encoding/multi-wchar_t-characters. Shift
> > states would be meaningful for the other direction.
> >
> > In any case I don't think it really matters. There are no existing
> > implementations with this version of Big5 (with the offending HKSCS
> > characters included) as the locale charset, since it can't work, and
> > there really is no good reason to be adding *new* locale encodings.
> > The reason we (speaking of the larger community; musl doesn't) have
> > non-UTF-8 locales is legacy compatibility for users who need or insist
> > on keeping them.
> 
>      I have no intentions on adding new locale-based charsets (and I
> absolutely agree that we should not be adding anymore). That being
> said, I want to focus on the main part of this, which is the ability
> to model existing encodings which have both/either shift states and/or
> multi-character expansions.

I think you misunderstood my remarks here. I was not talking about
invention of new charsets (which we seem to agree should not happen),
but making it possible to use existing legacy charsets which were
previously not usable as a locale's encoding due to limitations of the
C APIs. I see making that possible as counter-productive. It does not
serve to let users keep doing something they were already doing
(compatibility), only do to something newly backwards.

>      It is true wchar_t is invariably busted. There is no way a wide
> character string can be multi-unit: that shipped sailed when wchar_t
> was specified as is, and when it was codified in various APIs such as
> mbtowc/wctomb and friends. I will, however, note that the paper
> specifically wants to add the Restartable versions of "single unit" wc
> and mb to/from functions.

I don't follow. mbrtowc and wcrtomb already exist and have since at
least C99.

> > "After discussion, the committee concluded that mbstate was
> > already specified to handle this case, and as such the second
> > interpretation is intended.. The committee believes that there is
> > an underspecification, and solicited a further paper from the
> > author along the lines of the second option. Although not
> > discussed a Suggested Technical Corrigendum can be found at
> > N2040." -- WG14, April 2016,
> > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2059.htm#dr_488
> 
>     This means that while wcto* and *towc functions are broken, the

I don't see them as broken. They support every encoding that has ever
worked in the past as the encoding for a locale (tautologically). The
only way they're "broken" is if you want to add new locale encodings
that weren't previously supportable.

>      Is my understanding incorrect? Is there an implementation
> limitation I am missing here? I would hate to do this the wrong way
> and make the encoding situation even worse: my goal is to absolutely
> ensure we can transition legacy encodings to statically-known
> encodings.

The C locale API does not exist to convert arbitrary encodings, only
the one in use as the locale's encoding. Its purpose is to abstract
the concept of how text is encoded in the system's/user's environment
such that applications can honor it while (with recent versions of C)
being able to determine the identity of characters in terms of Unicode
and emit specific characters provided that they're representable in
the encoding.

Conversion of arbitrary encodings other than the one in use by the
locale requires a different API that takes encodings by name or some
other identifier. The standard (POSIX) API for this is iconv, which
has plenty of limitations of its own, some the same as what you've
identified.

Rich

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [ Guidance ] Potential New Routines; Requesting Help
  2019-12-30 17:28       ` Rich Felker
@ 2019-12-30 18:53         ` JeanHeyd Meneide
  0 siblings, 0 replies; 11+ messages in thread
From: JeanHeyd Meneide @ 2019-12-30 18:53 UTC (permalink / raw)
  To: Rich Felker; +Cc: Florian Weimer, musl

On Mon, Dec 30, 2019 at 12:28 PM Rich Felker <dalias@libc.org> wrote:
> I think you misunderstood my remarks here. I was not talking about
> invention of new charsets (which we seem to agree should not happen),
> but making it possible to use existing legacy charsets which were
> previously not usable as a locale's encoding due to limitations of the
> C APIs. I see making that possible as counter-productive. It does not
> serve to let users keep doing something they were already doing
> (compatibility), only do to something newly backwards.

     My goal is to allow developers to go from an encoding they do not
control fully (the multibyte encoding) to an encoding they know and
can reason about in their program (c8, for example). This is why I am
providing the mb -> cNN and wc -> cNN functions in both
single-character and string forms. The hope is to make it easy to go
from a statically known encoding (modulo difficulties from
__STD_C_UTF16/32__ not being defined) to the platform encoding, and
vice-versa, using the same style of functions like mb(s)(r)towc(s) and
wc(s)(r)tomb(s).

> >  ... I will, however, note that the paper
> > specifically wants to add the Restartable versions of "single unit" wc
> > and mb to/from functions.
>
> I don't follow. mbrtowc and wcrtomb already exist and have since at
> least C99.

     Apologies, I meant doing wc <-> cNN and mb <-> cNN!

> > ...
> >
> >     This means that while wcto* and *towc functions are broken, the
>
> I don't see them as broken. They support every encoding that has ever
> worked in the past as the encoding for a locale (tautologically). The
> only way they're "broken" is if you want to add new locale encodings
> that weren't previously supportable.

     Apologies; this was in reference to wide characters given a not
UTF-32 interpretation on certain platforms like Windows and certain
flavors of IBM. They chose 16 bits, which can't accommodate Unicode
without needing multiple wchar_t. Unfortunately, this means that they
were really out of luck before DR488 was accepted: they had no means
to return multiple wchar_t for characters outside the 16-bit maximum.
With DR488, restartable functions have the potential to convert out
properly (albeit, the DR was only applied to char16_t functions, so
while I have a hope and a wish we can fix it for their platforms it
might not work out for the wcto* and *towc functions anyways).

     char16_t functions, though, should offer those platforms a better
way out (though not a perfect one: they'll need to rely on platform
knowledge and perform some casts).

> ...
>
> Conversion of arbitrary encodings other than the one in use by the
> locale requires a different API that takes encodings by name or some
> other identifier. The standard (POSIX) API for this is iconv, which
> has plenty of limitations of its own, some the same as what you've
> identified.

    Absolutely agreed! I just want the ones that the platform controls
(wide character and multibyte character encodings) to have correct,
simple paths to static encodings that can be used for more rigorous
text processing.

Sincerely,
JeanHeyd Meneide


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [ Guidance ] Potential New Routines; Requesting Help
  2019-12-26  2:13   ` Rich Felker
  2019-12-26  5:43     ` JeanHeyd Meneide
@ 2019-12-26  9:43     ` Florian Weimer
  1 sibling, 0 replies; 11+ messages in thread
From: Florian Weimer @ 2019-12-26  9:43 UTC (permalink / raw)
  To: Rich Felker; +Cc: JeanHeyd Meneide, musl

* Rich Felker:

> On Wed, Dec 25, 2019 at 09:07:05PM +0100, Florian Weimer wrote:
>> * JeanHeyd Meneide:
>> 
>> >      I hope this e-mail finds you doing well this Holiday Season! I am
>> > interested in developing a few fast routines for text encoding for
>> > musl after the positive reception of a paper for the C Standard
>> > related to fast conversion routines:
>> >
>> >      https://thephd.github.io/vendor/future_cxx/papers/source/C%20-%20Efficient%20Character%20Conversions.html
>> 
>> I'm somewhat concerned that the C multibyte functions are too broken
>> to be useful.  There is a at least one widely implemented character
>> set (Big5 as specified for HTML5) which does not fit the model implied
>> by the standard.  Big5 does not have shift states, but a C
>> implementation using UTF-32 for wchar_t has to pretend it has because
>> correct conversion from Unicode to Big5 needs lookahead and cannot be
>> performed one character at a time.
>
> I don't think this can be modeled with shift states. C explicitly
> forbids a stateful wchar_t encoding/multi-wchar_t-characters. Shift
> states would be meaningful for the other direction.

The intent of the standard appears to be to support this as an
extension.

It's hard to tell because actual users of the interfaces with legacy
charsets do not seem to be represented on the standards committee
anymore (see the mblen behavioral change in C11 as evidence supporting
this theory).

> In any case I don't think it really matters. There are no existing
> implementations with this version of Big5 (with the offending HKSCS
> characters included) as the locale charset, since it can't work, and
> there really is no good reason to be adding *new* locale encodings.

Do you mean in musl?

> The reason we (speaking of the larger community; musl doesn't) have
> non-UTF-8 locales is legacy compatibility for users who need or insist
> on keeping them.

That is true.

> If there really is an insistence on using this version of Big5, the
> characters should be added to Unicode as <compat> characters so that
> there's an unambiguous one-to-one correspondence, and the people who
> care about it working should take responsibility for doing that.

Yes, I was very surprised this wasn't done for TSCII and HKSCS/Big5.
I think even for Big5, it would solve the issue because the decoding
process only needs to look at a single multibyte character at the
time (I may have suggested otherwise in the past).

A succint description of what is going on for Big 5 is here:
<https://encoding.spec.whatwg.org/#big5-decoder>, under step 3.3.  The
conversion is actually fairly simple, it's just hard to fit it into
the C interfaces.

>> This would at least affect the proposed c8rtomb function.
>> 
>> I posted a brief review of the problematic charsets in glibc here:
>> 
>>   <https://sourceware.org/ml/libc-alpha/2019-05/msg00079.html>
>
> I've read it but seemingly not in enough detail to gather what parts
> are relevant to this conversation.

It names a few character sets that have fake shift states because the
C interfaces cannot otherwise be used with them.  Some of the new
interfaces are problematic in this context (whether or not UTF-32 is
used for wchar_t).  I think new interfaces should be compatible with
existing implementation practice.

The other thing I found surprising is that there are no
ASCII-transparent charsets with traditional shift states in glibc.
The ASCII-transparent charsets with shift states have these fake shift
states.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [ Guidance ] Potential New Routines; Requesting Help
  2019-12-24 23:06 [ Guidance ] Potential New Routines; Requesting Help JeanHeyd Meneide
  2019-12-25 20:07 ` Florian Weimer
@ 2019-12-30 17:31 ` Rich Felker
  2019-12-30 18:39   ` JeanHeyd Meneide
  1 sibling, 1 reply; 11+ messages in thread
From: Rich Felker @ 2019-12-30 17:31 UTC (permalink / raw)
  To: JeanHeyd Meneide; +Cc: musl

On Tue, Dec 24, 2019 at 06:06:50PM -0500, JeanHeyd Meneide wrote:
> Dear musl Maintainers and Contributors,
> 
>      I hope this e-mail finds you doing well this Holiday Season! I am
> interested in developing a few fast routines for text encoding for
> musl after the positive reception of a paper for the C Standard
> related to fast conversion routines:
> 
>      https://thephd.github.io/vendor/future_cxx/papers/source/C%20-%20Efficient%20Character%20Conversions.html

This is interesting, but I'm trying to understand the motivation.

If __STDC_ISO_10646__ is defined, wchar_t is UTF-32/UCS-4, and the
proposed functions are just the identity (for the c32 ones) and
UTF-16/32 conversion.

If it's not defined, you have the same problem as the current mb/cNN
functions: there's no reason to believe arbitrary Unicode characters
can round-trip through wchar_t any better than they can through
multibyte characters. In fact on such implementations it's likely that
wchar_t meanings are locale-dependent and just a remapping of the
byte/multibyte characters.

What situation do you envision where the proposed functions let you
reliably do something that's not already possible?

>      While I have a basic implementation, I would like to use some
> processor and compiler intrinsics to make it faster and make sure my
> first contribution meets both quality and speed standards for a C
> library.
> 
>      Is there a place in the codebase I can look to for guidance on
> how to handle intrinsics properly within musl libc? If there is
> already infrastructure and common idioms in place, I would rather use
> that then starting to spin up my own.

I'm not sure what you mean by intrinsics or why you're looking for
them but I guess you're thinking of something as a performance
optimization? musl favors having code in straight simple C except when
there's a strong reason (known bottleneck in existing real-world
software -- things like memcpy, strlen, etc.) to do otherwise. The
existing mb/wc code is slightly "vectorized" (see mbsrtowcs) but doing
so was probably a mistake. The motivation came along with one of the
early motivations for musl: not making UTF-8 a major performance
regression like it was in glibc. But it turned out the bigger issue
was the performance of character-at-a-time and byte-at-a-time
conversions, not bulk conversion.

If we do adopt these functions, the right way to do it would be using
them to refactor the existing c16/c32 functions. Basically, for
example, the bulk of c16rtomb would become c16rtowc, and c16rtomb
would be replaced with a call to c16rtowc followed by wctomb. And the
string ones can all be simple loop wrappers.

Rich

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [ Guidance ] Potential New Routines; Requesting Help
  2019-12-30 17:31 ` Rich Felker
@ 2019-12-30 18:39   ` JeanHeyd Meneide
  2019-12-30 19:57     ` Rich Felker
  0 siblings, 1 reply; 11+ messages in thread
From: JeanHeyd Meneide @ 2019-12-30 18:39 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

On Mon, Dec 30, 2019 at 12:31 PM Rich Felker <dalias@libc.org> wrote:
> > ...
> This is interesting, but I'm trying to understand the motivation.
>
> If __STDC_ISO_10646__ is defined, wchar_t is UTF-32/UCS-4, and the
> proposed functions are just the identity (for the c32 ones) and
> UTF-16/32 conversion.
>
> If it's not defined, you have the same problem as the current mb/cNN
> functions: there's no reason to believe arbitrary Unicode characters
> can round-trip through wchar_t any better than they can through
> multibyte characters. In fact on such implementations it's likely that
> wchar_t meanings are locale-dependent and just a remapping of the
> byte/multibyte characters.

     I'm sorry, I'll try to phrase it as best as I can.

     The issue I and others have with the lack of cNNtowc is that, if
we are to write standards-compliant C, the only way to do that
transformation from, for example, char16_t data to wchar_t portably
is:

     c16rtomb -> multibyte data -> mbrtowc

     The problem with such a conversion sequence is that there are
many legacy encodings and this causes bugs on many user's machines.
Text representable in both char16_t and wchar_t is lost in the middle:
due to the middle not handling it, putting us in a place where we lose
of data going to and from wchar_t to char16_t. This has been
frustrating for a number of users who try to rely on the standard,
only to have to write the above conversions sequence and fail. Thus,
providing a direct function with no intermediates results in a better
Standard C experience.

     A minor but still helpful secondary motivation is in giving
people on certain long-standing platforms a way out. By definition,
UTF16 does not work with wchar_t, so I am explicitly told that wchar_t
for a platform like .e.g Windows is UCS-2 (the non-multi-unit version
of UTF-16 that was deprecated a while ago) is wrong when using the
Standard Library if I want real Unicode Support. Library developers
tell me to rely on platform-specific APIs. The "use
MultiByteToWideChar" or "use ICU" or "use this AIX-specific function",
makes it much less of a Standard way to handle text: hence, the paper
to the WG14 C Committee. The restartable versions of the
single-character functions and the bulk conversion functions give ways
for implementations locked to behaving like the deprecated UCS-2,
16-bit-single-unit-encoding a way out, and also allow us to have
lossless data conversion.

     This reasoning might be a little bit "overdone" for libraries
like musl and glibc who got wchar_t right (thank you!), but part of
standardizing these things means I have to account for implementations
that have been around longer than I have been alive. :) Does that make
sense?

> What situation do you envision where the proposed functions let you
> reliably do something that's not already possible?

     My understanding is that libraries such as musl are "blessed" as
distributions of the Standard Library, and that they can access system
information that makes it possible for them to utilize what the
current "wchar_t encoding" is in a way normal, regular developers
cannot. Specifically, in the generic external implementation I have
been working on, I have a number of #ifdef to check for, say, IBM
machines, then check if they are specifically under zh/tw or even jp
locales, because they deploy a wchar_t in these scenarios that is
neither UTF16 or UTF32 (but instead a flavor of one of the GB
encodings and Japanese encodings); otherwise, IBM uses UTF16/UCS-2 for
wchar_t in i686 and UTF-32 for wchar_t in x86_64 for certain machines.
I also check for what happens on Windows under various settings as
well. Doing this as an external library is hard, because there is no
way I can control the knobs for such reliably, but that a Standard
Library distribution would have access to that information (since they
are providing such functions already).

     So, for example, musl -- being the C library -- controls how the
wchar_t should behave (modulo compiler intervention) for its wide
character functions. Similarly, glibc would know what to do for its
platforms, and IBM would know what to do for its platforms, and so on
and so forth. Each distribution would provide behavior in coordination
with their platform.

    Is this incorrect? Am I assuming a level of standard library <->
vendor relation/cooperation that does not exist?

> >      While I have a basic implementation, I would like to use some
> > processor and compiler intrinsics to make it faster and make sure my
> > first contribution meets both quality and speed standards for a C
> > library.
> >
> >      Is there a place in the codebase I can look to for guidance on
> > how to handle intrinsics properly within musl libc? If there is
> > already infrastructure and common idioms in place, I would rather use
> > that then starting to spin up my own.
>
> I'm not sure what you mean by intrinsics or why you're looking for
> them but I guess you're thinking of something as a performance
> optimization? musl favors having code in straight simple C except when
> there's a strong reason (known bottleneck in existing real-world
> software -- things like memcpy, strlen, etc.) to do otherwise. The
> existing mb/wc code is slightly "vectorized" (see mbsrtowcs) but doing
> so was probably a mistake. The motivation came along with one of the
> early motivations for musl: not making UTF-8 a major performance
> regression like it was in glibc. But it turned out the bigger issue
> was the performance of character-at-a-time and byte-at-a-time
> conversions, not bulk conversion.

     My experience so far is that the character-at-a-time functions
can cause severe performance penalties for external users, especially
if the library is dynamically linked. If the C standard provides the
bulk-conversion functions, performance would increase drastically for
users desiring bulk conversion (because they do not have to write a
loop around a dynamically-loaded function call to do conversions
one-at-a-time). I am glad that musl has had similar experience, and
would like to make the bulk functions available in musl too!

     My asking about intrinsics and such was that I have some
optimizations using hand-vectorized instructions for some bulk cases.
I will be more than happy to just contribute regular and readable
plain C, though, and then revisit such functions if it turns out that
vectorization with SIMD and other instructions for various platforms
turns out to be worth it. My initial hunch is that it is, but I'm more
than happy to focus on correctness first, extreme performance (maybe)
later.

> If we do adopt these functions, the right way to do it would be using
> them to refactor the existing c16/c32 functions. Basically, for
> example, the bulk of c16rtomb would become c16rtowc, and c16rtomb
> would be replaced with a call to c16rtowc followed by wctomb. And the
> string ones can all be simple loop wrappers.

     I would be more than happy to write the implementation as such!
Most of the wchar_t functions will be very easy since musl and glibc
chose the right wchar_t. (Talking to other vendors is going to be a
much, much more difficult conversation...)

Best Wishes,
JeanHeyd Meneide


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [ Guidance ] Potential New Routines; Requesting Help
  2019-12-30 18:39   ` JeanHeyd Meneide
@ 2019-12-30 19:57     ` Rich Felker
  2019-12-31  3:58       ` JeanHeyd Meneide
  0 siblings, 1 reply; 11+ messages in thread
From: Rich Felker @ 2019-12-30 19:57 UTC (permalink / raw)
  To: JeanHeyd Meneide; +Cc: musl

On Mon, Dec 30, 2019 at 01:39:10PM -0500, JeanHeyd Meneide wrote:
> On Mon, Dec 30, 2019 at 12:31 PM Rich Felker <dalias@libc.org> wrote:
> > > ...
> > This is interesting, but I'm trying to understand the motivation.
> >
> > If __STDC_ISO_10646__ is defined, wchar_t is UTF-32/UCS-4, and the
> > proposed functions are just the identity (for the c32 ones) and
> > UTF-16/32 conversion.
> >
> > If it's not defined, you have the same problem as the current mb/cNN
> > functions: there's no reason to believe arbitrary Unicode characters
> > can round-trip through wchar_t any better than they can through
> > multibyte characters. In fact on such implementations it's likely that
> > wchar_t meanings are locale-dependent and just a remapping of the
> > byte/multibyte characters.
> 
>      I'm sorry, I'll try to phrase it as best as I can.
> 
>      The issue I and others have with the lack of cNNtowc is that, if
> we are to write standards-compliant C, the only way to do that
> transformation from, for example, char16_t data to wchar_t portably
> is:
> 
>      c16rtomb -> multibyte data -> mbrtowc
> 
>      The problem with such a conversion sequence is that there are
> many legacy encodings and this causes bugs on many user's machines.
> Text representable in both char16_t and wchar_t is lost in the middle:
> due to the middle not handling it, putting us in a place where we lose
> of data going to and from wchar_t to char16_t. This has been
> frustrating for a number of users who try to rely on the standard,
> only to have to write the above conversions sequence and fail. Thus,
> providing a direct function with no intermediates results in a better
> Standard C experience.
> 
>      A minor but still helpful secondary motivation is in giving
> people on certain long-standing platforms a way out. By definition,
> UTF16 does not work with wchar_t, so I am explicitly told that wchar_t
> for a platform like .e.g Windows is UCS-2 (the non-multi-unit version
> of UTF-16 that was deprecated a while ago) is wrong when using the
> Standard Library if I want real Unicode Support. Library developers
> tell me to rely on platform-specific APIs. The "use
> MultiByteToWideChar" or "use ICU" or "use this AIX-specific function",
> makes it much less of a Standard way to handle text: hence, the paper
> to the WG14 C Committee. The restartable versions of the
> single-character functions and the bulk conversion functions give ways
> for implementations locked to behaving like the deprecated UCS-2,
> 16-bit-single-unit-encoding a way out, and also allow us to have
> lossless data conversion.

I don't think these interfaces gives you an "out" in a way that's
fully conforming. The C model is that there's a set of characters
supported in the current locale, and each of them has one or more
multibyte representations (possibly involving shift states) and a
single wide character representation. Converting between UTF-16 or
UTF-32 and wchar_t outside the scope of characters that exist in the
current locale isn't presently a meaningful concept, and wouldn't
enable you to get meaningful results from wctype.h functions, etc.
(Would you propose having a second set of such functions for char32_t
to handle that? Really it sounds like what you want is an out to
deprecate wchar_t and use char32_t in its place, which wouldn't be a
bad idea...)

Solving these problems for implementations burdened by a legacy *wrong
choice* of definition of wchar_t is not possible by adding more
interfaces alone; it requires a lot of changes to the underlying
abstract model of what a character is in C. I'm not really in favor of
such changes. They complicate and burden existing working
implementations for the sake of ones that made bad choices. Windows in
particular *can* and *should* fix wchar_t to be 32-bit. The Windows
API uses WCHAR, not wchar_t, anyway, so that a change in wchar_t is
really not a big deal for interface compatibility, and has conformance
problems like wprintf treating %s/%ls incorrectly that require
breaking changes to fix. Good stdlib implementations on Windows
already fix these things.

>      This reasoning might be a little bit "overdone" for libraries
> like musl and glibc who got wchar_t right (thank you!), but part of
> standardizing these things means I have to account for implementations
> that have been around longer than I have been alive. :) Does that make
> sense?
> 
> > What situation do you envision where the proposed functions let you
> > reliably do something that's not already possible?
> 
>      My understanding is that libraries such as musl are "blessed" as
> distributions of the Standard Library, and that they can access system
> information that makes it possible for them to utilize what the
> current "wchar_t encoding" is in a way normal, regular developers
> cannot. Specifically, in the generic external implementation I have
> been working on, I have a number of #ifdef to check for, say, IBM
> machines, then check if they are specifically under zh/tw or even jp
> locales, because they deploy a wchar_t in these scenarios that is
> neither UTF16 or UTF32 (but instead a flavor of one of the GB
> encodings and Japanese encodings); otherwise, IBM uses UTF16/UCS-2 for
> wchar_t in i686 and UTF-32 for wchar_t in x86_64 for certain machines.
> I also check for what happens on Windows under various settings as
> well. Doing this as an external library is hard, because there is no
> way I can control the knobs for such reliably, but that a Standard
> Library distribution would have access to that information (since they
> are providing such functions already).

The __STDC_ISO_10646__ macro is the way to determine that the encoding
of wchar_t is Unicode (or some subset if WCHAR_MAX doesn't admit the
full range). Otherwise it's not something you can meaningfully work
with except as an abstract number, but in that case you just want to
avoid it as much as possible and convert directly between multibyte
characters and char16_t/char32_t. I don't see how converting directly
between wchar_t and char16_t/char32_t is more useful, even if it is a
prettier factorization of the code.

A far more useful thing to know than wchar_t encoding is the multibyte
encoding. POSIX gives you this in nl_langinfo(CODESET) but plain C has
no equivalent. I'd actually like to see WG14 adopt this into plain C.

> > >      While I have a basic implementation, I would like to use some
> > > processor and compiler intrinsics to make it faster and make sure my
> > > first contribution meets both quality and speed standards for a C
> > > library.
> > >
> > >      Is there a place in the codebase I can look to for guidance on
> > > how to handle intrinsics properly within musl libc? If there is
> > > already infrastructure and common idioms in place, I would rather use
> > > that then starting to spin up my own.
> >
> > I'm not sure what you mean by intrinsics or why you're looking for
> > them but I guess you're thinking of something as a performance
> > optimization? musl favors having code in straight simple C except when
> > there's a strong reason (known bottleneck in existing real-world
> > software -- things like memcpy, strlen, etc.) to do otherwise. The
> > existing mb/wc code is slightly "vectorized" (see mbsrtowcs) but doing
> > so was probably a mistake. The motivation came along with one of the
> > early motivations for musl: not making UTF-8 a major performance
> > regression like it was in glibc. But it turned out the bigger issue
> > was the performance of character-at-a-time and byte-at-a-time
> > conversions, not bulk conversion.
> 
>      My experience so far is that the character-at-a-time functions
> can cause severe performance penalties for external users, especially
> if the library is dynamically linked.

On musl (where I'm familiar with performance properties),
byte-at-a-time conversion is roughly half the speed of bulk, which
looks big but is diminishingly so if you're actually doing something
with the result (just converting to wchar_t for its own sake is not
very useful). Character-at-a-time is probably somewhat less slow than
byte-at-a-time. When I wrote this I put in heavy effort to make
byte/character-at-a-time not horribly slow, because it's normally the
natural programming model. Wide character strings are not an idiomatic
type to work with in C.

Rich

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [ Guidance ] Potential New Routines; Requesting Help
  2019-12-30 19:57     ` Rich Felker
@ 2019-12-31  3:58       ` JeanHeyd Meneide
  0 siblings, 0 replies; 11+ messages in thread
From: JeanHeyd Meneide @ 2019-12-31  3:58 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

On Mon, Dec 30, 2019 at 2:57 PM Rich Felker <dalias@libc.org> wrote:
> I don't think these interfaces gives you an "out" in a way that's
> fully conforming. The C model is that there's a set of characters
> supported in the current locale, and each of them has one or more
> multibyte representations (possibly involving shift states) and a
> single wide character representation. Converting between UTF-16 or
> UTF-32 and wchar_t outside the scope of characters that exist in the
> current locale isn't presently a meaningful concept, and wouldn't
> enable you to get meaningful results from wctype.h functions, etc.
> (Would you propose having a second set of such functions for char32_t
> to handle that? Really it sounds like what you want is an out to
> deprecate wchar_t and use char32_t in its place, which wouldn't be a
> bad idea...)

     This is actually something I am extremely interested in tackling.
But I need to make sure everyone can get their data in current
applications from mb/wide characters to the char32_t. Then a potential
<uctype.h> can be worked on that takes case mapping, case folding, and
all of the other useful things Unicode has brought to the table and
work with Unicode Code Points. One of the things I saw before is that
there was a previous proposal to extend wctype.h with other functions
that was very large, and despite being well motivated it did not
succeed in WG14.

     Also on my list of things is the fact that char16_t and char32_t
do not necessarily have to be Unicode (__STD_C_UTF32__ and friends).
This means that if we settle on char32_t for these interfaces, we may
set a potential trap for users who migrate and then try to port to
platforms where c16 does not mean UTF-16, and c32 does not mean
UTF-32. In coordinating with a few static analysis vendors who cover a
very large range of compiler implementations both C and C++, they have
reportedly not yet found a compiler which makes char16/32_t not be
UTF-16/32 (some platforms forget to define the macros but still use
those encodings). I hope that in the future a paper can be brought to
WG14 to make those encodings required for char16/32_t, rather than
checking the macro and leaving users out to dry. Right now everything
de-facto works, but I worry...

     Still. I want to introduce each logical piece of functionality in
its own paper, with its own scope and motivation. This, in my opinion,
seems to work much better. Work on transition and replacement, then
deprecate the things which are know from experience are bad. I don't
know if my plan is going to work, but having nobody vote against my
first ever WG14 proposal is a good start and I want to be careful to
not get stuck in Committee on mega-proposals that scare people.

> Solving these problems for implementations burdened by a legacy *wrong
> choice* of definition of wchar_t is not possible by adding more
> interfaces alone; it requires a lot of changes to the underlying
> abstract model of what a character is in C. I'm not really in favor of
> such changes. They complicate and burden existing working
> implementations for the sake of ones that made bad choices. Windows in
> particular *can* and *should* fix wchar_t to be 32-bit. The Windows
> API uses WCHAR, not wchar_t, anyway, so that a change in wchar_t is
> really not a big deal for interface compatibility, and has conformance
> problems like wprintf treating %s/%ls incorrectly that require
> breaking changes to fix. Good stdlib implementations on Windows
> already fix these things.

     They should, absolutely. Still, I think that preventing lossy
conversions for wchar_t usage on platforms where the wide character is
used to interface with the system is a worthwhile endeavor. I don't
think it is feasible (or would ever fly in WG14) to change what
wchar_t is and how it behaves: but I would rather invest time in
implementing interfaces that can offer better and more complete
functionality. I'm trying to keep my changes well-scoped, motivated,
and small.

> The __STDC_ISO_10646__ macro is the way to determine that the encoding
> of wchar_t is Unicode (or some subset if WCHAR_MAX doesn't admit the
> full range). Otherwise it's not something you can meaningfully work
> with except as an abstract number, but in that case you just want to
> avoid it as much as possible and convert directly between multibyte
> characters and char16_t/char32_t. I don't see how converting directly
> between wchar_t and char16_t/char32_t is more useful, even if it is a
> prettier factorization of the code.

     It is an abstract number with no meaning to the developer, but
the platform (e.g., IBM using various GB encodings for wchar_t on
certain platforms where __STDC_ISO_10646__ is not defined) knows that
meaning. My intention is that by letting the Standard Library and
platform handle it, you can get from a blob of abstract numbers to
meaningful text in a Standard way. Not only for wchar_t, but for mb
strings too.

> A far more useful thing to know than wchar_t encoding is the multibyte
> encoding. POSIX gives you this in nl_langinfo(CODESET) but plain C has
> no equivalent. I'd actually like to see WG14 adopt this into plain C.

     This is actually something I am considering! There are a few
sister papers related to this percolating through another Standards
Committee right now; I want to see how that goes before bringing it to
WG14. But, I think that functionality should come in addition to - not
instead of - additional conversion functions. Platforms own wchar_t
and multibyte char encodings: if the user has to write conversion
routines themselves after checking the equivalent of nl_langinfo, we
may end up with incomplete or half-done support for encodings in many
programs!

> On musl (where I'm familiar with performance properties),
> byte-at-a-time conversion is roughly half the speed of bulk, which
> looks big but is diminishingly so if you're actually doing something
> with the result (just converting to wchar_t for its own sake is not
> very useful). Character-at-a-time is probably somewhat less slow than
> byte-at-a-time. When I wrote this I put in heavy effort to make
> byte/character-at-a-time not horribly slow, because it's normally the
> natural programming model. Wide character strings are not an idiomatic
> type to work with in C.

      If it is still okay, I will put my best effort into making sure
the character-at-a-time and similar functions are something you and
other musl contributors can be happy with!

Sincerely,
JeanHeyd


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-12-31  3:58 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-24 23:06 [ Guidance ] Potential New Routines; Requesting Help JeanHeyd Meneide
2019-12-25 20:07 ` Florian Weimer
2019-12-26  2:13   ` Rich Felker
2019-12-26  5:43     ` JeanHeyd Meneide
2019-12-30 17:28       ` Rich Felker
2019-12-30 18:53         ` JeanHeyd Meneide
2019-12-26  9:43     ` Florian Weimer
2019-12-30 17:31 ` Rich Felker
2019-12-30 18:39   ` JeanHeyd Meneide
2019-12-30 19:57     ` Rich Felker
2019-12-31  3:58       ` JeanHeyd Meneide

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).