* [ Guidance ] Potential New Routines; Requesting Help @ 2019-12-24 23:06 JeanHeyd Meneide 2019-12-25 20:07 ` Florian Weimer 2019-12-30 17:31 ` Rich Felker 0 siblings, 2 replies; 11+ messages in thread From: JeanHeyd Meneide @ 2019-12-24 23:06 UTC (permalink / raw) To: musl Dear musl Maintainers and Contributors, I hope this e-mail finds you doing well this Holiday Season! I am interested in developing a few fast routines for text encoding for musl after the positive reception of a paper for the C Standard related to fast conversion routines: https://thephd.github.io/vendor/future_cxx/papers/source/C%20-%20Efficient%20Character%20Conversions.html While I have a basic implementation, I would like to use some processor and compiler intrinsics to make it faster and make sure my first contribution meets both quality and speed standards for a C library. Is there a place in the codebase I can look to for guidance on how to handle intrinsics properly within musl libc? If there is already infrastructure and common idioms in place, I would rather use that then starting to spin up my own. Sincerely, JeanHeyd Meneide ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [ Guidance ] Potential New Routines; Requesting Help 2019-12-24 23:06 [ Guidance ] Potential New Routines; Requesting Help JeanHeyd Meneide @ 2019-12-25 20:07 ` Florian Weimer 2019-12-26 2:13 ` Rich Felker 2019-12-30 17:31 ` Rich Felker 1 sibling, 1 reply; 11+ messages in thread From: Florian Weimer @ 2019-12-25 20:07 UTC (permalink / raw) To: JeanHeyd Meneide; +Cc: musl * JeanHeyd Meneide: > I hope this e-mail finds you doing well this Holiday Season! I am > interested in developing a few fast routines for text encoding for > musl after the positive reception of a paper for the C Standard > related to fast conversion routines: > > https://thephd.github.io/vendor/future_cxx/papers/source/C%20-%20Efficient%20Character%20Conversions.html I'm somewhat concerned that the C multibyte functions are too broken to be useful. There is a at least one widely implemented character set (Big5 as specified for HTML5) which does not fit the model implied by the standard. Big5 does not have shift states, but a C implementation using UTF-32 for wchar_t has to pretend it has because correct conversion from Unicode to Big5 needs lookahead and cannot be performed one character at a time. This would at least affect the proposed c8rtomb function. I posted a brief review of the problematic charsets in glibc here: <https://sourceware.org/ml/libc-alpha/2019-05/msg00079.html> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [ Guidance ] Potential New Routines; Requesting Help 2019-12-25 20:07 ` Florian Weimer @ 2019-12-26 2:13 ` Rich Felker 2019-12-26 5:43 ` JeanHeyd Meneide 2019-12-26 9:43 ` Florian Weimer 0 siblings, 2 replies; 11+ messages in thread From: Rich Felker @ 2019-12-26 2:13 UTC (permalink / raw) To: Florian Weimer; +Cc: JeanHeyd Meneide, musl On Wed, Dec 25, 2019 at 09:07:05PM +0100, Florian Weimer wrote: > * JeanHeyd Meneide: > > > I hope this e-mail finds you doing well this Holiday Season! I am > > interested in developing a few fast routines for text encoding for > > musl after the positive reception of a paper for the C Standard > > related to fast conversion routines: > > > > https://thephd.github.io/vendor/future_cxx/papers/source/C%20-%20Efficient%20Character%20Conversions.html > > I'm somewhat concerned that the C multibyte functions are too broken > to be useful. There is a at least one widely implemented character > set (Big5 as specified for HTML5) which does not fit the model implied > by the standard. Big5 does not have shift states, but a C > implementation using UTF-32 for wchar_t has to pretend it has because > correct conversion from Unicode to Big5 needs lookahead and cannot be > performed one character at a time. I don't think this can be modeled with shift states. C explicitly forbids a stateful wchar_t encoding/multi-wchar_t-characters. Shift states would be meaningful for the other direction. In any case I don't think it really matters. There are no existing implementations with this version of Big5 (with the offending HKSCS characters included) as the locale charset, since it can't work, and there really is no good reason to be adding *new* locale encodings. The reason we (speaking of the larger community; musl doesn't) have non-UTF-8 locales is legacy compatibility for users who need or insist on keeping them. If there really is an insistence on using this version of Big5, the characters should be added to Unicode as <compat> characters so that there's an unambiguous one-to-one correspondence, and the people who care about it working should take responsibility for doing that. > This would at least affect the proposed c8rtomb function. > > I posted a brief review of the problematic charsets in glibc here: > > <https://sourceware.org/ml/libc-alpha/2019-05/msg00079.html> I've read it but seemingly not in enough detail to gather what parts are relevant to this conversation. Rich ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [ Guidance ] Potential New Routines; Requesting Help 2019-12-26 2:13 ` Rich Felker @ 2019-12-26 5:43 ` JeanHeyd Meneide 2019-12-30 17:28 ` Rich Felker 2019-12-26 9:43 ` Florian Weimer 1 sibling, 1 reply; 11+ messages in thread From: JeanHeyd Meneide @ 2019-12-26 5:43 UTC (permalink / raw) To: Rich Felker; +Cc: Florian Weimer, musl Dear Rich Felker and Florien Weimer, Thank you for taking an interest in this! On Wed, Dec 25, 2019 at 9:14 PM Rich Felker <dalias@libc.org> wrote: > > On Wed, Dec 25, 2019 at 09:07:05PM +0100, Florian Weimer wrote: > > > > I'm somewhat concerned that the C multibyte functions are too broken > > to be useful. There is a at least one widely implemented character > > set (Big5 as specified for HTML5) which does not fit the model implied > > by the standard. Big5 does not have shift states, but a C > > implementation using UTF-32 for wchar_t has to pretend it has because > > correct conversion from Unicode to Big5 needs lookahead and cannot be > > performed one character at a time. > > I don't think this can be modeled with shift states. C explicitly > forbids a stateful wchar_t encoding/multi-wchar_t-characters. Shift > states would be meaningful for the other direction. > > In any case I don't think it really matters. There are no existing > implementations with this version of Big5 (with the offending HKSCS > characters included) as the locale charset, since it can't work, and > there really is no good reason to be adding *new* locale encodings. > The reason we (speaking of the larger community; musl doesn't) have > non-UTF-8 locales is legacy compatibility for users who need or insist > on keeping them. I have no intentions on adding new locale-based charsets (and I absolutely agree that we should not be adding anymore). That being said, I want to focus on the main part of this, which is the ability to model existing encodings which have both/either shift states and/or multi-character expansions. It is true wchar_t is invariably busted. There is no way a wide character string can be multi-unit: that shipped sailed when wchar_t was specified as is, and when it was codified in various APIs such as mbtowc/wctomb and friends. I will, however, note that the paper specifically wants to add the Restartable versions of "single unit" wc and mb to/from functions. The reason I chose the restartable forms is because in C2x a defect report was accepted that clarified the original intent of the single-character functions with respect to their R versions: > "After discussion, the committee concluded that mbstate was already specified to handle this case, and as such the second interpretation is intended. The committee believes that there is an underspecification, and solicited a further paper from the author along the lines of the second option. Although not discussed a Suggested Technical Corrigendum can be found at N2040." -- WG14, April 2016, http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2059.htm#dr_488 This means that while wcto* and *towc functions are broken, the restartable versions need not be. By returning a value of "0" (e.g., for c8rtomb) or returning a value of "-3" (e.g., for mbrtoc8), we can write out multiple characters based on the input and any data stored in mbstate_t. This allows us to handle e.g. UTF-16 for c16, multi-conversions for c8, and more. My understanding is that for one of the referenced encodings in the linked mailing post (TSCII), this would cover that use case. My understanding is also that for some Big5 encodings it would -- as you stated -- treat ambiguous leading sequences as a shift state, accumulate data in the mbstate_t, and then write out data if the sequence is made unambiguous by having further data provided to the next call. Is my understanding incorrect? Is there an implementation limitation I am missing here? I would hate to do this the wrong way and make the encoding situation even worse: my goal is to absolutely ensure we can transition legacy encodings to statically-known encodings. Sincerely, JeanHeyd Meneide ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [ Guidance ] Potential New Routines; Requesting Help 2019-12-26 5:43 ` JeanHeyd Meneide @ 2019-12-30 17:28 ` Rich Felker 2019-12-30 18:53 ` JeanHeyd Meneide 0 siblings, 1 reply; 11+ messages in thread From: Rich Felker @ 2019-12-30 17:28 UTC (permalink / raw) To: JeanHeyd Meneide; +Cc: Florian Weimer, musl On Thu, Dec 26, 2019 at 12:43:45AM -0500, JeanHeyd Meneide wrote: > Dear Rich Felker and Florien Weimer, > > Thank you for taking an interest in this! > > On Wed, Dec 25, 2019 at 9:14 PM Rich Felker <dalias@libc.org> wrote: > > > > On Wed, Dec 25, 2019 at 09:07:05PM +0100, Florian Weimer wrote: > > > > > > I'm somewhat concerned that the C multibyte functions are too broken > > > to be useful. There is a at least one widely implemented character > > > set (Big5 as specified for HTML5) which does not fit the model implied > > > by the standard. Big5 does not have shift states, but a C > > > implementation using UTF-32 for wchar_t has to pretend it has because > > > correct conversion from Unicode to Big5 needs lookahead and cannot be > > > performed one character at a time. > > > > I don't think this can be modeled with shift states. C explicitly > > forbids a stateful wchar_t encoding/multi-wchar_t-characters. Shift > > states would be meaningful for the other direction. > > > > In any case I don't think it really matters. There are no existing > > implementations with this version of Big5 (with the offending HKSCS > > characters included) as the locale charset, since it can't work, and > > there really is no good reason to be adding *new* locale encodings. > > The reason we (speaking of the larger community; musl doesn't) have > > non-UTF-8 locales is legacy compatibility for users who need or insist > > on keeping them. > > I have no intentions on adding new locale-based charsets (and I > absolutely agree that we should not be adding anymore). That being > said, I want to focus on the main part of this, which is the ability > to model existing encodings which have both/either shift states and/or > multi-character expansions. I think you misunderstood my remarks here. I was not talking about invention of new charsets (which we seem to agree should not happen), but making it possible to use existing legacy charsets which were previously not usable as a locale's encoding due to limitations of the C APIs. I see making that possible as counter-productive. It does not serve to let users keep doing something they were already doing (compatibility), only do to something newly backwards. > It is true wchar_t is invariably busted. There is no way a wide > character string can be multi-unit: that shipped sailed when wchar_t > was specified as is, and when it was codified in various APIs such as > mbtowc/wctomb and friends. I will, however, note that the paper > specifically wants to add the Restartable versions of "single unit" wc > and mb to/from functions. I don't follow. mbrtowc and wcrtomb already exist and have since at least C99. > > "After discussion, the committee concluded that mbstate was > > already specified to handle this case, and as such the second > > interpretation is intended.. The committee believes that there is > > an underspecification, and solicited a further paper from the > > author along the lines of the second option. Although not > > discussed a Suggested Technical Corrigendum can be found at > > N2040." -- WG14, April 2016, > > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2059.htm#dr_488 > > This means that while wcto* and *towc functions are broken, the I don't see them as broken. They support every encoding that has ever worked in the past as the encoding for a locale (tautologically). The only way they're "broken" is if you want to add new locale encodings that weren't previously supportable. > Is my understanding incorrect? Is there an implementation > limitation I am missing here? I would hate to do this the wrong way > and make the encoding situation even worse: my goal is to absolutely > ensure we can transition legacy encodings to statically-known > encodings. The C locale API does not exist to convert arbitrary encodings, only the one in use as the locale's encoding. Its purpose is to abstract the concept of how text is encoded in the system's/user's environment such that applications can honor it while (with recent versions of C) being able to determine the identity of characters in terms of Unicode and emit specific characters provided that they're representable in the encoding. Conversion of arbitrary encodings other than the one in use by the locale requires a different API that takes encodings by name or some other identifier. The standard (POSIX) API for this is iconv, which has plenty of limitations of its own, some the same as what you've identified. Rich ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [ Guidance ] Potential New Routines; Requesting Help 2019-12-30 17:28 ` Rich Felker @ 2019-12-30 18:53 ` JeanHeyd Meneide 0 siblings, 0 replies; 11+ messages in thread From: JeanHeyd Meneide @ 2019-12-30 18:53 UTC (permalink / raw) To: Rich Felker; +Cc: Florian Weimer, musl On Mon, Dec 30, 2019 at 12:28 PM Rich Felker <dalias@libc.org> wrote: > I think you misunderstood my remarks here. I was not talking about > invention of new charsets (which we seem to agree should not happen), > but making it possible to use existing legacy charsets which were > previously not usable as a locale's encoding due to limitations of the > C APIs. I see making that possible as counter-productive. It does not > serve to let users keep doing something they were already doing > (compatibility), only do to something newly backwards. My goal is to allow developers to go from an encoding they do not control fully (the multibyte encoding) to an encoding they know and can reason about in their program (c8, for example). This is why I am providing the mb -> cNN and wc -> cNN functions in both single-character and string forms. The hope is to make it easy to go from a statically known encoding (modulo difficulties from __STD_C_UTF16/32__ not being defined) to the platform encoding, and vice-versa, using the same style of functions like mb(s)(r)towc(s) and wc(s)(r)tomb(s). > > ... I will, however, note that the paper > > specifically wants to add the Restartable versions of "single unit" wc > > and mb to/from functions. > > I don't follow. mbrtowc and wcrtomb already exist and have since at > least C99. Apologies, I meant doing wc <-> cNN and mb <-> cNN! > > ... > > > > This means that while wcto* and *towc functions are broken, the > > I don't see them as broken. They support every encoding that has ever > worked in the past as the encoding for a locale (tautologically). The > only way they're "broken" is if you want to add new locale encodings > that weren't previously supportable. Apologies; this was in reference to wide characters given a not UTF-32 interpretation on certain platforms like Windows and certain flavors of IBM. They chose 16 bits, which can't accommodate Unicode without needing multiple wchar_t. Unfortunately, this means that they were really out of luck before DR488 was accepted: they had no means to return multiple wchar_t for characters outside the 16-bit maximum. With DR488, restartable functions have the potential to convert out properly (albeit, the DR was only applied to char16_t functions, so while I have a hope and a wish we can fix it for their platforms it might not work out for the wcto* and *towc functions anyways). char16_t functions, though, should offer those platforms a better way out (though not a perfect one: they'll need to rely on platform knowledge and perform some casts). > ... > > Conversion of arbitrary encodings other than the one in use by the > locale requires a different API that takes encodings by name or some > other identifier. The standard (POSIX) API for this is iconv, which > has plenty of limitations of its own, some the same as what you've > identified. Absolutely agreed! I just want the ones that the platform controls (wide character and multibyte character encodings) to have correct, simple paths to static encodings that can be used for more rigorous text processing. Sincerely, JeanHeyd Meneide ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [ Guidance ] Potential New Routines; Requesting Help 2019-12-26 2:13 ` Rich Felker 2019-12-26 5:43 ` JeanHeyd Meneide @ 2019-12-26 9:43 ` Florian Weimer 1 sibling, 0 replies; 11+ messages in thread From: Florian Weimer @ 2019-12-26 9:43 UTC (permalink / raw) To: Rich Felker; +Cc: JeanHeyd Meneide, musl * Rich Felker: > On Wed, Dec 25, 2019 at 09:07:05PM +0100, Florian Weimer wrote: >> * JeanHeyd Meneide: >> >> > I hope this e-mail finds you doing well this Holiday Season! I am >> > interested in developing a few fast routines for text encoding for >> > musl after the positive reception of a paper for the C Standard >> > related to fast conversion routines: >> > >> > https://thephd.github.io/vendor/future_cxx/papers/source/C%20-%20Efficient%20Character%20Conversions.html >> >> I'm somewhat concerned that the C multibyte functions are too broken >> to be useful. There is a at least one widely implemented character >> set (Big5 as specified for HTML5) which does not fit the model implied >> by the standard. Big5 does not have shift states, but a C >> implementation using UTF-32 for wchar_t has to pretend it has because >> correct conversion from Unicode to Big5 needs lookahead and cannot be >> performed one character at a time. > > I don't think this can be modeled with shift states. C explicitly > forbids a stateful wchar_t encoding/multi-wchar_t-characters. Shift > states would be meaningful for the other direction. The intent of the standard appears to be to support this as an extension. It's hard to tell because actual users of the interfaces with legacy charsets do not seem to be represented on the standards committee anymore (see the mblen behavioral change in C11 as evidence supporting this theory). > In any case I don't think it really matters. There are no existing > implementations with this version of Big5 (with the offending HKSCS > characters included) as the locale charset, since it can't work, and > there really is no good reason to be adding *new* locale encodings. Do you mean in musl? > The reason we (speaking of the larger community; musl doesn't) have > non-UTF-8 locales is legacy compatibility for users who need or insist > on keeping them. That is true. > If there really is an insistence on using this version of Big5, the > characters should be added to Unicode as <compat> characters so that > there's an unambiguous one-to-one correspondence, and the people who > care about it working should take responsibility for doing that. Yes, I was very surprised this wasn't done for TSCII and HKSCS/Big5. I think even for Big5, it would solve the issue because the decoding process only needs to look at a single multibyte character at the time (I may have suggested otherwise in the past). A succint description of what is going on for Big 5 is here: <https://encoding.spec.whatwg.org/#big5-decoder>, under step 3.3. The conversion is actually fairly simple, it's just hard to fit it into the C interfaces. >> This would at least affect the proposed c8rtomb function. >> >> I posted a brief review of the problematic charsets in glibc here: >> >> <https://sourceware.org/ml/libc-alpha/2019-05/msg00079.html> > > I've read it but seemingly not in enough detail to gather what parts > are relevant to this conversation. It names a few character sets that have fake shift states because the C interfaces cannot otherwise be used with them. Some of the new interfaces are problematic in this context (whether or not UTF-32 is used for wchar_t). I think new interfaces should be compatible with existing implementation practice. The other thing I found surprising is that there are no ASCII-transparent charsets with traditional shift states in glibc. The ASCII-transparent charsets with shift states have these fake shift states. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [ Guidance ] Potential New Routines; Requesting Help 2019-12-24 23:06 [ Guidance ] Potential New Routines; Requesting Help JeanHeyd Meneide 2019-12-25 20:07 ` Florian Weimer @ 2019-12-30 17:31 ` Rich Felker 2019-12-30 18:39 ` JeanHeyd Meneide 1 sibling, 1 reply; 11+ messages in thread From: Rich Felker @ 2019-12-30 17:31 UTC (permalink / raw) To: JeanHeyd Meneide; +Cc: musl On Tue, Dec 24, 2019 at 06:06:50PM -0500, JeanHeyd Meneide wrote: > Dear musl Maintainers and Contributors, > > I hope this e-mail finds you doing well this Holiday Season! I am > interested in developing a few fast routines for text encoding for > musl after the positive reception of a paper for the C Standard > related to fast conversion routines: > > https://thephd.github.io/vendor/future_cxx/papers/source/C%20-%20Efficient%20Character%20Conversions.html This is interesting, but I'm trying to understand the motivation. If __STDC_ISO_10646__ is defined, wchar_t is UTF-32/UCS-4, and the proposed functions are just the identity (for the c32 ones) and UTF-16/32 conversion. If it's not defined, you have the same problem as the current mb/cNN functions: there's no reason to believe arbitrary Unicode characters can round-trip through wchar_t any better than they can through multibyte characters. In fact on such implementations it's likely that wchar_t meanings are locale-dependent and just a remapping of the byte/multibyte characters. What situation do you envision where the proposed functions let you reliably do something that's not already possible? > While I have a basic implementation, I would like to use some > processor and compiler intrinsics to make it faster and make sure my > first contribution meets both quality and speed standards for a C > library. > > Is there a place in the codebase I can look to for guidance on > how to handle intrinsics properly within musl libc? If there is > already infrastructure and common idioms in place, I would rather use > that then starting to spin up my own. I'm not sure what you mean by intrinsics or why you're looking for them but I guess you're thinking of something as a performance optimization? musl favors having code in straight simple C except when there's a strong reason (known bottleneck in existing real-world software -- things like memcpy, strlen, etc.) to do otherwise. The existing mb/wc code is slightly "vectorized" (see mbsrtowcs) but doing so was probably a mistake. The motivation came along with one of the early motivations for musl: not making UTF-8 a major performance regression like it was in glibc. But it turned out the bigger issue was the performance of character-at-a-time and byte-at-a-time conversions, not bulk conversion. If we do adopt these functions, the right way to do it would be using them to refactor the existing c16/c32 functions. Basically, for example, the bulk of c16rtomb would become c16rtowc, and c16rtomb would be replaced with a call to c16rtowc followed by wctomb. And the string ones can all be simple loop wrappers. Rich ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [ Guidance ] Potential New Routines; Requesting Help 2019-12-30 17:31 ` Rich Felker @ 2019-12-30 18:39 ` JeanHeyd Meneide 2019-12-30 19:57 ` Rich Felker 0 siblings, 1 reply; 11+ messages in thread From: JeanHeyd Meneide @ 2019-12-30 18:39 UTC (permalink / raw) To: Rich Felker; +Cc: musl On Mon, Dec 30, 2019 at 12:31 PM Rich Felker <dalias@libc.org> wrote: > > ... > This is interesting, but I'm trying to understand the motivation. > > If __STDC_ISO_10646__ is defined, wchar_t is UTF-32/UCS-4, and the > proposed functions are just the identity (for the c32 ones) and > UTF-16/32 conversion. > > If it's not defined, you have the same problem as the current mb/cNN > functions: there's no reason to believe arbitrary Unicode characters > can round-trip through wchar_t any better than they can through > multibyte characters. In fact on such implementations it's likely that > wchar_t meanings are locale-dependent and just a remapping of the > byte/multibyte characters. I'm sorry, I'll try to phrase it as best as I can. The issue I and others have with the lack of cNNtowc is that, if we are to write standards-compliant C, the only way to do that transformation from, for example, char16_t data to wchar_t portably is: c16rtomb -> multibyte data -> mbrtowc The problem with such a conversion sequence is that there are many legacy encodings and this causes bugs on many user's machines. Text representable in both char16_t and wchar_t is lost in the middle: due to the middle not handling it, putting us in a place where we lose of data going to and from wchar_t to char16_t. This has been frustrating for a number of users who try to rely on the standard, only to have to write the above conversions sequence and fail. Thus, providing a direct function with no intermediates results in a better Standard C experience. A minor but still helpful secondary motivation is in giving people on certain long-standing platforms a way out. By definition, UTF16 does not work with wchar_t, so I am explicitly told that wchar_t for a platform like .e.g Windows is UCS-2 (the non-multi-unit version of UTF-16 that was deprecated a while ago) is wrong when using the Standard Library if I want real Unicode Support. Library developers tell me to rely on platform-specific APIs. The "use MultiByteToWideChar" or "use ICU" or "use this AIX-specific function", makes it much less of a Standard way to handle text: hence, the paper to the WG14 C Committee. The restartable versions of the single-character functions and the bulk conversion functions give ways for implementations locked to behaving like the deprecated UCS-2, 16-bit-single-unit-encoding a way out, and also allow us to have lossless data conversion. This reasoning might be a little bit "overdone" for libraries like musl and glibc who got wchar_t right (thank you!), but part of standardizing these things means I have to account for implementations that have been around longer than I have been alive. :) Does that make sense? > What situation do you envision where the proposed functions let you > reliably do something that's not already possible? My understanding is that libraries such as musl are "blessed" as distributions of the Standard Library, and that they can access system information that makes it possible for them to utilize what the current "wchar_t encoding" is in a way normal, regular developers cannot. Specifically, in the generic external implementation I have been working on, I have a number of #ifdef to check for, say, IBM machines, then check if they are specifically under zh/tw or even jp locales, because they deploy a wchar_t in these scenarios that is neither UTF16 or UTF32 (but instead a flavor of one of the GB encodings and Japanese encodings); otherwise, IBM uses UTF16/UCS-2 for wchar_t in i686 and UTF-32 for wchar_t in x86_64 for certain machines. I also check for what happens on Windows under various settings as well. Doing this as an external library is hard, because there is no way I can control the knobs for such reliably, but that a Standard Library distribution would have access to that information (since they are providing such functions already). So, for example, musl -- being the C library -- controls how the wchar_t should behave (modulo compiler intervention) for its wide character functions. Similarly, glibc would know what to do for its platforms, and IBM would know what to do for its platforms, and so on and so forth. Each distribution would provide behavior in coordination with their platform. Is this incorrect? Am I assuming a level of standard library <-> vendor relation/cooperation that does not exist? > > While I have a basic implementation, I would like to use some > > processor and compiler intrinsics to make it faster and make sure my > > first contribution meets both quality and speed standards for a C > > library. > > > > Is there a place in the codebase I can look to for guidance on > > how to handle intrinsics properly within musl libc? If there is > > already infrastructure and common idioms in place, I would rather use > > that then starting to spin up my own. > > I'm not sure what you mean by intrinsics or why you're looking for > them but I guess you're thinking of something as a performance > optimization? musl favors having code in straight simple C except when > there's a strong reason (known bottleneck in existing real-world > software -- things like memcpy, strlen, etc.) to do otherwise. The > existing mb/wc code is slightly "vectorized" (see mbsrtowcs) but doing > so was probably a mistake. The motivation came along with one of the > early motivations for musl: not making UTF-8 a major performance > regression like it was in glibc. But it turned out the bigger issue > was the performance of character-at-a-time and byte-at-a-time > conversions, not bulk conversion. My experience so far is that the character-at-a-time functions can cause severe performance penalties for external users, especially if the library is dynamically linked. If the C standard provides the bulk-conversion functions, performance would increase drastically for users desiring bulk conversion (because they do not have to write a loop around a dynamically-loaded function call to do conversions one-at-a-time). I am glad that musl has had similar experience, and would like to make the bulk functions available in musl too! My asking about intrinsics and such was that I have some optimizations using hand-vectorized instructions for some bulk cases. I will be more than happy to just contribute regular and readable plain C, though, and then revisit such functions if it turns out that vectorization with SIMD and other instructions for various platforms turns out to be worth it. My initial hunch is that it is, but I'm more than happy to focus on correctness first, extreme performance (maybe) later. > If we do adopt these functions, the right way to do it would be using > them to refactor the existing c16/c32 functions. Basically, for > example, the bulk of c16rtomb would become c16rtowc, and c16rtomb > would be replaced with a call to c16rtowc followed by wctomb. And the > string ones can all be simple loop wrappers. I would be more than happy to write the implementation as such! Most of the wchar_t functions will be very easy since musl and glibc chose the right wchar_t. (Talking to other vendors is going to be a much, much more difficult conversation...) Best Wishes, JeanHeyd Meneide ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [ Guidance ] Potential New Routines; Requesting Help 2019-12-30 18:39 ` JeanHeyd Meneide @ 2019-12-30 19:57 ` Rich Felker 2019-12-31 3:58 ` JeanHeyd Meneide 0 siblings, 1 reply; 11+ messages in thread From: Rich Felker @ 2019-12-30 19:57 UTC (permalink / raw) To: JeanHeyd Meneide; +Cc: musl On Mon, Dec 30, 2019 at 01:39:10PM -0500, JeanHeyd Meneide wrote: > On Mon, Dec 30, 2019 at 12:31 PM Rich Felker <dalias@libc.org> wrote: > > > ... > > This is interesting, but I'm trying to understand the motivation. > > > > If __STDC_ISO_10646__ is defined, wchar_t is UTF-32/UCS-4, and the > > proposed functions are just the identity (for the c32 ones) and > > UTF-16/32 conversion. > > > > If it's not defined, you have the same problem as the current mb/cNN > > functions: there's no reason to believe arbitrary Unicode characters > > can round-trip through wchar_t any better than they can through > > multibyte characters. In fact on such implementations it's likely that > > wchar_t meanings are locale-dependent and just a remapping of the > > byte/multibyte characters. > > I'm sorry, I'll try to phrase it as best as I can. > > The issue I and others have with the lack of cNNtowc is that, if > we are to write standards-compliant C, the only way to do that > transformation from, for example, char16_t data to wchar_t portably > is: > > c16rtomb -> multibyte data -> mbrtowc > > The problem with such a conversion sequence is that there are > many legacy encodings and this causes bugs on many user's machines. > Text representable in both char16_t and wchar_t is lost in the middle: > due to the middle not handling it, putting us in a place where we lose > of data going to and from wchar_t to char16_t. This has been > frustrating for a number of users who try to rely on the standard, > only to have to write the above conversions sequence and fail. Thus, > providing a direct function with no intermediates results in a better > Standard C experience. > > A minor but still helpful secondary motivation is in giving > people on certain long-standing platforms a way out. By definition, > UTF16 does not work with wchar_t, so I am explicitly told that wchar_t > for a platform like .e.g Windows is UCS-2 (the non-multi-unit version > of UTF-16 that was deprecated a while ago) is wrong when using the > Standard Library if I want real Unicode Support. Library developers > tell me to rely on platform-specific APIs. The "use > MultiByteToWideChar" or "use ICU" or "use this AIX-specific function", > makes it much less of a Standard way to handle text: hence, the paper > to the WG14 C Committee. The restartable versions of the > single-character functions and the bulk conversion functions give ways > for implementations locked to behaving like the deprecated UCS-2, > 16-bit-single-unit-encoding a way out, and also allow us to have > lossless data conversion. I don't think these interfaces gives you an "out" in a way that's fully conforming. The C model is that there's a set of characters supported in the current locale, and each of them has one or more multibyte representations (possibly involving shift states) and a single wide character representation. Converting between UTF-16 or UTF-32 and wchar_t outside the scope of characters that exist in the current locale isn't presently a meaningful concept, and wouldn't enable you to get meaningful results from wctype.h functions, etc. (Would you propose having a second set of such functions for char32_t to handle that? Really it sounds like what you want is an out to deprecate wchar_t and use char32_t in its place, which wouldn't be a bad idea...) Solving these problems for implementations burdened by a legacy *wrong choice* of definition of wchar_t is not possible by adding more interfaces alone; it requires a lot of changes to the underlying abstract model of what a character is in C. I'm not really in favor of such changes. They complicate and burden existing working implementations for the sake of ones that made bad choices. Windows in particular *can* and *should* fix wchar_t to be 32-bit. The Windows API uses WCHAR, not wchar_t, anyway, so that a change in wchar_t is really not a big deal for interface compatibility, and has conformance problems like wprintf treating %s/%ls incorrectly that require breaking changes to fix. Good stdlib implementations on Windows already fix these things. > This reasoning might be a little bit "overdone" for libraries > like musl and glibc who got wchar_t right (thank you!), but part of > standardizing these things means I have to account for implementations > that have been around longer than I have been alive. :) Does that make > sense? > > > What situation do you envision where the proposed functions let you > > reliably do something that's not already possible? > > My understanding is that libraries such as musl are "blessed" as > distributions of the Standard Library, and that they can access system > information that makes it possible for them to utilize what the > current "wchar_t encoding" is in a way normal, regular developers > cannot. Specifically, in the generic external implementation I have > been working on, I have a number of #ifdef to check for, say, IBM > machines, then check if they are specifically under zh/tw or even jp > locales, because they deploy a wchar_t in these scenarios that is > neither UTF16 or UTF32 (but instead a flavor of one of the GB > encodings and Japanese encodings); otherwise, IBM uses UTF16/UCS-2 for > wchar_t in i686 and UTF-32 for wchar_t in x86_64 for certain machines. > I also check for what happens on Windows under various settings as > well. Doing this as an external library is hard, because there is no > way I can control the knobs for such reliably, but that a Standard > Library distribution would have access to that information (since they > are providing such functions already). The __STDC_ISO_10646__ macro is the way to determine that the encoding of wchar_t is Unicode (or some subset if WCHAR_MAX doesn't admit the full range). Otherwise it's not something you can meaningfully work with except as an abstract number, but in that case you just want to avoid it as much as possible and convert directly between multibyte characters and char16_t/char32_t. I don't see how converting directly between wchar_t and char16_t/char32_t is more useful, even if it is a prettier factorization of the code. A far more useful thing to know than wchar_t encoding is the multibyte encoding. POSIX gives you this in nl_langinfo(CODESET) but plain C has no equivalent. I'd actually like to see WG14 adopt this into plain C. > > > While I have a basic implementation, I would like to use some > > > processor and compiler intrinsics to make it faster and make sure my > > > first contribution meets both quality and speed standards for a C > > > library. > > > > > > Is there a place in the codebase I can look to for guidance on > > > how to handle intrinsics properly within musl libc? If there is > > > already infrastructure and common idioms in place, I would rather use > > > that then starting to spin up my own. > > > > I'm not sure what you mean by intrinsics or why you're looking for > > them but I guess you're thinking of something as a performance > > optimization? musl favors having code in straight simple C except when > > there's a strong reason (known bottleneck in existing real-world > > software -- things like memcpy, strlen, etc.) to do otherwise. The > > existing mb/wc code is slightly "vectorized" (see mbsrtowcs) but doing > > so was probably a mistake. The motivation came along with one of the > > early motivations for musl: not making UTF-8 a major performance > > regression like it was in glibc. But it turned out the bigger issue > > was the performance of character-at-a-time and byte-at-a-time > > conversions, not bulk conversion. > > My experience so far is that the character-at-a-time functions > can cause severe performance penalties for external users, especially > if the library is dynamically linked. On musl (where I'm familiar with performance properties), byte-at-a-time conversion is roughly half the speed of bulk, which looks big but is diminishingly so if you're actually doing something with the result (just converting to wchar_t for its own sake is not very useful). Character-at-a-time is probably somewhat less slow than byte-at-a-time. When I wrote this I put in heavy effort to make byte/character-at-a-time not horribly slow, because it's normally the natural programming model. Wide character strings are not an idiomatic type to work with in C. Rich ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [ Guidance ] Potential New Routines; Requesting Help 2019-12-30 19:57 ` Rich Felker @ 2019-12-31 3:58 ` JeanHeyd Meneide 0 siblings, 0 replies; 11+ messages in thread From: JeanHeyd Meneide @ 2019-12-31 3:58 UTC (permalink / raw) To: Rich Felker; +Cc: musl On Mon, Dec 30, 2019 at 2:57 PM Rich Felker <dalias@libc.org> wrote: > I don't think these interfaces gives you an "out" in a way that's > fully conforming. The C model is that there's a set of characters > supported in the current locale, and each of them has one or more > multibyte representations (possibly involving shift states) and a > single wide character representation. Converting between UTF-16 or > UTF-32 and wchar_t outside the scope of characters that exist in the > current locale isn't presently a meaningful concept, and wouldn't > enable you to get meaningful results from wctype.h functions, etc. > (Would you propose having a second set of such functions for char32_t > to handle that? Really it sounds like what you want is an out to > deprecate wchar_t and use char32_t in its place, which wouldn't be a > bad idea...) This is actually something I am extremely interested in tackling. But I need to make sure everyone can get their data in current applications from mb/wide characters to the char32_t. Then a potential <uctype.h> can be worked on that takes case mapping, case folding, and all of the other useful things Unicode has brought to the table and work with Unicode Code Points. One of the things I saw before is that there was a previous proposal to extend wctype.h with other functions that was very large, and despite being well motivated it did not succeed in WG14. Also on my list of things is the fact that char16_t and char32_t do not necessarily have to be Unicode (__STD_C_UTF32__ and friends). This means that if we settle on char32_t for these interfaces, we may set a potential trap for users who migrate and then try to port to platforms where c16 does not mean UTF-16, and c32 does not mean UTF-32. In coordinating with a few static analysis vendors who cover a very large range of compiler implementations both C and C++, they have reportedly not yet found a compiler which makes char16/32_t not be UTF-16/32 (some platforms forget to define the macros but still use those encodings). I hope that in the future a paper can be brought to WG14 to make those encodings required for char16/32_t, rather than checking the macro and leaving users out to dry. Right now everything de-facto works, but I worry... Still. I want to introduce each logical piece of functionality in its own paper, with its own scope and motivation. This, in my opinion, seems to work much better. Work on transition and replacement, then deprecate the things which are know from experience are bad. I don't know if my plan is going to work, but having nobody vote against my first ever WG14 proposal is a good start and I want to be careful to not get stuck in Committee on mega-proposals that scare people. > Solving these problems for implementations burdened by a legacy *wrong > choice* of definition of wchar_t is not possible by adding more > interfaces alone; it requires a lot of changes to the underlying > abstract model of what a character is in C. I'm not really in favor of > such changes. They complicate and burden existing working > implementations for the sake of ones that made bad choices. Windows in > particular *can* and *should* fix wchar_t to be 32-bit. The Windows > API uses WCHAR, not wchar_t, anyway, so that a change in wchar_t is > really not a big deal for interface compatibility, and has conformance > problems like wprintf treating %s/%ls incorrectly that require > breaking changes to fix. Good stdlib implementations on Windows > already fix these things. They should, absolutely. Still, I think that preventing lossy conversions for wchar_t usage on platforms where the wide character is used to interface with the system is a worthwhile endeavor. I don't think it is feasible (or would ever fly in WG14) to change what wchar_t is and how it behaves: but I would rather invest time in implementing interfaces that can offer better and more complete functionality. I'm trying to keep my changes well-scoped, motivated, and small. > The __STDC_ISO_10646__ macro is the way to determine that the encoding > of wchar_t is Unicode (or some subset if WCHAR_MAX doesn't admit the > full range). Otherwise it's not something you can meaningfully work > with except as an abstract number, but in that case you just want to > avoid it as much as possible and convert directly between multibyte > characters and char16_t/char32_t. I don't see how converting directly > between wchar_t and char16_t/char32_t is more useful, even if it is a > prettier factorization of the code. It is an abstract number with no meaning to the developer, but the platform (e.g., IBM using various GB encodings for wchar_t on certain platforms where __STDC_ISO_10646__ is not defined) knows that meaning. My intention is that by letting the Standard Library and platform handle it, you can get from a blob of abstract numbers to meaningful text in a Standard way. Not only for wchar_t, but for mb strings too. > A far more useful thing to know than wchar_t encoding is the multibyte > encoding. POSIX gives you this in nl_langinfo(CODESET) but plain C has > no equivalent. I'd actually like to see WG14 adopt this into plain C. This is actually something I am considering! There are a few sister papers related to this percolating through another Standards Committee right now; I want to see how that goes before bringing it to WG14. But, I think that functionality should come in addition to - not instead of - additional conversion functions. Platforms own wchar_t and multibyte char encodings: if the user has to write conversion routines themselves after checking the equivalent of nl_langinfo, we may end up with incomplete or half-done support for encodings in many programs! > On musl (where I'm familiar with performance properties), > byte-at-a-time conversion is roughly half the speed of bulk, which > looks big but is diminishingly so if you're actually doing something > with the result (just converting to wchar_t for its own sake is not > very useful). Character-at-a-time is probably somewhat less slow than > byte-at-a-time. When I wrote this I put in heavy effort to make > byte/character-at-a-time not horribly slow, because it's normally the > natural programming model. Wide character strings are not an idiomatic > type to work with in C. If it is still okay, I will put my best effort into making sure the character-at-a-time and similar functions are something you and other musl contributors can be happy with! Sincerely, JeanHeyd ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2019-12-31 3:58 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-12-24 23:06 [ Guidance ] Potential New Routines; Requesting Help JeanHeyd Meneide 2019-12-25 20:07 ` Florian Weimer 2019-12-26 2:13 ` Rich Felker 2019-12-26 5:43 ` JeanHeyd Meneide 2019-12-30 17:28 ` Rich Felker 2019-12-30 18:53 ` JeanHeyd Meneide 2019-12-26 9:43 ` Florian Weimer 2019-12-30 17:31 ` Rich Felker 2019-12-30 18:39 ` JeanHeyd Meneide 2019-12-30 19:57 ` Rich Felker 2019-12-31 3:58 ` JeanHeyd Meneide
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).