On Sat, Jun 06, 2015 at 10:50:25PM -0400, Rich Felker wrote: > On Sat, Jun 06, 2015 at 05:40:07PM -0400, Rich Felker wrote: > > Attached is the first draft of a proposed byte-based C locale. The > > patch is about 400 lines but most of it is context, because it's > > basically a lot of tiny changes spread out over lots of files. > > [...] > > If we go forward with this, I think I can factor it into 3 parts: > > 1. Add checks for MB_CUR_MAX==1 and the bytelocale support they would > activate, and the CODEUNIT/IS_CODEUNIT macros needed for these code > paths. This patch would be a complete nop and would not even affect > codegen with a decent compiler since MB_CUR_MAX==4 is a constant > right now. > > 2. Introduce stdio saving of active LC_CTYPE at the time of stream > orientation (fwide) and save/restore of current locale around stdio > ops that need it (fputwc, fgetwc, ungetwc) and iconv usage of > multibyte functions. This patch would increase code size in a few > places but would not change behavior. > > 3. Replace the constant MB_CUR_MAX macro with a runtime-variable value > dependent on CURRENT_LOCALE->cat[LC_CTYPE]. This would actually > activate the byte-based C locale support. locale_impl.h is actually > already doing this, so I think I should remove that definition > before making any changes and only bring it back if/when stage 3 > here is committed. > > In principle stages 1 and 2 could be committed in either order; > they're independent. Stage 3 is also independent in what it touches, > but if it's already committed before stage 1/2, then committing stage > 1 without stage 2 is a functional regression (stdio functions no > longer behave according to spec; iconv stops working in C locale). Attached is the 3-part factorization described above, as patches against commit 536c6d5a4205e2a3f161f2983ce1e0ac3082187d. As predicted, part 1 does not change the generated code at all, at least for my toolchain. If nobody has further comments/discussion, I'll probably begin committing this soon, starting with part 1, and the rest as I test it more. While in the past there were ideological objections, including by myself, all the feedback this time has been from people who want this feature for compatibility (and future standards conformance), and I think I've managed to do it in a way that's basically cost-free and does not compromise musl's principle that UTF-8 is first-class, but instead just gives you a way (only if/when you want it) to process UTF-8 as code units instead of codepoints. Rich