From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/7930 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: [PATCH] Byte-based C locale, draft 2 Date: Sat, 13 Jun 2015 03:06:55 -0400 Message-ID: <20150613070655.GJ17573@brightrain.aerifal.cx> References: <20150606214007.GA17398@brightrain.aerifal.cx> <20150607025025.GC17573@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="nzri8VXeXB/g5ayr" X-Trace: ger.gmane.org 1434179243 13444 80.91.229.3 (13 Jun 2015 07:07:23 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 13 Jun 2015 07:07:23 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-7943-gllmg-musl=m.gmane.org@lists.openwall.com Sat Jun 13 09:07:18 2015 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1Z3fXB-0004Xo-LX for gllmg-musl@m.gmane.org; Sat, 13 Jun 2015 09:07:17 +0200 Original-Received: (qmail 1346 invoked by uid 550); 13 Jun 2015 07:07:15 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 32723 invoked from network); 13 Jun 2015 07:07:09 -0000 Content-Disposition: inline In-Reply-To: <20150607025025.GC17573@brightrain.aerifal.cx> User-Agent: Mutt/1.5.21 (2010-09-15) Original-Sender: Rich Felker Xref: news.gmane.org gmane.linux.lib.musl.general:7930 Archived-At: --nzri8VXeXB/g5ayr Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Sat, Jun 06, 2015 at 10:50:25PM -0400, Rich Felker wrote: > On Sat, Jun 06, 2015 at 05:40:07PM -0400, Rich Felker wrote: > > Attached is the first draft of a proposed byte-based C locale. The > > patch is about 400 lines but most of it is context, because it's > > basically a lot of tiny changes spread out over lots of files. > > [...] > > If we go forward with this, I think I can factor it into 3 parts: > > 1. Add checks for MB_CUR_MAX==1 and the bytelocale support they would > activate, and the CODEUNIT/IS_CODEUNIT macros needed for these code > paths. This patch would be a complete nop and would not even affect > codegen with a decent compiler since MB_CUR_MAX==4 is a constant > right now. > > 2. Introduce stdio saving of active LC_CTYPE at the time of stream > orientation (fwide) and save/restore of current locale around stdio > ops that need it (fputwc, fgetwc, ungetwc) and iconv usage of > multibyte functions. This patch would increase code size in a few > places but would not change behavior. > > 3. Replace the constant MB_CUR_MAX macro with a runtime-variable value > dependent on CURRENT_LOCALE->cat[LC_CTYPE]. This would actually > activate the byte-based C locale support. locale_impl.h is actually > already doing this, so I think I should remove that definition > before making any changes and only bring it back if/when stage 3 > here is committed. > > In principle stages 1 and 2 could be committed in either order; > they're independent. Stage 3 is also independent in what it touches, > but if it's already committed before stage 1/2, then committing stage > 1 without stage 2 is a functional regression (stdio functions no > longer behave according to spec; iconv stops working in C locale). Attached is the 3-part factorization described above, as patches against commit 536c6d5a4205e2a3f161f2983ce1e0ac3082187d. As predicted, part 1 does not change the generated code at all, at least for my toolchain. If nobody has further comments/discussion, I'll probably begin committing this soon, starting with part 1, and the rest as I test it more. While in the past there were ideological objections, including by myself, all the feedback this time has been from people who want this feature for compatibility (and future standards conformance), and I think I've managed to do it in a way that's basically cost-free and does not compromise musl's principle that UTF-8 is first-class, but instead just gives you a way (only if/when you want it) to process UTF-8 as code units instead of codepoints. Rich --nzri8VXeXB/g5ayr Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="bytelocale-part1.diff" diff --git a/src/ctype/__ctype_get_mb_cur_max.c b/src/ctype/__ctype_get_mb_cur_max.c index d235f4d..8e946fc 100644 --- a/src/ctype/__ctype_get_mb_cur_max.c +++ b/src/ctype/__ctype_get_mb_cur_max.c @@ -1,6 +1,7 @@ -#include +#include +#include "locale_impl.h" size_t __ctype_get_mb_cur_max() { - return 4; + return MB_CUR_MAX; } diff --git a/src/locale/langinfo.c b/src/locale/langinfo.c index a1ada24..776b447 100644 --- a/src/locale/langinfo.c +++ b/src/locale/langinfo.c @@ -33,7 +33,8 @@ char *__nl_langinfo_l(nl_item item, locale_t loc) int idx = item & 65535; const char *str; - if (item == CODESET) return "UTF-8"; + if (item == CODESET) + return MB_CUR_MAX==1 ? "UTF-8-CODE-UNITS" : "UTF-8"; switch (cat) { case LC_NUMERIC: diff --git a/src/multibyte/btowc.c b/src/multibyte/btowc.c index 9d2c3b1..8de060f 100644 --- a/src/multibyte/btowc.c +++ b/src/multibyte/btowc.c @@ -1,7 +1,10 @@ -#include #include +#include +#include "internal.h" wint_t btowc(int c) { - return c<128U ? c : EOF; + if (c < 128U) return c; + if (MB_CUR_MAX==1) return CODEUNIT(c); + return WEOF; } diff --git a/src/multibyte/internal.h b/src/multibyte/internal.h index cc017fa..53d62ed 100644 --- a/src/multibyte/internal.h +++ b/src/multibyte/internal.h @@ -23,3 +23,10 @@ extern const uint32_t bittab[]; #define SA 0xc2u #define SB 0xf4u + +/* Arbitrary encoding for representing code units instead of characters. */ +#define CODEUNIT(c) (0xdfff & (signed char)(c)) +#define IS_CODEUNIT(c) ((unsigned)(c)-0xdf80 < 0x80) + +/* Get inline definition of MB_CUR_MAX. */ +#include "locale_impl.h" diff --git a/src/multibyte/mbrtowc.c b/src/multibyte/mbrtowc.c index e7b3654..ca7da70 100644 --- a/src/multibyte/mbrtowc.c +++ b/src/multibyte/mbrtowc.c @@ -4,6 +4,7 @@ * unnecessary. */ +#include #include #include #include "internal.h" @@ -27,6 +28,7 @@ size_t mbrtowc(wchar_t *restrict wc, const char *restrict src, size_t n, mbstate if (!n) return -2; if (!c) { if (*s < 0x80) return !!(*wc = *s); + if (MB_CUR_MAX==1) return (*wc = CODEUNIT(*s)), 1; if (*s-SA > SB-SA) goto ilseq; c = bittab[*s++-SA]; n--; } diff --git a/src/multibyte/mbsrtowcs.c b/src/multibyte/mbsrtowcs.c index 3c1343a..e23083d 100644 --- a/src/multibyte/mbsrtowcs.c +++ b/src/multibyte/mbsrtowcs.c @@ -7,6 +7,8 @@ #include #include #include +#include +#include #include "internal.h" size_t mbsrtowcs(wchar_t *restrict ws, const char **restrict src, size_t wn, mbstate_t *restrict st) @@ -24,6 +26,23 @@ size_t mbsrtowcs(wchar_t *restrict ws, const char **restrict src, size_t wn, mbs } } + if (MB_CUR_MAX==1) { + if (!ws) return strlen((const char *)s); + for (;;) { + if (!wn) { + *src = (const void *)s; + return wn0; + } + if (!*s) break; + c = *s++; + *ws++ = CODEUNIT(c); + wn--; + } + *ws = 0; + *src = 0; + return wn0-wn; + } + if (!ws) for (;;) { if (*s-1u < 0x7f && (uintptr_t)s%4 == 0) { while (!(( *(uint32_t*)s | *(uint32_t*)s-0x01010101) & 0x80808080)) { diff --git a/src/multibyte/mbtowc.c b/src/multibyte/mbtowc.c index 803d221..71a9506 100644 --- a/src/multibyte/mbtowc.c +++ b/src/multibyte/mbtowc.c @@ -4,6 +4,7 @@ * unnecessary. */ +#include #include #include #include "internal.h" @@ -19,6 +20,7 @@ int mbtowc(wchar_t *restrict wc, const char *restrict src, size_t n) if (!wc) wc = &dummy; if (*s < 0x80) return !!(*wc = *s); + if (MB_CUR_MAX==1) return (*wc = CODEUNIT(*s)), 1; if (*s-SA > SB-SA) goto ilseq; c = bittab[*s++-SA]; diff --git a/src/multibyte/wcrtomb.c b/src/multibyte/wcrtomb.c index 59f733d..ddc37a5 100644 --- a/src/multibyte/wcrtomb.c +++ b/src/multibyte/wcrtomb.c @@ -4,8 +4,10 @@ * unnecessary. */ +#include #include #include +#include "internal.h" size_t wcrtomb(char *restrict s, wchar_t wc, mbstate_t *restrict st) { @@ -13,6 +15,13 @@ size_t wcrtomb(char *restrict s, wchar_t wc, mbstate_t *restrict st) if ((unsigned)wc < 0x80) { *s = wc; return 1; + } else if (MB_CUR_MAX == 1) { + if (!IS_CODEUNIT(wc)) { + errno = EILSEQ; + return -1; + } + *s = wc; + return 1; } else if ((unsigned)wc < 0x800) { *s++ = 0xc0 | (wc>>6); *s = 0x80 | (wc&0x3f); diff --git a/src/multibyte/wctob.c b/src/multibyte/wctob.c index d6353ee..4aeda6a 100644 --- a/src/multibyte/wctob.c +++ b/src/multibyte/wctob.c @@ -1,8 +1,10 @@ -#include #include +#include +#include "internal.h" int wctob(wint_t c) { if (c < 128U) return c; + if (MB_CUR_MAX==1 && IS_CODEUNIT(c)) return (unsigned char)c; return EOF; } diff --git a/src/regex/fnmatch.c b/src/regex/fnmatch.c index 7f6b65f..978fff8 100644 --- a/src/regex/fnmatch.c +++ b/src/regex/fnmatch.c @@ -18,6 +18,7 @@ #include #include #include +#include "locale_impl.h" #define END 0 #define UNMATCHABLE -2 @@ -229,7 +230,7 @@ static int fnmatch_internal(const char *pat, size_t m, const char *str, size_t n * On illegal sequences we may get it wrong, but in that case * we necessarily have a matching failure anyway. */ for (s=endstr; s>str && tailcnt; tailcnt--) { - if (s[-1] < 128U) s--; + if (s[-1] < 128U || MB_CUR_MAX==1) s--; else while ((unsigned char)*--s-0x80U<0x40 && s>str); } if (tailcnt) return FNM_NOMATCH; --nzri8VXeXB/g5ayr Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="bytelocale-part2.diff" diff --git a/src/internal/stdio_impl.h b/src/internal/stdio_impl.h index e1325fe..72c5519 100644 --- a/src/internal/stdio_impl.h +++ b/src/internal/stdio_impl.h @@ -47,6 +47,7 @@ struct _IO_FILE { unsigned char *shend; off_t shlim, shcnt; FILE *prev_locked, *next_locked; + struct __locale_struct *locale; }; size_t __stdio_read(FILE *, unsigned char *, size_t); diff --git a/src/locale/iconv.c b/src/locale/iconv.c index e6121ae..1eeea94 100644 --- a/src/locale/iconv.c +++ b/src/locale/iconv.c @@ -5,6 +5,7 @@ #include #include #include +#include "locale_impl.h" #define UTF_32BE 0300 #define UTF_16LE 0301 @@ -165,9 +166,12 @@ size_t iconv(iconv_t cd0, char **restrict in, size_t *restrict inb, char **restr int err; unsigned char type = map[-1]; unsigned char totype = tomap[-1]; + locale_t *ploc = &CURRENT_LOCALE, loc = *ploc; if (!in || !*in || !*inb) return 0; + *ploc = UTF8_LOCALE; + for (; *inb; *in+=l, *inb-=l) { c = *(unsigned char *)*in; l = 1; @@ -431,6 +435,7 @@ size_t iconv(iconv_t cd0, char **restrict in, size_t *restrict inb, char **restr break; } } + *ploc = loc; return x; ilseq: err = EILSEQ; @@ -445,5 +450,6 @@ starved: x = -1; end: errno = err; + *ploc = loc; return x; } diff --git a/src/stdio/fgetwc.c b/src/stdio/fgetwc.c index b261b44..e455cfe 100644 --- a/src/stdio/fgetwc.c +++ b/src/stdio/fgetwc.c @@ -1,8 +1,9 @@ #include "stdio_impl.h" +#include "locale_impl.h" #include #include -wint_t __fgetwc_unlocked(FILE *f) +static wint_t __fgetwc_unlocked_internal(FILE *f) { mbstate_t st = { 0 }; wchar_t wc; @@ -10,8 +11,6 @@ wint_t __fgetwc_unlocked(FILE *f) unsigned char b; size_t l; - if (f->mode <= 0) fwide(f, 1); - /* Convert character from buffer if possible */ if (f->rpos < f->rend) { l = mbrtowc(&wc, (void *)f->rpos, f->rend - f->rpos, &st); @@ -39,6 +38,16 @@ wint_t __fgetwc_unlocked(FILE *f) return wc; } +wint_t __fgetwc_unlocked(FILE *f) +{ + locale_t *ploc = &CURRENT_LOCALE, loc = *ploc; + if (f->mode <= 0) fwide(f, 1); + *ploc = f->locale; + wchar_t wc = __fgetwc_unlocked_internal(f); + *ploc = loc; + return wc; +} + wint_t fgetwc(FILE *f) { wint_t c; diff --git a/src/stdio/fputwc.c b/src/stdio/fputwc.c index 1bf165b..0be5666 100644 --- a/src/stdio/fputwc.c +++ b/src/stdio/fputwc.c @@ -1,4 +1,5 @@ #include "stdio_impl.h" +#include "locale_impl.h" #include #include #include @@ -7,8 +8,10 @@ wint_t __fputwc_unlocked(wchar_t c, FILE *f) { char mbc[MB_LEN_MAX]; int l; + locale_t *ploc = &CURRENT_LOCALE, loc = *ploc; if (f->mode <= 0) fwide(f, 1); + *ploc = f->locale; if (isascii(c)) { c = putc_unlocked(c, f); @@ -18,8 +21,11 @@ wint_t __fputwc_unlocked(wchar_t c, FILE *f) else f->wpos += l; } else { l = wctomb(mbc, c); - if (l < 0 || __fwritex((void *)mbc, l, f) < l) c = WEOF; + if (l < 0 || __fwritex((void *)mbc, l, f) < l) + c = WEOF; } + if (c==WEOF) f->flags |= F_ERR; + *ploc = loc; return c; } diff --git a/src/stdio/fputws.c b/src/stdio/fputws.c index 317d65f..0ed02f1 100644 --- a/src/stdio/fputws.c +++ b/src/stdio/fputws.c @@ -1,23 +1,28 @@ #include "stdio_impl.h" +#include "locale_impl.h" #include int fputws(const wchar_t *restrict ws, FILE *restrict f) { unsigned char buf[BUFSIZ]; size_t l=0; + locale_t *ploc = &CURRENT_LOCALE, loc = *ploc; FLOCK(f); fwide(f, 1); + *ploc = f->locale; while (ws && (l = wcsrtombs((void *)buf, (void*)&ws, sizeof buf, 0))+1 > 1) if (__fwritex(buf, l, f) < l) { FUNLOCK(f); + *ploc = loc; return -1; } FUNLOCK(f); + *ploc = loc; return l; /* 0 or -1 */ } diff --git a/src/stdio/fwide.c b/src/stdio/fwide.c index 8088e7a..8410b15 100644 --- a/src/stdio/fwide.c +++ b/src/stdio/fwide.c @@ -1,13 +1,14 @@ -#include #include "stdio_impl.h" - -#define SH (8*sizeof(int)-1) -#define NORMALIZE(x) ((x)>>SH | -((-(x))>>SH)) +#include "locale_impl.h" int fwide(FILE *f, int mode) { FLOCK(f); - if (!f->mode) f->mode = NORMALIZE(mode); + if (mode) { + if (!f->locale) f->locale = MB_CUR_MAX==1 + ? C_LOCALE : UTF8_LOCALE; + if (!f->mode) f->mode = mode>0 ? 1 : -1; + } mode = f->mode; FUNLOCK(f); return mode; diff --git a/src/stdio/ungetwc.c b/src/stdio/ungetwc.c index d4c7de3..80d6e20 100644 --- a/src/stdio/ungetwc.c +++ b/src/stdio/ungetwc.c @@ -1,4 +1,5 @@ #include "stdio_impl.h" +#include "locale_impl.h" #include #include #include @@ -8,15 +9,18 @@ wint_t ungetwc(wint_t c, FILE *f) { unsigned char mbc[MB_LEN_MAX]; int l=1; + locale_t *ploc = &CURRENT_LOCALE, loc = *ploc; FLOCK(f); if (f->mode <= 0) fwide(f, 1); + *ploc = f->locale; if (!f->rpos) __toread(f); if (!f->rpos || f->rpos < f->buf - UNGET + l || c == WEOF || (!isascii(c) && (l = wctomb((void *)mbc, c)) < 0)) { FUNLOCK(f); + *ploc = loc; return WEOF; } @@ -26,5 +30,6 @@ wint_t ungetwc(wint_t c, FILE *f) f->flags &= ~F_EOF; FUNLOCK(f); + *ploc = loc; return c; } --nzri8VXeXB/g5ayr Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="bytelocale-part3.diff" diff --git a/include/stdlib.h b/include/stdlib.h index 97ce5a7..d2c911f 100644 --- a/include/stdlib.h +++ b/include/stdlib.h @@ -76,7 +76,8 @@ size_t wcstombs (char *__restrict, const wchar_t *__restrict, size_t); #define EXIT_FAILURE 1 #define EXIT_SUCCESS 0 -#define MB_CUR_MAX ((size_t)+4) +size_t __ctype_get_mb_cur_max(void); +#define MB_CUR_MAX (__ctype_get_mb_cur_max()) #define RAND_MAX (0x7fffffff) diff --git a/src/internal/locale_impl.h b/src/internal/locale_impl.h index 85db793..f5e4d9b 100644 --- a/src/internal/locale_impl.h +++ b/src/internal/locale_impl.h @@ -34,4 +34,7 @@ const char *__lctrans_cur(const char *); #define CURRENT_UTF8 (!!__pthread_self()->locale->cat[LC_CTYPE]) +#undef MB_CUR_MAX +#define MB_CUR_MAX (CURRENT_UTF8 ? 4 : 1) + #endif --nzri8VXeXB/g5ayr--