* Re: First feedback on new C locale problems
@ 2015-09-26 4:58 Felix Janda
2015-09-26 19:35 ` Rich Felker
0 siblings, 1 reply; 8+ messages in thread
From: Felix Janda @ 2015-09-26 4:58 UTC (permalink / raw)
To: musl
On 2015-09-09 05:56:48 GMT, Rich Felker wrote:
> On Tue, Sep 01, 2015 at 02:32:35AM -0400, Rich Felker wrote:
> > What I'd like to do to fix it is just always return "UTF-8" for
> > nl_langinfo(CODESET) regardless of locale (rather than returning
> > "UTF-8-CODE-UNITS" when in C locale). POSIX places no requirements on
> > nl_langinfo that would preclude this, and it seems like it would
> > restore the desired properties and fix all the regressions.
>
> Committed.
>
> Rich
GNU sed seems to care about the output from nl_langinfo:
https://bugs.gentoo.org/show_bug.cgi?id=560728
More specifically, so does lib/localecharset.c, which is used in
the replacement of re_compile_pattern.
Felix
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Re: First feedback on new C locale problems 2015-09-26 4:58 First feedback on new C locale problems Felix Janda @ 2015-09-26 19:35 ` Rich Felker 2015-09-27 6:17 ` Felix Janda 0 siblings, 1 reply; 8+ messages in thread From: Rich Felker @ 2015-09-26 19:35 UTC (permalink / raw) To: musl On Sat, Sep 26, 2015 at 06:58:36AM +0200, Felix Janda wrote: > On 2015-09-09 05:56:48 GMT, Rich Felker wrote: > > On Tue, Sep 01, 2015 at 02:32:35AM -0400, Rich Felker wrote: > > > What I'd like to do to fix it is just always return "UTF-8" for > > > nl_langinfo(CODESET) regardless of locale (rather than returning > > > "UTF-8-CODE-UNITS" when in C locale). POSIX places no requirements on > > > nl_langinfo that would preclude this, and it seems like it would > > > restore the desired properties and fix all the regressions. > > > > Committed. > > > > Rich > > GNU sed seems to care about the output from nl_langinfo: > > https://bugs.gentoo.org/show_bug.cgi?id=560728 > > More specifically, so does lib/localecharset.c, which is used in > the replacement of re_compile_pattern. I was able to reproduce this (with slightly different output, "a© a'") on Alpine. Clearly this is some sort of bug in the gnulib code or sed itself, since it's producing corrupt output. I think we should explore why that's happening and whether it's possible to fix there. But if there remain other reasons that returning "UTF-8" in the C locale is not practical then perhaps we could resort to returning "ASCII". Rich ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Re: First feedback on new C locale problems 2015-09-26 19:35 ` Rich Felker @ 2015-09-27 6:17 ` Felix Janda 2015-09-27 13:47 ` Rich Felker 0 siblings, 1 reply; 8+ messages in thread From: Felix Janda @ 2015-09-27 6:17 UTC (permalink / raw) To: musl Rich Felker wrote: > On Sat, Sep 26, 2015 at 06:58:36AM +0200, Felix Janda wrote: > > On 2015-09-09 05:56:48 GMT, Rich Felker wrote: > > > On Tue, Sep 01, 2015 at 02:32:35AM -0400, Rich Felker wrote: > > > > What I'd like to do to fix it is just always return "UTF-8" for > > > > nl_langinfo(CODESET) regardless of locale (rather than returning > > > > "UTF-8-CODE-UNITS" when in C locale). POSIX places no requirements on > > > > nl_langinfo that would preclude this, and it seems like it would > > > > restore the desired properties and fix all the regressions. > > > > > > Committed. > > > > > > Rich > > > > GNU sed seems to care about the output from nl_langinfo: > > > > https://bugs.gentoo.org/show_bug.cgi?id=560728 > > > > More specifically, so does lib/localecharset.c, which is used in > > the replacement of re_compile_pattern. > > I was able to reproduce this (with slightly different output, "a© a'") > on Alpine. Clearly this is some sort of bug in the gnulib code or sed > itself, since it's producing corrupt output. I think we should explore > why that's happening and whether it's possible to fix there. But if > there remain other reasons that returning "UTF-8" in the C locale is > not practical then perhaps we could resort to returning "ASCII". A possible fix is --- ./a/sed-4.2.1/lib/regcomp.c +++ ./a/sed-4.2.1/lib/regcomp.c @@ -824,7 +824,7 @@ re_compile_internal (regex_t *preg, cons #ifdef RE_ENABLE_I18N /* If possible, do searching in single byte encoding to speed things up. */ - if (dfa->is_utf8 && dfa->mb_cur_max != 1 && !(syntax & RE_ICASE) && preg->translate == NULL) + if (dfa->is_utf8 && !(syntax & RE_ICASE) && preg->translate == NULL) optimize_utf8 (dfa); #endif In our case is_utf8 is 1 and mb_cur_max is also 1. The function optimize_utf8() would change "." to match utf8 characters instead of bytes. For some reason I have not investigated further then "©" (or any other non-ASCII) character is not matched, but in the C locale we want "." also to match non-valid utf8 characters anyway. glibc seems to be the upstream for the code. Felix ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Re: First feedback on new C locale problems 2015-09-27 6:17 ` Felix Janda @ 2015-09-27 13:47 ` Rich Felker 2015-09-27 13:49 ` Felix Janda 0 siblings, 1 reply; 8+ messages in thread From: Rich Felker @ 2015-09-27 13:47 UTC (permalink / raw) To: musl On Sun, Sep 27, 2015 at 08:17:38AM +0200, Felix Janda wrote: > Rich Felker wrote: > > On Sat, Sep 26, 2015 at 06:58:36AM +0200, Felix Janda wrote: > > > On 2015-09-09 05:56:48 GMT, Rich Felker wrote: > > > > On Tue, Sep 01, 2015 at 02:32:35AM -0400, Rich Felker wrote: > > > > > What I'd like to do to fix it is just always return "UTF-8" for > > > > > nl_langinfo(CODESET) regardless of locale (rather than returning > > > > > "UTF-8-CODE-UNITS" when in C locale). POSIX places no requirements on > > > > > nl_langinfo that would preclude this, and it seems like it would > > > > > restore the desired properties and fix all the regressions. > > > > > > > > Committed. > > > > > > > > Rich > > > > > > GNU sed seems to care about the output from nl_langinfo: > > > > > > https://bugs.gentoo.org/show_bug.cgi?id=560728 > > > > > > More specifically, so does lib/localecharset.c, which is used in > > > the replacement of re_compile_pattern. > > > > I was able to reproduce this (with slightly different output, "a© a'") > > on Alpine. Clearly this is some sort of bug in the gnulib code or sed > > itself, since it's producing corrupt output. I think we should explore > > why that's happening and whether it's possible to fix there. But if > > there remain other reasons that returning "UTF-8" in the C locale is > > not practical then perhaps we could resort to returning "ASCII". > > A possible fix is > > --- ./a/sed-4.2.1/lib/regcomp.c > +++ ./a/sed-4.2.1/lib/regcomp.c > @@ -824,7 +824,7 @@ re_compile_internal (regex_t *preg, cons > > #ifdef RE_ENABLE_I18N > /* If possible, do searching in single byte encoding to speed things up. */ > - if (dfa->is_utf8 && dfa->mb_cur_max != 1 && !(syntax & RE_ICASE) && preg->translate == NULL) > + if (dfa->is_utf8 && !(syntax & RE_ICASE) && preg->translate == NULL) > optimize_utf8 (dfa); > #endif > > > In our case is_utf8 is 1 and mb_cur_max is also 1. The function > optimize_utf8() would change "." to match utf8 characters instead of > bytes. For some reason I have not investigated further then "©" (or any > other non-ASCII) character is not matched, but in the C locale we want > "." also to match non-valid utf8 characters anyway. I think this fix is misplaced; it looks like it would make GNU regex do UTF-8 character matching rather than byte matching in the C locale. Rather one of the other places that has an is_utf8 check also needs to have the mb_cur_max!=1 check added, I think. Rich ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Re: First feedback on new C locale problems 2015-09-27 13:47 ` Rich Felker @ 2015-09-27 13:49 ` Felix Janda 2015-09-27 16:59 ` Rich Felker 0 siblings, 1 reply; 8+ messages in thread From: Felix Janda @ 2015-09-27 13:49 UTC (permalink / raw) To: musl Rich Felker wrote: > On Sun, Sep 27, 2015 at 08:17:38AM +0200, Felix Janda wrote: > > Rich Felker wrote: > > > On Sat, Sep 26, 2015 at 06:58:36AM +0200, Felix Janda wrote: > > > > On 2015-09-09 05:56:48 GMT, Rich Felker wrote: > > > > > On Tue, Sep 01, 2015 at 02:32:35AM -0400, Rich Felker wrote: > > > > > > What I'd like to do to fix it is just always return "UTF-8" for > > > > > > nl_langinfo(CODESET) regardless of locale (rather than returning > > > > > > "UTF-8-CODE-UNITS" when in C locale). POSIX places no requirements on > > > > > > nl_langinfo that would preclude this, and it seems like it would > > > > > > restore the desired properties and fix all the regressions. > > > > > > > > > > Committed. > > > > > > > > > > Rich > > > > > > > > GNU sed seems to care about the output from nl_langinfo: > > > > > > > > https://bugs.gentoo.org/show_bug.cgi?id=560728 > > > > > > > > More specifically, so does lib/localecharset.c, which is used in > > > > the replacement of re_compile_pattern. > > > > > > I was able to reproduce this (with slightly different output, "a© a'") > > > on Alpine. Clearly this is some sort of bug in the gnulib code or sed > > > itself, since it's producing corrupt output. I think we should explore > > > why that's happening and whether it's possible to fix there. But if > > > there remain other reasons that returning "UTF-8" in the C locale is > > > not practical then perhaps we could resort to returning "ASCII". > > > > A possible fix is > > > > --- ./a/sed-4.2.1/lib/regcomp.c > > +++ ./a/sed-4.2.1/lib/regcomp.c > > @@ -824,7 +824,7 @@ re_compile_internal (regex_t *preg, cons > > > > #ifdef RE_ENABLE_I18N > > /* If possible, do searching in single byte encoding to speed things up. */ > > - if (dfa->is_utf8 && dfa->mb_cur_max != 1 && !(syntax & RE_ICASE) && preg->translate == NULL) > > + if (dfa->is_utf8 && !(syntax & RE_ICASE) && preg->translate == NULL) > > optimize_utf8 (dfa); > > #endif > > > > > > In our case is_utf8 is 1 and mb_cur_max is also 1. The function > > optimize_utf8() would change "." to match utf8 characters instead of > > bytes. For some reason I have not investigated further then "©" (or any > > other non-ASCII) character is not matched, but in the C locale we want > > "." also to match non-valid utf8 characters anyway. > > I think this fix is misplaced; it looks like it would make GNU regex > do UTF-8 character matching rather than byte matching in the C locale. > Rather one of the other places that has an is_utf8 check also needs to > have the mb_cur_max!=1 check added, I think. Oh, sorry for the confusion. The patch is inverted... Felix ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Re: First feedback on new C locale problems 2015-09-27 13:49 ` Felix Janda @ 2015-09-27 16:59 ` Rich Felker 2015-09-28 18:58 ` Rich Felker 0 siblings, 1 reply; 8+ messages in thread From: Rich Felker @ 2015-09-27 16:59 UTC (permalink / raw) To: musl On Sun, Sep 27, 2015 at 03:49:02PM +0200, Felix Janda wrote: > Rich Felker wrote: > > On Sun, Sep 27, 2015 at 08:17:38AM +0200, Felix Janda wrote: > > > Rich Felker wrote: > > > > On Sat, Sep 26, 2015 at 06:58:36AM +0200, Felix Janda wrote: > > > > > On 2015-09-09 05:56:48 GMT, Rich Felker wrote: > > > > > > On Tue, Sep 01, 2015 at 02:32:35AM -0400, Rich Felker wrote: > > > > > > > What I'd like to do to fix it is just always return "UTF-8" for > > > > > > > nl_langinfo(CODESET) regardless of locale (rather than returning > > > > > > > "UTF-8-CODE-UNITS" when in C locale). POSIX places no requirements on > > > > > > > nl_langinfo that would preclude this, and it seems like it would > > > > > > > restore the desired properties and fix all the regressions. > > > > > > > > > > > > Committed. > > > > > > > > > > > > Rich > > > > > > > > > > GNU sed seems to care about the output from nl_langinfo: > > > > > > > > > > https://bugs.gentoo.org/show_bug.cgi?id=560728 > > > > > > > > > > More specifically, so does lib/localecharset.c, which is used in > > > > > the replacement of re_compile_pattern. > > > > > > > > I was able to reproduce this (with slightly different output, "a© a'") > > > > on Alpine. Clearly this is some sort of bug in the gnulib code or sed > > > > itself, since it's producing corrupt output. I think we should explore > > > > why that's happening and whether it's possible to fix there. But if > > > > there remain other reasons that returning "UTF-8" in the C locale is > > > > not practical then perhaps we could resort to returning "ASCII". > > > > > > A possible fix is > > > > > > --- ./a/sed-4.2.1/lib/regcomp.c > > > +++ ./a/sed-4.2.1/lib/regcomp.c > > > @@ -824,7 +824,7 @@ re_compile_internal (regex_t *preg, cons > > > > > > #ifdef RE_ENABLE_I18N > > > /* If possible, do searching in single byte encoding to speed things up. */ > > > - if (dfa->is_utf8 && dfa->mb_cur_max != 1 && !(syntax & RE_ICASE) && preg->translate == NULL) > > > + if (dfa->is_utf8 && !(syntax & RE_ICASE) && preg->translate == NULL) > > > optimize_utf8 (dfa); > > > #endif > > > > > > > > > In our case is_utf8 is 1 and mb_cur_max is also 1. The function > > > optimize_utf8() would change "." to match utf8 characters instead of > > > bytes. For some reason I have not investigated further then "©" (or any > > > other non-ASCII) character is not matched, but in the C locale we want > > > "." also to match non-valid utf8 characters anyway. > > > > I think this fix is misplaced; it looks like it would make GNU regex > > do UTF-8 character matching rather than byte matching in the C locale. > > Rather one of the other places that has an is_utf8 check also needs to > > have the mb_cur_max!=1 check added, I think. > > Oh, sorry for the confusion. The patch is inverted... Ah, ok. But in that case, it's probably best not to detect is_utf8 to begin with if MB_CUR_MAX==1. I should probably read the code and try to get a better understanding of what it's doing. Rich ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Re: First feedback on new C locale problems 2015-09-27 16:59 ` Rich Felker @ 2015-09-28 18:58 ` Rich Felker 2015-09-29 4:00 ` Felix Janda 0 siblings, 1 reply; 8+ messages in thread From: Rich Felker @ 2015-09-28 18:58 UTC (permalink / raw) To: musl On Sun, Sep 27, 2015 at 12:59:25PM -0400, Rich Felker wrote: > On Sun, Sep 27, 2015 at 03:49:02PM +0200, Felix Janda wrote: > > Rich Felker wrote: > > > On Sun, Sep 27, 2015 at 08:17:38AM +0200, Felix Janda wrote: > > > > Rich Felker wrote: > > > > > On Sat, Sep 26, 2015 at 06:58:36AM +0200, Felix Janda wrote: > > > > > > On 2015-09-09 05:56:48 GMT, Rich Felker wrote: > > > > > > > On Tue, Sep 01, 2015 at 02:32:35AM -0400, Rich Felker wrote: > > > > > > > > What I'd like to do to fix it is just always return "UTF-8" for > > > > > > > > nl_langinfo(CODESET) regardless of locale (rather than returning > > > > > > > > "UTF-8-CODE-UNITS" when in C locale). POSIX places no requirements on > > > > > > > > nl_langinfo that would preclude this, and it seems like it would > > > > > > > > restore the desired properties and fix all the regressions. > > > > > > > > > > > > > > Committed. > > > > > > > > > > > > > > Rich > > > > > > > > > > > > GNU sed seems to care about the output from nl_langinfo: > > > > > > > > > > > > https://bugs.gentoo.org/show_bug.cgi?id=560728 > > > > > > > > > > > > More specifically, so does lib/localecharset.c, which is used in > > > > > > the replacement of re_compile_pattern. > > > > > > > > > > I was able to reproduce this (with slightly different output, "a© a'") > > > > > on Alpine. Clearly this is some sort of bug in the gnulib code or sed > > > > > itself, since it's producing corrupt output. I think we should explore > > > > > why that's happening and whether it's possible to fix there. But if > > > > > there remain other reasons that returning "UTF-8" in the C locale is > > > > > not practical then perhaps we could resort to returning "ASCII". > > > > > > > > A possible fix is > > > > > > > > --- ./a/sed-4.2.1/lib/regcomp.c > > > > +++ ./a/sed-4.2.1/lib/regcomp.c > > > > @@ -824,7 +824,7 @@ re_compile_internal (regex_t *preg, cons > > > > > > > > #ifdef RE_ENABLE_I18N > > > > /* If possible, do searching in single byte encoding to speed things up. */ > > > > - if (dfa->is_utf8 && dfa->mb_cur_max != 1 && !(syntax & RE_ICASE) && preg->translate == NULL) > > > > + if (dfa->is_utf8 && !(syntax & RE_ICASE) && preg->translate == NULL) > > > > optimize_utf8 (dfa); > > > > #endif > > > > > > > > > > > > In our case is_utf8 is 1 and mb_cur_max is also 1. The function > > > > optimize_utf8() would change "." to match utf8 characters instead of > > > > bytes. For some reason I have not investigated further then "©" (or any > > > > other non-ASCII) character is not matched, but in the C locale we want > > > > "." also to match non-valid utf8 characters anyway. > > > > > > I think this fix is misplaced; it looks like it would make GNU regex > > > do UTF-8 character matching rather than byte matching in the C locale. > > > Rather one of the other places that has an is_utf8 check also needs to > > > have the mb_cur_max!=1 check added, I think. > > > > Oh, sorry for the confusion. The patch is inverted... > > Ah, ok. But in that case, it's probably best not to detect is_utf8 to > begin with if MB_CUR_MAX==1. > > I should probably read the code and try to get a better understanding > of what it's doing. I think the actual error is here: http://git.savannah.gnu.org/cgit/gnulib.git/tree/lib/regcomp.c#n903 In the _LIBC code path, they check MB_CUR_LEN==6 (glibc's nonstandard value they use for UTF-8) perhaps just as an optimization of the non-UTF-8 case, but they don't check it for !_LIBC; they just rely on the CODESET name matching. I'm still somewhat concerned that returning "UTF-8" is problematic here, but I think gnulib also has a bug; trusting their interpretation of the string returned by nl_langinfo(CODESET) seems to be leading to corrupt results. Rich ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Re: First feedback on new C locale problems 2015-09-28 18:58 ` Rich Felker @ 2015-09-29 4:00 ` Felix Janda 0 siblings, 0 replies; 8+ messages in thread From: Felix Janda @ 2015-09-29 4:00 UTC (permalink / raw) To: musl Rich Felker wrote: > On Sun, Sep 27, 2015 at 12:59:25PM -0400, Rich Felker wrote: > > On Sun, Sep 27, 2015 at 03:49:02PM +0200, Felix Janda wrote: > > > Rich Felker wrote: > > > > On Sun, Sep 27, 2015 at 08:17:38AM +0200, Felix Janda wrote: > > > > > Rich Felker wrote: > > > > > > On Sat, Sep 26, 2015 at 06:58:36AM +0200, Felix Janda wrote: > > > > > > > On 2015-09-09 05:56:48 GMT, Rich Felker wrote: > > > > > > > > On Tue, Sep 01, 2015 at 02:32:35AM -0400, Rich Felker wrote: > > > > > > > > > What I'd like to do to fix it is just always return "UTF-8" for > > > > > > > > > nl_langinfo(CODESET) regardless of locale (rather than returning > > > > > > > > > "UTF-8-CODE-UNITS" when in C locale). POSIX places no requirements on > > > > > > > > > nl_langinfo that would preclude this, and it seems like it would > > > > > > > > > restore the desired properties and fix all the regressions. > > > > > > > > > > > > > > > > Committed. > > > > > > > > > > > > > > > > Rich > > > > > > > > > > > > > > GNU sed seems to care about the output from nl_langinfo: > > > > > > > > > > > > > > https://bugs.gentoo.org/show_bug.cgi?id=560728 > > > > > > > > > > > > > > More specifically, so does lib/localecharset.c, which is used in > > > > > > > the replacement of re_compile_pattern. > > > > > > > > > > > > I was able to reproduce this (with slightly different output, "a© a'") > > > > > > on Alpine. Clearly this is some sort of bug in the gnulib code or sed > > > > > > itself, since it's producing corrupt output. I think we should explore > > > > > > why that's happening and whether it's possible to fix there. But if > > > > > > there remain other reasons that returning "UTF-8" in the C locale is > > > > > > not practical then perhaps we could resort to returning "ASCII". > > > > > > > > > > A possible fix is > > > > > > > > > > --- ./a/sed-4.2.1/lib/regcomp.c > > > > > +++ ./a/sed-4.2.1/lib/regcomp.c > > > > > @@ -824,7 +824,7 @@ re_compile_internal (regex_t *preg, cons > > > > > > > > > > #ifdef RE_ENABLE_I18N > > > > > /* If possible, do searching in single byte encoding to speed things up. */ > > > > > - if (dfa->is_utf8 && dfa->mb_cur_max != 1 && !(syntax & RE_ICASE) && preg->translate == NULL) > > > > > + if (dfa->is_utf8 && !(syntax & RE_ICASE) && preg->translate == NULL) > > > > > optimize_utf8 (dfa); > > > > > #endif > > > > > > > > > > > > > > > In our case is_utf8 is 1 and mb_cur_max is also 1. The function > > > > > optimize_utf8() would change "." to match utf8 characters instead of > > > > > bytes. For some reason I have not investigated further then "©" (or any > > > > > other non-ASCII) character is not matched, but in the C locale we want > > > > > "." also to match non-valid utf8 characters anyway. > > > > > > > > I think this fix is misplaced; it looks like it would make GNU regex > > > > do UTF-8 character matching rather than byte matching in the C locale. > > > > Rather one of the other places that has an is_utf8 check also needs to > > > > have the mb_cur_max!=1 check added, I think. > > > > > > Oh, sorry for the confusion. The patch is inverted... > > > > Ah, ok. But in that case, it's probably best not to detect is_utf8 to > > begin with if MB_CUR_MAX==1. > > > > I should probably read the code and try to get a better understanding > > of what it's doing. > > I think the actual error is here: > > http://git.savannah.gnu.org/cgit/gnulib.git/tree/lib/regcomp.c#n903 > > In the _LIBC code path, they check MB_CUR_LEN==6 (glibc's nonstandard > value they use for UTF-8) perhaps just as an optimization of the > non-UTF-8 case, but they don't check it for !_LIBC; they just rely on > the CODESET name matching. Upon your previous mail I had come to the same conclusions. Maybe they would not be opposed to optimizing the non-UTF-8 when !_LIBC using MB_CUR_MAX. Unfortunately, the GNU regex code seems to be copied into quite a lot of projects. Felix > I'm still somewhat concerned that returning "UTF-8" is problematic > here, but I think gnulib also has a bug; trusting their interpretation > of the string returned by nl_langinfo(CODESET) seems to be leading to > corrupt results. > > Rich ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2015-09-29 4:00 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-09-26 4:58 First feedback on new C locale problems Felix Janda 2015-09-26 19:35 ` Rich Felker 2015-09-27 6:17 ` Felix Janda 2015-09-27 13:47 ` Rich Felker 2015-09-27 13:49 ` Felix Janda 2015-09-27 16:59 ` Rich Felker 2015-09-28 18:58 ` Rich Felker 2015-09-29 4:00 ` Felix Janda
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).