From: Felix Janda <felix.janda@posteo.de>
To: musl@lists.openwall.com
Subject: Re: Re: First feedback on new C locale problems
Date: Sun, 27 Sep 2015 08:17:38 +0200 [thread overview]
Message-ID: <20150927061738.GA311@nyan> (raw)
In-Reply-To: <20150926193542.GO17773@brightrain.aerifal.cx>
Rich Felker wrote:
> On Sat, Sep 26, 2015 at 06:58:36AM +0200, Felix Janda wrote:
> > On 2015-09-09 05:56:48 GMT, Rich Felker wrote:
> > > On Tue, Sep 01, 2015 at 02:32:35AM -0400, Rich Felker wrote:
> > > > What I'd like to do to fix it is just always return "UTF-8" for
> > > > nl_langinfo(CODESET) regardless of locale (rather than returning
> > > > "UTF-8-CODE-UNITS" when in C locale). POSIX places no requirements on
> > > > nl_langinfo that would preclude this, and it seems like it would
> > > > restore the desired properties and fix all the regressions.
> > >
> > > Committed.
> > >
> > > Rich
> >
> > GNU sed seems to care about the output from nl_langinfo:
> >
> > https://bugs.gentoo.org/show_bug.cgi?id=560728
> >
> > More specifically, so does lib/localecharset.c, which is used in
> > the replacement of re_compile_pattern.
>
> I was able to reproduce this (with slightly different output, "a© a'")
> on Alpine. Clearly this is some sort of bug in the gnulib code or sed
> itself, since it's producing corrupt output. I think we should explore
> why that's happening and whether it's possible to fix there. But if
> there remain other reasons that returning "UTF-8" in the C locale is
> not practical then perhaps we could resort to returning "ASCII".
A possible fix is
--- ./a/sed-4.2.1/lib/regcomp.c
+++ ./a/sed-4.2.1/lib/regcomp.c
@@ -824,7 +824,7 @@ re_compile_internal (regex_t *preg, cons
#ifdef RE_ENABLE_I18N
/* If possible, do searching in single byte encoding to speed things up. */
- if (dfa->is_utf8 && dfa->mb_cur_max != 1 && !(syntax & RE_ICASE) && preg->translate == NULL)
+ if (dfa->is_utf8 && !(syntax & RE_ICASE) && preg->translate == NULL)
optimize_utf8 (dfa);
#endif
In our case is_utf8 is 1 and mb_cur_max is also 1. The function
optimize_utf8() would change "." to match utf8 characters instead of
bytes. For some reason I have not investigated further then "©" (or any
other non-ASCII) character is not matched, but in the C locale we want
"." also to match non-valid utf8 characters anyway.
glibc seems to be the upstream for the code.
Felix
next prev parent reply other threads:[~2015-09-27 6:17 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-09-26 4:58 Felix Janda
2015-09-26 19:35 ` Rich Felker
2015-09-27 6:17 ` Felix Janda [this message]
2015-09-27 13:47 ` Rich Felker
2015-09-27 13:49 ` Felix Janda
2015-09-27 16:59 ` Rich Felker
2015-09-28 18:58 ` Rich Felker
2015-09-29 4:00 ` Felix Janda
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150927061738.GA311@nyan \
--to=felix.janda@posteo.de \
--cc=musl@lists.openwall.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).