From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/8587
Path: news.gmane.org!not-for-mail
From: Felix Janda <felix.janda@posteo.de>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: Re: First feedback on new C locale problems
Date: Tue, 29 Sep 2015 06:00:44 +0200
Message-ID: <20150929040044.GA432@nyan>
References: <20150926045836.GA2341@nyan>
 <20150926193542.GO17773@brightrain.aerifal.cx>
 <20150927061738.GA311@nyan>
 <20150927134712.GQ17773@brightrain.aerifal.cx>
 <20150927134902.GA5764@nyan>
 <20150927165925.GR17773@brightrain.aerifal.cx>
 <20150928185837.GD17773@brightrain.aerifal.cx>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-Trace: ger.gmane.org 1443499607 19498 80.91.229.3 (29 Sep 2015 04:06:47 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Tue, 29 Sep 2015 04:06:47 +0000 (UTC)
To: musl@lists.openwall.com
Original-X-From: musl-return-8599-gllmg-musl=m.gmane.org@lists.openwall.com Tue Sep 29 06:06:47 2015
Return-path: <musl-return-8599-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@m.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by plane.gmane.org with smtp (Exim 4.69)
	(envelope-from <musl-return-8599-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1ZgmBf-00078z-Bp
	for gllmg-musl@m.gmane.org; Tue, 29 Sep 2015 06:06:43 +0200
Original-Received: (qmail 11886 invoked by uid 550); 29 Sep 2015 04:06:40 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
Original-Received: (qmail 11861 invoked from network); 29 Sep 2015 04:06:39 -0000
Mail-Followup-To: musl@lists.openwall.com
Content-Disposition: inline
In-Reply-To: <20150928185837.GD17773@brightrain.aerifal.cx>
User-Agent: Mutt/1.5.23 (2014-03-12)
Xref: news.gmane.org gmane.linux.lib.musl.general:8587
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/8587>

Rich Felker wrote:
> On Sun, Sep 27, 2015 at 12:59:25PM -0400, Rich Felker wrote:
> > On Sun, Sep 27, 2015 at 03:49:02PM +0200, Felix Janda wrote:
> > > Rich Felker wrote:
> > > > On Sun, Sep 27, 2015 at 08:17:38AM +0200, Felix Janda wrote:
> > > > > Rich Felker wrote:
> > > > > > On Sat, Sep 26, 2015 at 06:58:36AM +0200, Felix Janda wrote:
> > > > > > > On 2015-09-09 05:56:48 GMT, Rich Felker wrote:
> > > > > > > > On Tue, Sep 01, 2015 at 02:32:35AM -0400, Rich Felker wrote:
> > > > > > > > > What I'd like to do to fix it is just always return "UTF-8" for
> > > > > > > > > nl_langinfo(CODESET) regardless of locale (rather than returning
> > > > > > > > > "UTF-8-CODE-UNITS" when in C locale). POSIX places no requirements on
> > > > > > > > > nl_langinfo that would preclude this, and it seems like it would
> > > > > > > > > restore the desired properties and fix all the regressions.
> > > > > > > >
> > > > > > > > Committed.
> > > > > > > >
> > > > > > > > Rich
> > > > > > > 
> > > > > > > GNU sed seems to care about the output from nl_langinfo:
> > > > > > > 
> > > > > > > https://bugs.gentoo.org/show_bug.cgi?id=560728
> > > > > > > 
> > > > > > > More specifically, so does lib/localecharset.c, which is used in
> > > > > > > the replacement of re_compile_pattern.
> > > > > > 
> > > > > > I was able to reproduce this (with slightly different output, "a© a'")
> > > > > > on Alpine. Clearly this is some sort of bug in the gnulib code or sed
> > > > > > itself, since it's producing corrupt output. I think we should explore
> > > > > > why that's happening and whether it's possible to fix there. But if
> > > > > > there remain other reasons that returning "UTF-8" in the C locale is
> > > > > > not practical then perhaps we could resort to returning "ASCII".
> > > > > 
> > > > > A possible fix is
> > > > > 
> > > > > --- ./a/sed-4.2.1/lib/regcomp.c
> > > > > +++ ./a/sed-4.2.1/lib/regcomp.c
> > > > > @@ -824,7 +824,7 @@ re_compile_internal (regex_t *preg, cons
> > > > >  
> > > > >  #ifdef RE_ENABLE_I18N
> > > > >    /* If possible, do searching in single byte encoding to speed things up.  */
> > > > > -  if (dfa->is_utf8 && dfa->mb_cur_max != 1 && !(syntax & RE_ICASE) && preg->translate == NULL)
> > > > > +  if (dfa->is_utf8 && !(syntax & RE_ICASE) && preg->translate == NULL)
> > > > >      optimize_utf8 (dfa);
> > > > >  #endif
> > > > >  
> > > > > 
> > > > > In our case is_utf8 is 1 and mb_cur_max is also 1. The function
> > > > > optimize_utf8() would change "." to match utf8 characters instead of
> > > > > bytes. For some reason I have not investigated further then "©" (or any
> > > > > other non-ASCII) character is not matched, but in the C locale we want
> > > > > "." also to match non-valid utf8 characters anyway.
> > > > 
> > > > I think this fix is misplaced; it looks like it would make GNU regex
> > > > do UTF-8 character matching rather than byte matching in the C locale.
> > > > Rather one of the other places that has an is_utf8 check also needs to
> > > > have the mb_cur_max!=1 check added, I think.
> > > 
> > > Oh, sorry for the confusion. The patch is inverted...
> > 
> > Ah, ok. But in that case, it's probably best not to detect is_utf8 to
> > begin with if MB_CUR_MAX==1.
> > 
> > I should probably read the code and try to get a better understanding
> > of what it's doing.
> 
> I think the actual error is here:
> 
> http://git.savannah.gnu.org/cgit/gnulib.git/tree/lib/regcomp.c#n903
> 
> In the _LIBC code path, they check MB_CUR_LEN==6 (glibc's nonstandard
> value they use for UTF-8) perhaps just as an optimization of the
> non-UTF-8 case, but they don't check it for !_LIBC; they just rely on
> the CODESET name matching.

Upon your previous mail I had come to the same conclusions. Maybe they
would not be opposed to optimizing the non-UTF-8 when !_LIBC using
MB_CUR_MAX.

Unfortunately, the GNU regex code seems to be copied into quite a lot
of projects.

Felix

> I'm still somewhat concerned that returning "UTF-8" is problematic
> here, but I think gnulib also has a bug; trusting their interpretation
> of the string returned by nl_langinfo(CODESET) seems to be leading to
> corrupt results.
>
> Rich