From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 9755 invoked by alias); 3 Sep 2015 10:10:17 -0000 Mailing-List: contact zsh-workers-help@zsh.org; run by ezmlm Precedence: bulk X-No-Archive: yes List-Id: Zsh Workers List List-Post: List-Help: X-Seq: 36411 Received: (qmail 11905 invoked from network); 3 Sep 2015 10:10:16 -0000 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on f.primenet.com.au X-Spam-Level: X-Spam-Status: No, score=-1.0 required=5.0 tests=BAYES_00,DKIM_ADSP_CUSTOM_MED, FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.0 X-Injected-Via-Gmane: http://gmane.org/ To: zsh-workers@zsh.org From: Stephane Chazelas Subject: Re: invalid characters and multi-byte [x-y] ranges Date: Thu, 3 Sep 2015 11:09:44 +0100 Message-ID: <20150903100943.GB7821@chaz.gmail.com> References: <20150902230711.GA4967@chaz.gmail.com> <20150903100037.6e6ac852@pwslap01u.europe.root.pri> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: 05448dab.skybroadband.com Content-Disposition: inline In-Reply-To: <20150903100037.6e6ac852@pwslap01u.europe.root.pri> User-Agent: Mutt/1.5.21 (2010-09-15) 2015-09-03 10:00:37 +0100, Peter Stephenson: > On Thu, 3 Sep 2015 00:07:11 +0100 > Stephane Chazelas wrote: > > is this (in a UTF-8 locale): > > > > $ zsh -c $'[[ \xcc = [\uaa-\udd] ]]' && echo yes > > yes > > > > expected or desirable? > > This comes from the function charref() in pattern.c. We discover the > sequence is incomplete / invalid and don't know what to do with it, so we > simply treat the single byte as a character: > > return (wchar_t) STOUC(*x); > > (the macro ensures we get an unsigned value to cast). Typically this > will do what you see (though wchar_t isn't guaranteed to have that > property). > > I'm not sure what else to do here. The function is used all over the > pattern code so anything other than tweak the code locally to return > another character (what?) is horrific to get consistent. We don't want > [[ $'\xcc' = $'\xdd' ]] to succeed, but ideally we do want [[ $'\xcc' = > $'\xcc' ]] to succeed comparing raw bytes (we're not morally forced to > do that in a UTF-8 locale, I don't think, but it wouldn't be very > helpful if it didn't work). > > If wchar_t is 32 bits (the only place where it wasn't used to be Cygwin > but I think that's changed) we could cheat by adding (wchar_t)0x7FFFFF00 > to it... that would fix your problem and (I hope) keep the two above > working, and minimsie the likelihood of generating a valid character... > that's about the least horrific I can come up with. [...] There was a related discussion not so long ago on the Austin group ML (zsh was also mentioned there) on a related subject: http://thread.gmane.org/gmane.comp.standards.posix.austin.general/11059/focus=11118 (the whole discussion started at http://thread.gmane.org/gmane.comp.standards.posix.austin.general/11059/focus=11098 is interesting) and see: https://www.python.org/dev/peps/pep-0383/ A discussed approach there was to internally represent bytes not forming part of a valid character as code points in the range D800-DFFF (specifically DC80 DCFF for bytes 0x80 to 0xff) (those code points are reserved in Unicode for UTF-16 surrogates and are *not* characters, in particular the byte-sequence that would be the UTF-8 encoding of a 0xD800 for example (ed a0 80) would not form a valid character so be internally represented as DCED DCA0 DC80. I beleive that's what python3 does. -- Stephane