From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 16547 invoked by alias); 3 Sep 2015 09:00:48 -0000 Mailing-List: contact zsh-workers-help@zsh.org; run by ezmlm Precedence: bulk X-No-Archive: yes List-Id: Zsh Workers List List-Post: List-Help: X-Seq: 36409 Received: (qmail 8246 invoked from network); 3 Sep 2015 09:00:46 -0000 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on f.primenet.com.au X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.0 X-AuditID: cbfec7f5-f794b6d000001495-f2-55e80c3867be Date: Thu, 03 Sep 2015 10:00:37 +0100 From: Peter Stephenson To: Zsh hackers list Subject: Re: invalid characters and multi-byte [x-y] ranges Message-id: <20150903100037.6e6ac852@pwslap01u.europe.root.pri> In-reply-to: <20150902230711.GA4967@chaz.gmail.com> References: <20150902230711.GA4967@chaz.gmail.com> Organization: Samsung Cambridge Solution Centre X-Mailer: Claws Mail 3.7.9 (GTK+ 2.22.0; i386-redhat-linux-gnu) MIME-version: 1.0 Content-type: text/plain; charset=US-ASCII Content-transfer-encoding: 7bit X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrMLMWRmVeSWpSXmKPExsVy+t/xa7oWPC9CDQ7N07c42PyQyYHRY9XB D0wBjFFcNimpOZllqUX6dglcGXu77rMX/OGqeDf7J2sD4w2OLkZODgkBE4n/3U/ZIGwxiQv3 1gPZXBxCAksZJT7d3MMIkhASmMEk8fSUIURiK6PEllObWEASLAKqEqunz2EHsdkEDCWmbpoN 1iAioCWx4+RJJhBbWMBa4ueNlWD1vAL2Eleu3WAGsTkFjCXWTl0DtcBI4nXbT1YQm19AX+Lq 309MEBfZS8y8coYRoldQ4sfke2BzmIHmb97WxAphy0tsXvOWGWKOusSNu7vZJzAKzULSMgtJ yywkLQsYmVcxiqaWJhcUJ6XnGukVJ+YWl+al6yXn525ihATt1x2MS49ZHWIU4GBU4uGdMPt5 qBBrYllxZe4hRgkOZiUR3pQ3QCHelMTKqtSi/Pii0pzU4kOM0hwsSuK8M3e9DxESSE8sSc1O TS1ILYLJMnFwSjUwLukV4FxdJro26bQ475Kc0D0TffZ95fulvFv/88akjexcki4Lyu1SvCy+ PzTJzxc3vR52T+zT48wc9g+s3nu2fLrklOqlH3CU7VCNMev/nM3NEz6tm/dW8ETG0p3Hhdx9 DzR7L7mdcOKliYNE2sFg1+OqHRvTr7f9vch4sv2hd6lHQlmP2VlRJZbijERDLeai4kQACDOy MFYCAAA= On Thu, 3 Sep 2015 00:07:11 +0100 Stephane Chazelas wrote: > is this (in a UTF-8 locale): > > $ zsh -c $'[[ \xcc = [\uaa-\udd] ]]' && echo yes > yes > > expected or desirable? This comes from the function charref() in pattern.c. We discover the sequence is incomplete / invalid and don't know what to do with it, so we simply treat the single byte as a character: return (wchar_t) STOUC(*x); (the macro ensures we get an unsigned value to cast). Typically this will do what you see (though wchar_t isn't guaranteed to have that property). I'm not sure what else to do here. The function is used all over the pattern code so anything other than tweak the code locally to return another character (what?) is horrific to get consistent. We don't want [[ $'\xcc' = $'\xdd' ]] to succeed, but ideally we do want [[ $'\xcc' = $'\xcc' ]] to succeed comparing raw bytes (we're not morally forced to do that in a UTF-8 locale, I don't think, but it wouldn't be very helpful if it didn't work). If wchar_t is 32 bits (the only place where it wasn't used to be Cygwin but I think that's changed) we could cheat by adding (wchar_t)0x7FFFFF00 to it... that would fix your problem and (I hope) keep the two above working, and minimsie the likelihood of generating a valid character... that's about the least horrific I can come up with. pws