From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 7764 invoked by alias); 3 Sep 2015 14:18:21 -0000 Mailing-List: contact zsh-workers-help@zsh.org; run by ezmlm Precedence: bulk X-No-Archive: yes List-Id: Zsh Workers List List-Post: List-Help: X-Seq: 36415 Received: (qmail 24689 invoked from network); 3 Sep 2015 14:18:19 -0000 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on f.primenet.com.au X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.0 X-AuditID: cbfec7f5-f794b6d000001495-df-55e856a79bec Date: Thu, 03 Sep 2015 15:18:11 +0100 From: Peter Stephenson To: zsh-workers@zsh.org Subject: Re: invalid characters and multi-byte [x-y] ranges Message-id: <20150903151811.557a40ec@pwslap01u.europe.root.pri> In-reply-to: <20150903100943.GB7821@chaz.gmail.com> References: <20150902230711.GA4967@chaz.gmail.com> <20150903100037.6e6ac852@pwslap01u.europe.root.pri> <20150903100943.GB7821@chaz.gmail.com> Organization: Samsung Cambridge Solution Centre X-Mailer: Claws Mail 3.7.9 (GTK+ 2.22.0; i386-redhat-linux-gnu) MIME-version: 1.0 Content-type: text/plain; charset=US-ASCII Content-transfer-encoding: 7bit X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrCLMWRmVeSWpSXmKPExsVy+t/xa7rLw16EGqydxWRxsPkhkwOjx6qD H5gCGKO4bFJSczLLUov07RK4MnrOnWcueCRdMfvnKtYGxg2iXYycHBICJhLLVk9jgbDFJC7c W8/WxcjFISSwlFHi/vwOJghnBpPEhlXHmCGcrYwSz6ZcYwZpYRFQlXj0+TSYzSZgKDF102xG EFtEQFzi7NrzYGOFBawlft5YCWRzcPAK2EucuOAJEuYUMJaYeOEYC8TMiYwSWzftZgJJ8Avo S1z9+4kJ4iR7iZlXzoDN5BUQlPgx+R7YTGYBLYnN25pYIWx5ic1r3oLdICSgLnHj7m72CYxC s5C0zELSMgtJywJG5lWMoqmlyQXFSem5RnrFibnFpXnpesn5uZsYIWH7dQfj0mNWhxgFOBiV eHgnzH4eKsSaWFZcmXuIUYKDWUmEN8jmRagQb0piZVVqUX58UWlOavEhRmkOFiVx3pm73ocI CaQnlqRmp6YWpBbBZJk4OKUaGC++n/2k60n9h0cPFsnG+h71VlTe9qXixxSL1CDRXboXVj5Y ePmy9MWdHs7GFr7tbJF3NmXwe0mvPRn/8oVg4rppDdf2CTge8p06LaXVXGin6WPLR+J251/r rV3+MGOyyGaO/Q2ZDXoBy9MDozO7bVWSMxYyGGQJrI8uORhZPE3X5pWErxlbuxJLcUaioRZz UXEiAHKLAStXAgAA On Thu, 3 Sep 2015 11:09:44 +0100 Stephane Chazelas wrote: > A discussed approach there was to internally represent bytes not > forming part of a valid character as code points in the range > D800-DFFF (specifically DC80 DCFF for bytes 0x80 to 0xff) That's easy if wchar_t is actually Unicode. I'm not sure how to do it otherwise. We could treat it identically to the Unicode conversion of 0xdC00 + STOUCH(ch) to wchar_t, e.g. iconv UCS-4 to WCHAR_T, but is that guranteed to work? This needs to be a robust fallback and it's not clear relying on iconv is the right thing to do. The safe option would be only to use this if #ifdef __STDC_ISO_10646__. On the other hand, it's probably not going to be worse than the previous code... pws diff --git a/Src/pattern.c b/Src/pattern.c index 7d38988..7457cbd 100644 --- a/Src/pattern.c +++ b/Src/pattern.c @@ -224,6 +224,22 @@ typedef zlong zrange_t; typedef unsigned long zrange_t; #endif +#ifdef MULTIBYTE_SUPPORT +/* + * Handle a byte that's not part of a valid character. + * + * This range in Unicode is recommended for purposes of this + * kind as it corresponds to invalid characters. + * + * Note that this strictly only works if wchar_t represents + * Unicode code points, which isn't necessarily true; however, + * converting an invalid character into an unknown format is + * a bit tricky... + */ +#define WCHAR_INVALID(ch) \ + ((wchar_t) (0xDC00 + STOUC(ch))) +#endif /* MULTIBYTE_SUPPORT */ + /* * Array of characters corresponding to zpc_chars enum, which it must match. */ @@ -353,10 +369,10 @@ metacharinc(char **x) return wc; } - /* Error. Treat as single byte. */ + /* Error. */ /* Reset the shift state for next time. */ memset(&shiftstate, 0, sizeof(shiftstate)); - return (wchar_t) STOUC(*(*x)++); + return WCHAR_INVALID(*(*x)++); } #else @@ -1867,10 +1883,10 @@ charref(char *x, char *y) ret = mbrtowc(&wc, x, y-x, &shiftstate); if (ret == MB_INVALID || ret == MB_INCOMPLETE) { - /* Error. Treat as single byte. */ + /* Error. */ /* Reset the shift state for next time. */ memset(&shiftstate, 0, sizeof(shiftstate)); - return (wchar_t) STOUC(*x); + return WCHAR_INVALID(*x); } return wc; @@ -1913,7 +1929,7 @@ charrefinc(char **x, char *y, int *z) size_t ret; if (!(patglobflags & GF_MULTIBYTE) || !(STOUC(**x) & 0x80)) - return (wchar_t) STOUC(*(*x)++); + return WCHAR_INVALID(*(*x)++); ret = mbrtowc(&wc, *x, y-*x, &shiftstate); @@ -1922,7 +1938,7 @@ charrefinc(char **x, char *y, int *z) *z = 1; /* Reset the shift state for next time. */ memset(&shiftstate, 0, sizeof(shiftstate)); - return (wchar_t) STOUC(*(*x)++); + return WCHAR_INVALID(*(*x)++); } /* Nulls here are normal characters */ diff --git a/Test/D07multibyte.ztst b/Test/D07multibyte.ztst index 0e3e98d..3fadd80 100644 --- a/Test/D07multibyte.ztst +++ b/Test/D07multibyte.ztst @@ -508,3 +508,20 @@ cd .. } 0:cd with special characters + + test_array=( + '[[ \xcc = \xcc ]]' + '[[ \xcc != \xcd ]]' + '[[ \xcc != \ucc ]]' + '[[ \ucc = \ucc ]]' + '[[ \ucc = [\ucc] ]]' + '[[ \xcc != [\ucc] ]]' + # Not clear how useful the following is... + '[[ \xcc = [\xcc] ]]' + ) + for test in $test_array; do + if ! eval ${(g::)test} ; then + print -rl "Test $test failed" >&2 + fi + done +0:Invalid characters in pattern matching