* invalid characters and multi-byte [x-y] ranges @ 2015-09-02 23:07 Stephane Chazelas 2015-09-03 9:00 ` Peter Stephenson 0 siblings, 1 reply; 8+ messages in thread From: Stephane Chazelas @ 2015-09-02 23:07 UTC (permalink / raw) To: Zsh hackers list Hello, is this (in a UTF-8 locale): $ zsh -c $'[[ \xcc = [\uaa-\udd] ]]' && echo yes yes expected or desirable? I know I can't expect much since \xcc is not a valid character, but I wouldn't have expected it to match [\uaa-\udd] there. Same for: $ zsh -c $'[[ \xcc = [[:alpha:]] ]]' && echo yes yes It seems \xcc is treated as U+00CC here (which in UTF-8 is \xc3\x8c) $ zsh -c $'[[ \xcc = \ucc ]]' && echo yes $ zsh -c $'[[ \xcc = [\ucc] ]]' && echo yes yes -- Stephane ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: invalid characters and multi-byte [x-y] ranges 2015-09-02 23:07 invalid characters and multi-byte [x-y] ranges Stephane Chazelas @ 2015-09-03 9:00 ` Peter Stephenson 2015-09-03 10:09 ` Stephane Chazelas 0 siblings, 1 reply; 8+ messages in thread From: Peter Stephenson @ 2015-09-03 9:00 UTC (permalink / raw) To: Zsh hackers list On Thu, 3 Sep 2015 00:07:11 +0100 Stephane Chazelas <stephane.chazelas@gmail.com> wrote: > is this (in a UTF-8 locale): > > $ zsh -c $'[[ \xcc = [\uaa-\udd] ]]' && echo yes > yes > > expected or desirable? This comes from the function charref() in pattern.c. We discover the sequence is incomplete / invalid and don't know what to do with it, so we simply treat the single byte as a character: return (wchar_t) STOUC(*x); (the macro ensures we get an unsigned value to cast). Typically this will do what you see (though wchar_t isn't guaranteed to have that property). I'm not sure what else to do here. The function is used all over the pattern code so anything other than tweak the code locally to return another character (what?) is horrific to get consistent. We don't want [[ $'\xcc' = $'\xdd' ]] to succeed, but ideally we do want [[ $'\xcc' = $'\xcc' ]] to succeed comparing raw bytes (we're not morally forced to do that in a UTF-8 locale, I don't think, but it wouldn't be very helpful if it didn't work). If wchar_t is 32 bits (the only place where it wasn't used to be Cygwin but I think that's changed) we could cheat by adding (wchar_t)0x7FFFFF00 to it... that would fix your problem and (I hope) keep the two above working, and minimsie the likelihood of generating a valid character... that's about the least horrific I can come up with. pws ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: invalid characters and multi-byte [x-y] ranges 2015-09-03 9:00 ` Peter Stephenson @ 2015-09-03 10:09 ` Stephane Chazelas 2015-09-03 14:18 ` Peter Stephenson 0 siblings, 1 reply; 8+ messages in thread From: Stephane Chazelas @ 2015-09-03 10:09 UTC (permalink / raw) To: zsh-workers 2015-09-03 10:00:37 +0100, Peter Stephenson: > On Thu, 3 Sep 2015 00:07:11 +0100 > Stephane Chazelas <stephane.chazelas@gmail.com> wrote: > > is this (in a UTF-8 locale): > > > > $ zsh -c $'[[ \xcc = [\uaa-\udd] ]]' && echo yes > > yes > > > > expected or desirable? > > This comes from the function charref() in pattern.c. We discover the > sequence is incomplete / invalid and don't know what to do with it, so we > simply treat the single byte as a character: > > return (wchar_t) STOUC(*x); > > (the macro ensures we get an unsigned value to cast). Typically this > will do what you see (though wchar_t isn't guaranteed to have that > property). > > I'm not sure what else to do here. The function is used all over the > pattern code so anything other than tweak the code locally to return > another character (what?) is horrific to get consistent. We don't want > [[ $'\xcc' = $'\xdd' ]] to succeed, but ideally we do want [[ $'\xcc' = > $'\xcc' ]] to succeed comparing raw bytes (we're not morally forced to > do that in a UTF-8 locale, I don't think, but it wouldn't be very > helpful if it didn't work). > > If wchar_t is 32 bits (the only place where it wasn't used to be Cygwin > but I think that's changed) we could cheat by adding (wchar_t)0x7FFFFF00 > to it... that would fix your problem and (I hope) keep the two above > working, and minimsie the likelihood of generating a valid character... > that's about the least horrific I can come up with. [...] There was a related discussion not so long ago on the Austin group ML (zsh was also mentioned there) on a related subject: http://thread.gmane.org/gmane.comp.standards.posix.austin.general/11059/focus=11118 (the whole discussion started at http://thread.gmane.org/gmane.comp.standards.posix.austin.general/11059/focus=11098 is interesting) and see: https://www.python.org/dev/peps/pep-0383/ A discussed approach there was to internally represent bytes not forming part of a valid character as code points in the range D800-DFFF (specifically DC80 DCFF for bytes 0x80 to 0xff) (those code points are reserved in Unicode for UTF-16 surrogates and are *not* characters, in particular the byte-sequence that would be the UTF-8 encoding of a 0xD800 for example (ed a0 80) would not form a valid character so be internally represented as DCED DCA0 DC80. I beleive that's what python3 does. -- Stephane ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: invalid characters and multi-byte [x-y] ranges 2015-09-03 10:09 ` Stephane Chazelas @ 2015-09-03 14:18 ` Peter Stephenson 2015-09-04 10:53 ` Ismail Donmez 0 siblings, 1 reply; 8+ messages in thread From: Peter Stephenson @ 2015-09-03 14:18 UTC (permalink / raw) To: zsh-workers On Thu, 3 Sep 2015 11:09:44 +0100 Stephane Chazelas <stephane.chazelas@gmail.com> wrote: > A discussed approach there was to internally represent bytes not > forming part of a valid character as code points in the range > D800-DFFF (specifically DC80 DCFF for bytes 0x80 to 0xff) That's easy if wchar_t is actually Unicode. I'm not sure how to do it otherwise. We could treat it identically to the Unicode conversion of 0xdC00 + STOUCH(ch) to wchar_t, e.g. iconv UCS-4 to WCHAR_T, but is that guranteed to work? This needs to be a robust fallback and it's not clear relying on iconv is the right thing to do. The safe option would be only to use this if #ifdef __STDC_ISO_10646__. On the other hand, it's probably not going to be worse than the previous code... pws diff --git a/Src/pattern.c b/Src/pattern.c index 7d38988..7457cbd 100644 --- a/Src/pattern.c +++ b/Src/pattern.c @@ -224,6 +224,22 @@ typedef zlong zrange_t; typedef unsigned long zrange_t; #endif +#ifdef MULTIBYTE_SUPPORT +/* + * Handle a byte that's not part of a valid character. + * + * This range in Unicode is recommended for purposes of this + * kind as it corresponds to invalid characters. + * + * Note that this strictly only works if wchar_t represents + * Unicode code points, which isn't necessarily true; however, + * converting an invalid character into an unknown format is + * a bit tricky... + */ +#define WCHAR_INVALID(ch) \ + ((wchar_t) (0xDC00 + STOUC(ch))) +#endif /* MULTIBYTE_SUPPORT */ + /* * Array of characters corresponding to zpc_chars enum, which it must match. */ @@ -353,10 +369,10 @@ metacharinc(char **x) return wc; } - /* Error. Treat as single byte. */ + /* Error. */ /* Reset the shift state for next time. */ memset(&shiftstate, 0, sizeof(shiftstate)); - return (wchar_t) STOUC(*(*x)++); + return WCHAR_INVALID(*(*x)++); } #else @@ -1867,10 +1883,10 @@ charref(char *x, char *y) ret = mbrtowc(&wc, x, y-x, &shiftstate); if (ret == MB_INVALID || ret == MB_INCOMPLETE) { - /* Error. Treat as single byte. */ + /* Error. */ /* Reset the shift state for next time. */ memset(&shiftstate, 0, sizeof(shiftstate)); - return (wchar_t) STOUC(*x); + return WCHAR_INVALID(*x); } return wc; @@ -1913,7 +1929,7 @@ charrefinc(char **x, char *y, int *z) size_t ret; if (!(patglobflags & GF_MULTIBYTE) || !(STOUC(**x) & 0x80)) - return (wchar_t) STOUC(*(*x)++); + return WCHAR_INVALID(*(*x)++); ret = mbrtowc(&wc, *x, y-*x, &shiftstate); @@ -1922,7 +1938,7 @@ charrefinc(char **x, char *y, int *z) *z = 1; /* Reset the shift state for next time. */ memset(&shiftstate, 0, sizeof(shiftstate)); - return (wchar_t) STOUC(*(*x)++); + return WCHAR_INVALID(*(*x)++); } /* Nulls here are normal characters */ diff --git a/Test/D07multibyte.ztst b/Test/D07multibyte.ztst index 0e3e98d..3fadd80 100644 --- a/Test/D07multibyte.ztst +++ b/Test/D07multibyte.ztst @@ -508,3 +508,20 @@ cd .. } 0:cd with special characters + + test_array=( + '[[ \xcc = \xcc ]]' + '[[ \xcc != \xcd ]]' + '[[ \xcc != \ucc ]]' + '[[ \ucc = \ucc ]]' + '[[ \ucc = [\ucc] ]]' + '[[ \xcc != [\ucc] ]]' + # Not clear how useful the following is... + '[[ \xcc = [\xcc] ]]' + ) + for test in $test_array; do + if ! eval ${(g::)test} ; then + print -rl "Test $test failed" >&2 + fi + done +0:Invalid characters in pattern matching ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: invalid characters and multi-byte [x-y] ranges 2015-09-03 14:18 ` Peter Stephenson @ 2015-09-04 10:53 ` Ismail Donmez 2015-09-04 11:47 ` Peter Stephenson 0 siblings, 1 reply; 8+ messages in thread From: Ismail Donmez @ 2015-09-04 10:53 UTC (permalink / raw) To: zsh-workers Hi, Peter Stephenson <p.stephenson <at> samsung.com> writes: > > On Thu, 3 Sep 2015 11:09:44 +0100 > Stephane Chazelas <stephane.chazelas <at> gmail.com> wrote: > > A discussed approach there was to internally represent bytes not > > forming part of a valid character as code points in the range > > D800-DFFF (specifically DC80 DCFF for bytes 0x80 to 0xff) > > That's easy if wchar_t is actually Unicode. > > I'm not sure how to do it otherwise. We could treat it identically to > the Unicode conversion of 0xdC00 + STOUCH(ch) to wchar_t, e.g. iconv > UCS-4 to WCHAR_T, but is that guranteed to work? This needs to be a > robust fallback and it's not clear relying on iconv is the right thing > to do. > > The safe option would be only to use this if #ifdef __STDC_ISO_10646__. > > On the other hand, it's probably not going to be worse than the previous > code... This seems to break glob tests: ./D02glob.ztst: starting. *** /tmp/zsh.ztst.out.6468 Fri Sep 4 13:52:19 2015 --- /tmp/zsh.ztst.tout.6468 Fri Sep 4 13:52:19 2015 *************** *** 91,107 **** 0: [[ [ = [[] ]] 0: [[ ] = []] ]] 0: [[ [] = [^]]] ]] ! 0: [[ fooxx = (#i)FOOXX ]] 1: [[ fooxx = (#l)FOOXX ]] ! 0: [[ FOOXX = (#l)fooxx ]] 1: [[ fooxx = (#i)FOO(#I)X(#i)X ]] ! 0: [[ fooXx = (#i)FOO(#I)X(#i)X ]] ! 0: [[ fooxx = ((#i)FOOX)x ]] 1: [[ fooxx = ((#i)FOOX)X ]] 1: [[ BAR = (bar|(#i)foo) ]] ! 0: [[ FOO = (bar|(#i)foo) ]] ! 0: [[ Modules = (#i)*m* ]] ! 0: [[ fooGRUD = (#i)(bar|(#I)foo|(#i)rod)grud ]] 1: [[ FOOGRUD = (#i)(bar|(#I)foo|(#i)rod)grud ]] 0: [[ readme = (#i)readme~README|readme ]] 0: [[ readme = (#i)readme~README|readme~README ]] --- 91,114 ---- 0: [[ [ = [[] ]] 0: [[ ] = []] ]] 0: [[ [] = [^]]] ]] ! 1: [[ fooxx = (#i)FOOXX ]] ! Test failed: [[ fooxx = (#i)FOOXX ]] 1: [[ fooxx = (#l)FOOXX ]] ! 1: [[ FOOXX = (#l)fooxx ]] ! Test failed: [[ FOOXX = (#l)fooxx ]] 1: [[ fooxx = (#i)FOO(#I)X(#i)X ]] ! 1: [[ fooXx = (#i)FOO(#I)X(#i)X ]] ! Test failed: [[ fooXx = (#i)FOO(#I)X(#i)X ]] ! 1: [[ fooxx = ((#i)FOOX)x ]] ! Test failed: [[ fooxx = ((#i)FOOX)x ]] 1: [[ fooxx = ((#i)FOOX)X ]] 1: [[ BAR = (bar|(#i)foo) ]] ! 1: [[ FOO = (bar|(#i)foo) ]] ! Test failed: [[ FOO = (bar|(#i)foo) ]] ! 1: [[ Modules = (#i)*m* ]] ! Test failed: [[ Modules = (#i)*m* ]] ! 1: [[ fooGRUD = (#i)(bar|(#I)foo|(#i)rod)grud ]] ! Test failed: [[ fooGRUD = (#i)(bar|(#I)foo|(#i)rod)grud ]] 1: [[ FOOGRUD = (#i)(bar|(#I)foo|(#i)rod)grud ]] 0: [[ readme = (#i)readme~README|readme ]] 0: [[ readme = (#i)readme~README|readme~README ]] *************** *** 110,122 **** 0: [[ 633 = <1->33 ]] 0: [[ 633 = <->33 ]] 0: [[ 12345678901234567890123456789012345678901234567890123456789012345678901234 567890foo = <42->foo ]] ! 0: [[ READ.ME = (#ia1)readme ]] 1: [[ READ..ME = (#ia1)readme ]] ! 0: [[ README = (#ia1)readm ]] ! 0: [[ READM = (#ia1)readme ]] ! 0: [[ README = (#ia1)eadme ]] ! 0: [[ EADME = (#ia1)readme ]] ! 0: [[ READEM = (#ia1)readme ]] 1: [[ ADME = (#ia1)readme ]] 1: [[ README = (#ia1)read ]] 0: [[ bob = (#a1)[b][b] ]] --- 117,135 ---- 0: [[ 633 = <1->33 ]] 0: [[ 633 = <->33 ]] 0: [[ 12345678901234567890123456789012345678901234567890123456789012345678901234 567890foo = <42->foo ]] ! 1: [[ READ.ME = (#ia1)readme ]] ! Test failed: [[ READ.ME = (#ia1)readme ]] 1: [[ READ..ME = (#ia1)readme ]] ! 1: [[ README = (#ia1)readm ]] ! Test failed: [[ README = (#ia1)readm ]] ! 1: [[ READM = (#ia1)readme ]] ! Test failed: [[ READM = (#ia1)readme ]] ! 1: [[ README = (#ia1)eadme ]] ! Test failed: [[ README = (#ia1)eadme ]] ! 1: [[ EADME = (#ia1)readme ]] ! Test failed: [[ EADME = (#ia1)readme ]] ! 1: [[ READEM = (#ia1)readme ]] ! Test failed: [[ READEM = (#ia1)readme ]] 1: [[ ADME = (#ia1)readme ]] 1: [[ README = (#ia1)read ]] 0: [[ bob = (#a1)[b][b] ]] *************** *** 138,144 **** 0: [[ aaaXaaabY = (#a1)(a##b)##Y ]] 0: [[ aaaXbaabY = (#a1)(a##b)##Y ]] 1: [[ read.me = (#ia1)README~READ.ME ]] ! 0: [[ read.me = (#ia1)README~READ_ME ]] 1: [[ read.me = (#ia1)README~(#a1)READ_ME ]] 0: [[ test = *((#s)|/)test((#e)|/)* ]] 0: [[ test/path = *((#s)|/)test((#e)|/)* ]] --- 151,158 ---- 0: [[ aaaXaaabY = (#a1)(a##b)##Y ]] 0: [[ aaaXbaabY = (#a1)(a##b)##Y ]] 1: [[ read.me = (#ia1)README~READ.ME ]] ! 1: [[ read.me = (#ia1)README~READ_ME ]] ! Test failed: [[ read.me = (#ia1)README~READ_ME ]] 1: [[ read.me = (#ia1)README~(#a1)READ_ME ]] 0: [[ test = *((#s)|/)test((#e)|/)* ]] 0: [[ test/path = *((#s)|/)test((#e)|/)* ]] *************** *** 177,180 **** 0: [[ test.bash = *.?(#c1,2)sh ]] 0: [[ test.bash = *.?(#c1,)sh ]] 0: [[ test.zsh = *.?(#c1,)sh ]] ! 0 tests failed. --- 191,194 ---- 0: [[ test.bash = *.?(#c1,2)sh ]] 0: [[ test.bash = *.?(#c1,)sh ]] 0: [[ test.zsh = *.?(#c1,)sh ]] ! 14 tests failed. Test ./D02glob.ztst failed: output differs from expected as shown above for: globtest globtests Was testing: zsh globbing ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: invalid characters and multi-byte [x-y] ranges 2015-09-04 10:53 ` Ismail Donmez @ 2015-09-04 11:47 ` Peter Stephenson 2015-09-04 12:35 ` Peter Stephenson 0 siblings, 1 reply; 8+ messages in thread From: Peter Stephenson @ 2015-09-04 11:47 UTC (permalink / raw) To: zsh-workers On Fri, 4 Sep 2015 10:53:14 +0000 Ismail Donmez <ismail@i10z.com> wrote: on <p.stephenson <at> samsung.com> writes: > This seems to break glob tests: I missed that --- it looks like there's something funny with the (#i) flag. That may mean there's a pre-existing funny since it shouldn't be encountering anything looking like invalid characters. Looks like the alias test needs updating for the new report format (presumably trivial). pws ./A02alias.ztst: starting. This test hangs the shell when it fails... *** /tmp/zsh.ztst.out.4161 Fri Sep 4 12:43:33 2015 --- /tmp/zsh.ztst.tout.4161 Fri Sep 4 12:43:33 2015 *************** *** 1,2 **** foo is a suffix alias for print ! foo: suffix alias --- 1,2 ---- foo is a suffix alias for print ! foo: alias Test ./A02alias.ztst failed: output differs from expected as shown above for: alias -s foo=print type bar.foo; type -w bar.foo unalias -as Was testing: unalias -as ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: invalid characters and multi-byte [x-y] ranges 2015-09-04 11:47 ` Peter Stephenson @ 2015-09-04 12:35 ` Peter Stephenson 2015-09-04 15:02 ` Ismail Donmez 0 siblings, 1 reply; 8+ messages in thread From: Peter Stephenson @ 2015-09-04 12:35 UTC (permalink / raw) To: zsh-workers On Fri, 4 Sep 2015 12:47:19 +0100 Peter Stephenson <p.stephenson@samsung.com> wrote: > On Fri, 4 Sep 2015 10:53:14 +0000 > Ismail Donmez <ismail@i10z.com> wrote: > on <p.stephenson <at> samsung.com> writes: > > This seems to break glob tests: > > I missed that --- it looks like there's something funny with the (#i) > flag. That may mean there's a pre-existing funny since it shouldn't be > encountering anything looking like invalid characters. It was just stupidity. I changed a line that referred to an ASCII character rather than an invalid multibyte character. > Looks like the alias test needs updating for the new report format > (presumably trivial). It waqs just stupidity. I hadn't recompiled after pulling that change. Can't get the staff. pws diff --git a/Src/pattern.c b/Src/pattern.c index 7457cbd..b4ba33e 100644 --- a/Src/pattern.c +++ b/Src/pattern.c @@ -1929,7 +1929,7 @@ charrefinc(char **x, char *y, int *z) size_t ret; if (!(patglobflags & GF_MULTIBYTE) || !(STOUC(**x) & 0x80)) - return WCHAR_INVALID(*(*x)++); + return (wchar_t) STOUC(*(*x)++); ret = mbrtowc(&wc, *x, y-*x, &shiftstate); ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: invalid characters and multi-byte [x-y] ranges 2015-09-04 12:35 ` Peter Stephenson @ 2015-09-04 15:02 ` Ismail Donmez 0 siblings, 0 replies; 8+ messages in thread From: Ismail Donmez @ 2015-09-04 15:02 UTC (permalink / raw) To: zsh-workers Hi, Peter Stephenson <p.stephenson <at> samsung.com> writes: > --- a/Src/pattern.c > +++ b/Src/pattern.c > <at> <at> -1929,7 +1929,7 <at> <at> charrefinc(char **x, char *y, int *z) > size_t ret; > > if (!(patglobflags & GF_MULTIBYTE) || !(STOUC(**x) & 0x80)) > - return WCHAR_INVALID(*(*x)++); > + return (wchar_t) STOUC(*(*x)++); > > ret = mbrtowc(&wc, *x, y-*x, &shiftstate); It works now, thanks! ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2015-09-04 15:02 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-09-02 23:07 invalid characters and multi-byte [x-y] ranges Stephane Chazelas 2015-09-03 9:00 ` Peter Stephenson 2015-09-03 10:09 ` Stephane Chazelas 2015-09-03 14:18 ` Peter Stephenson 2015-09-04 10:53 ` Ismail Donmez 2015-09-04 11:47 ` Peter Stephenson 2015-09-04 12:35 ` Peter Stephenson 2015-09-04 15:02 ` Ismail Donmez
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/zsh/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).