invalid characters and multi-byte [x-y] ranges

zsh-workers
 help / color / mirror / code / Atom feed

* invalid characters and multi-byte [x-y] ranges
@ 2015-09-02 23:07 Stephane Chazelas
  2015-09-03  9:00 ` Peter Stephenson
  0 siblings, 1 reply; 8+ messages in thread
From: Stephane Chazelas @ 2015-09-02 23:07 UTC (permalink / raw)
  To: Zsh hackers list

Hello,

is this (in a UTF-8 locale):

$ zsh -c $'[[ \xcc = [\uaa-\udd] ]]' && echo yes
yes

expected or desirable? I know I can't expect much since \xcc is
not a valid character, but I wouldn't have expected it to match
[\uaa-\udd] there.

Same for:

$ zsh -c $'[[ \xcc = [[:alpha:]] ]]' && echo yes
yes

It seems \xcc is treated as U+00CC here (which in UTF-8 is
\xc3\x8c)

$ zsh -c $'[[ \xcc = \ucc ]]' && echo yes
$ zsh -c $'[[ \xcc = [\ucc] ]]' && echo yes
yes

-- 
Stephane


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: invalid characters and multi-byte [x-y] ranges
  2015-09-02 23:07 invalid characters and multi-byte [x-y] ranges Stephane Chazelas
@ 2015-09-03  9:00 ` Peter Stephenson
  2015-09-03 10:09   ` Stephane Chazelas
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Stephenson @ 2015-09-03  9:00 UTC (permalink / raw)
  To: Zsh hackers list

On Thu, 3 Sep 2015 00:07:11 +0100
Stephane Chazelas <stephane.chazelas@gmail.com> wrote:
> is this (in a UTF-8 locale):
> 
> $ zsh -c $'[[ \xcc = [\uaa-\udd] ]]' && echo yes
> yes
> 
> expected or desirable?

This comes from the function charref() in pattern.c.  We discover the
sequence is incomplete / invalid and don't know what to do with it, so we
simply treat the single byte as a character:

	return (wchar_t) STOUC(*x);

(the macro ensures we get an unsigned value to cast).  Typically this
will do what you see (though wchar_t isn't guaranteed to have that
property).

I'm not sure what else to do here.  The function is used all over the
pattern code so anything other than tweak the code locally to return
another character (what?) is horrific to get consistent.  We don't want
[[ $'\xcc' = $'\xdd' ]] to succeed, but ideally we do want [[ $'\xcc' =
$'\xcc' ]] to succeed comparing raw bytes (we're not morally forced to
do that in a UTF-8 locale, I don't think, but it wouldn't be very
helpful if it didn't work).

If wchar_t is 32 bits (the only place where it wasn't used to be Cygwin
but I think that's changed) we could cheat by adding (wchar_t)0x7FFFFF00
to it... that would fix your problem and (I hope) keep the two above
working, and minimsie the likelihood of generating a valid character...
that's about the least horrific I can come up with.

pws

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: invalid characters and multi-byte [x-y] ranges
  2015-09-03  9:00 ` Peter Stephenson
@ 2015-09-03 10:09   ` Stephane Chazelas
  2015-09-03 14:18     ` Peter Stephenson
  0 siblings, 1 reply; 8+ messages in thread
From: Stephane Chazelas @ 2015-09-03 10:09 UTC (permalink / raw)
  To: zsh-workers

2015-09-03 10:00:37 +0100, Peter Stephenson:
> On Thu, 3 Sep 2015 00:07:11 +0100
> Stephane Chazelas <stephane.chazelas@gmail.com> wrote:
> > is this (in a UTF-8 locale):
> > 
> > $ zsh -c $'[[ \xcc = [\uaa-\udd] ]]' && echo yes
> > yes
> > 
> > expected or desirable?
> 
> This comes from the function charref() in pattern.c.  We discover the
> sequence is incomplete / invalid and don't know what to do with it, so we
> simply treat the single byte as a character:
> 
> 	return (wchar_t) STOUC(*x);
> 
> (the macro ensures we get an unsigned value to cast).  Typically this
> will do what you see (though wchar_t isn't guaranteed to have that
> property).
> 
> I'm not sure what else to do here.  The function is used all over the
> pattern code so anything other than tweak the code locally to return
> another character (what?) is horrific to get consistent.  We don't want
> [[ $'\xcc' = $'\xdd' ]] to succeed, but ideally we do want [[ $'\xcc' =
> $'\xcc' ]] to succeed comparing raw bytes (we're not morally forced to
> do that in a UTF-8 locale, I don't think, but it wouldn't be very
> helpful if it didn't work).
> 
> If wchar_t is 32 bits (the only place where it wasn't used to be Cygwin
> but I think that's changed) we could cheat by adding (wchar_t)0x7FFFFF00
> to it... that would fix your problem and (I hope) keep the two above
> working, and minimsie the likelihood of generating a valid character...
> that's about the least horrific I can come up with.
[...]

There was a related discussion not so long ago on the Austin
group ML (zsh was also mentioned there) on a related subject:

http://thread.gmane.org/gmane.comp.standards.posix.austin.general/11059/focus=11118
(the whole discussion started at
http://thread.gmane.org/gmane.comp.standards.posix.austin.general/11059/focus=11098 is interesting)

and see:
https://www.python.org/dev/peps/pep-0383/

A discussed approach there was to internally represent bytes not
forming part of a valid character as code points in the range
D800-DFFF (specifically DC80 DCFF for bytes 0x80 to 0xff) (those
code points are reserved in Unicode for UTF-16 surrogates and are *not*
characters, in particular the byte-sequence that
would be the UTF-8 encoding of a 0xD800 for example (ed a0 80)
would not form a valid character so be internally represented as 
DCED DCA0 DC80.

I beleive that's what python3 does.

-- 
Stephane


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: invalid characters and multi-byte [x-y] ranges
  2015-09-03 10:09   ` Stephane Chazelas
@ 2015-09-03 14:18     ` Peter Stephenson
  2015-09-04 10:53       ` Ismail Donmez
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Stephenson @ 2015-09-03 14:18 UTC (permalink / raw)
  To: zsh-workers

On Thu, 3 Sep 2015 11:09:44 +0100
Stephane Chazelas <stephane.chazelas@gmail.com> wrote:
> A discussed approach there was to internally represent bytes not
> forming part of a valid character as code points in the range
> D800-DFFF (specifically DC80 DCFF for bytes 0x80 to 0xff)

That's easy if wchar_t is actually Unicode.

I'm not sure how to do it otherwise.  We could treat it identically to
the Unicode conversion of 0xdC00 + STOUCH(ch) to wchar_t, e.g. iconv
UCS-4 to WCHAR_T, but is that guranteed to work?  This needs to be a
robust fallback and it's not clear relying on iconv is the right thing
to do.

The safe option would be only to use this if #ifdef __STDC_ISO_10646__.

On the other hand, it's probably not going to be worse than the previous
code...

pws

diff --git a/Src/pattern.c b/Src/pattern.c
index 7d38988..7457cbd 100644
--- a/Src/pattern.c
+++ b/Src/pattern.c
@@ -224,6 +224,22 @@ typedef zlong zrange_t;
 typedef unsigned long zrange_t;
 #endif
 
+#ifdef MULTIBYTE_SUPPORT
+/*
+ * Handle a byte that's not part of a valid character.
+ *
+ * This range in Unicode is recommended for purposes of this
+ * kind as it corresponds to invalid characters.
+ *
+ * Note that this strictly only works if wchar_t represents
+ * Unicode code points, which isn't necessarily true; however,
+ * converting an invalid character into an unknown format is
+ * a bit tricky...
+ */
+#define WCHAR_INVALID(ch)			\
+    ((wchar_t) (0xDC00 + STOUC(ch)))
+#endif /* MULTIBYTE_SUPPORT */
+
 /*
  * Array of characters corresponding to zpc_chars enum, which it must match.
  */
@@ -353,10 +369,10 @@ metacharinc(char **x)
 	return wc;
     }
 
-    /* Error.  Treat as single byte. */
+    /* Error. */
     /* Reset the shift state for next time. */
     memset(&shiftstate, 0, sizeof(shiftstate));
-    return (wchar_t) STOUC(*(*x)++);
+    return WCHAR_INVALID(*(*x)++);
 }
 
 #else
@@ -1867,10 +1883,10 @@ charref(char *x, char *y)
     ret = mbrtowc(&wc, x, y-x, &shiftstate);
 
     if (ret == MB_INVALID || ret == MB_INCOMPLETE) {
-	/* Error.  Treat as single byte. */
+	/* Error. */
 	/* Reset the shift state for next time. */
 	memset(&shiftstate, 0, sizeof(shiftstate));
-	return (wchar_t) STOUC(*x);
+	return WCHAR_INVALID(*x);
     }
 
     return wc;
@@ -1913,7 +1929,7 @@ charrefinc(char **x, char *y, int *z)
     size_t ret;
 
     if (!(patglobflags & GF_MULTIBYTE) || !(STOUC(**x) & 0x80))
-	return (wchar_t) STOUC(*(*x)++);
+	return WCHAR_INVALID(*(*x)++);
 
     ret = mbrtowc(&wc, *x, y-*x, &shiftstate);
 
@@ -1922,7 +1938,7 @@ charrefinc(char **x, char *y, int *z)
 	*z = 1;
 	/* Reset the shift state for next time. */
 	memset(&shiftstate, 0, sizeof(shiftstate));
-	return (wchar_t) STOUC(*(*x)++);
+	return WCHAR_INVALID(*(*x)++);
     }
 
     /* Nulls here are normal characters */
diff --git a/Test/D07multibyte.ztst b/Test/D07multibyte.ztst
index 0e3e98d..3fadd80 100644
--- a/Test/D07multibyte.ztst
+++ b/Test/D07multibyte.ztst
@@ -508,3 +508,20 @@
      cd ..
   }
 0:cd with special characters
+
+  test_array=(
+  '[[ \xcc = \xcc ]]'
+  '[[ \xcc != \xcd ]]'
+  '[[ \xcc != \ucc ]]'
+  '[[ \ucc = \ucc ]]'
+  '[[ \ucc = [\ucc] ]]'
+  '[[ \xcc != [\ucc] ]]'
+  # Not clear how useful the following is...
+  '[[ \xcc = [\xcc] ]]'
+  )
+  for test in $test_array; do
+    if ! eval ${(g::)test} ; then
+      print -rl "Test $test failed" >&2
+    fi
+  done
+0:Invalid characters in pattern matching


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: invalid characters and multi-byte [x-y] ranges
  2015-09-03 14:18     ` Peter Stephenson
@ 2015-09-04 10:53       ` Ismail Donmez
  2015-09-04 11:47         ` Peter Stephenson
  0 siblings, 1 reply; 8+ messages in thread
From: Ismail Donmez @ 2015-09-04 10:53 UTC (permalink / raw)
  To: zsh-workers

Hi,

Peter Stephenson <p.stephenson <at> samsung.com> writes:

> 
> On Thu, 3 Sep 2015 11:09:44 +0100
> Stephane Chazelas <stephane.chazelas <at> gmail.com> wrote:
> > A discussed approach there was to internally represent bytes not
> > forming part of a valid character as code points in the range
> > D800-DFFF (specifically DC80 DCFF for bytes 0x80 to 0xff)
> 
> That's easy if wchar_t is actually Unicode.
> 
> I'm not sure how to do it otherwise.  We could treat it identically to
> the Unicode conversion of 0xdC00 + STOUCH(ch) to wchar_t, e.g. iconv
> UCS-4 to WCHAR_T, but is that guranteed to work?  This needs to be a
> robust fallback and it's not clear relying on iconv is the right thing
> to do.
> 
> The safe option would be only to use this if #ifdef __STDC_ISO_10646__.
> 
> On the other hand, it's probably not going to be worse than the previous
> code...

This seems to break glob tests:

./D02glob.ztst: starting.
*** /tmp/zsh.ztst.out.6468      Fri Sep  4 13:52:19 2015
--- /tmp/zsh.ztst.tout.6468     Fri Sep  4 13:52:19 2015
***************
*** 91,107 ****
  0:  [[ [ = [[] ]]
  0:  [[ ] = []] ]]
  0:  [[ [] = [^]]] ]]
! 0:  [[ fooxx = (#i)FOOXX ]]
  1:  [[ fooxx = (#l)FOOXX ]]
! 0:  [[ FOOXX = (#l)fooxx ]]
  1:  [[ fooxx = (#i)FOO(#I)X(#i)X ]]
! 0:  [[ fooXx = (#i)FOO(#I)X(#i)X ]]
! 0:  [[ fooxx = ((#i)FOOX)x ]]
  1:  [[ fooxx = ((#i)FOOX)X ]]
  1:  [[ BAR = (bar|(#i)foo) ]]
! 0:  [[ FOO = (bar|(#i)foo) ]]
! 0:  [[ Modules = (#i)*m* ]]
! 0:  [[ fooGRUD = (#i)(bar|(#I)foo|(#i)rod)grud ]]
  1:  [[ FOOGRUD = (#i)(bar|(#I)foo|(#i)rod)grud ]]
  0:  [[ readme = (#i)readme~README|readme ]]
  0:  [[ readme = (#i)readme~README|readme~README ]]
--- 91,114 ----
  0:  [[ [ = [[] ]]
  0:  [[ ] = []] ]]
  0:  [[ [] = [^]]] ]]
! 1:  [[ fooxx = (#i)FOOXX ]]
! Test failed:  [[ fooxx = (#i)FOOXX ]]
  1:  [[ fooxx = (#l)FOOXX ]]
! 1:  [[ FOOXX = (#l)fooxx ]]
! Test failed:  [[ FOOXX = (#l)fooxx ]]
  1:  [[ fooxx = (#i)FOO(#I)X(#i)X ]]
! 1:  [[ fooXx = (#i)FOO(#I)X(#i)X ]]
! Test failed:  [[ fooXx = (#i)FOO(#I)X(#i)X ]]
! 1:  [[ fooxx = ((#i)FOOX)x ]]
! Test failed:  [[ fooxx = ((#i)FOOX)x ]]
  1:  [[ fooxx = ((#i)FOOX)X ]]
  1:  [[ BAR = (bar|(#i)foo) ]]
! 1:  [[ FOO = (bar|(#i)foo) ]]
! Test failed:  [[ FOO = (bar|(#i)foo) ]]
! 1:  [[ Modules = (#i)*m* ]]
! Test failed:  [[ Modules = (#i)*m* ]]
! 1:  [[ fooGRUD = (#i)(bar|(#I)foo|(#i)rod)grud ]]
! Test failed:  [[ fooGRUD = (#i)(bar|(#I)foo|(#i)rod)grud ]]
  1:  [[ FOOGRUD = (#i)(bar|(#I)foo|(#i)rod)grud ]]
  0:  [[ readme = (#i)readme~README|readme ]]
  0:  [[ readme = (#i)readme~README|readme~README ]]
***************
*** 110,122 ****
  0:  [[ 633 = <1->33 ]]
  0:  [[ 633 = <->33 ]]
  0:  [[ 
12345678901234567890123456789012345678901234567890123456789012345678901234
567890foo = <42->foo ]]
! 0:  [[ READ.ME = (#ia1)readme ]]
  1:  [[ READ..ME = (#ia1)readme ]]
! 0:  [[ README = (#ia1)readm ]]
! 0:  [[ READM = (#ia1)readme ]]
! 0:  [[ README = (#ia1)eadme ]]
! 0:  [[ EADME = (#ia1)readme ]]
! 0:  [[ READEM = (#ia1)readme ]]
  1:  [[ ADME = (#ia1)readme ]]
  1:  [[ README = (#ia1)read ]]
  0:  [[ bob = (#a1)[b][b] ]]
--- 117,135 ----
  0:  [[ 633 = <1->33 ]]
  0:  [[ 633 = <->33 ]]
  0:  [[ 
12345678901234567890123456789012345678901234567890123456789012345678901234
567890foo = <42->foo ]]
! 1:  [[ READ.ME = (#ia1)readme ]]
! Test failed:  [[ READ.ME = (#ia1)readme ]]
  1:  [[ READ..ME = (#ia1)readme ]]
! 1:  [[ README = (#ia1)readm ]]
! Test failed:  [[ README = (#ia1)readm ]]
! 1:  [[ READM = (#ia1)readme ]]
! Test failed:  [[ READM = (#ia1)readme ]]
! 1:  [[ README = (#ia1)eadme ]]
! Test failed:  [[ README = (#ia1)eadme ]]
! 1:  [[ EADME = (#ia1)readme ]]
! Test failed:  [[ EADME = (#ia1)readme ]]
! 1:  [[ READEM = (#ia1)readme ]]
! Test failed:  [[ READEM = (#ia1)readme ]]
  1:  [[ ADME = (#ia1)readme ]]
  1:  [[ README = (#ia1)read ]]
  0:  [[ bob = (#a1)[b][b] ]]
***************
*** 138,144 ****
  0:  [[ aaaXaaabY = (#a1)(a##b)##Y ]]
  0:  [[ aaaXbaabY = (#a1)(a##b)##Y ]]
  1:  [[ read.me = (#ia1)README~READ.ME ]]
! 0:  [[ read.me = (#ia1)README~READ_ME ]]
  1:  [[ read.me = (#ia1)README~(#a1)READ_ME ]]
  0:  [[ test = *((#s)|/)test((#e)|/)* ]]
  0:  [[ test/path = *((#s)|/)test((#e)|/)* ]]
--- 151,158 ----
  0:  [[ aaaXaaabY = (#a1)(a##b)##Y ]]
  0:  [[ aaaXbaabY = (#a1)(a##b)##Y ]]
  1:  [[ read.me = (#ia1)README~READ.ME ]]
! 1:  [[ read.me = (#ia1)README~READ_ME ]]
! Test failed:  [[ read.me = (#ia1)README~READ_ME ]]
  1:  [[ read.me = (#ia1)README~(#a1)READ_ME ]]
  0:  [[ test = *((#s)|/)test((#e)|/)* ]]
  0:  [[ test/path = *((#s)|/)test((#e)|/)* ]]
***************
*** 177,180 ****
  0:  [[ test.bash = *.?(#c1,2)sh ]]
  0:  [[ test.bash = *.?(#c1,)sh ]]
  0:  [[ test.zsh = *.?(#c1,)sh ]]
! 0 tests failed.
--- 191,194 ----
  0:  [[ test.bash = *.?(#c1,2)sh ]]
  0:  [[ test.bash = *.?(#c1,)sh ]]
  0:  [[ test.zsh = *.?(#c1,)sh ]]
! 14 tests failed.
Test ./D02glob.ztst failed: output differs from expected as shown above 
for:
  globtest globtests
Was testing: zsh globbing



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: invalid characters and multi-byte [x-y] ranges
  2015-09-04 10:53       ` Ismail Donmez
@ 2015-09-04 11:47         ` Peter Stephenson
  2015-09-04 12:35           ` Peter Stephenson
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Stephenson @ 2015-09-04 11:47 UTC (permalink / raw)
  To: zsh-workers

On Fri, 4 Sep 2015 10:53:14 +0000
Ismail Donmez <ismail@i10z.com> wrote:
on <p.stephenson <at> samsung.com> writes:
> This seems to break glob tests:

I missed that --- it looks like there's something funny with the (#i)
flag.  That may mean there's a pre-existing funny since it shouldn't be
encountering anything looking like invalid characters.

Looks like the alias test needs updating for the new report format
(presumably trivial).

pws

./A02alias.ztst: starting.
This test hangs the shell when it fails...
*** /tmp/zsh.ztst.out.4161      Fri Sep  4 12:43:33 2015                        
--- /tmp/zsh.ztst.tout.4161	Fri Sep  4 12:43:33 2015
***************
*** 1,2 ****
  foo is a suffix alias for print
! foo: suffix alias
--- 1,2 ----
  foo is a suffix alias for print
! foo: alias
Test ./A02alias.ztst failed: output differs from expected as shown above for:
  alias -s foo=print
  type bar.foo; type -w bar.foo
  unalias -as
Was testing: unalias -as


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: invalid characters and multi-byte [x-y] ranges
  2015-09-04 11:47         ` Peter Stephenson
@ 2015-09-04 12:35           ` Peter Stephenson
  2015-09-04 15:02             ` Ismail Donmez
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Stephenson @ 2015-09-04 12:35 UTC (permalink / raw)
  To: zsh-workers

On Fri, 4 Sep 2015 12:47:19 +0100
Peter Stephenson <p.stephenson@samsung.com> wrote:

> On Fri, 4 Sep 2015 10:53:14 +0000
> Ismail Donmez <ismail@i10z.com> wrote:
> on <p.stephenson <at> samsung.com> writes:
> > This seems to break glob tests:
> 
> I missed that --- it looks like there's something funny with the (#i)
> flag.  That may mean there's a pre-existing funny since it shouldn't be
> encountering anything looking like invalid characters.

It was just stupidity.  I changed a line that referred to an ASCII
character rather than an invalid multibyte character.

> Looks like the alias test needs updating for the new report format
> (presumably trivial).

It waqs just stupidity.  I hadn't recompiled after pulling that change.

Can't get the staff.
pws

diff --git a/Src/pattern.c b/Src/pattern.c
index 7457cbd..b4ba33e 100644
--- a/Src/pattern.c
+++ b/Src/pattern.c
@@ -1929,7 +1929,7 @@ charrefinc(char **x, char *y, int *z)
     size_t ret;
 
     if (!(patglobflags & GF_MULTIBYTE) || !(STOUC(**x) & 0x80))
-	return WCHAR_INVALID(*(*x)++);
+	return (wchar_t) STOUC(*(*x)++);
 
     ret = mbrtowc(&wc, *x, y-*x, &shiftstate);
 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: invalid characters and multi-byte [x-y] ranges
  2015-09-04 12:35           ` Peter Stephenson
@ 2015-09-04 15:02             ` Ismail Donmez
  0 siblings, 0 replies; 8+ messages in thread
From: Ismail Donmez @ 2015-09-04 15:02 UTC (permalink / raw)
  To: zsh-workers

Hi,

Peter Stephenson <p.stephenson <at> samsung.com> writes:

> --- a/Src/pattern.c
> +++ b/Src/pattern.c
>  <at>  <at>  -1929,7 +1929,7  <at>  <at>  charrefinc(char **x, char *y, 
int *z)
>      size_t ret;
> 
>      if (!(patglobflags & GF_MULTIBYTE) || !(STOUC(**x) & 0x80))
> -	return WCHAR_INVALID(*(*x)++);
> +	return (wchar_t) STOUC(*(*x)++);
> 
>      ret = mbrtowc(&wc, *x, y-*x, &shiftstate);


It works now, thanks!




^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-09-04 15:02 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-02 23:07 invalid characters and multi-byte [x-y] ranges Stephane Chazelas
2015-09-03  9:00 ` Peter Stephenson
2015-09-03 10:09   ` Stephane Chazelas
2015-09-03 14:18     ` Peter Stephenson
2015-09-04 10:53       ` Ismail Donmez
2015-09-04 11:47         ` Peter Stephenson
2015-09-04 12:35           ` Peter Stephenson
2015-09-04 15:02             ` Ismail Donmez

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).