zsh-workers
 help / color / mirror / code / Atom feed
From: Peter Stephenson <p.stephenson@samsung.com>
To: zsh-workers@zsh.org
Subject: Re: invalid characters and multi-byte [x-y] ranges
Date: Thu, 03 Sep 2015 15:18:11 +0100	[thread overview]
Message-ID: <20150903151811.557a40ec@pwslap01u.europe.root.pri> (raw)
In-Reply-To: <20150903100943.GB7821@chaz.gmail.com>

On Thu, 3 Sep 2015 11:09:44 +0100
Stephane Chazelas <stephane.chazelas@gmail.com> wrote:
> A discussed approach there was to internally represent bytes not
> forming part of a valid character as code points in the range
> D800-DFFF (specifically DC80 DCFF for bytes 0x80 to 0xff)

That's easy if wchar_t is actually Unicode.

I'm not sure how to do it otherwise.  We could treat it identically to
the Unicode conversion of 0xdC00 + STOUCH(ch) to wchar_t, e.g. iconv
UCS-4 to WCHAR_T, but is that guranteed to work?  This needs to be a
robust fallback and it's not clear relying on iconv is the right thing
to do.

The safe option would be only to use this if #ifdef __STDC_ISO_10646__.

On the other hand, it's probably not going to be worse than the previous
code...

pws

diff --git a/Src/pattern.c b/Src/pattern.c
index 7d38988..7457cbd 100644
--- a/Src/pattern.c
+++ b/Src/pattern.c
@@ -224,6 +224,22 @@ typedef zlong zrange_t;
 typedef unsigned long zrange_t;
 #endif
 
+#ifdef MULTIBYTE_SUPPORT
+/*
+ * Handle a byte that's not part of a valid character.
+ *
+ * This range in Unicode is recommended for purposes of this
+ * kind as it corresponds to invalid characters.
+ *
+ * Note that this strictly only works if wchar_t represents
+ * Unicode code points, which isn't necessarily true; however,
+ * converting an invalid character into an unknown format is
+ * a bit tricky...
+ */
+#define WCHAR_INVALID(ch)			\
+    ((wchar_t) (0xDC00 + STOUC(ch)))
+#endif /* MULTIBYTE_SUPPORT */
+
 /*
  * Array of characters corresponding to zpc_chars enum, which it must match.
  */
@@ -353,10 +369,10 @@ metacharinc(char **x)
 	return wc;
     }
 
-    /* Error.  Treat as single byte. */
+    /* Error. */
     /* Reset the shift state for next time. */
     memset(&shiftstate, 0, sizeof(shiftstate));
-    return (wchar_t) STOUC(*(*x)++);
+    return WCHAR_INVALID(*(*x)++);
 }
 
 #else
@@ -1867,10 +1883,10 @@ charref(char *x, char *y)
     ret = mbrtowc(&wc, x, y-x, &shiftstate);
 
     if (ret == MB_INVALID || ret == MB_INCOMPLETE) {
-	/* Error.  Treat as single byte. */
+	/* Error. */
 	/* Reset the shift state for next time. */
 	memset(&shiftstate, 0, sizeof(shiftstate));
-	return (wchar_t) STOUC(*x);
+	return WCHAR_INVALID(*x);
     }
 
     return wc;
@@ -1913,7 +1929,7 @@ charrefinc(char **x, char *y, int *z)
     size_t ret;
 
     if (!(patglobflags & GF_MULTIBYTE) || !(STOUC(**x) & 0x80))
-	return (wchar_t) STOUC(*(*x)++);
+	return WCHAR_INVALID(*(*x)++);
 
     ret = mbrtowc(&wc, *x, y-*x, &shiftstate);
 
@@ -1922,7 +1938,7 @@ charrefinc(char **x, char *y, int *z)
 	*z = 1;
 	/* Reset the shift state for next time. */
 	memset(&shiftstate, 0, sizeof(shiftstate));
-	return (wchar_t) STOUC(*(*x)++);
+	return WCHAR_INVALID(*(*x)++);
     }
 
     /* Nulls here are normal characters */
diff --git a/Test/D07multibyte.ztst b/Test/D07multibyte.ztst
index 0e3e98d..3fadd80 100644
--- a/Test/D07multibyte.ztst
+++ b/Test/D07multibyte.ztst
@@ -508,3 +508,20 @@
      cd ..
   }
 0:cd with special characters
+
+  test_array=(
+  '[[ \xcc = \xcc ]]'
+  '[[ \xcc != \xcd ]]'
+  '[[ \xcc != \ucc ]]'
+  '[[ \ucc = \ucc ]]'
+  '[[ \ucc = [\ucc] ]]'
+  '[[ \xcc != [\ucc] ]]'
+  # Not clear how useful the following is...
+  '[[ \xcc = [\xcc] ]]'
+  )
+  for test in $test_array; do
+    if ! eval ${(g::)test} ; then
+      print -rl "Test $test failed" >&2
+    fi
+  done
+0:Invalid characters in pattern matching


  reply	other threads:[~2015-09-03 14:18 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-02 23:07 Stephane Chazelas
2015-09-03  9:00 ` Peter Stephenson
2015-09-03 10:09   ` Stephane Chazelas
2015-09-03 14:18     ` Peter Stephenson [this message]
2015-09-04 10:53       ` Ismail Donmez
2015-09-04 11:47         ` Peter Stephenson
2015-09-04 12:35           ` Peter Stephenson
2015-09-04 15:02             ` Ismail Donmez

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150903151811.557a40ec@pwslap01u.europe.root.pri \
    --to=p.stephenson@samsung.com \
    --cc=zsh-workers@zsh.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).