PATCH: multibyte characters in patterns.

zsh-workers
 help / color / mirror / code / Atom feed

* PATCH: multibyte characters in patterns.
@ 2006-04-09 21:38 Peter Stephenson
  2006-05-31  1:16 ` Wayne Davison
  0 siblings, 1 reply; 6+ messages in thread
From: Peter Stephenson @ 2006-04-09 21:38 UTC (permalink / raw)
  To: Zsh hackers list

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 40923 bytes --]

This adds handling for multibyte characters in patterns when the shell
is compiled with MULTIBYTE_SUPPORT.  This is activated in two ways:

- Set the new MULTIBYTE option.  This will eventually cover parameter
expansion and anything else in the main shell that needs it.  It won't
cover ZLE; that will always use the locale directly.  The reason for the
difference is that scripts and functions may trip up on binary input or,
for example, ISO-8859-1-encoded files that used to be handled properly
before the locale was taken into account.  Whether it should be turned
on by default is still to be determined.
- Use the (#u) globbing flag.  Unfortunately (#m) was already taken;
it's supposed to suggest "Unicode" or "UTF-8", even though we'll handle
other character sets.  (#U) is the opposite, as expected.

All tests pass, and on that basis I'll commit it, although the current
tests don't test the handling of multibyte characters at all.  We
probably need to add a test in some UTF-8 based locale if we can find
on.  (It's quite depressing how pleased I was just to find the shell
recognised o with an Umlaut as a single character.)  It's very likely
the result is still typographically challenged.

I've done virtually no optimisation of the code, and this could make a
big difference.  Where it used to skip over a character simply with an
inline test for Meta and a couple of increments, it now always enters a
function, and with multibyte mode in effect always loops over the system
test for a character.  The latter is inevitable but that doesn't mean
the code is as good as it could be.  It would be possible to convert to
wide characters, although it's complicated by the fact that we need to
support arbitrary bytes, too; it would have to be done with something
like a discriminated union of a char or wchar string.  Then we would
have to convert each test string as well.  I don't know how important
this is likely to be.

However, backtracking during the execution phase of the pattern is
already fairly well optimised: complicated cases were always done with
subroutines, so nothing needed to change, and I've made simple cases use
an extra array to mark the beginning of characters so we don't need to
look for wide characters when we backtrack.

Various other things that will need changing:
- Indexes for backreferences etc. don't count wide characters even
  in multibyte mode.  That's because parameters currently only count
  bytes, so this would be inconsistent.  The code is there and should
  change if we fix up parameters; holler if I forget.
- Tests for IFS and IFS space characters don't handle wide characters:
  again, the rest of the shell needs changing for these too.

I've also put in a note that I've got a single static shift state for
multibyte characters, but I'm not really sure what effect that will have.

Index: Doc/Zsh/expn.yo
===================================================================
RCS file: /cvsroot/zsh/zsh/Doc/Zsh/expn.yo,v
retrieving revision 1.64
diff -u -r1.64 expn.yo
--- Doc/Zsh/expn.yo	5 Apr 2006 09:58:42 -0000	1.64
+++ Doc/Zsh/expn.yo	9 Apr 2006 21:33:17 -0000
@@ -1461,20 +1461,20 @@
 forms:
 
 startitem()
-item(i)(
+item(tt(i))(
 Case insensitive:  upper or lower case characters in the pattern match
 upper or lower case characters.
 )
-item(l)(
+item(tt(l))(
 Lower case characters in the pattern match upper or lower case
 characters; upper case characters in the pattern still only match
 upper case characters.
 )
-item(I)(
+item(tt(I))(
 Case sensitive:  locally negates the effect of tt(i) or tt(l) from
 that point on.
 )
-item(b)(
+item(tt(b))(
 Activate backreferences for parenthesised groups in the pattern;
 this does not work in filename generation.  When a pattern with a set of
 active parentheses is matched, the strings matched by the groups are
@@ -1525,11 +1525,11 @@
 
 Pattern matching with backreferences is slightly slower than without.
 )
-item(B)(
+item(tt(B))(
 Deactivate backreferences, negating the effect of the tt(b) flag from that
 point on.
 )
-item(m)(
+item(tt(m))(
 Set references to the match data for the entire string matched; this is
 similar to backreferencing and does not work in filename generation.  The
 flag must be in effect at the end of the pattern, i.e. not local to a
@@ -1550,7 +1550,7 @@
 references, other than the extra substitutions required for the
 replacement strings in cases such as the example shown.
 )
-item(M)(
+item(tt(M))(
 Deactivate the tt(m) flag, hence no references to match data will be
 created.
 )
@@ -1596,6 +1596,19 @@
 `tt((#q.))' will be ignored.  Note that colon modifiers in the glob
 qualifiers are also not applied in ordinary pattern matching.
 )
+item(tt(u))(
+Respect the current locale in determining the presence of multibyte
+characters in a pattern, provided the shell was compiled with 
+tt(MULTIBYTE_SUPPORT).  This overrides the tt(MULTIBYTE)
+option; the default behaviour is taken from the option.  Compare tt(U).
+(Mnemonic: typically multibyte characters are from Unicode in the UTF-8
+encoding, although any extension of ASCII supported by the system
+library may be used.)
+)
+item(tt(U))(
+All characters are considered to be a single byte long.  The opposite
+of tt(u).  This overrides the tt(MULTIBYTE) option.
+)
 enditem()
 
 For example, the test string tt(fooxx) can be matched by the pattern
Index: Doc/Zsh/options.yo
===================================================================
RCS file: /cvsroot/zsh/zsh/Doc/Zsh/options.yo,v
retrieving revision 1.45
diff -u -r1.45 options.yo
--- Doc/Zsh/options.yo	20 Mar 2006 11:06:25 -0000	1.45
+++ Doc/Zsh/options.yo	9 Apr 2006 21:33:19 -0000
@@ -411,6 +411,20 @@
 Append a trailing `tt(/)' to all directory
 names resulting from filename generation (globbing).
 )
+pindex(MULTIBYTE)
+cindex(characters, multibyte, in expansion and globbing)
+cindex(multibyte characters, in expansion and globbing)
+item(tt(MULTIBYTE))(
+Respect multibyte characters when found during pattern matching.
+When this option is set, characters strings are examined using the
+system library to determine how many bytes form a character, depending
+on the current locale.  If the option is unset
+(or the shell was not compiled with the configuration option
+tt(MULTIBYTE_SUPPORT)) a single byte is always treated as a single
+character.  The option will eventually be extended to cover expansion.
+Note, however, that it does not affect the shellʼs editor, which always
+uses the locale to determine multibyte characters.
+)
 pindex(NOMATCH)
 cindex(globbing, no matches)
 item(tt(NOMATCH) (tt(PLUS()3)) <C> <Z>)(
Index: Src/options.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/options.c,v
retrieving revision 1.26
diff -u -r1.26 options.c
--- Src/options.c	7 Mar 2006 21:30:37 -0000	1.26
+++ Src/options.c	9 Apr 2006 21:33:20 -0000
@@ -166,6 +166,7 @@
 {{NULL, "markdirs",	      0},			 MARKDIRS},
 {{NULL, "menucomplete",	      0},			 MENUCOMPLETE},
 {{NULL, "monitor",	      OPT_SPECIAL},		 MONITOR},
+{{NULL, "multibyte",	      0/*TBD*/},		 MULTIBYTE},
 {{NULL, "multios",	      OPT_EMULATE|OPT_ZSH},	 MULTIOS},
 {{NULL, "nomatch",	      OPT_EMULATE|OPT_NONBOURNE},NOMATCH},
 {{NULL, "notify",	      OPT_ZSH},			 NOTIFY},
Index: Src/pattern.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/pattern.c,v
retrieving revision 1.29
diff -u -r1.29 pattern.c
--- Src/pattern.c	11 Oct 2005 16:48:06 -0000	1.29
+++ Src/pattern.c	9 Apr 2006 21:33:22 -0000
@@ -216,22 +216,6 @@
 #define P_HSTART        0x02	/* Starts with # or ##'d pattern. */
 #define P_PURESTR	0x04	/* Can be matched with a strcmp */
 
-/*
- * Increment pointer which may be on a Meta (x is a pointer variable),
- * returning the incremented value (i.e. like pre-increment).
- *
- * In future the following will need to refer to metafied multibyte
- * characters.  References to invidual characters are not turned
- * into a macro when the characters is metafied (c.f. CHARREF()
- * below then the character is not metafied) and will need tracking
- * down.
- */
-#define METAINC(x)	((x) += (*(x) == Meta) ? 2 : 1)
-/*
- * Return unmetafied char from string (x is any char *)
- */
-#define UNMETA(x)	(*(x) == Meta ? (x)[1] ^ 32 : *(x))
-
 #if defined(ZSH_64_BIT_TYPE) || defined(LONG_IS_64_BIT)
 typedef zlong zrange_t;
 #define ZRANGE_T_IS_SIGNED	(1)
@@ -288,6 +272,89 @@
 static int patflags;		    /* flags passed down to patcompile */
 static int patglobflags;  /* globbing flags & approx */
 
+/*
+ * Increment pointer to metafied multibyte string.
+ */
+#ifdef MULTIBYTE_SUPPORT
+typedef wchar_t patchar_t;
+
+#define METACHARINC(x) ((void)metacharinc(&x))
+
+/*
+ * TODO: the shiftstate isn't well handled; we don't guarantee
+ * to maintain it properly between characters.  If we don't
+ * need it we should use mbtowc() instead.
+ */
+static mbstate_t shiftstate;
+
+/*
+ * Multibyte version: it's (almost) as easy to return the
+ * value as not, so do so since we sometimes need it..
+ */
+static wchar_t
+metacharinc(char **x)
+{
+    char *inptr = *x;
+    char inchar;
+    size_t ret = MB_INVALID;
+    wchar_t wc;
+
+    /*
+     * Cheat if the top bit isn't set.  This is second-guessing
+     * the library, but we know for sure that if the character
+     * set doesn't have the property that all bytes with the 8th
+     * bit clear are single characters then we are stuffed.
+     */
+    if (!(patglobflags & GF_MULTIBYTE) || !(STOUC(*inptr) & 0x80))
+    {
+	if (itok(*inptr))
+	    inchar = ztokens[*inptr++ - Pound];
+	else if (*inptr == Meta) {
+	    inptr++;
+	    inchar = *inptr++ ^ 32;
+	} else {
+	    inchar = *inptr++;
+	}
+	*x = inptr;
+	return (wchar_t)inchar;
+    }
+
+    while (*inptr) {
+	if (itok(*inptr))
+	    inchar = ztokens[*inptr++ - Pound];
+	else if (*inptr == Meta) {
+	    inptr++;
+	    inchar = *inptr++ ^ 32;
+	} else {
+	    inchar = *inptr++;
+	}
+	ret = mbrtowc(&wc, &inchar, 1, &shiftstate);
+
+	if (ret == MB_INVALID)
+	    break;
+	if (ret == MB_INCOMPLETE)
+	    continue;
+	*x = inptr;
+	return wc;
+    }
+
+    /* Error.  Treat as single byte. */
+    /* Reset the shift state for next time. */
+    memset(&shiftstate, 0, sizeof(shiftstate));
+    return (wchar_t) *(*x)++;
+}
+
+#else
+typedef int patchar_t;
+
+#define METACHARINC(x)	((void)((x) += (*(x) == Meta) ? 2 : 1))
+/*
+ * Return unmetafied char from string (x is any char *)
+ */
+#define UNMETA(x)	(*(x) == Meta ? (x)[1] ^ 32 : *(x))
+#endif
+
+
 /* Add n more characters, ensuring there is enough space. */
 
 enum {
@@ -353,6 +420,8 @@
 	patglobflags = 0;
     else
 	patglobflags = GF_IGNCASE;
+    if (isset(MULTIBYTE))
+	patglobflags |= GF_MULTIBYTE;
 }
 
 /*
@@ -404,7 +473,10 @@
 	patendseglen--;
 	patendstrlen--;
 	remnulargs(patparse);
-	patglobflags = 0;
+	if (isset(MULTIBYTE))
+	    patglobflags = GF_MULTIBYTE;
+	else
+	    patglobflags = 0;
     }
     /*
      * Have to be set now, since they get updated during compilation.
@@ -935,6 +1007,14 @@
 		*assertp = P_ISEND;
 		break;
 
+	    case 'u':
+		patglobflags |= GF_MULTIBYTE;
+		break;
+
+	    case 'U':
+		patglobflags &= ~GF_MULTIBYTE;
+		break;
+
 	    default:
 		return 0;
 	    }
@@ -961,11 +1041,16 @@
     long starter = 0, next, pound, op;
     int flags, flags2, kshchar, len, ch, patch, nmeta;
     union upat up;
-    char *nptr, *str0, *ptr, cbuf[2];
+    char *nptr, *str0, *ptr, *patprev;
     zrange_t from, to;
+#ifdef MULTIBYTE_SUPPORT
+    char *charstart;
+#else
+    char cbuf[2];
+#endif
 
     flags = 0;
-    str0 = patparse;
+    str0 = patprev = patparse;
     for (;;) {
 	/*
 	 * Check if we have a string. First, we need to make sure
@@ -992,7 +1077,9 @@
 			 !memchr(patendseg, patparse[1], patendseglen))))
 	    break;
 
-	METAINC(patparse);
+	/* Remember the previous character for backtracking */
+	patprev = patparse;
+	METACHARINC(patparse);
     }
 
     if (patparse > str0) {
@@ -1007,13 +1094,13 @@
 	flags |= P_PURESTR;
 	DPUTS(patparse == str0, "BUG: matched nothing in patcomppiece.");
 	/* more than one character matched? */
-	morelen = str0 + (*str0 == Meta ? 2 : 1) < patparse;
+	morelen = (patprev > str0);
 	/*
 	 * If we have more than one character, a following hash only
-	 * applies to the last, so decrement.
+	 * applies to the last, so backtrack one character.
 	 */
 	if (isset(EXTENDEDGLOB) && *patparse == Pound && morelen)
-	    patparse -= (patparse > str0 + 1 && patparse[-2] == Meta) ? 2 : 1;
+	    patparse = patprev;
 	/*
 	 * If len is 1, we can't have an active # following, so doesn't
 	 * matter that we don't make X in `XX#' simple.
@@ -1066,7 +1153,7 @@
 	    patparse++;
 
 	patch = *patparse;
-	METAINC(patparse);
+	METACHARINC(patparse);
 	switch(patch) {
 	case Quest:
 	    flags |= P_SIMPLE;
@@ -1137,27 +1224,27 @@
 			    patadd(NULL, STOUC(Meta+ch), 1, PA_NOALIGN);
 			continue;
 		}
-		if (itok(*patparse)) {
-		    cbuf[0] = ztokens[*patparse - Pound];
-		} else if (*patparse == Meta) {
-		    cbuf[0] = Meta;
-		    cbuf[1] = *++patparse;
-		} else
-		    cbuf[0] = *patparse;
-		patparse++;
+		charstart = patparse;
+		METACHARINC(patparse);
 
-		if (*patparse == '-' && patparse[1] != Outbrack) {
+		if (*patparse == '-' && patparse[1] &&
+		    patparse[1] != Outbrack) {
 		    patadd(NULL, STOUC(Meta+PP_RANGE), 1, PA_NOALIGN);
-		    patadd(cbuf, 0, (cbuf[0] == Meta) ? 2 : 1, PA_NOALIGN);
-		    if (itok(*++patparse)) {
-			patadd(0, STOUC(ztokens[*patparse - Pound]), 1,
+		    if (itok(*charstart)) {
+			patadd(0, STOUC(ztokens[*charstart - Pound]), 1,
 			       PA_NOALIGN);
-		    } else
-			patadd(patparse, 0, (*patparse == Meta) ? 2 : 1,
-			       PA_NOALIGN);
-		    METAINC(patparse);
-		} else
-		    patadd(cbuf, 0, (cbuf[0] == Meta) ? 2 : 1, PA_NOALIGN);
+		    } else {
+			patadd(charstart, 0, patparse-charstart, PA_NOALIGN);
+		    }
+		    charstart = ++patparse;	/* skip ASCII '-' */
+		    METACHARINC(patparse);
+		}
+		if (itok(*charstart)) {
+		    patadd(0, STOUC(ztokens[*charstart - Pound]), 1,
+			   PA_NOALIGN);
+		} else {
+		    patadd(charstart, 0, patparse-charstart, PA_NOALIGN);
+		}
 	    }
 	    if (*patparse != Outbrack)
 		return 0;
@@ -1475,19 +1562,140 @@
 static int globdots;			/* Glob initial dots? */
 
 /*
- * Macros which are currently trivial but are likely to be less
- * so when we handle multibyte characters.  They operate on
- * umetafied strings.
+ * Character functions operating on unmetafied strings.
+ */
+#ifdef MULTIBYTE_SUPPORT
+
+/* Get a character from the start point in a string */
+#define CHARREF(x, y)	charref((x), (y))
+static wchar_t
+charref(char *x, char *y)
+{
+    wchar_t wc;
+    size_t ret;
+
+    if (!(patglobflags & GF_MULTIBYTE) || !(STOUC(*x) & 0x80))
+	return (wchar_t) *x;
+
+    ret = mbrtowc(&wc, x, y-x, &shiftstate);
+
+    if (ret == MB_INVALID || ret == MB_INCOMPLETE) {
+	/* Error.  Treat as single byte. */
+	/* Reset the shift state for next time. */
+	memset(&shiftstate, 0, sizeof(shiftstate));
+	return (wchar_t) *x;
+    }
+
+    return wc;
+}
+
+/* Get  a pointer to the next character */
+#define CHARNEXT(x, y)	charnext((x), (y))
+static char *
+charnext(char *x, char *y)
+{
+    wchar_t wc;
+    size_t ret;
+
+    if (!(patglobflags & GF_MULTIBYTE) || !(STOUC(*x) & 0x80))
+	return x + 1;
+
+    ret = mbrtowc(&wc, x, y-x, &shiftstate);
+
+    if (ret == MB_INVALID || ret == MB_INCOMPLETE) {
+	/* Error.  Treat as single byte. */
+	/* Reset the shift state for next time. */
+	memset(&shiftstate, 0, sizeof(shiftstate));
+	return x + 1;
+    }
+
+    /* Nulls here are normal characters */
+    return x + (ret ? ret : 1);
+}
+
+/* Increment a pointer past the current character. */
+#define CHARINC(x, y)	((x) = charnext((x), (y)))
+
+
+/* Get a character and increment */
+#define CHARREFINC(x, y)	charrefinc(&(x), (y))
+static wchar_t
+charrefinc(char **x, char *y)
+{
+    wchar_t wc;
+    size_t ret;
+
+    if (!(patglobflags & GF_MULTIBYTE) || !(STOUC(**x) & 0x80))
+	return (wchar_t) *(*x)++;
+
+    ret = mbrtowc(&wc, *x, y-*x, &shiftstate);
+
+    if (ret == MB_INVALID || ret == MB_INCOMPLETE) {
+	/* Error.  Treat as single byte. */
+	/* Reset the shift state for next time. */
+	memset(&shiftstate, 0, sizeof(shiftstate));
+	return (wchar_t) *(*x)++;
+    }
+
+    /* Nulls here are normal characters */
+    *x += ret ? ret : 1;
+
+    return wc;
+}
+
+
+#ifndef PARAMETER_CODE_HANDLES_MULTIBYTE
+/*
+ * TODO: We should use the other branch, but currently
+ * the parameter code doesn't handle multibyte input,
+ * so this would produce the wrong subscripts,
+ * so just use a raw byte difference for now.
  */
+/* Counter the number of characters between two pointers, smaller first */
+# define CHARSUB(x,y)	((y) - (x))
+#else
+/* Counter the number of characters between two pointers, smaller first */
+#define CHARSUB(x,y)	charsub(x, y)
+static ptrdiff_t
+charsub(char *x, char *y)
+{
+    ptrdiff_t res = 0;
+    size_t ret;
+    wchar_t wc;
+
+    while (x < y) {
+	ret = mbrtowc(&wc, x, y-x, &shiftstate);
+
+	if (ret == MB_INVALID || ret == MB_INCOMPLETE) {
+	    /* Error.  Treat remainder as single characters */
+	    return res + (y - x);
+	}
+
+	/* Treat nulls as normal characters */
+	if (!ret)
+	    ret = 1;
+	res += ret;
+	x += ret;
+    }
+
+    return res;
+}
+#endif
+
+#else /* no MULTIBYTE_SUPPORT */
 
 /* Get a character from the start point in a string */
-#define CHARREF(x)	(STOUC(*x))
+#define CHARREF(x, y)	(STOUC(*(x)))
 /* Get  a pointer to the next character */
-#define CHARNEXT(x)	(x+1)
+#define CHARNEXT(x, y)	((x)+1)
 /* Increment a pointer past the current character. */
-#define CHARINC(x)	(x++)
-/* Counter the number of characters between two pointers, largest first */
-#define CHARSUB(x,y)	(x-y)
+#define CHARINC(x, y)	((x)++)
+/* Get a character and increment */
+#define CHARREFINC(x, y)	(STOUC(*(x)++))
+/* Counter the number of characters between two pointers, smaller first */
+#define CHARSUB(x,y)	(y-x)
+
+#endif /* MULTIBYTE_SUPPORT */
 
 /*
  * The following need to be accessed in the globbing scanner for
@@ -1798,7 +2006,7 @@
 		 * Remember the test pattern is already unmetafied.
 		 */
 		char *str;
-		int mlen = CHARSUB(patinput, patinstart);
+		int mlen = CHARSUB(patinstart, patinput);
 
 		str = metafy(patinstart, patinput - patinstart, META_DUP);
 		setsparam("MATCH", str);
@@ -1820,9 +2028,9 @@
 		for (i = 0; i < prog->patnpar && i < maxnpos; i++) {
 		    if (parsfound & (1 << i)) {
 			if (begp)
-			    *begp++ = CHARSUB(*sp, patinstart) + patoffset;
+			    *begp++ = CHARSUB(patinstart, *sp) + patoffset;
 			if (endp)
-			    *endp++ = CHARSUB(*ep, patinstart) + patoffset
+			    *endp++ = CHARSUB(patinstart, *ep) + patoffset
 				- 1;
 		    } else {
 			if (begp)
@@ -1862,12 +2070,12 @@
 			 * corresponds to indexing as ${foo[1,1]}.
 			 */
 			sprintf(numbuf, "%ld",
-				(long)(CHARSUB(*sp, patinstart) +
+				(long)(CHARSUB(patinstart, *sp) +
 				       patoffset +
 				       !isset(KSHARRAYS)));
 			mbeginarr[i] = ztrdup(numbuf);
 			sprintf(numbuf, "%ld",
-				(long)(CHARSUB(*ep, patinstart) +
+				(long)(CHARSUB(patinstart, *ep) +
 				       patoffset +
 				       !isset(KSHARRAYS) - 1));
 			mendarr[i] = ztrdup(numbuf);
@@ -1916,12 +2124,26 @@
  * Match literal characters with case insensitivity test:  the first
  * comes from the input string, the second the current pattern.
  */
+#ifdef MULTIBYTE_SUPPORT
+#define ISUPPER(x)	iswupper(x)
+#define ISLOWER(x)	iswlower(x)
+#define TOUPPER(x)	towupper(x)
+#define TOLOWER(x)	towlower(x)
+#define ISDIGIT(x)	iswdigit(x)
+#else
+#define ISUPPER(x)	isupper(x)
+#define ISLOWER(x)	islower(x)
+#define TOUPPER(x)	toupperr(x)
+#define TOLOWER(x)	tolower(x)
+#define ISDIGIT(x)	idigit(x)
+#endif
 #define CHARMATCH(chin, chpa) (chin == chpa || \
         ((patglobflags & GF_IGNCASE) ? \
-	 ((isupper(chin) ? tolower(chin) : chin) == \
-	  (isupper(chpa) ? tolower(chpa) : chpa)) : \
+	 ((ISUPPER(chin) ? TOLOWER(chin) : chin) == \
+	  (ISUPPER(chpa) ? TOLOWER(chpa) : chpa)) : \
 	 (patglobflags & GF_LCMATCHUC) ? \
-	 (islower(chpa) && toupper(chpa) == chin) : 0))
+	 (ISLOWER(chpa) && TOUPPER(chpa) == chin) : 0))
+
 /*
  * The same but caching an expression from the first argument,
  * Requires local charmatch_cache definition.
@@ -1968,7 +2190,7 @@
 	    if (patinput == patinend)
 		fail = 1;
 	    else
-		CHARINC(patinput);
+		CHARINC(patinput, patinend);
 	    break;
 	case P_EXACTLY:
 	    /*
@@ -1984,14 +2206,16 @@
 	    }
 	    exactpos = NULL;
 	    while (chrop < chrend && patinput < patinend) {
-		int chin = CHARREF(patinput);
-		int chpa = CHARREF(chrop);
+		char *savpatinput = patinput;
+		char *savchrop = chrop;
+		patchar_t chin = CHARREFINC(patinput, patinend);
+		patchar_t chpa = CHARREFINC(chrop, chrend);
 		if (!CHARMATCH(chin, chpa)) {
 		    fail = 1;
+		    patinput = savpatinput;
+		    chrop = savchrop;
 		    break;
 		}
-		CHARINC(chrop);
-		CHARINC(patinput);
 	    }
 	    if (chrop < chrend) {
 		exactpos = chrop;
@@ -2002,18 +2226,18 @@
 	case P_ANYOF:
 	    if (patinput == patinend ||
 		!patmatchrange((char *)P_OPERAND(scan),
-			       CHARREF(patinput)))
+			       CHARREF(patinput, patinend)))
 		fail = 1;
 	    else
-		CHARINC(patinput);
+		CHARINC(patinput, patinend);
 	    break;
 	case P_ANYBUT:
 	    if (patinput == patinend ||
 		patmatchrange((char *)P_OPERAND(scan),
-			      CHARREF(patinput)))
+			      CHARREF(patinput, patinend)))
 		fail = 1;
 	    else
-		CHARINC(patinput);
+		CHARINC(patinput, patinend);
 	    break;
 	case P_NUMRNG:
 	case P_NUMFROM:
@@ -2108,7 +2332,7 @@
 	case P_NUMANY:
 	    /* This is <->: any old set of digits, don't bother comparing */
 	    start = patinput;
-	    while (patinput < patinend && idigit(CHARREF(patinput)))
+	    while (patinput < patinend && idigit(*patinput))
 		patinput++;
 	    save = patinput;
 	    no = 0;
@@ -2117,7 +2341,7 @@
 		    return 1;
 		if (!no && P_OP(next) == P_EXACTLY &&
 		    (!P_LS_LEN(next) ||
-		     !idigit(CHARREF(P_LS_STR(next)))) &&
+		     !idigit(*P_LS_STR(next))) &&
 		    !(patglobflags & 0xff))
 		    return 0;
 		patinput = --save;
@@ -2462,74 +2686,89 @@
 	    op = P_OP(scan);
 	    /* Note that no counts possibly metafied characters */
 	    start = patinput;
-	    if (op == P_STAR) {
-		for (no = 0; patinput < patinend; CHARINC(patinput))
-		    no++;
-		/* simple optimization for reasonably common case */
-		if (P_OP(next) == P_END)
-		    return 1;
-	    } else {
-		DPUTS(patglobflags & 0xff,
-		      "BUG: wrong backtracking with approximation.");
-		if (!globdots && P_NOTDOT(P_OPERAND(scan)) &&
-		    patinput == patinstart && patinput < patinend &&
-		    CHARREF(patinput) == '.')
-		    return 0;
-		no = patrepeat(P_OPERAND(scan));
-	    }
-	    min = (op == P_TWOHASH) ? 1 : 0;
-	    /*
-	     * Lookahead to avoid useless matches. This is not possible
-	     * with approximation.
-	     */
-	    if (P_OP(next) == P_EXACTLY && P_LS_LEN(next) &&
-		!(patglobflags & 0xff)) {
-		char *nextop = P_LS_STR(next);
+	    {
+		char *lastcharstart;
 		/*
-		 * If that P_EXACTLY is last (common in simple patterns,
-		 * such as *.c), then it can be only be matched at one
-		 * point in the test string, so record that.
+		 * Array to record the start of characters for
+		 * backtracking.
 		 */
-		if (P_OP(PATNEXT(next)) == P_END &&
-		    !(patflags & PAT_NOANCH)) {
-		    int ptlen = patinend - patinput;
-		    int lenmatch = patinend - (min ? CHARNEXT(start) : start);
-		    /* Are we in the right range? */
-		    if (P_LS_LEN(next) > lenmatch || P_LS_LEN(next) < ptlen)
-			return 0;
-		    /* Yes, just position appropriately and test. */
-		    patinput += ptlen - P_LS_LEN(next);
-		    /*
-		     * Here we will need to be careful that patinput is not
-		     * in the middle of a multibyte character.
-		     */
-		    /* Continue loop with P_EXACTLY test. */
-		    break;
-		}
-		nextch = CHARREF(nextop);
-	    } else
-		nextch = -1;
-	    save = patinput;
-	    savglobflags = patglobflags;
-	    saverrsfound = errsfound;
-	    while (no >= min) {
-		int charmatch_cache;
-		if (nextch < 0 ||
-		    (patinput < patinend &&
-		     CHARMATCH_EXPR(CHARREF(patinput), nextch))) {
-		    if (patmatch(next))
+		VARARR(char, charstart, patinend-patinput);
+		memset(charstart, 0, patinend-patinput);
+
+		if (op == P_STAR) {
+		    for (no = 0; patinput < patinend;
+			 CHARINC(patinput, patinend))
+		    {
+			charstart[patinput-start] = 1;
+			no++;
+		    }
+		    /* simple optimization for reasonably common case */
+		    if (P_OP(next) == P_END)
 			return 1;
+		} else {
+		    DPUTS(patglobflags & 0xff,
+			  "BUG: wrong backtracking with approximation.");
+		    if (!globdots && P_NOTDOT(P_OPERAND(scan)) &&
+			patinput == patinstart && patinput < patinend &&
+			CHARREF(patinput, patinend) == ZWC('.'))
+			return 0;
+		    no = patrepeat(P_OPERAND(scan), charstart);
 		}
-		no--;
-		save--;
+		min = (op == P_TWOHASH) ? 1 : 0;
 		/*
-		 * Here we will need to make sure save is
-		 * decremented properly to the start of
-		 * the preceeding multibyte character.
+		 * Lookahead to avoid useless matches. This is not possible
+		 * with approximation.
 		 */
-		patinput = save;
-		patglobflags = savglobflags;
-		errsfound = saverrsfound;
+		if (P_OP(next) == P_EXACTLY && P_LS_LEN(next) &&
+		    !(patglobflags & 0xff)) {
+		    char *nextop = P_LS_STR(next);
+		    int nextlen = P_LS_LEN(next);
+		    /*
+		     * If that P_EXACTLY is last (common in simple patterns,
+		     * such as *.c), then it can be only be matched at one
+		     * point in the test string, so record that.
+		     */
+		    if (P_OP(PATNEXT(next)) == P_END &&
+			!(patflags & PAT_NOANCH)) {
+			int ptlen = patinend - patinput;
+			int lenmatch = patinend -
+			    (min ? CHARNEXT(start, patinend) : start);
+			/* Are we in the right range? */
+			if (P_LS_LEN(next) > lenmatch ||
+			    P_LS_LEN(next) < ptlen)
+			    return 0;
+			/* Yes, just position appropriately and test. */
+			patinput += ptlen - P_LS_LEN(next);
+			/*
+			 * Here we will need to be careful that patinput is not
+			 * in the middle of a multibyte character.
+			 */
+			/* Continue loop with P_EXACTLY test. */
+			break;
+		    }
+		    nextch = CHARREF(nextop, nextop + nextlen);
+		} else
+		    nextch = -1;
+		savglobflags = patglobflags;
+		saverrsfound = errsfound;
+		lastcharstart = charstart + (patinput - start);
+		while (no >= min) {
+		    int charmatch_cache;
+		    if (nextch < 0 ||
+			(patinput < patinend &&
+			 CHARMATCH_EXPR(CHARREF(patinput, patinend),
+					nextch))) {
+			if (patmatch(next))
+			    return 1;
+		    }
+		    no--;
+		    /* find start of previous full character */
+		    while (!*--lastcharstart)
+			;
+		    patinput = start + (lastcharstart-charstart);
+		    patglobflags = savglobflags;
+		    errsfound = saverrsfound;
+		}
 	    }
 	    /*
 	     * As with branches, the patmatch(next) stuff for *
@@ -2590,7 +2829,7 @@
 
 		/* Try omitting a character from the input string */
 		if (patinput < patinend) {
-		    CHARINC(patinput);
+		    CHARINC(patinput, patinend);
 		    /* If we are not on an exact match, then this is
 		     * our last gasp effort, so we can optimize out
 		     * the recursive call.
@@ -2605,11 +2844,11 @@
 		    char *nextexact = savexact;
 		    DPUTS(!savexact,
 			  "BUG: exact match has not set exactpos");
-		    CHARINC(nextexact);
+		    CHARINC(nextexact, exactend);
 
 		    if (save < patinend) {
 			char *nextin = save;
-			CHARINC(nextin);
+			CHARINC(nextin, patinend);
 			patglobflags = savglobflags;
 			errsfound = saverrsfound;
 			exactpos = savexact;
@@ -2620,17 +2859,17 @@
 			 */
 			if (save < patinend && nextin < patinend &&
 			    nextexact < exactend) {
-			    int cin0 = CHARREF(save);
-			    int cpa0 = CHARREF(exactpos);
-			    int cin1 = CHARREF(nextin);
-			    int cpa1 = CHARREF(nextexact);
+			    patchar_t cin0 = CHARREF(save, patinend);
+			    patchar_t cpa0 = CHARREF(exactpos, exactend);
+			    patchar_t cin1 = CHARREF(nextin, patinend);
+			    patchar_t cpa1 = CHARREF(nextexact, exactend);
 
 			    if (CHARMATCH(cin0, cpa1) &&
 				CHARMATCH(cin1, cpa0)) {
 				patinput = nextin;
-				CHARINC(patinput);
+				CHARINC(patinput, patinend);
 				exactpos = nextexact;
-				CHARINC(exactpos);
+				CHARINC(exactpos, exactend);
 				if (patmatch(scan))
 				    return 1;
 
@@ -2659,7 +2898,7 @@
 		     * This must be the last attempt, so just loop
 		     * instead of calling recursively.
 		     */
-		    CHARINC(exactpos);
+		    CHARINC(exactpos, exactend);
 		    continue;
 		}
 	    }
@@ -2673,6 +2912,122 @@
     return 0;
 }
 
+
+/**/
+#ifdef MULTIBYTE_SUPPORT
+
+/**/
+static int
+patmatchrange(char *range, wchar_t ch)
+{
+    wchar_t r1, r2;
+
+    /*
+     * Careful here: unlike other strings, range is a NULL-terminated,
+     * metafied string, because we need to treat the Posix and hyphenated
+     * ranges specially.
+     */
+    while (*range) {
+	if (imeta(STOUC(*range))) {
+	    switch (STOUC(*range++) - STOUC(Meta)) {
+	    case 0:
+		/* ordinary metafied character */
+		range--;
+		if (metacharinc(&range) == ch)
+		    return 1;
+		break;
+	    case PP_ALPHA:
+		if (iswalpha(ch))
+		    return 1;
+		break;
+	    case PP_ALNUM:
+		if (iswalnum(ch))
+		    return 1;
+		break;
+	    case PP_ASCII:
+		if ((ch & ~0x7f) == 0)
+		    return 1;
+		break;
+	    case PP_BLANK:
+		if (ch == L' ' || ch == L'\t')
+		    return 1;
+		break;
+	    case PP_CNTRL:
+		if (iswcntrl(ch))
+		    return 1;
+		break;
+	    case PP_DIGIT:
+		if (iswdigit(ch))
+		    return 1;
+		break;
+	    case PP_GRAPH:
+		if (iswgraph(ch))
+		    return 1;
+		break;
+	    case PP_LOWER:
+		if (iswlower(ch))
+		    return 1;
+		break;
+	    case PP_PRINT:
+		if (iswprint(ch))
+		    return 1;
+		break;
+	    case PP_PUNCT:
+		if (iswpunct(ch))
+		    return 1;
+		break;
+	    case PP_SPACE:
+		if (iswspace(ch))
+		    return 1;
+		break;
+	    case PP_UPPER:
+		if (iswupper(ch))
+		    return 1;
+		break;
+	    case PP_XDIGIT:
+		if (iswxdigit(ch))
+		    return 1;
+		break;
+	    case PP_IDENT:
+		if (wcsiident(ch))
+		    return 1;
+		break;
+	    case PP_IFS:
+		/* TODO */
+		if (isep(ch))
+		    return 1;
+		break;
+	    case PP_IFSSPACE:
+		/* TODO */
+		if (iwsep(ch))
+		    return 1;
+		break;
+	    case PP_WORD:
+		if (wcsiword(ch))
+		    return 1;
+		break;
+	    case PP_RANGE:
+		r1 = metacharinc(&range);
+		r2 = metacharinc(&range);
+		if (r1 <= ch && ch <= r2)
+		    return 1;
+		break;
+	    case PP_UNKWN:
+		DPUTS(1, "BUG: unknown posix range passed through.\n");
+		break;
+	    default:
+		DPUTS(1, "BUG: unknown metacharacter in range.");
+		break;
+	    }
+	} else if (metacharinc(&range) == ch)
+	    return 1;
+    }
+    return 0;
+}
+
+/**/
+#else
+
 /**/
 static int
 patmatchrange(char *range, int ch)
@@ -2756,17 +3111,13 @@
 		    return 1;
 		break;
 	    case PP_WORD:
-		/*
-		 * HERE: when we support multibyte characters,
-		 * this test needs to be wcsiword().
-		 */
 		if (iword(ch))
 		    return 1;
 		break;
 	    case PP_RANGE:
 		range++;
 		r1 = STOUC(UNMETA(range));
-		METAINC(range);
+		METACHARINC(range);
 		r2 = STOUC(UNMETA(range));
 		if (*range == Meta)
 		    range++;
@@ -2786,12 +3137,21 @@
     return 0;
 }
 
-/* repeatedly match something simple and say how many times */
+/**/
+#endif
+
+/*
+ * Repeatedly match something simple and say how many times.
+ * charstart is an array parallel to that starting at patinput
+ * and records the start of (possibly multibyte) characters
+ * to aid in later backtracking.
+ */
 
 /**/
-static int patrepeat(Upat p)
+static int patrepeat(Upat p, char *charstart)
 {
-    int count = 0, tch, charmatch_cache;
+    int count = 0;
+    patchar_t tch, charmatch_cache;
     char *scan, *opnd;
 
     scan = patinput;
@@ -2806,23 +3166,28 @@
 #endif
     case P_EXACTLY:
 	DPUTS(P_LS_LEN(p) != 1, "closure following more than one character");
-	tch = CHARREF(P_LS_STR(p));
+	tch = CHARREF(P_LS_STR(p), P_LS_STR(p) + P_LS_LEN(p));
 	while (scan < patinend &&
-	       CHARMATCH_EXPR(CHARREF(scan), tch)) {
+	       CHARMATCH_EXPR(CHARREF(scan, patinend), tch)) {
+	    charstart[scan-patinput] = 1;
 	    count++;
-	    CHARINC(scan);
+	    CHARINC(scan, patinend);
 	}
 	break;
     case P_ANYOF:
-	while (scan < patinend && patmatchrange(opnd, CHARREF(scan))) {
+	while (scan < patinend &&
+	       patmatchrange(opnd, CHARREF(scan, patinend))) {
+	    charstart[scan-patinput] = 1;
 	    count++;
-	    CHARINC(scan);
+	    CHARINC(scan, patinend);
     	}
 	break;
     case P_ANYBUT:
-	while (scan < patinend && !patmatchrange(opnd, CHARREF(scan))) {
+	while (scan < patinend &&
+	       !patmatchrange(opnd, CHARREF(scan, patinend))) {
+	    charstart[scan-patinput] = 1;
 	    count++;
-	    CHARINC(scan);
+	    CHARINC(scan, patinend);
     	}
 	break;
 #ifdef DEBUG
@@ -2846,237 +3211,3 @@
     if (prog && prog != dummy_patprog1 && prog != dummy_patprog2)
 	zfree(prog, prog->size);
 }
-
-/**/
-#ifdef ZSH_PAT_DEBUG
-
-/* Debugging stuff: print and test a regular expression */
-
-/* Dump a regexp onto stdout in vaguely comprehensible form */
-
-/**/
-static void
-patdump(Patprog r)
-{
-    char *s, *base, op = P_EXACTLY;
-    Upat up, codestart, next;
-
-    base = (char *)r;
-    s = base + r->startoff;
-
-    if (r->flags & PAT_PURES) {
-	printf("STRING:%s\n", (char *)s);
-    } else {
-	codestart = (Upat)s;
-	while (op != P_END) {
-	    up = (Upat)s;
-	    op = P_OP(up);
-	    printf("%2d%s", up-codestart, patprop(up));
-	    next = PATNEXT(up);
-	    printf("(%d)", next ? next-codestart : 0);
-	    s += sizeof(union upat);
-	    if (op == P_EXACTLY) {
-		long llen = *(long *)s;
-		s += sizeof(long);
-		while (llen--) {
-		    putchar(CHARREF(s));
-		    CHARINC(s);
-		}
-	    } else if (op == P_ANYOF || op == P_ANYBUT) {
-		while (*s != '\0') {
-		    if (itok(*s)) {
-			if (*s == Meta + PP_RANGE) {
-			    s++;
-			    printf("<RANGE:%c-", UNMETA(s));
-			    METAINC(s);
-			    printf("%c>", UNMETA(s));
-			} else {
-			    printf("<TYPE:%d>", *s - Meta);
-			    s++;
-			    continue;
-			}
-		    } else
-			putchar(UNMETA(s));
-		    METAINC(s);
-		}
-	    } else if (op == P_NUMRNG || op == P_NUMFROM || op == P_NUMTO) {
-		printf("%lu", (unsigned long)*(zrange_t *)s);
-		s += sizeof(zrange_t);
-		if (op == P_NUMRNG) {
-		    printf("-%lu", (unsigned long)*(zrange_t *)s);
-		    s += sizeof(zrange_t);
-		}
-	    } else if (op == P_GFLAGS) {
-		printf("%ld, %ld", (++up)->l & ~0xff, (++up)->l & 0xff);
-		s += sizeof(union upat);
-	    } else if (op == P_WBRANCH || op == P_EXCLUDE ||
-		       op == P_EXCLUDP) {
-		s += sizeof(union upat);
-	    }
-	    putchar('\n');
-	    s = base + (((s - base) + sizeof(union upat) - 1) &
-			~(sizeof(union upat) - 1));
-	}
-    }
-
-    printf("Total size = %ld\n", r->size);
-    if (r->patstartch)
-	printf("start `%c' ", r->patstartch);
-    if (!(r->flags & PAT_NOANCH))
-	printf("EOL-anchor ");
-    if (r->patnpar)
-	printf("%d active backreferences ", r->patnpar);
-    if (r->mustoff)
-	printf("must have \"%s\"", (char *)r + r->mustoff);
-    printf("\n");
-    if (r->globflags) {
-	printf("Globbing flags: ");
-	if (r->globflags & GF_LCMATCHUC)
-	    printf("LC matches UC ");
-	if (r->globflags & GF_IGNCASE)
-	    printf("Ignore case");
-	printf("\n");
-	if (r->globflags & 0xff)
-	    printf("Max errors = %d\n", r->globflags & 0xff);
-    }
-}
-
-/**/
-static char *
-patprop(Upat op)
-{
-    char *p = NULL;
-    static char buf[50];
-
-    strcpy(buf, ":");
-
-    switch(P_OP(op)) {
-    case P_ANY:
-	p = "ANY";
-	break;
-    case P_ANYOF:
-	p = "ANYOF";
-	break;
-    case P_ANYBUT:
-	p = "ANYBUT";
-	break;
-    case P_BRANCH:
-	p = "BRANCH";
-	break;
-    case P_WBRANCH:
-	p = "WBRANCH";
-	break;
-    case P_EXCLUDE:
-	p = "EXCLUDE";
-	break;
-    case P_EXCLUDP:
-	p = "EXCLUDP";
-	break;
-    case P_EXCSYNC:
-	p = "EXCSYNC";
-	break;
-    case P_EXCEND:
-	p = "EXCEND";
-	break;
-    case P_EXACTLY:
-	p = "EXACTLY";
-	break;
-    case P_GFLAGS:
-	p = "GFLAGS";
-	break;
-    case P_ISSTART:
-	p = "ISSTART";
-	break;
-    case P_ISEND:
-	p = "ISEND";
-	break;
-    case P_NOTHING:
-	p = "NOTHING";
-	break;
-    case P_BACK:
-	p = "BACK";
-	break;
-    case P_END:
-	p = "END";
-	break;
-    case P_OPEN:
-    case P_OPEN+1:
-    case P_OPEN+2:
-    case P_OPEN+3:
-    case P_OPEN+4:
-    case P_OPEN+5:
-    case P_OPEN+6:
-    case P_OPEN+7:
-    case P_OPEN+8:
-    case P_OPEN+9:
-	sprintf(buf+strlen(buf), "OPEN%ld", P_OP(op)-P_OPEN);
-	p = NULL;
-	break;
-    case P_CLOSE:
-    case P_CLOSE+1:
-    case P_CLOSE+2:
-    case P_CLOSE+3:
-    case P_CLOSE+4:
-    case P_CLOSE+5:
-    case P_CLOSE+6:
-    case P_CLOSE+7:
-    case P_CLOSE+8:
-    case P_CLOSE+9:
-	sprintf(buf+strlen(buf), "CLOSE%ld", P_OP(op)-P_CLOSE);
-	p = NULL;
-	break;
-    case P_STAR:
-	p = "STAR";
-	break;
-    case P_ONEHASH:
-	p = "ONEHASH";
-	break;
-    case P_TWOHASH:
-	p = "TWOHASH";
-	break;
-    case P_NUMRNG:
-	p = "NUMRNG";
-	break;
-    case P_NUMFROM:
-	p = "NUMFROM";
-	break;
-    case P_NUMTO:
-	p = "NUMTO";
-	break;
-    case P_NUMANY:
-	p = "NUMANY";
-	break;
-    default:
-	fprintf(stderr, "Bad opcode\n");
-	p = NULL;
-	break;
-    }
-    if (p)
-	strcat(buf, p);
-    return buf;
-}
-
-/**/
-int
-bin_patdebug(char *name, char **args, char *ops, int func)
-{
-    Patprog prog;
-    int ret = 0;
-
-    tokenize(*args);
-
-    if (!(prog = patcompile((char *)*args, 0, 0)))
-	return 1;
-    if (ops['p'] || !args[1]) {
-	patdump(prog);
-    }
-
-    while (*++args) {
-	if (!pattry(prog, (char *)*args))
-	    ret++;
-    }
-    return ret;
-}
-
-/**/
-#endif /* ZSH_PAT_DEBUG */
Index: Src/utils.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/utils.c,v
retrieving revision 1.119
diff -u -r1.119 utils.c
--- Src/utils.c	7 Mar 2006 21:31:35 -0000	1.119
+++ Src/utils.c	9 Apr 2006 21:33:24 -0000
@@ -2702,7 +2702,7 @@
     } else if (len == 1 && iascii(*outstr)) {
 	return iident(*outstr);
     } else {
-	/* not currently allowed, see above */
+	/* TODO: not currently allowed, see above */
 	return 0;
     }
 }
Index: Src/zsh.h
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/zsh.h,v
retrieving revision 1.87
diff -u -r1.87 zsh.h
--- Src/zsh.h	7 Mar 2006 22:05:45 -0000	1.87
+++ Src/zsh.h	9 Apr 2006 21:33:26 -0000
@@ -1161,6 +1161,7 @@
 #define GF_IGNCASE	0x0200
 #define GF_BACKREF	0x0400
 #define GF_MATCHREF	0x0800
+#define GF_MULTIBYTE	0x1000	/* Use multibyte if supported by build */
 
 /* Dummy Patprog pointers. Used mainly in executable code, but the
  * pattern code needs to know about it, too. */
@@ -1596,6 +1597,7 @@
     MARKDIRS,
     MENUCOMPLETE,
     MONITOR,
+    MULTIBYTE,
     MULTIOS,
     NOMATCH,
     NOTIFY,
@@ -1924,4 +1926,26 @@
 
 #define MB_INCOMPLETE	((size_t)-2)
 #define MB_INVALID	((size_t)-1)
+
+/*
+ * MB_CUR_MAX is the maximum number of bytes that a single wide
+ * character will convert into.  We use it to keep strings
+ * sufficiently long.  It should always be defined, but if it isn't
+ * just assume we are using Unicode which requires 6 characters.
+ * (Note that it's not necessarily defined to a constant.)
+ */
+#ifndef MB_CUR_MAX
+#define MB_CUR_MAX 6
+#endif
+
+/* Convert character or string to wide character or string */
+#define ZWC(c)	L ## c
+#define ZWS(s)	L ## s
+
+#else
+
+/* Leave character or string as is. */
+#define ZWC(c)	c
+#define ZWS(s)	s
+
 #endif
Index: Src/Zle/zle.h
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/Zle/zle.h,v
retrieving revision 1.29
diff -u -r1.29 zle.h
--- Src/Zle/zle.h	20 Mar 2006 15:22:58 -0000	1.29
+++ Src/Zle/zle.h	9 Apr 2006 21:33:26 -0000
@@ -33,20 +33,6 @@
 typedef wint_t   ZLE_INT_T;
 #define ZLE_CHAR_SIZE	sizeof(wchar_t)
 
-/*
- * MB_CUR_MAX is the maximum number of bytes that a single wide
- * character will convert into.  We use it to keep strings
- * sufficiently long.  It should always be defined, but if it isn't
- * just assume we are using Unicode which requires 6 characters.
- * (Note that it's not necessarily defined to a constant.)
- */
-#ifndef MB_CUR_MAX
-#define MB_CUR_MAX 6
-#endif
-
-/* Convert character or string to wide character or string */
-#define ZWC(c)	L ## c
-#define ZWS(s)	L ## s
 
 #define ZLEEOF	WEOF
 
@@ -96,10 +82,6 @@
 typedef int ZLE_INT_T;
 #define ZLE_CHAR_SIZE	sizeof(ZLE_CHAR_T)
 
-/* Leave character or string as is. */
-#define ZWC(c)	c
-#define ZWS(s)	s
-
 #define ZLEEOF	EOF
 
 /* Functions that operate on a ZLE_STRING_T. */

-- 
Peter Stephenson <p.w.stephenson@ntlworld.com>
Web page still at http://www.pwstephenson.fsnet.co.uk/


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PATCH: multibyte characters in patterns.
  2006-04-09 21:38 PATCH: multibyte characters in patterns Peter Stephenson
@ 2006-05-31  1:16 ` Wayne Davison
  2006-05-31  9:25   ` Peter Stephenson
  0 siblings, 1 reply; 6+ messages in thread
From: Wayne Davison @ 2006-05-31  1:16 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: Zsh hackers list

[-- Attachment #1: Type: text/plain, Size: 901 bytes --]

On Sun, Apr 09, 2006 at 10:38:58PM +0100, Peter Stephenson wrote:
> This adds handling for multibyte characters in patterns when the shell
> is compiled with MULTIBYTE_SUPPORT.

FYI, I just checked in a few changes to pattern.c to fix some compiler
warnings about signed/unsigned mismatches.  There were two spots that
were still using an "int" where they should have been using the new
patchar_t typedef.  One of these was assigning a literal -1 to the
variable, so I added a PEOF define to use either WEOF or EOF for this
value (like the Zle code uses -- perhaps we should make a global define
for this in zsh.h?).  Finally, I changed the multibyte version of
patchar_t to use wint_t instead of wchar_t (since this is what the
towupper()-style functions return), and in so doing, I renamed the
typedef to be patint_t (since the non-multibyte version is an "int").

Attached is the patch.

..wayne..

[-- Attachment #2: pattern.patch --]
[-- Type: text/plain, Size: 3064 bytes --]

--- Src/pattern.c	25 Apr 2006 17:40:26 -0000	1.33
+++ Src/pattern.c	31 May 2006 00:56:56 -0000
@@ -276,7 +276,9 @@ static int patglobflags;  /* globbing fl
  * Increment pointer to metafied multibyte string.
  */
 #ifdef MULTIBYTE_SUPPORT
-typedef wchar_t patchar_t;
+typedef wint_t patint_t;
+
+#define PEOF WEOF
 
 #define METACHARINC(x) ((void)metacharinc(&x))
 
@@ -345,7 +347,9 @@ metacharinc(char **x)
 }
 
 #else
-typedef int patchar_t;
+typedef int patint_t;
+
+#define PEOF EOF
 
 #define METACHARINC(x)	((void)((x) += (*(x) == Meta) ? 2 : 1))
 /*
@@ -1689,7 +1693,7 @@ charsub(char *x, char *y)
 /* Get a character and increment */
 #define CHARREFINC(x, y)	(STOUC(*(x)++))
 /* Counter the number of characters between two pointers, smaller first */
-#define CHARSUB(x,y)	(y-x)
+#define CHARSUB(x,y)	((y) - (x))
 
 #endif /* MULTIBYTE_SUPPORT */
 
@@ -2171,8 +2175,9 @@ patmatch(Upat prog)
     /* Current and next nodes */
     Upat scan = prog, next, opnd;
     char *start, *save, *chrop, *chrend, *compend;
-    int savglobflags, op, no, min, nextch, fail = 0, saverrsfound;
+    int savglobflags, op, no, min, fail = 0, saverrsfound;
     zrange_t from, to, comp;
+    patint_t nextch;
 
     while  (scan) {
 	next = PATNEXT(scan);
@@ -2204,8 +2209,8 @@ patmatch(Upat prog)
 	    while (chrop < chrend && patinput < patinend) {
 		char *savpatinput = patinput;
 		char *savchrop = chrop;
-		patchar_t chin = CHARREFINC(patinput, patinend);
-		patchar_t chpa = CHARREFINC(chrop, chrend);
+		patint_t chin = CHARREFINC(patinput, patinend);
+		patint_t chpa = CHARREFINC(chrop, chrend);
 		if (!CHARMATCH(chin, chpa)) {
 		    fail = 1;
 		    patinput = savpatinput;
@@ -2747,14 +2752,14 @@ patmatch(Upat prog)
 		    }
 		    nextch = CHARREF(nextop, nextop + nextlen);
 		} else
-		    nextch = -1;
+		    nextch = PEOF;
 		savglobflags = patglobflags;
 		saverrsfound = errsfound;
 		lastcharstart = charstart + (patinput - start);
 		if (no >= min) {
 		    for (;;) {
-			int charmatch_cache;
-			if (nextch < 0 ||
+			patint_t charmatch_cache;
+			if (nextch == PEOF ||
 			    (patinput < patinend &&
 			     CHARMATCH_EXPR(CHARREF(patinput, patinend),
 					    nextch))) {
@@ -2862,10 +2867,10 @@ patmatch(Upat prog)
 			 */
 			if (save < patinend && nextin < patinend &&
 			    nextexact < exactend) {
-			    patchar_t cin0 = CHARREF(save, patinend);
-			    patchar_t cpa0 = CHARREF(exactpos, exactend);
-			    patchar_t cin1 = CHARREF(nextin, patinend);
-			    patchar_t cpa1 = CHARREF(nextexact, exactend);
+			    patint_t cin0 = CHARREF(save, patinend);
+			    patint_t cpa0 = CHARREF(exactpos, exactend);
+			    patint_t cin1 = CHARREF(nextin, patinend);
+			    patint_t cpa1 = CHARREF(nextexact, exactend);
 
 			    if (CHARMATCH(cin0, cpa1) &&
 				CHARMATCH(cin1, cpa0)) {
@@ -3154,7 +3159,7 @@ patmatchrange(char *range, int ch)
 static int patrepeat(Upat p, char *charstart)
 {
     int count = 0;
-    patchar_t tch, charmatch_cache;
+    patint_t tch, charmatch_cache;
     char *scan, *opnd;
 
     scan = patinput;

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PATCH: multibyte characters in patterns.
  2006-05-31  1:16 ` Wayne Davison
@ 2006-05-31  9:25   ` Peter Stephenson
  0 siblings, 0 replies; 6+ messages in thread
From: Peter Stephenson @ 2006-05-31  9:25 UTC (permalink / raw)
  To: Zsh hackers list

Wayne Davison wrote:
> FYI, I just checked in a few changes to pattern.c to fix some compiler
> warnings about signed/unsigned mismatches.

Thanks.

> One of these was assigning a literal -1 to the
> variable, so I added a PEOF define to use either WEOF or EOF for this
> value (like the Zle code uses -- perhaps we should make a global define
> for this in zsh.h?).

Yes, I was in the process of moving a whole load of stuff from zle.h to
zsh.h and turning ZLE_CHAR_T into ZCHAR_T, and so on.  It was in aid of
turning the completion code (the libraries, not the interaction with the
lexical analyser in the main shell) into wide characters instead of multibyte
characters, although this bit stood alone.  I stopped because I wondered
whether it was a good idea to have wide characters in completion after
all:  one of the good features at the moment is being able to use
completion even if the locale is a bit off base, and we would probably
lose that.  The part that needs redoing anyway is the matching control,
which is based around 256 octet character sets.

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070

To access the latest news from CSR copy this link into a web browser:  http://www.csr.com/email_sig.php

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PATCH: multibyte characters in patterns.
@ 2006-04-10 16:00 Peter Stephenson
  0 siblings, 0 replies; 6+ messages in thread
From: Peter Stephenson @ 2006-04-10 16:00 UTC (permalink / raw)
  To: Zsh hackers list

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2011 bytes --]

Vincent Lefevre wrote:
> Could you give examples of what it does exactly?
> Do you mean that "?" can now match a multibyte character?

Yes.  If X is a multibyte character consisting of two bytes (say a with
a grave accent) in the current locale, the following are both true:

[[ X = (#U)?? ]]
[[ X = (#u)? ]]

> Will it also match a UTF-8 character while being in ISO-8859-1 locales?
> (The reason could be to be able to handle data that use another encoding
> than the locales, mainly when data are shared amongst different users
> who use different locales, in which case these data are encoded in UTF-8
> in general.)

You should be able to do this by locally altering the locale, since the
various variables (LANG, LC_*) are special in zsh and will perform the
appropriate setlocale() calls---as long as the system library supports
the locale, obviously.  Making the variable local should be good enough
since specials are set and restored with the correct function calls.
However, I haven't tried this.  (This ability is already present---the
only relevant thing I've changed is that patterns will obey the locale.)

> How about that in UTF-8 locales?
> 
> dixsept:~> foo="bàr"
> dixsept:~> echo $foo[2]

I haven't done anything with parameters yet, so that currently operates
on bytes, but this will be fixed eventually.  The MULTIBYTE option will
apply and we'll presumably need parameter flags equivalent to the
globbing flags; unfortunately this time even (u) and (U) are taken.

> Couldn't an "unused" area of Unicode be used for arbitrary bytes?

I suppose that's possible, but it's not actually guaranteed (and we
don't require) that a wchar_t is actually a Unicode character at all; if
I've finally understood the __STDC_ISO_10646__ stuff there seems to be
quite a lot of systems like this.

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PATCH: multibyte characters in patterns.
@ 2006-04-10 15:40 Vincent Lefevre
  0 siblings, 0 replies; 6+ messages in thread
From: Vincent Lefevre @ 2006-04-10 15:40 UTC (permalink / raw)
  To: Zsh hackers list

On 2006-04-09 22:38:58 +0100, Peter Stephenson wrote:
> This adds handling for multibyte characters in patterns when the shell
> is compiled with MULTIBYTE_SUPPORT.  This is activated in two ways:
> 
> - Set the new MULTIBYTE option.  This will eventually cover parameter
> expansion and anything else in the main shell that needs it.  It won't
> cover ZLE; that will always use the locale directly.  The reason for the
> difference is that scripts and functions may trip up on binary input or,
> for example, ISO-8859-1-encoded files that used to be handled properly
> before the locale was taken into account.  Whether it should be turned
> on by default is still to be determined.
> - Use the (#u) globbing flag.  Unfortunately (#m) was already taken;
> it's supposed to suggest "Unicode" or "UTF-8", even though we'll handle
> other character sets.  (#U) is the opposite, as expected.

It could also suggest mUltibyte, and looks like the Greek letter mu
(for MUltibyte). :)

Could you give examples of what it does exactly?
Do you mean that "?" can now match a multibyte character?
Will it also match a UTF-8 character while being in ISO-8859-1 locales?
(The reason could be to be able to handle data that use another encoding
than the locales, mainly when data are shared amongst different users
who use different locales, in which case these data are encoded in UTF-8
in general.)

How about that in UTF-8 locales?

dixsept:~> foo="bàr"
dixsept:~> echo $foo[2]

> I've done virtually no optimisation of the code, and this could make a
> big difference.  Where it used to skip over a character simply with an
> inline test for Meta and a couple of increments, it now always enters a
> function, and with multibyte mode in effect always loops over the system
> test for a character.  The latter is inevitable but that doesn't mean
> the code is as good as it could be.  It would be possible to convert to
> wide characters, although it's complicated by the fact that we need to
> support arbitrary bytes, too; it would have to be done with something
> like a discriminated union of a char or wchar string.  Then we would
> have to convert each test string as well.  I don't know how important
> this is likely to be.

Couldn't an "unused" area of Unicode be used for arbitrary bytes?

-- 
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / SPACES project at LORIA


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PATCH: multibyte characters in patterns.
@ 2006-04-09 21:53 Peter Stephenson
  0 siblings, 0 replies; 6+ messages in thread
From: Peter Stephenson @ 2006-04-09 21:53 UTC (permalink / raw)
  To: Zsh hackers list

Peter Stephenson wrote:
> It's very likely the result is still typographically challenged.

This fixes compilation with --disable-multibyte.

Index: Src/pattern.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/pattern.c,v
retrieving revision 1.30
diff -u -r1.30 pattern.c
--- Src/pattern.c	9 Apr 2006 21:47:22 -0000	1.30
+++ Src/pattern.c	9 Apr 2006 21:52:13 -0000
@@ -1043,11 +1043,7 @@
     union upat up;
     char *nptr, *str0, *ptr, *patprev;
     zrange_t from, to;
-#ifdef MULTIBYTE_SUPPORT
     char *charstart;
-#else
-    char cbuf[2];
-#endif
 
     flags = 0;
     str0 = patprev = patparse;
@@ -2133,7 +2129,7 @@
 #else
 #define ISUPPER(x)	isupper(x)
 #define ISLOWER(x)	islower(x)
-#define TOUPPER(x)	toupperr(x)
+#define TOUPPER(x)	toupper(x)
 #define TOLOWER(x)	tolower(x)
 #define ISDIGIT(x)	idigit(x)
 #endif
@@ -2722,7 +2718,10 @@
 		if (P_OP(next) == P_EXACTLY && P_LS_LEN(next) &&
 		    !(patglobflags & 0xff)) {
 		    char *nextop = P_LS_STR(next);
+#ifdef MULTIBYTE_SUPPORT
+		    /* else second argument of CHARREF isn't used */
 		    int nextlen = P_LS_LEN(next);
+#endif
 		    /*
 		     * If that P_EXACTLY is last (common in simple patterns,
 		     * such as *.c), then it can be only be matched at one

-- 
Peter Stephenson <p.w.stephenson@ntlworld.com>
Web page still at http://www.pwstephenson.fsnet.co.uk/


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2006-05-31  9:25 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-04-09 21:38 PATCH: multibyte characters in patterns Peter Stephenson
2006-05-31  1:16 ` Wayne Davison
2006-05-31  9:25   ` Peter Stephenson
2006-04-09 21:53 Peter Stephenson
2006-04-10 15:40 Vincent Lefevre
2006-04-10 16:00 Peter Stephenson

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).