Questions about character types

zsh-workers
 help / color / mirror / code / Atom feed

* Questions about character types
@ 2006-07-05 12:14 Peter Stephenson
  2006-07-05 16:31 ` Bart Schaefer
  0 siblings, 1 reply; 3+ messages in thread
From: Peter Stephenson @ 2006-07-05 12:14 UTC (permalink / raw)
  To: Zsh hackers list

I'm just looking at handling character types consistently for multibyte
characters.

The most important issue is what to allow in identifiers (which usually
means parameter names).  The traditional zsh behaviour is to allow all
non-ASCII characters.  Now we can do it properly, it makes sense to
limit these to alphanumerics; this isn't portable, but allowing all
8-bit characters as before seems too gross to continue.  However,
according to POSIX, only characters from the "portable character set",
which in our case means the ASCII subset, are allowed.  The only way
round this I can see is to add an option POSIX_IDENTIFIERS to limit the
behaviour.  I would use an existing option if one seemed relevant but it
doesn't.

The test for module names currently approximates that for identifiers,
only allowing / as well, so I think using the same logic as above would
be the obvious thing to do.

Another question is what to do with user names.  Currently these are
just the ASCII identifier characters plus "-".  Is it useful to extend
these to include alphanumeric characters from the local character set?

Finally, I failed to interpret this code from math.c:

	    if (*ptr == '+' && (unary || !ialnum(*ptr))) {
		ptr++;
		return (unary) ? PREPLUS : POSTPLUS;
	    }

I don't see how ialnum(*ptr) could succeed.  What does this mean?

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070

To access the latest news from CSR copy this link into a web browser:  http://www.csr.com/email_sig.php

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Questions about character types
  2006-07-05 12:14 Questions about character types Peter Stephenson
@ 2006-07-05 16:31 ` Bart Schaefer
  2006-07-10 12:51   ` Peter Stephenson
  0 siblings, 1 reply; 3+ messages in thread
From: Bart Schaefer @ 2006-07-05 16:31 UTC (permalink / raw)
  To: Zsh hackers list

On Jul 5,  1:14pm, Peter Stephenson wrote:
}
} according to POSIX, only characters from the "portable character set",
} which in our case means the ASCII subset, are allowed.  The only way
} round this I can see is to add an option POSIX_IDENTIFIERS

Seems OK to me.

} The test for module names currently approximates that for identifiers,
} only allowing / as well, so I think using the same logic as above would
} be the obvious thing to do.

Agreed.

} Another question is what to do with user names.  Currently these are
} just the ASCII identifier characters plus "-".  Is it useful to extend
} these to include alphanumeric characters from the local character set?

My impression is that it would be, but some non-English-speakers should
weigh in.

} Finally, I failed to interpret this code from math.c:
} 
} 	    if (*ptr == '+' && (unary || !ialnum(*ptr))) {
} 		ptr++;

I suspect that's a bug.  It probably originally said

	    if (*ptr++ == '+' && (unary || !ialnum(*ptr))) {

but someone realized it was wrong to increment the pointer if it was NOT
equal to plus, and made an incomplete fix.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Questions about character types
  2006-07-05 16:31 ` Bart Schaefer
@ 2006-07-10 12:51   ` Peter Stephenson
  0 siblings, 0 replies; 3+ messages in thread
From: Peter Stephenson @ 2006-07-10 12:51 UTC (permalink / raw)
  To: Zsh hackers list

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 25958 bytes --]

Bart Schaefer wrote:
> } Another question is what to do with user names.  Currently these are
> } just the ASCII identifier characters plus "-".  Is it useful to extend
> } these to include alphanumeric characters from the local character set?
> 
> My impression is that it would be, but some non-English-speakers should
> weigh in.

I've allowed it; it's particularly useful if you're using the extended
parameter naming rules and the reference is to a named directory.

> } Finally, I failed to interpret this code from math.c:
> } 
> } 	    if (*ptr == '+' && (unary || !ialnum(*ptr))) {
> } 		ptr++;
> 
> I suspect that's a bug.  It probably originally said
> 
> 	    if (*ptr++ == '+' && (unary || !ialnum(*ptr))) {
> 
> but someone realized it was wrong to increment the pointer if it was NOT
> equal to plus, and made an incomplete fix.

I've just assumed this is "&& 1" which is how it's evaluated as far back
as the CVS archive goes without any obvious problems, and hence removed
it.

Here is the patch.  The MULTIBYTE and POSIX_IDENTIFIERS options should
be respected whenever necessary when testing character types.

There's one remaining big job:  I have not yet fixed up IFS to handle
multibyte characters (also isep() and ISEP macros).  That looks a little
messy in places.

The other remaining cases I'm aware of where we still don't test for
multibyte characters should be harmless:
- Some idigit()s.  I don't see any good reason for allowing active
  multibyte digit characters in numerical expressions (for example,
  extra width digits), so anywhere a real digit (rather than just
  a printable character that happens to look like a digit) is required
  it must be 0 to 9 from the portable character set.
- Likewise some iblank()s when inputting text.  Whitespace has to
  be portable whitespace.
- One ialpha() when checking options to builtins, since
  all option letters come from the portable character set.

Index: README
===================================================================
RCS file: /cvsroot/zsh/zsh/README,v
retrieving revision 1.33
diff -u -r1.33 README
--- README	26 Jun 2006 09:57:17 -0000	1.33
+++ README	10 Jul 2006 12:49:17 -0000
@@ -50,11 +50,23 @@
 subsequently by the user.  It is valid for the variable to be unset.
 
 Zsh has previously been lax about whether it allows octets with the
-top bit set to be part of a shell identifier.  With --enable-multibyte set,
-this is now completely disabled.  This is a temporary fix until the main
-shell handles multibyte characters properly and the appropriate library
-tests can be used.  This change may be reviewed if no such permanent fix
-is forthcoming.
+top bit set to be part of a shell identifier.  Older versions of the shell
+assumed all such octets were allowed in identifiers, however the POSIX
+standard does not allow such characters in identifiers.  The older
+behaviour is still obtained with --disable-multibyte in effect.
+With --enable-multibyte set there are three possible cases:
+  MULTIBYTE option unset:  only ASCII characters are allowed; the
+    shell does not attempt to identify non-ASCII characters at all.
+  MULTIBYTE option set, POSIX_IDENTIFIERS option unset: in addition
+    to the POSIX characters, any alphanumeric characters in the
+    local character set are allowed.  Note that scripts and functions that
+    take advantage of this are non-portable; however, this is in the spirit
+    of previous versions of the shell.  Note also that the options must
+    be set before the shell parses the script or function; setting
+    them during execution is not sufficient.
+  MULITBYTE option set, POSIX_IDENTIFIERS set:  only ASCII characters
+    are allowed in identifiers even though the shell will recognise
+    alphanumeric multibyte characters.
 
 The completion style pine-directory must now be set to use completion
 for PINE mailbox folders; previously it had the default ~/mail.  This
Index: Doc/Zsh/options.yo
===================================================================
RCS file: /cvsroot/zsh/zsh/Doc/Zsh/options.yo,v
retrieving revision 1.46
diff -u -r1.46 options.yo
--- Doc/Zsh/options.yo	9 Apr 2006 21:47:22 -0000	1.46
+++ Doc/Zsh/options.yo	10 Jul 2006 12:49:18 -0000
@@ -1204,6 +1204,27 @@
 tt(trap) and
 tt(unset).
 )
+pindex(POSIX_IDENTIFIERS)
+cindex(identifiers, non-portable characters in)
+cindex(parameter names, non-portable characters in)
+item(tt(POSIX_IDENTIFIERS) <K> <S>)(
+When this option is set, only the ASCII characters tt(a) to tt(z), tt(A) to
+tt(Z), tt(0) to tt(9) and tt(_) may be used in identifiers (names
+of shell parameters and modules).
+
+When the option is unset and multibyte character support is enabled (i.e. it
+is compiled in and the option tt(MULTIBYTE) is set), then additionally any
+alphanumeric characters in the local character set may be used in
+identifiers.  Note that scripts and functions written with this feature are
+not portable, and also that both options must be set before the script
+or function is parsed; setting them during execution is not sufficient
+as the syntax var(variable)tt(=)var(value) has already been parsed as
+a command rather than an assignment.
+
+If multibyte character support is not compiled into the shell this option is
+ignored; all octets with the top bit set may be used in identifiers.
+This is non-standard but is the traditional zsh behaviour.
+)
 pindex(SH_FILE_EXPANSION)
 cindex(sh, expansion style)
 cindex(expansion style, sh)
Index: Src/builtin.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/builtin.c,v
retrieving revision 1.158
diff -u -r1.158 builtin.c
--- Src/builtin.c	30 May 2006 22:35:03 -0000	1.158
+++ Src/builtin.c	10 Jul 2006 12:49:20 -0000
@@ -2629,9 +2629,7 @@
 	    char *modname = NULL;
 	    char *ptr;
 
-	    for (ptr = funcname; *ptr; ptr++)
-		if (!iident(*ptr))
-		    break;
+	    ptr = itype_end(funcname, IIDENT, 0);
 	    if (idigit(*funcname) || funcname == ptr || *ptr) {
 		zwarnnam(name, "-M %s: bad math function name", funcname);
 		return 1;
Index: Src/glob.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/glob.c,v
retrieving revision 1.51
diff -u -r1.51 glob.c
--- Src/glob.c	30 May 2006 22:35:03 -0000	1.51
+++ Src/glob.c	10 Jul 2006 12:49:21 -0000
@@ -1443,9 +1443,7 @@
 
 		    if (s[-1] == '+') {
 			plus = 0;
-			tt = s;
-			while (iident(*tt))
-			    tt++;
+			tt = itype_end(s, IIDENT, 0);
 			if (tt == s)
 			{
 			    zerr("missing identifier after `+'");
Index: Src/lex.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/lex.c,v
retrieving revision 1.33
diff -u -r1.33 lex.c
--- Src/lex.c	30 May 2006 22:35:03 -0000	1.33
+++ Src/lex.c	10 Jul 2006 12:49:22 -0000
@@ -1135,10 +1135,13 @@
 		if (idigit(*t))
 		    while (++t < bptr && idigit(*t));
 		else {
-		    while (iident(*t) && ++t < bptr);
+		    int sav = *bptr;
+		    *bptr = '\0';
+		    t = itype_end(t, IIDENT, 0);
 		    if (t < bptr) {
-			*bptr = '\0';
 			skipparens(Inbrack, Outbrack, &t);
+		    } else {
+			*bptr = sav;
 		    }
 		}
 		if (*t == '+')
Index: Src/math.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/math.c,v
retrieving revision 1.25
diff -u -r1.25 math.c
--- Src/math.c	30 Jun 2006 09:41:35 -0000	1.25
+++ Src/math.c	10 Jul 2006 12:49:22 -0000
@@ -265,11 +265,12 @@
 {
     int cct = 0;
     yyval.type = MN_INTEGER;
+    char *ie;
 
     for (;; cct = 0)
 	switch (*ptr++) {
 	case '+':
-	    if (*ptr == '+' && (unary || !ialnum(*ptr))) {
+	    if (*ptr == '+') {
 		ptr++;
 		return (unary) ? PREPLUS : POSTPLUS;
 	    }
@@ -279,7 +280,7 @@
 	    }
 	    return (unary) ? UPLUS : PLUS;
 	case '-':
-	    if (*ptr == '-' && (unary || !ialnum(*ptr))) {
+	    if (*ptr == '-') {
 		ptr++;
 		return (unary) ? PREMINUS : POSTMINUS;
 	    }
@@ -469,12 +470,12 @@
 		}
 		cct = 1;
 	    }
-	    if (iident(*ptr)) {
+	    if ((ie = itype_end(ptr, IIDENT, 0)) != ptr) {
 		int func = 0;
 		char *p;
 
 		p = ptr;
-		while (iident(*++ptr));
+		ptr = ie;
 		if (*ptr == '[' || (!cct && *ptr == '(')) {
 		    char op = *ptr, cp = ((*ptr == '[') ? ']' : ')');
 		    int l;
Index: Src/module.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/module.c,v
retrieving revision 1.22
diff -u -r1.22 module.c
--- Src/module.c	30 May 2006 22:35:03 -0000	1.22
+++ Src/module.c	10 Jul 2006 12:49:23 -0000
@@ -734,12 +734,8 @@
 modname_ok(char const *p)
 {
     do {
-	if(*p != '_' && !ialnum(*p))
-	    return 0;
-	do {
-	    p++;
-	} while(*p == '_' || ialnum(*p));
-	if(!*p)
+	p = itype_end(p, IIDENT, 0);
+	if (!*p)
 	    return 1;
     } while(*p++ == '/');
     return 0;
Index: Src/options.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/options.c,v
retrieving revision 1.28
diff -u -r1.28 options.c
--- Src/options.c	30 May 2006 22:35:03 -0000	1.28
+++ Src/options.c	10 Jul 2006 12:49:23 -0000
@@ -176,6 +176,7 @@
 {{NULL, "overstrike",	      0},			 OVERSTRIKE},
 {{NULL, "pathdirs",	      OPT_EMULATE},		 PATHDIRS},
 {{NULL, "posixbuiltins",      OPT_EMULATE|OPT_BOURNE},	 POSIXBUILTINS},
+{{NULL, "posixidentifiers",   OPT_EMULATE|OPT_BOURNE},	 POSIXIDENTIFIERS},
 {{NULL, "printeightbit",      0},                        PRINTEIGHTBIT},
 {{NULL, "printexitvalue",     0},			 PRINTEXITVALUE},
 {{NULL, "privileged",	      OPT_SPECIAL},		 PRIVILEGED},
Index: Src/params.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/params.c,v
retrieving revision 1.116
diff -u -r1.116 params.c
--- Src/params.c	27 Jun 2006 16:28:46 -0000	1.116
+++ Src/params.c	10 Jul 2006 12:49:24 -0000
@@ -899,9 +899,7 @@
 		break;
     } else {
 	/* Find the first character in `s' not in the iident type table */
-	for (ss = s; *ss; ss++)
-	    if (!iident(*ss))
-		break;
+	ss = itype_end(s, IIDENT, 0);
     }
 
     /* If the next character is not [, then it is *
@@ -1653,7 +1651,7 @@
 mod_export Value
 fetchvalue(Value v, char **pptr, int bracks, int flags)
 {
-    char *s, *t;
+    char *s, *t, *ie;
     char sav, c;
     int ppar = 0;
 
@@ -1665,9 +1663,8 @@
 	else
 	    ppar = *s++ - '0';
     }
-    else if (iident(c))
-	while (iident(*s))
-	    s++;
+    else if ((ie = itype_end(s, IIDENT, 0)) != s)
+	s = ie;
     else if (c == Quest)
 	*s++ = '?';
     else if (c == Pound)
@@ -1732,7 +1729,7 @@
 		return v;
 	    }
 	} else if (!(flags & SCANPM_ASSIGNING) && v->isarr &&
-		   iident(*t) && isset(KSHARRAYS))
+		   itype_end(t, IIDENT, 1) != t && isset(KSHARRAYS))
 	    v->end = 1, v->isarr = 0;
     }
     if (!bracks && *s)
Index: Src/parse.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/parse.c,v
retrieving revision 1.56
diff -u -r1.56 parse.c
--- Src/parse.c	9 Jul 2006 14:47:22 -0000	1.56
+++ Src/parse.c	10 Jul 2006 12:49:26 -0000
@@ -1603,10 +1603,7 @@
 
 		if (*ptr == Outbrace && ptr > tokstr + 1)
 		{
-		    while (--ptr > tokstr)
-			if (!iident(*ptr))
-			    break;
-		    if (ptr == tokstr)
+		    if (itype_end(tokstr, IIDENT, 0) >= ptr - 1)
 		    {
 			char *toksave = tokstr;
 			char *idstring = dupstrpfx(tokstr+1, eptr-tokstr-1);
Index: Src/subst.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/subst.c,v
retrieving revision 1.53
diff -u -r1.53 subst.c
--- Src/subst.c	28 Jun 2006 14:34:27 -0000	1.53
+++ Src/subst.c	10 Jul 2006 12:49:28 -0000
@@ -475,15 +475,14 @@
 		return 0;
 	    *namptr = dyncat(ds, ptr);
 	    return 1;
-	} else if (iuser(str[1])) {   /* ~foo */
-	    char *ptr, *hom, save;
+	} else if ((ptr = itype_end(str+1, IUSER, 0)) != str+1) {   /* ~foo */
+	    char *hom, save;
 
-	    for (ptr = ++str; *ptr && iuser(*ptr); ptr++);
 	    save = *ptr;
 	    if (!isend(save))
 		return 0;
 	    *ptr = 0;
-	    if (!(hom = getnameddir(str))) {
+	    if (!(hom = getnameddir(++str))) {
 		if (isset(NOMATCH))
 		    zerr("no such user or named directory: %s", str);
 		*ptr = save;
@@ -1146,9 +1145,10 @@
      * Shouldn't this be a table or something?  We test for all
      * these later on, too.
      */
-    if (!ialnum(c = *s) && c != '#' && c != Pound && c != '-' &&
-	c != '!' && c != '$' && c != String && c != Qstring &&
-	c != '?' && c != Quest && c != '_' &&
+    c = *s;
+    if (itype_end(s, IIDENT, 1) == s && *s != '#' && c != Pound &&
+	c != '-' && c != '!' && c != '$' && c != String && c != Qstring &&
+	c != '?' && c != Quest &&
 	c != '*' && c != Star && c != '@' && c != '{' &&
 	c != Inbrace && c != '=' && c != Equals && c != Hat &&
 	c != '^' && c != '~' && c != Tilde && c != '+') {
@@ -1446,8 +1446,8 @@
 	    } else
 		spbreak = 2;
 	} else if ((c == '#' || c == Pound) &&
-		   (iident(cc = s[1])
-		    || cc == '*' || cc == Star || cc == '@'
+		   (itype_end(s+1, IIDENT, 0) != s + 1
+		    || (cc = s[1]) == '*' || cc == Star || cc == '@'
 		    || cc == '-' || (cc == ':' && s[2] == '-')
 		    || (isstring(cc) && (s[2] == Inbrace || s[2] == Inpar)))) {
 	    getlen = 1 + whichlen, s++;
@@ -1471,7 +1471,7 @@
 	     * Try to handle this when parameter is named
 	     * by (P) (second part of test).
 	     */
-	    if (iident(s[1]) || (aspar && isstring(s[1]) &&
+	    if (itype_end(s+1, IIDENT, 0) != s+1 || (aspar && isstring(s[1]) &&
 				 (s[2] == Inbrace || s[2] == Inpar)))
 		chkset = 1, s++;
 	    else if (!inbrace) {
Index: Src/utils.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/utils.c,v
retrieving revision 1.126
diff -u -r1.126 utils.c
--- Src/utils.c	30 Jun 2006 09:41:35 -0000	1.126
+++ Src/utils.c	10 Jul 2006 12:49:29 -0000
@@ -1921,7 +1921,7 @@
 	return;
     if (**s == String && !*t) {
 	guess = *s + 1;
-	if (*t || !ialpha(*guess))
+	if (itype_end(guess, IIDENT, 1) == guess)
 	    return;
 	ic = String;
 	d = 100;
@@ -2750,11 +2750,8 @@
  * iident() macro extended to support wide characters.
  *
  * The macro is intended to test if a character is allowed in an
- * internal zsh identifier.  Until the main shell handles multibyte
- * characters it's not a good idea to allow characters other than
- * ASCII characters; it would cause zle to allow characters that
- * the main shell would reject.  Eventually we should be able
- * to allow all alphanumerics.
+ * internal zsh identifier.  We allow all alphanumerics outside
+ * the ASCII range unless POSIXIDENTIFIERS is set.
  *
  * Otherwise similar to wcsiword.
  */
@@ -2774,14 +2771,90 @@
     } else if (len == 1 && iascii(*outstr)) {
 	return iident(*outstr);
     } else {
-	/* TODO: not currently allowed, see above */
-	return 0;
+	return !isset(POSIXIDENTIFIERS) && iswalnum(c);
     }
 }
 /**/
 #endif
 
 
+/*
+ * Find the end of a set of characters in the set specified by itype;
+ * one of IALNUM, IIDENT, IWORD or IUSER.  For non-ASCII characters, we assume
+ * alphanumerics are part of the set, with the exception that
+ * identifiers are not treated that way if POSIXIDENTIFIERS is set.
+ *
+ * See notes above for identifiers.
+ * Returns the same pointer as passed if not on an identifier character.
+ * If "once" is set, just test the first character, i.e. (outptr !=
+ * inptr) tests whether the first character is valid in an identifier.
+ *
+ * Currently this is only called with itype IIDENT or IUSER.
+ */
+
+/**/
+mod_export char *
+itype_end(const char *ptr, int itype, int once)
+{
+#ifdef MULTIBYTE_SUPPORT
+    if (isset(MULTIBYTE) &&
+	(itype != IIDENT || !isset(POSIXIDENTIFIERS))) {
+	mb_metacharinit();
+	while (*ptr) {
+	    wint_t wc;
+	    int len = mb_metacharlenconv(ptr, &wc);
+
+	    if (!len)
+		break;
+
+	    if (wc == WEOF) {
+		/* invalid, treat as single character */
+		int chr = STOUC(*ptr == Meta ? ptr[1] ^ 32 : *ptr);
+		/* in this case non-ASCII characters can't match */
+		if (chr > 127 || !zistype(chr,itype))
+		    break;
+	    } else if (len == 1 && iascii(*ptr)) {
+		/* ASCII: can't be metafied, use standard test */
+		if (!zistype(*ptr,itype))
+		    break;
+	    } else {
+		/*
+		 * Valid non-ASCII character.  Allow all alphanumerics;
+		 * if testing for words, allow all wordchars.
+		 */
+		if (!(iswalnum(wc) ||
+		      (itype == IWORD && wcschr(wordchars_wide, wc))))
+		    break;
+	    }
+	    ptr += len;
+
+	    if (once)
+		break;
+	}
+    } else
+#endif
+	for (;;) {
+	    int chr = STOUC(*ptr == Meta ? ptr[1] ^ 32 : *ptr);
+	    if (!zistype(chr,itype))
+		break;
+	    ptr += (*ptr == Meta) ? 2 : 1;
+
+	    if (once)
+		break;
+	}
+
+    /*
+     * Nasty.  The first argument is const char * because we
+     * don't modify it here.  However, we really want to pass
+     * back the same type as was passed down, to allow idioms like
+     *   p = itype_end(p, IIDENT, 0);
+     * So returning a const char * isn't really the right thing to do.
+     * Without having two different functions the following seems
+     * to be the best we can do.
+     */
+    return (char *)ptr;
+}
+
 /**/
 mod_export char **
 arrdup(char **s)
@@ -3710,9 +3783,10 @@
 
 /**/
 int
-mb_metacharlenconv(char *s, wint_t *wcp)
+mb_metacharlenconv(const char *s, wint_t *wcp)
 {
-    char inchar, *ptr;
+    char inchar;
+    const char *ptr;
     size_t ret;
     wchar_t wc;
 
Index: Src/zsh.h
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/zsh.h,v
retrieving revision 1.92
diff -u -r1.92 zsh.h
--- Src/zsh.h	9 Jul 2006 14:47:22 -0000	1.92
+++ Src/zsh.h	10 Jul 2006 12:49:30 -0000
@@ -1610,6 +1610,7 @@
     OVERSTRIKE,
     PATHDIRS,
     POSIXBUILTINS,
+    POSIXIDENTIFIERS,
     PRINTEIGHTBIT,
     PRINTEXITVALUE,
     PRIVILEGED,
Index: Src/ztype.h
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/ztype.h,v
retrieving revision 1.3
diff -u -r1.3 ztype.h
--- Src/ztype.h	1 Nov 2005 02:50:22 -0000	1.3
+++ Src/ztype.h	10 Jul 2006 12:49:30 -0000
@@ -42,22 +42,22 @@
 #define IMETA    (1 << 12)
 #define IWSEP    (1 << 13)
 #define INULL    (1 << 14)
-#define _icom(X,Y) (typtab[STOUC(X)] & Y)
-#define idigit(X) _icom(X,IDIGIT)
-#define ialnum(X) _icom(X,IALNUM)
-#define iblank(X) _icom(X,IBLANK)	/* blank, not including \n */
-#define inblank(X) _icom(X,INBLANK)	/* blank or \n */
-#define itok(X) _icom(X,ITOK)
-#define isep(X) _icom(X,ISEP)
-#define ialpha(X) _icom(X,IALPHA)
-#define iident(X) _icom(X,IIDENT)
-#define iuser(X) _icom(X,IUSER)	/* username char */
-#define icntrl(X) _icom(X,ICNTRL)
-#define iword(X) _icom(X,IWORD)
-#define ispecial(X) _icom(X,ISPECIAL)
-#define imeta(X) _icom(X,IMETA)
-#define iwsep(X) _icom(X,IWSEP)
-#define inull(X) _icom(X,INULL)
+#define zistype(X,Y) (typtab[STOUC(X)] & Y)
+#define idigit(X) zistype(X,IDIGIT)
+#define ialnum(X) zistype(X,IALNUM)
+#define iblank(X) zistype(X,IBLANK)	/* blank, not including \n */
+#define inblank(X) zistype(X,INBLANK)	/* blank or \n */
+#define itok(X) zistype(X,ITOK)
+#define isep(X) zistype(X,ISEP)
+#define ialpha(X) zistype(X,IALPHA)
+#define iident(X) zistype(X,IIDENT)
+#define iuser(X) zistype(X,IUSER)	/* username char */
+#define icntrl(X) zistype(X,ICNTRL)
+#define iword(X) zistype(X,IWORD)
+#define ispecial(X) zistype(X,ISPECIAL)
+#define imeta(X) zistype(X,IMETA)
+#define iwsep(X) zistype(X,IWSEP)
+#define inull(X) zistype(X,INULL)
 
 #define iascii(X) isascii(STOUC(X))
 #define ilower(X) islower(STOUC(X))
Index: Src/Zle/compcore.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/Zle/compcore.c,v
retrieving revision 1.83
diff -u -r1.83 compcore.c
--- Src/Zle/compcore.c	7 Mar 2006 12:52:28 -0000	1.83
+++ Src/Zle/compcore.c	10 Jul 2006 12:49:31 -0000
@@ -1081,7 +1081,7 @@
     }
     if ((*p == String || *p == Qstring) && p[1] != Inpar && p[1] != Inbrack) {
 	/* This is really a parameter expression (not $(...) or $[...]). */
-	char *b = p + 1, *e = b;
+	char *b = p + 1, *e = b, *ie;
 	int n = 0, br = 1, nest = 0;
 
 	if (*b == Inbrace) {
@@ -1124,10 +1124,16 @@
 	else if (idigit(*e))
 	    while (idigit(*e))
 		e++;
-	else if (iident(*e))
-	    while (iident(*e) ||
-		   (comppatmatch && *comppatmatch && (*e == Star || *e == Quest)))
-		e++;
+	else if ((ie = itype_end(e, IIDENT, 0)) != e) {
+	    do {
+		e = ie;
+		if (comppatmatch && *comppatmatch &&
+		    (*e == Star || *e == Quest))
+		    ie = e + 1;
+		else
+		    ie = itype_end(e, IIDENT, 0);
+	    } while (ie != e);
+	}
 
 	/* Now make sure that the cursor is inside the name. */
 	if (offs <= e - s && offs >= b - s && n <= 0) {
Index: Src/Zle/zle_tricky.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/Zle/zle_tricky.c,v
retrieving revision 1.67
diff -u -r1.67 zle_tricky.c
--- Src/Zle/zle_tricky.c	30 May 2006 22:35:04 -0000	1.67
+++ Src/Zle/zle_tricky.c	10 Jul 2006 12:49:32 -0000
@@ -551,9 +551,8 @@
 	else if (idigit(*e))
 	    while (idigit(*e))
 		e++;
-	else if (iident(*e))
-	    while (iident(*e))
-		e++;
+	else
+	    e = itype_end(e, IIDENT, 0);
 
 	/* Now make sure that the cursor is inside the name. */
 	if (offs <= e - s && offs >= b - s && n <= 0) {
@@ -740,8 +739,7 @@
 			    else if (idigit(*q))
 				do q++; while (idigit(*q));
 			    else
-				while (iident(*q))
-				    q++;
+				q = itype_end(q, IIDENT, 0);
 			    sav = *q;
 			    *q = '\0';
 			    if (zlemetacs - wb == q - s &&
@@ -1293,7 +1291,7 @@
 	if (varq)
 	    tt = clwords[clwpos];
 
-	for (s = tt; iident(*s); s++);
+	s = itype_end(tt, IIDENT, 0);
 	sav = *s;
 	*s = '\0';
 	zsfree(varname);
@@ -1360,17 +1358,29 @@
      * as being in math.                                              */
     if (inwhat != IN_MATH) {
 	int i = 0;
-	char *nnb = (iident(*s) ? s : s + 1), *nb = NULL, *ne = NULL;
-	
-	for (tt = s; ++tt < s + zlemetacs - wb;)
+	char *nnb, *nb = NULL, *ne = NULL;
+
+	MB_METACHARINIT();
+	if (itype_end(s, IIDENT, 1) == s)
+	    nnb = s + MB_METACHARLEN(s);
+	else
+	    nnb = s;
+	for (tt = s; tt < s + zlemetacs - wb;) {
 	    if (*tt == Inbrack) {
 		i++;
 		nb = nnb;
 		ne = tt;
-	    } else if (i && *tt == Outbrack)
+		tt++;
+	    } else if (i && *tt == Outbrack) {
 		i--;
-	    else if (!iident(*tt))
-		nnb = tt + 1;
+		tt++;
+	    } else {
+		int nclen = MB_METACHARLEN(tt);
+		if (itype_end(tt, IIDENT, 1) == tt)
+		    nnb = tt + nclen;
+		tt += nclen;
+	    }
+	}
 	if (i) {
 	    inwhat = IN_MATH;
 	    insubscr = 1;
@@ -1415,33 +1425,59 @@
 	    /* In mathematical expression, we complete parameter names  *
 	     * (even if they don't have a `$' in front of them).  So we *
 	     * have to find that name.                                  */
-	    for (we = zlemetacs; iident(zlemetaline[we]); we++);
-	    for (wb = zlemetacs; --wb >= 0 && iident(zlemetaline[wb]););
-	    wb++;
+	    char *cspos = zlemetaline + zlemetacs, *wptr, *cptr;
+	    we = itype_end(cspos, IIDENT, 0) - cspos;
+
+	    /*
+	     * With multibyte characters we need to go forwards,
+	     * so start at the beginning of the line and continue
+	     * until cspos.
+	     */
+	    wptr = cptr = zlemetaline;
+	    for (;;) {
+		cptr = itype_end(wptr, IIDENT, 0);
+		if (cptr == wptr) {
+		    /* not an ident character */
+		    wptr = (cptr += MB_METACHARLEN(cptr));
+		}
+		if (cptr >= cspos) {
+		    wb = wptr - zlemetaline;
+		    break;
+		}
+	    }
 	}
 	zsfree(s);
 	s = zalloc(we - wb + 1);
 	strncpy(s, zlemetaline + wb, we - wb);
 	s[we - wb] = '\0';
-	if (wb > 2 && zlemetaline[wb - 1] == '[' &&
-	    iident(zlemetaline[wb - 2])) {
-	    int i = wb - 3;
-	    char sav = zlemetaline[wb - 1];
 
-	    while (i >= 0 && iident(zlemetaline[i]))
-		i--;
+	if (wb > 2 && zlemetaline[wb - 1] == '[') {
+	    char *sqbr = zlemetaline + wb - 1, *cptr, *wptr;
 
-	    zlemetaline[wb - 1] = '\0';
-	    zsfree(varname);
-	    varname = ztrdup(zlemetaline + i + 1);
-	    zlemetaline[wb - 1] = sav;
-	    if ((keypm = (Param) paramtab->getnode(paramtab, varname)) &&
-		(keypm->node.flags & PM_HASHED)) {
-		if (insubscr != 3)
-		    insubscr = 2;
-	    } else
-		insubscr = 1;
+	    /* Need to search forward for word characters */
+	    cptr = wptr = zlemetaline;
+	    for (;;) {
+		cptr = itype_end(wptr, IIDENT, 0);
+		if (cptr == wptr) {
+		    /* not an ident character */
+		    wptr = (cptr += MB_METACHARLEN(cptr));
+		}
+		if (cptr >= sqbr)
+		    break;
+	    }
+
+	    if (wptr < sqbr) {
+		zsfree(varname);
+		varname = ztrduppfx(wptr, sqbr - wptr);
+		if ((keypm = (Param) paramtab->getnode(paramtab, varname)) &&
+		    (keypm->node.flags & PM_HASHED)) {
+		    if (insubscr != 3)
+			insubscr = 2;
+		} else
+		    insubscr = 1;
+	    }
 	}
+
 	parse_subst_string(s);
     }
     /* This variable will hold the current word in quoted form. */
@@ -1562,12 +1598,12 @@
 			*tp == '@')
 			p++, i++;
 		    else {
+			char *ie;
 			if (idigit(*tp))
 			    while (idigit(*tp))
 				tp++;
-			else if (iident(*tp))
-			    while (iident(*tp))
-				tp++;
+			else if ((ie = itype_end(tp, IIDENT, 0)) != tp)
+			    tp = ie;
 			else {
 			    tt = NULL;
 			    break;
Index: Test/D07multibyte.ztst
===================================================================
RCS file: /cvsroot/zsh/zsh/Test/D07multibyte.ztst,v
retrieving revision 1.4
diff -u -r1.4 D07multibyte.ztst
--- Test/D07multibyte.ztst	30 Jun 2006 09:41:35 -0000	1.4
+++ Test/D07multibyte.ztst	10 Jul 2006 12:49:32 -0000
@@ -165,3 +165,12 @@
 >165
 >163
 >945 945
+
+  unsetopt posix_identifiers
+  expr='hähä=3 || exit 1; print $hähä'
+  eval $expr
+  setopt posix_identifiers
+  (eval $expr)
+1:POSIX_IDENTIFIERS option
+>3
+?(eval):1: command not found: hähä=3

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2006-07-10 12:52 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-07-05 12:14 Questions about character types Peter Stephenson
2006-07-05 16:31 ` Bart Schaefer
2006-07-10 12:51   ` Peter Stephenson

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).