PATCH: character sets for internal zsh tests

zsh-workers
 help / color / mirror / code / Atom feed

* PATCH: character sets for internal zsh tests
@ 2005-04-28 11:41 Peter Stephenson
  2005-04-28 14:54 ` Bart Schaefer
  0 siblings, 1 reply; 5+ messages in thread
From: Peter Stephenson @ 2005-04-28 11:41 UTC (permalink / raw)
  To: Zsh hackers list

After the last mail I sent, I was just thinking about quoting of
separators between array elements in vared, and I drifted into thinking
about how it would be useful to have tests for whether a character
was a separator, etc.  You can do things like [$IFS], but (1) they are
a bit fraught with difficulty because in general IFS can contain
pretty much anything including a "-" or a "!" (2) you need to apply
additional rules in some cases such as "IFS whitespace" or word
characters which always include alphanumerics.  (See my hacks for
[$WORDCHARS] in the Zle function match-words-by-style, for example.)

This patch adds [[:sep:]], [[:wsep:]], [[:ident:]], [[:word:]].  These
are trivial because the tests are already available internally, so we
can get quite a lot from little effort.  The names are simply borrowed
from the internal macros; let me know if you think there are better
names.  I think the last two are OK but maybe [[:ifs:]] and [[:ifsw:]]
or [[:ifsspace:]] would be better for the first two.  Then I will add
tests.

Index: Doc/Zsh/expn.yo
===================================================================
RCS file: /cvsroot/zsh/zsh/Doc/Zsh/expn.yo,v
retrieving revision 1.53
diff -u -r1.53 expn.yo
--- Doc/Zsh/expn.yo	24 Apr 2005 18:38:04 -0000	1.53
+++ Doc/Zsh/expn.yo	28 Apr 2005 11:30:17 -0000
@@ -1224,19 +1224,81 @@
 first character in the list.
 cindex(character classes)
 There are also several named classes of characters, in the form
-`tt([:)var(name)tt(:])' with the following meanings:  `tt([:alnum:])'
-alphanumeric, `tt([:alpha:])' alphabetic,
-`tt([:ascii:])' 7-bit,
-`tt([:blank:])' space or tab,
-`tt([:cntrl:])' control character, `tt([:digit:])' decimal
-digit, `tt([:graph:])' printable character except whitespace,
-`tt([:lower:])' lowercase letter, `tt([:print:])' printable character,
-`tt([:punct:])' printable character neither alphanumeric nor whitespace,
-`tt([:space:])' whitespace character, `tt([:upper:])' uppercase letter, 
-`tt([:xdigit:])' hexadecimal digit.  These use the macros provided by
+`tt([:)var(name)tt(:])' with the following meanings.
+The first set use the macros provided by
 the operating system to test for the given character combinations,
-including any modifications due to local language settings:  see
-manref(ctype)(3).  Note that the square brackets are additional
+including any modifications due to local language settings, see
+manref(ctype)(3):
+
+startitem()
+item(tt([:alnum:]))(
+The character is alphanumeric
+)
+item(tt([:alpha:]))
+(
+The character is alphabetic
+)
+item(tt([:ascii:]))(
+The character is 7-bit, i.e. is a single-byte character without
+the top bit set.
+)
+item(tt([:blank:]))(
+The character is either space or tab
+)
+item(tt([:cntrl:]))(
+The character is a control character
+)
+item(tt([:digit:]))(
+The character is a decimal digit
+)
+item(tt([:graph:]))(
+The character is a printable character other than whitespace
+)
+item(tt([:lower:]))(l
+The character is a lowercase letter
+)
+item(tt([:print:]))(
+The character is printable
+)
+item(tt([:punct:]))(
+The character is printable but neither alphanumeric nor whitespace
+)
+item(tt([:space:]))(
+The character is whitespace
+)
+item(tt([:upper:]))(
+The character is an uppercase letter
+)
+item(tt([:xdigit:]))(
+The character is a hexadecimal digit
+)
+enditem()
+
+Another set of tests are handled internally by the shell and
+are not sensitive to the locale:
+
+startitem()
+item(tt([:ident:]))(
+The character is allowed to form part of a shell identifier, such
+as a parameter name
+)
+item(tt([:sep:]))(
+The character is a separator, i.e. is contained in the tt(IFS) parameter
+)
+item(tt([:word:]))(
+The character is treated as part of a word; this test is sensitive
+to the value of the tt(WORDCHARS) parameter
+)
+item(tt([:wsep:]))(
+The character is an IFS white space character; see the documentation
+for tt(IFS) in
+ifzman(the zmanref(zshparams) manual page)\
+ifnzman(noderef(Parameters Used By The Shell))\
+.
+)
+enditem()
+
+Note that the square brackets are additional
 to those enclosing the whole set of characters, so to test for a
 single alphanumeric character you need `tt([[:alnum:]])'.  Named
 character sets can be used alongside other types,
Index: Src/pattern.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/pattern.c,v
retrieving revision 1.26
diff -u -r1.26 pattern.c
--- Src/pattern.c	26 Apr 2005 09:51:29 -0000	1.26
+++ Src/pattern.c	28 Apr 2005 11:30:19 -0000
@@ -193,8 +193,12 @@
 #define PP_SPACE  11
 #define PP_UPPER  12
 #define PP_XDIGIT 13
-#define PP_UNKWN  14
-#define PP_RANGE  15
+#define PP_IDENT  14
+#define PP_SEP    15
+#define PP_WORD   16
+#define PP_WSEP   17
+#define PP_UNKWN  18
+#define PP_RANGE  19
 
 #define	P_OP(p)		((p)->l & 0xff)
 #define	P_NEXT(p)	((p)->l >> 8)
@@ -1118,6 +1122,14 @@
 			    ch = PP_UPPER;
 			else if (!strncmp(patparse, "xdigit", len))
 			    ch = PP_XDIGIT;
+			else if (!strncmp(patparse, "ident", len))
+			    ch = PP_IDENT;
+			else if (!strncmp(patparse, "sep", len))
+			    ch = PP_SEP;
+			else if (!strncmp(patparse, "word", len))
+			    ch = PP_WORD;
+			else if (!strncmp(patparse, "wsep", len))
+			    ch = PP_WSEP;
 			else
 			    ch = PP_UNKWN;
 			patparse = nptr + 2;
@@ -2724,6 +2736,22 @@
 		if (isxdigit(ch))
 		    return 1;
 		break;
+	    case PP_IDENT:
+		if (iident(ch))
+		    return 1;
+		break;
+	    case PP_SEP:
+		if (isep(ch))
+		    return 1;
+		break;
+	    case PP_WORD:
+		if (iword(ch))
+		    return 1;
+		break;
+	    case PP_WSEP:
+		if (iwsep(ch))
+		    return 1;
+		break;
 	    case PP_RANGE:
 		range++;
 		r1 = STOUC(UNMETA(range));

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070


**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

**********************************************************************


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: PATCH: character sets for internal zsh tests
  2005-04-28 11:41 PATCH: character sets for internal zsh tests Peter Stephenson
@ 2005-04-28 14:54 ` Bart Schaefer
  2005-04-28 15:09   ` Peter Stephenson
  0 siblings, 1 reply; 5+ messages in thread
From: Bart Schaefer @ 2005-04-28 14:54 UTC (permalink / raw)
  To: Zsh hackers list

On Apr 28, 12:41pm, Peter Stephenson wrote:
} Subject: PATCH: character sets for internal zsh tests

You mean character classes, right?

} This patch adds [[:sep:]], [[:wsep:]], [[:ident:]], [[:word:]].

I like the idea, but I wonder if we're in danger of running into
conflict with future POSIX extensions, particularly with [:word:].
It might be better to use [:_word:] or something.  However, I don't
feel strongly about this.  Ooh, another idea:  make them all-caps,
e.g. [:WORD:].

} I think the last two are OK but maybe [[:ifs:]] and [[:ifsw:]]
} or [[:ifsspace:]] would be better for the first two.

Given the recent austin-group discussion about how "separators" is
misused in "IFS" (they're really "internal field terminators") it may
in fact be better to avoid "sep".  However, unless you go with all-
caps [:IFS:] I think you should pick something that looks like a word
(in the English dictionary sense) or an abbreviation, rather than an
acronym.  Unfortunately I don't have any good suggestions just now.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: PATCH: character sets for internal zsh tests
  2005-04-28 14:54 ` Bart Schaefer
@ 2005-04-28 15:09   ` Peter Stephenson
  2005-04-28 15:26     ` Bart Schaefer
  0 siblings, 1 reply; 5+ messages in thread
From: Peter Stephenson @ 2005-04-28 15:09 UTC (permalink / raw)
  To: Zsh hackers list

Bart Schaefer wrote:
> Ooh, another idea:  make them all-caps, e.g. [:WORD:].

That would be OK.

> Given the recent austin-group discussion about how "separators" is
> misused in "IFS" (they're really "internal field terminators") it may
> in fact be better to avoid "sep".

I'm going to end up with an essay like
[:FIELDSEPARATOREXCEPTTHEYSOMETIMESGETABITPRUNED:]
if I do that.  Not just the variable name but also the documentation
consistently uses the word "separator".  Introducing "terminator" now is
going to confuse the issue.  Hasta la vista.

>  However, unless you go with all-
> caps [:IFS:] I think you should pick something that looks like a word
> (in the English dictionary sense) or an abbreviation, rather than an
> acronym.  Unfortunately I don't have any good suggestions just now.

[:FIELDSEP:] and [:WHITESEP:], perhaps?

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070

**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

**********************************************************************

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: PATCH: character sets for internal zsh tests
  2005-04-28 15:09   ` Peter Stephenson
@ 2005-04-28 15:26     ` Bart Schaefer
  2005-04-28 16:10       ` Peter Stephenson
  0 siblings, 1 reply; 5+ messages in thread
From: Bart Schaefer @ 2005-04-28 15:26 UTC (permalink / raw)
  To: Peter Stephenson, Zsh hackers list

On Apr 28,  4:09pm, Peter Stephenson wrote:
}
} [:FIELDSEP:] and [:WHITESEP:], perhaps?

Those aren't too bad, though I'm less excited about "WHITE".

[:SPACESEP:] ?  (Too much like what the aliens arrived in?)

Just in case I wasn't clear before, in all-caps I think [:IFS:] and
[:IFSSPACE:] are OK.  I didn't like lower-case [:ifs:] or [:ifsw:].

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: PATCH: character sets for internal zsh tests
  2005-04-28 15:26     ` Bart Schaefer
@ 2005-04-28 16:10       ` Peter Stephenson
  0 siblings, 0 replies; 5+ messages in thread
From: Peter Stephenson @ 2005-04-28 16:10 UTC (permalink / raw)
  To: Zsh hackers list

Bart Schaefer wrote:
> Just in case I wasn't clear before, in all-caps I think [:IFS:] and
> [:IFSSPACE:] are OK.  I didn't like lower-case [:ifs:] or [:ifsw:].

OK, let's keep the link between [:IFS:] and $IFS explicit.

Index: Doc/Zsh/expn.yo
===================================================================
RCS file: /cvsroot/zsh/zsh/Doc/Zsh/expn.yo,v
retrieving revision 1.53
diff -u -r1.53 expn.yo
--- Doc/Zsh/expn.yo	24 Apr 2005 18:38:04 -0000	1.53
+++ Doc/Zsh/expn.yo	28 Apr 2005 16:08:18 -0000
@@ -1224,19 +1224,82 @@
 first character in the list.
 cindex(character classes)
 There are also several named classes of characters, in the form
-`tt([:)var(name)tt(:])' with the following meanings:  `tt([:alnum:])'
-alphanumeric, `tt([:alpha:])' alphabetic,
-`tt([:ascii:])' 7-bit,
-`tt([:blank:])' space or tab,
-`tt([:cntrl:])' control character, `tt([:digit:])' decimal
-digit, `tt([:graph:])' printable character except whitespace,
-`tt([:lower:])' lowercase letter, `tt([:print:])' printable character,
-`tt([:punct:])' printable character neither alphanumeric nor whitespace,
-`tt([:space:])' whitespace character, `tt([:upper:])' uppercase letter, 
-`tt([:xdigit:])' hexadecimal digit.  These use the macros provided by
+`tt([:)var(name)tt(:])' with the following meanings.
+The first set use the macros provided by
 the operating system to test for the given character combinations,
-including any modifications due to local language settings:  see
-manref(ctype)(3).  Note that the square brackets are additional
+including any modifications due to local language settings, see
+manref(ctype)(3):
+
+startitem()
+item(tt([:alnum:]))(
+The character is alphanumeric
+)
+item(tt([:alpha:]))
+(
+The character is alphabetic
+)
+item(tt([:ascii:]))(
+The character is 7-bit, i.e. is a single-byte character without
+the top bit set.
+)
+item(tt([:blank:]))(
+The character is either space or tab
+)
+item(tt([:cntrl:]))(
+The character is a control character
+)
+item(tt([:digit:]))(
+The character is a decimal digit
+)
+item(tt([:graph:]))(
+The character is a printable character other than whitespace
+)
+item(tt([:lower:]))(l
+The character is a lowercase letter
+)
+item(tt([:print:]))(
+The character is printable
+)
+item(tt([:punct:]))(
+The character is printable but neither alphanumeric nor whitespace
+)
+item(tt([:space:]))(
+The character is whitespace
+)
+item(tt([:upper:]))(
+The character is an uppercase letter
+)
+item(tt([:xdigit:]))(
+The character is a hexadecimal digit
+)
+enditem()
+
+Another set of named classes is handled internally by the shell and
+is not sensitive to the locale:
+
+startitem()
+item(tt([:IDENT:]))(
+The character is allowed to form part of a shell identifier, such
+as a parameter name
+)
+item(tt([:IFS:]))(
+The character is used as an input field separator, i.e. is contained in the
+tt(IFS) parameter
+)
+item(tt([:IFSSPACE:]))(
+The character is an IFS white space character; see the documentation
+for tt(IFS) in
+ifzman(the zmanref(zshparams) manual page)\
+ifnzman(noderef(Parameters Used By The Shell))\
+.
+)
+item(tt([:WORD:]))(
+The character is treated as part of a word; this test is sensitive
+to the value of the tt(WORDCHARS) parameter
+)
+enditem()
+
+Note that the square brackets are additional
 to those enclosing the whole set of characters, so to test for a
 single alphanumeric character you need `tt([[:alnum:]])'.  Named
 character sets can be used alongside other types,
Index: Src/pattern.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/pattern.c,v
retrieving revision 1.26
diff -u -r1.26 pattern.c
--- Src/pattern.c	26 Apr 2005 09:51:29 -0000	1.26
+++ Src/pattern.c	28 Apr 2005 16:08:18 -0000
@@ -193,8 +193,12 @@
 #define PP_SPACE  11
 #define PP_UPPER  12
 #define PP_XDIGIT 13
-#define PP_UNKWN  14
-#define PP_RANGE  15
+#define PP_IDENT  14
+#define PP_IFS    15
+#define PP_IFSSPACE   16
+#define PP_WORD   17
+#define PP_UNKWN  18
+#define PP_RANGE  19
 
 #define	P_OP(p)		((p)->l & 0xff)
 #define	P_NEXT(p)	((p)->l >> 8)
@@ -1118,6 +1122,14 @@
 			    ch = PP_UPPER;
 			else if (!strncmp(patparse, "xdigit", len))
 			    ch = PP_XDIGIT;
+			else if (!strncmp(patparse, "IDENT", len))
+			    ch = PP_IDENT;
+			else if (!strncmp(patparse, "IFS", len))
+			    ch = PP_IFS;
+			else if (!strncmp(patparse, "IFSSPACE", len))
+			    ch = PP_IFSSPACE;
+			else if (!strncmp(patparse, "WORD", len))
+			    ch = PP_WORD;
 			else
 			    ch = PP_UNKWN;
 			patparse = nptr + 2;
@@ -2724,6 +2736,22 @@
 		if (isxdigit(ch))
 		    return 1;
 		break;
+	    case PP_IDENT:
+		if (iident(ch))
+		    return 1;
+		break;
+	    case PP_IFS:
+		if (isep(ch))
+		    return 1;
+		break;
+	    case PP_IFSSPACE:
+		if (iwsep(ch))
+		    return 1;
+		break;
+	    case PP_WORD:
+		if (iword(ch))
+		    return 1;
+		break;
 	    case PP_RANGE:
 		range++;
 		r1 = STOUC(UNMETA(range));
Index: Test/D02glob.ztst
===================================================================
RCS file: /cvsroot/zsh/zsh/Test/D02glob.ztst,v
retrieving revision 1.9
diff -u -r1.9 D02glob.ztst
--- Test/D02glob.ztst	16 Mar 2005 11:51:15 -0000	1.9
+++ Test/D02glob.ztst	28 Apr 2005 16:08:18 -0000
@@ -323,3 +323,28 @@
  print glob.tmp/ra=1.0_et=3.5/???
 0:Bug with intermediate paths with plain strings but tokenized characters
 >glob.tmp/ra=1.0_et=3.5/foo
+
+ doesmatch() {
+   setopt localoptions extendedglob
+   print -n $1 $2\ 
+   if [[ $1 = $~2 ]]; then print yes; else print no; fi;
+ }
+ doesmatch MY_IDENTIFIER '[[:IDENT:]]##'
+ doesmatch YOUR:IDENTIFIER '[[:IDENT:]]##'
+ IFS=$'\n' doesmatch $'\n' '[[:IFS:]]'
+ IFS=' ' doesmatch $'\n' '[[:IFS:]]'
+ IFS=':' doesmatch : '[[:IFSSPACE:]]'
+ IFS=' ' doesmatch ' ' '[[:IFSSPACE:]]'
+ WORDCHARS="" doesmatch / '[[:WORD:]]'
+ WORDCHARS="/" doesmatch / '[[:WORD:]]'
+0:Named character sets handled internally
+>MY_IDENTIFIER [[:IDENT:]]## yes
+>YOUR:IDENTIFIER [[:IDENT:]]## no
+>
+> [[:IFS:]] yes
+>
+> [[:IFS:]] no
+>: [[:IFSSPACE:]] no
+>  [[:IFSSPACE:]] yes
+>/ [[:WORD:]] no
+>/ [[:WORD:]] yes

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070


**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

**********************************************************************


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2005-04-28 16:10 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-04-28 11:41 PATCH: character sets for internal zsh tests Peter Stephenson
2005-04-28 14:54 ` Bart Schaefer
2005-04-28 15:09   ` Peter Stephenson
2005-04-28 15:26     ` Bart Schaefer
2005-04-28 16:10       ` Peter Stephenson

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).