PATCH: PCRE support for embedded NUL characters

zsh-workers
 help / color / mirror / code / Atom feed

* PATCH: PCRE support for embedded NUL characters
@ 2012-09-16 12:50 Phil Pennock
  2012-09-17  5:59 ` Phil Pennock
  2012-09-17  8:57 ` Peter Stephenson
  0 siblings, 2 replies; 8+ messages in thread
From: Phil Pennock @ 2012-09-16 12:50 UTC (permalink / raw)
  To: zsh-workers

This patch does not touch the docs or code for non-PCRE.  It just
changes PCRE: code, docs, tests.  As I wrote this mail, I realised a
couple of issues which lead me to think I shouldn't just commit this as
is; there are open questions below for Peter/Bart.

Moritz Bunkus reported a problem which boiled down to regular
expressions containing ASCII NUL characters; the support for UTF-8
multibyte characters in regular expressions meant that we no longer
passed zsh's internal metafied forms to the regular expression
libraries, but this also meant that NUL characters in zsh now make it
down.  There's no way I can see to deal with this for zsh/regex, but for
zsh/pcre we can hack around it.

More careful tracking of length, instead of using strlen(), lets us pass
the search space (haystack) down intact.  For the regular expression, as
part of the unmetafy() I now check to see if strlen() doesn't match the
decoded length, in which case there's a NUL somewhere, and then all the
NULs get replaced with \x00 in the pattern.

One thing that occurs to me now: what's the correct expectation if the
pattern contains two characters, "backslash NUL"?  Before, that broke;
with this, it becomes \\x00 which won't match.  \CHAR for a non-letter
CHAR should remove special handling and treat the character as itself.
Fixing this seems to require full string parsing in zsh with knowledge
of the regexp escape sequences.  Document it as a limitation?

In addition, fixing the main NUL issues revealed that I didn't clean up
some meta-based length measuring for $mbegin/$mend setting; because we
only care about the offsets, and we have those _directly_ now that we're
not passing metafied strings in, that code got a lot simpler and so got
rid of an infinitish loop on hitting a NUL.

Another open question: are $mbegin/$mend offsets supposed to be in
octets or in characters?  Given the MB_ prefix, I'm guessing I just
broke this and will need to fix it tomorrow, before commit, after I get
some sleep.  Do we have a decent way to count the number of wide
characters in an unmetafied string which can contain NUL characters?

Right now, I'm not seeing a way to take offsets into an unmetafied
string and turn them into offsets into the corresponding metafied
string, to then be able to use the length measurement tools we already
have, so I _think_ the simplification stays, but I then need to
compensate separately for the wide characters.

Also, this means I need to expand the test suite for the multibyte
character regexps to report the indices of matches, because the
testsuite passed with this change, and since I now believe that to be a
regression, that seems to be a sign of incomplete coverage.  The
function I added to the %prep should be helpful there.

(And a few lines here come from leading/trailing whitespace/tab fixes I
 made while I was in)


Index: Doc/Zsh/mod_pcre.yo
===================================================================
RCS file: /home/cvsroot/remote-repos/zsh-repo/zsh/Doc/Zsh/mod_pcre.yo,v
retrieving revision 1.8
diff -a -u -p -r1.8 mod_pcre.yo
--- Doc/Zsh/mod_pcre.yo	25 Mar 2009 11:29:14 -0000	1.8
+++ Doc/Zsh/mod_pcre.yo	16 Sep 2012 12:25:58 -0000
@@ -2,6 +2,15 @@ COMMENT(!MOD!zsh/pcre
 Interface to the PCRE library.
 !MOD!)
 cindex(regular expressions, perl-compatible)
+The tt(zsh/pcre) module is one of two providers of regular expression
+support for zsh.  If the tt(RE_MATCH_PCRE) option is set, then the
+tt(=~) conditional operator will use this module.
+
+zsh's PCRE support is multi-byte aware and handles embedded NUL characters
+in both the string to be searched and the string specifying the regular
+expression; for the latter case, the NUL will be replaced with tt(\x00)
+before compiling the expression.
+
 The tt(zsh/pcre) module makes some commands available as builtins:
 
 startitem()
Index: Src/Modules/pcre.c
===================================================================
RCS file: /home/cvsroot/remote-repos/zsh-repo/zsh/Src/Modules/pcre.c,v
retrieving revision 1.21
diff -a -u -p -r1.21 pcre.c
--- Src/Modules/pcre.c	6 Jan 2012 10:08:54 -0000	1.21
+++ Src/Modules/pcre.c	16 Sep 2012 12:18:03 -0000
@@ -72,29 +72,73 @@ zpcre_utf8_enabled(void)
 }
 
 /**/
+static char *
+unmetafydup_pcre_pattern(const char *metafied)
+{
+    /*
+     * Premise: PCRE compilation does not take lengths of pattern strings,
+     * only of the haystack string, but we want to continue to let patterns
+     * with embedded NUL characters work, even though we now unmetafy to let
+     * multibyte text match correctly.
+     */
+    char *new_buffer, *long_enough, *p, *q;
+    int decoded_length, prenul_length, nul_count, avail_size, need_size;
+
+    new_buffer = ztrdup(metafied);
+    unmetafy(new_buffer, &decoded_length);
+    prenul_length = strlen(new_buffer);
+    if (prenul_length == decoded_length) {
+	return new_buffer;
+    }
+    /* must be an embedded NUL somewhere, replace with \x00 */
+    avail_size = strlen(metafied);
+    for (nul_count = 0, p = new_buffer; p < new_buffer + decoded_length; ++p) {
+	if (*p == '\0')
+	    nul_count++;
+    }
+    need_size = decoded_length + 3 * nul_count;
+    if (need_size > avail_size) {
+	long_enough = (char *)zalloc(need_size + 1);
+    } else {
+	long_enough = new_buffer;
+    }
+    q = long_enough + need_size;
+    p = new_buffer + decoded_length;
+    *p-- = *q-- = '\0';
+    while (p >= new_buffer) {
+	if (*p) {
+	    *q-- = *p--;
+	    continue;
+	}
+	p--;
+	*q-- = '0'; *q-- = '0'; *q-- = 'x'; *q-- = '\\';
+    }
+    return long_enough;
+}
+
+/**/
 static int
 bin_pcre_compile(char *nam, char **args, Options ops, UNUSED(int func))
 {
     int pcre_opts = 0, pcre_errptr;
     const char *pcre_error;
     char *target;
-    
+
     if(OPT_ISSET(ops,'a')) pcre_opts |= PCRE_ANCHORED;
     if(OPT_ISSET(ops,'i')) pcre_opts |= PCRE_CASELESS;
     if(OPT_ISSET(ops,'m')) pcre_opts |= PCRE_MULTILINE;
     if(OPT_ISSET(ops,'x')) pcre_opts |= PCRE_EXTENDED;
     if(OPT_ISSET(ops,'s')) pcre_opts |= PCRE_DOTALL;
-    
+
     if (zpcre_utf8_enabled())
 	pcre_opts |= PCRE_UTF8;
 
     pcre_hints = NULL;  /* Is this necessary? */
-    
+
     if (pcre_pattern)
 	pcre_free(pcre_pattern);
 
-    target = ztrdup(*args);
-    unmetafy(target, NULL);
+    target = unmetafydup_pcre_pattern(*args);
 
     pcre_pattern = pcre_compile(target, pcre_opts, &pcre_error, &pcre_errptr, NULL);
     
@@ -123,14 +167,14 @@ bin_pcre_study(char *nam, UNUSED(char **
 	zwarnnam(nam, "no pattern has been compiled for study");
 	return 1;
     }
-    
+
     pcre_hints = pcre_study(pcre_pattern, 0, &pcre_error);
     if (pcre_error != NULL)
     {
 	zwarnnam(nam, "error while studying regex: %s", pcre_error);
 	return 1;
     }
-    
+
     return 0;
 }
 
@@ -148,9 +192,12 @@ zpcre_get_substrings(char *arg, int *ove
 		     char *substravar, int want_offset_pair, int matchedinarr,
 		     int want_begin_end)
 {
+    /* Extracts substrings into variables, metafying the results. */
+
     char **captures, *match_all, **matches;
     char offset_all[50];
     int capture_start = 1;
+    int rv;
 
     if (matchedinarr)
 	capture_start = 0;
@@ -158,16 +205,29 @@ zpcre_get_substrings(char *arg, int *ove
 	matchvar = "MATCH";
     if (substravar == NULL)
 	substravar = "match";
-    
+
+    /*
+     * "Unfortunately, the interface to pcre_get_substring_list() is not
+     * adequate for handling strings containing binary zeros, because the end
+     * of the final string is not independently indicated." -- pcreapi(3)
+     *
+     * Well, by iterating over ovec manually, we can get each length in turn,
+     * and still use this call to get all the strings alloced in one memory
+     * block.
+     */
+
     /* captures[0] will be entire matched string, [1] first substring */
-    if (!pcre_get_substring_list(arg, ovec, ret, (const char ***)&captures)) {
+    rv = pcre_get_substring_list(arg, ovec, ret, (const char ***)&captures);
+    if (rv) {
+	zwarn("internal error getting PCRE substrings (%d)", rv);
+    } else {
 	int nelem = arrlen(captures)-1;
 	/* Set to the offsets of the complete match */
 	if (want_offset_pair) {
 	    sprintf(offset_all, "%d %d", ovec[0], ovec[1]);
 	    setsparam("ZPCRE_OP", ztrdup(offset_all));
 	}
-	match_all = metafy(captures[0], -1, META_DUP);
+	match_all = metafy(captures[0], ovec[1]-ovec[0], META_DUP);
 	setsparam(matchvar, match_all);
 	/*
 	 * If we're setting match, mbegin, mend we only do
@@ -176,11 +236,15 @@ zpcre_get_substrings(char *arg, int *ove
 	 */
 	if (!want_begin_end || nelem) {
 	    char **x, **y;
+	    int index_vec, ylen;
 	    y = &captures[capture_start];
+	    index_vec = capture_start * 2;
 	    matches = x = (char **) zalloc(sizeof(char *) * (arrlen(y) + 1));
 	    do {
+		ylen = ovec[index_vec+1] - ovec[index_vec];
+		index_vec += 2;
 		if (*y)
-		    *x++ = metafy(*y, -1, META_DUP);
+		    *x++ = metafy(*y, ylen, META_DUP);
 		else
 		    *x++ = NULL;
 	    } while (*y++);
@@ -188,22 +252,15 @@ zpcre_get_substrings(char *arg, int *ove
 	}
 
 	if (want_begin_end) {
-	    char *ptr = arg;
-	    zlong offs = 0;
-
-	    /* Count the characters before the match */
-	    MB_METACHARINIT();
-	    while (ptr < arg + ovec[0]) {
-		offs++;
-		ptr += MB_METACHARLEN(ptr);
-	    }
-	    setiparam("MBEGIN", offs + !isset(KSHARRAYS));
-	    /* Add on the characters in the match */
-	    while (ptr < arg + ovec[1]) {
-		offs++;
-		ptr += MB_METACHARLEN(ptr);
-	    }
-	    setiparam("MEND", offs + !isset(KSHARRAYS) - 1);
+	    /*
+	     * Note: the original string was metafied, the arg we are passed
+	     * is not (and may contain NUL characters).  For this routine,
+	     * we only care about character offsets as presented to the user,
+	     * so decoding with MB_METACHARINIT/MB_METACHARLEN is overkill
+	     * (and will loop forever on hitting a NUL).
+	     */
+	    setiparam("MBEGIN", ovec[0] + !isset(KSHARRAYS));
+	    setiparam("MEND", ovec[1] + !isset(KSHARRAYS) - 1);
 	    if (nelem) {
 		char **mbegin, **mend, **bptr, **eptr;
 		int i, *ipair;
@@ -216,22 +273,9 @@ zpcre_get_substrings(char *arg, int *ove
 		     ipair += 2, i++, bptr++, eptr++)
 		{
 		    char buf[DIGBUFSIZE];
-		    ptr = arg;
-		    offs = 0;
-		    /* Find the start offset */
-		    MB_METACHARINIT();
-		    while (ptr < arg + ipair[0]) {
-			offs++;
-			ptr += MB_METACHARLEN(ptr);
-		    }
-		    convbase(buf, offs + !isset(KSHARRAYS), 10);
+		    convbase(buf, ipair[0] + !isset(KSHARRAYS), 10);
 		    *bptr = ztrdup(buf);
-		    /* Continue to the end offset */
-		    while (ptr < arg + ipair[1]) {
-			offs++;
-			ptr += MB_METACHARLEN(ptr);
-		    }
-		    convbase(buf, offs + !isset(KSHARRAYS) - 1, 10);
+		    convbase(buf, ipair[1] + !isset(KSHARRAYS) - 1, 10);
 		    *eptr = ztrdup(buf);
 		}
 		*bptr = *eptr = NULL;
@@ -281,7 +325,7 @@ bin_pcre_match(char *nam, char **args, O
 	zwarnnam(nam, "no pattern has been compiled");
 	return 1;
     }
-    
+
     if(OPT_HASARG(ops,c='a')) {
 	receptacle = OPT_ARG(ops,c);
     }
@@ -297,7 +341,7 @@ bin_pcre_match(char *nam, char **args, O
     if(!*args) {
 	zwarnnam(nam, "not enough arguments");
     }
-    
+
     if ((ret = pcre_fullinfo(pcre_pattern, pcre_hints, PCRE_INFO_CAPTURECOUNT, &capcount)))
     {
 	zwarnnam(nam, "error %d in fullinfo", ret);
@@ -308,8 +352,7 @@ bin_pcre_match(char *nam, char **args, O
     ovec = zalloc(ovecsize*sizeof(int));
 
     plaintext = ztrdup(*args);
-    unmetafy(plaintext, NULL);
-    subject_len = (int)strlen(plaintext);
+    unmetafy(plaintext, &subject_len);
 
     if (offset_start < 0 || offset_start >= subject_len)
 	ret = PCRE_ERROR_NOMATCH;
@@ -326,7 +369,7 @@ bin_pcre_match(char *nam, char **args, O
     else {
 	zwarnnam(nam, "error in pcre_exec [%d]", ret);
     }
-    
+
     if (ovec)
 	zfree(ovec, ovecsize*sizeof(int));
 
@@ -341,6 +384,7 @@ cond_pcre_match(char **a, int id)
     const char *pcre_err;
     char *lhstr, *rhre, *lhstr_plain, *rhre_plain, *avar=NULL;
     int r = 0, pcre_opts = 0, pcre_errptr, capcnt, *ov, ovsize;
+    int lhstr_plain_len;
     int return_value = 0;
 
     if (zpcre_utf8_enabled())
@@ -349,9 +393,8 @@ cond_pcre_match(char **a, int id)
     lhstr = cond_str(a,0,0);
     rhre = cond_str(a,1,0);
     lhstr_plain = ztrdup(lhstr);
-    rhre_plain = ztrdup(rhre);
-    unmetafy(lhstr_plain, NULL);
-    unmetafy(rhre_plain, NULL);
+    unmetafy(lhstr_plain, &lhstr_plain_len);
+    rhre_plain = unmetafydup_pcre_pattern(rhre);
     pcre_pat = NULL;
     ov = NULL;
     ovsize = 0;
@@ -366,19 +409,19 @@ cond_pcre_match(char **a, int id)
 		    zwarn("failed to compile regexp /%s/: %s", rhre, pcre_err);
 		    break;
 		}
-                pcre_fullinfo(pcre_pat, NULL, PCRE_INFO_CAPTURECOUNT, &capcnt);
-    		ovsize = (capcnt+1)*3;
+		pcre_fullinfo(pcre_pat, NULL, PCRE_INFO_CAPTURECOUNT, &capcnt);
+		ovsize = (capcnt+1)*3;
 		ov = zalloc(ovsize*sizeof(int));
-    		r = pcre_exec(pcre_pat, NULL, lhstr_plain, strlen(lhstr_plain), 0, 0, ov, ovsize);
+		r = pcre_exec(pcre_pat, NULL, lhstr_plain, lhstr_plain_len, 0, 0, ov, ovsize);
 		/* r < 0 => error; r==0 match but not enough size in ov
 		 * r > 0 => (r-1) substrings found; r==1 => no substrings
 		 */
-    		if (r==0) {
+		if (r==0) {
 		    zwarn("reportable zsh problem: pcre_exec() returned 0");
 		    return_value = 1;
 		    break;
 		}
-	        else if (r==PCRE_ERROR_NOMATCH) {
+		else if (r==PCRE_ERROR_NOMATCH) {
 		    return_value = 0; /* no match */
 		    break;
 		}
@@ -386,7 +429,7 @@ cond_pcre_match(char **a, int id)
 		    zwarn("pcre_exec() error [%d]", r);
 		    break;
 		}
-                else if (r>0) {
+		else if (r>0) {
 		    zpcre_get_substrings(lhstr_plain, ov, r, NULL, avar, 0,
 					 isset(BASHREMATCH),
 					 !isset(BASHREMATCH));
Index: Test/V07pcre.ztst
===================================================================
RCS file: /home/cvsroot/remote-repos/zsh-repo/zsh/Test/V07pcre.ztst,v
retrieving revision 1.2
diff -a -u -p -r1.2 V07pcre.ztst
--- Test/V07pcre.ztst	26 Oct 2011 23:28:27 -0000	1.2
+++ Test/V07pcre.ztst	16 Sep 2012 12:07:10 -0000
@@ -26,6 +26,12 @@
     mkdir multibyte.tmp && cd multibyte.tmp
   fi
 
+  function dump_match_info {
+    print "$? ${MATCH//$'\0'/_} $match $MBEGIN $MEND\c"
+    for ((i=1; i<=$#mbegin; i++)) print "  ${mbegin[i]} $mend[i]\c"
+    print
+  }
+
 %test
 
   [[ 'foo→bar' =~ .([^[:ascii:]]). ]]
@@ -108,3 +114,20 @@
 >1
 >0 xo→t →t
 >0 Xo→t →t
+
+# embedded NUL characters need caution now that we unmetafy strings for
+# multibyte compatibility; different code-paths for [[=~]] vs builtins
+  [[ $'abc\0de' =~ '^([a-c]+)\x00([de]+)$' ]] ; dump_match_info
+  [[ $'abc\0de' =~ $'^([a-c]+)\0([de]+)$' ]] ; dump_match_info
+  pcre_compile '^([a-c]+)\x00([de]+)$'
+  pcre_match $'abc\0de'
+  dump_match_info
+  pcre_compile $'^([a-c]+)\0([de]+)$'
+  pcre_match $'abc\0de'
+  dump_match_info
+0:pcre matching strings with embedded NUL
+>0 abc_de abc de 1 6  1 3  5 6
+>0 abc_de abc de 1 6  1 3  5 6
+>0 abc_de abc de 1 6  1 3  5 6
+>0 abc_de abc de 1 6  1 3  5 6
+


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PATCH: PCRE support for embedded NUL characters
  2012-09-16 12:50 PATCH: PCRE support for embedded NUL characters Phil Pennock
@ 2012-09-17  5:59 ` Phil Pennock
  2012-09-19 18:24   ` Peter Stephenson
  2012-09-17  8:57 ` Peter Stephenson
  1 sibling, 1 reply; 8+ messages in thread
From: Phil Pennock @ 2012-09-17  5:59 UTC (permalink / raw)
  To: zsh-workers

On 2012-09-16 at 08:50 -0400, Phil Pennock wrote:
> Another open question: are $mbegin/$mend offsets supposed to be in
> octets or in characters?  Given the MB_ prefix, I'm guessing I just
> broke this and will need to fix it tomorrow, before commit, after I get
> some sleep.  Do we have a decent way to count the number of wide
> characters in an unmetafied string which can contain NUL characters?

Anyone have any opinions of what ZPCRE_OP should count?

It's currently documented as counting bytes, which is certainly
accurate, but seems not very useful given that you can't index strings
by byte offsets in zsh (or can you, with some option I've not noticed
before?)

-Phil


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PATCH: PCRE support for embedded NUL characters
  2012-09-16 12:50 PATCH: PCRE support for embedded NUL characters Phil Pennock
  2012-09-17  5:59 ` Phil Pennock
@ 2012-09-17  8:57 ` Peter Stephenson
  2012-09-17 19:04   ` Phil Pennock
  1 sibling, 1 reply; 8+ messages in thread
From: Peter Stephenson @ 2012-09-17  8:57 UTC (permalink / raw)
  To: zsh-workers

On Sun, 16 Sep 2012 08:50:15 -0400
Phil Pennock <zsh-workers+phil.pennock@spodhuis.org> wrote:
> This patch does not touch the docs or code for non-PCRE.  It just
> changes PCRE: code, docs, tests.  As I wrote this mail, I realised a
> couple of issues which lead me to think I shouldn't just commit this as
> is; there are open questions below for Peter/Bart.
> 
> Moritz Bunkus reported a problem which boiled down to regular
> expressions containing ASCII NUL characters; the support for UTF-8
> multibyte characters in regular expressions meant that we no longer
> passed zsh's internal metafied forms to the regular expression
> libraries, but this also meant that NUL characters in zsh now make it
> down.  There's no way I can see to deal with this for zsh/regex, but for
> zsh/pcre we can hack around it.

I'm not sure what the right answer is for zsh/regex, but documenting it
will do for now.

> More careful tracking of length, instead of using strlen(), lets us pass
> the search space (haystack) down intact.  For the regular expression, as
> part of the unmetafy() I now check to see if strlen() doesn't match the
> decoded length, in which case there's a NUL somewhere, and then all the
> NULs get replaced with \x00 in the pattern.
> 
> One thing that occurs to me now: what's the correct expectation if the
> pattern contains two characters, "backslash NUL"?  Before, that broke;
> with this, it becomes \\x00 which won't match.  \CHAR for a non-letter
> CHAR should remove special handling and treat the character as itself.
> Fixing this seems to require full string parsing in zsh with knowledge
> of the regexp escape sequences.  Document it as a limitation?

You can just do a pre-scan of the whole string for backslashes.  If
there's a backslash followed by a non-NULL, skip checking that next
character (which may itself be a backslash that's escaped); if there's a
backslash followed by a NULL the backslash can go.  It's such an unusual
case it hardly seems worth it, though.

> Another open question: are $mbegin/$mend offsets supposed to be in
> octets or in characters?  Given the MB_ prefix, I'm guessing I just
> broke this and will need to fix it tomorrow, before commit, after I get
> some sleep.  Do we have a decent way to count the number of wide
> characters in an unmetafied string which can contain NUL characters?

They are characters.  If the string is unmetafied you can skip the
MB_METACHARLEN() stuff and use the mbrtowc()/WCWIDTH() library calls
directly (WCWIDTH() is only defined in order to be able to replace an
unusable wcwidth()), but a null probably needs to be a special case
since I think the libraries assume it's a terminator.  It looks like the
existing pattern code uses metafied strings.

-- 
Peter Stephenson <pws@csr.com>            Software Engineer
Tel: +44 (0)1223 692070                   Cambridge Silicon Radio Limited
Churchill House, Cambridge Business Park, Cowley Road, Cambridge, CB4 0WZ, UK

Member of the CSR plc group of companies. CSR plc registered in England and Wales, registered number 4187346, registered office Churchill House, Cambridge Business Park, Cowley Road, Cambridge, CB4 0WZ, United Kingdom
More information can be found at www.csr.com. Follow CSR on Twitter at http://twitter.com/CSR_PLC and read our blog at www.csr.com/blog

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PATCH: PCRE support for embedded NUL characters
  2012-09-17  8:57 ` Peter Stephenson
@ 2012-09-17 19:04   ` Phil Pennock
  2012-09-18  8:51     ` Peter Stephenson
  0 siblings, 1 reply; 8+ messages in thread
From: Phil Pennock @ 2012-09-17 19:04 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: zsh-workers

On 2012-09-17 at 09:57 +0100, Peter Stephenson wrote:
> You can just do a pre-scan of the whole string for backslashes.  If
> there's a backslash followed by a non-NULL, skip checking that next
> character (which may itself be a backslash that's escaped); if there's a
> backslash followed by a NULL the backslash can go.  It's such an unusual
> case it hardly seems worth it, though.

I'm inclined to document it as a limitation, rather than add complexity,
wherein will lurk bugs.

> They are characters.  If the string is unmetafied you can skip the
> MB_METACHARLEN() stuff and use the mbrtowc()/WCWIDTH() library calls
> directly (WCWIDTH() is only defined in order to be able to replace an
> unusable wcwidth()), but a null probably needs to be a special case
> since I think the libraries assume it's a terminator.  It looks like the
> existing pattern code uses metafied strings.

Yeah, but correlating offsets in unmetafied strings to the metafied
strings for then counting is non-trivial (or so it seems to me).

And wcwidth() tells how many display cells are needed for a given
character, assuming a monospace layout.  For this, instead, mblen() is
needed, on a character-by-character basis.  Given that mblen() is C99, I
opted to avoid it, and implement this just for UTF-8 with bit-pattern
examination to quickly count past characters.  We only initialise PCRE
for wide characters with UTF-8.  I've no idea how much effort we want to
put into supporting non-UTF-8 wide-character PCRE across multiple OSes.

Thoughts?

(Code not finished yesterday, will be a little bit before I get back to
it).

-Phil

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PATCH: PCRE support for embedded NUL characters
  2012-09-17 19:04   ` Phil Pennock
@ 2012-09-18  8:51     ` Peter Stephenson
  2012-09-18 10:40       ` Peter Stephenson
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Stephenson @ 2012-09-18  8:51 UTC (permalink / raw)
  To: zsh-workers

On Mon, 17 Sep 2012 15:04:23 -0400
Phil Pennock <zsh-workers+phil.pennock@spodhuis.org> wrote:
> Yeah, but correlating offsets in unmetafied strings to the metafied
> strings for then counting is non-trivial (or so it seems to me).

It's not so difficult: we already do most of this conversion for other
similar cases of pattern matching, where we need to convert offsets in
octets to characters, the only difference being the metafication which
just means the loop over the characters is slightly different.  In fact,
it's if anything marginally easier since the metafication is a pure zsh
invention.

> And wcwidth() tells how many display cells are needed for a given
> character, assuming a monospace layout.  For this, instead, mblen() is
> needed, on a character-by-character basis.  Given that mblen() is C99, I
> opted to avoid it, and implement this just for UTF-8 with bit-pattern
> examination to quickly count past characters.  We only initialise PCRE
> for wide characters with UTF-8.  I've no idea how much effort we want to
> put into supporting non-UTF-8 wide-character PCRE across multiple OSes.

Doing it just for UTF-8 is incompatible with the rest of the shell.  It
should be possible to do it similarly to mb_metastrlen() in utils.c.
Basically the only difference is using an explicit length rather than null
termination, plus not having an internal test for Meta characters.

-- 
Peter Stephenson <pws@csr.com>            Software Engineer
Tel: +44 (0)1223 692070                   Cambridge Silicon Radio Limited
Churchill House, Cambridge Business Park, Cowley Road, Cambridge, CB4 0WZ, UK

Member of the CSR plc group of companies. CSR plc registered in England and Wales, registered number 4187346, registered office Churchill House, Cambridge Business Park, Cowley Road, Cambridge, CB4 0WZ, United Kingdom
More information can be found at www.csr.com. Follow CSR on Twitter at http://twitter.com/CSR_PLC and read our blog at www.csr.com/blog

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PATCH: PCRE support for embedded NUL characters
  2012-09-18  8:51     ` Peter Stephenson
@ 2012-09-18 10:40       ` Peter Stephenson
  0 siblings, 0 replies; 8+ messages in thread
From: Peter Stephenson @ 2012-09-18 10:40 UTC (permalink / raw)
  To: zsh-workers

On Tue, 18 Sep 2012 09:51:45 +0100
Peter Stephenson <Peter.Stephenson@csr.com> wrote:
> Doing it just for UTF-8 is incompatible with the rest of the shell.  It
> should be possible to do it similarly to mb_metastrlen() in utils.c.
> Basically the only difference is using an explicit length rather than null
> termination, plus not having an internal test for Meta characters.

Indeed, it would be straightforward to add extra arguments to
mb_metastrlen() and mb_metacharlenconv(_r)() to indicate they were being
passed an unmetafied string of an explicit length.

-- 
Peter Stephenson <pws@csr.com>            Software Engineer
Tel: +44 (0)1223 692070                   Cambridge Silicon Radio Limited
Churchill House, Cambridge Business Park, Cowley Road, Cambridge, CB4 0WZ, UK

Member of the CSR plc group of companies. CSR plc registered in England and Wales, registered number 4187346, registered office Churchill House, Cambridge Business Park, Cowley Road, Cambridge, CB4 0WZ, United Kingdom
More information can be found at www.csr.com. Follow CSR on Twitter at http://twitter.com/CSR_PLC and read our blog at www.csr.com/blog

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PATCH: PCRE support for embedded NUL characters
  2012-09-17  5:59 ` Phil Pennock
@ 2012-09-19 18:24   ` Peter Stephenson
  2012-09-19 18:49     ` Phil Pennock
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Stephenson @ 2012-09-19 18:24 UTC (permalink / raw)
  To: zsh-workers

On Mon, 17 Sep 2012 01:59:25 -0400
Phil Pennock <zsh-workers+phil.pennock@spodhuis.org> wrote:
> Anyone have any opinions of what ZPCRE_OP should count?
> 
> It's currently documented as counting bytes, which is certainly
> accurate, but seems not very useful given that you can't index strings
> by byte offsets in zsh (or can you, with some option I've not noticed
> before?)

Ideally, it's bytes with MULTIBYTE unset, and code points with it set,
i.e. what the mb/wc functions consider a "character".

-- 
Peter Stephenson <p.w.stephenson@ntlworld.com>
Web page now at http://homepage.ntlworld.com/p.w.stephenson/


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PATCH: PCRE support for embedded NUL characters
  2012-09-19 18:24   ` Peter Stephenson
@ 2012-09-19 18:49     ` Phil Pennock
  0 siblings, 0 replies; 8+ messages in thread
From: Phil Pennock @ 2012-09-19 18:49 UTC (permalink / raw)
  To: zsh-workers

On 2012-09-19 at 19:24 +0100, Peter Stephenson wrote:
> On Mon, 17 Sep 2012 01:59:25 -0400
> Phil Pennock <zsh-workers+phil.pennock@spodhuis.org> wrote:
> > Anyone have any opinions of what ZPCRE_OP should count?
> > 
> > It's currently documented as counting bytes, which is certainly
> > accurate, but seems not very useful given that you can't index strings
> > by byte offsets in zsh (or can you, with some option I've not noticed
> > before?)
> 
> Ideally, it's bytes with MULTIBYTE unset, and code points with it set,
> i.e. what the mb/wc functions consider a "character".

Okay, good; when I get back to this, I'll harmonise all the outputs to
be by character, including that one, and update the docs.  Probably this
weekend, unless I need a mental code-switch break before then.

-Phil


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-09-19 18:49 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-16 12:50 PATCH: PCRE support for embedded NUL characters Phil Pennock
2012-09-17  5:59 ` Phil Pennock
2012-09-19 18:24   ` Peter Stephenson
2012-09-19 18:49     ` Phil Pennock
2012-09-17  8:57 ` Peter Stephenson
2012-09-17 19:04   ` Phil Pennock
2012-09-18  8:51     ` Peter Stephenson
2012-09-18 10:40       ` Peter Stephenson

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).