From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <zsh-workers-return-30670-mason-zsh=primenet.com.au@zsh.org>
Received: (qmail 15568 invoked by alias); 16 Sep 2012 12:50:27 -0000
Mailing-List: contact zsh-workers-help@zsh.org; run by ezmlm
Precedence: bulk
X-No-Archive: yes
List-Id: Zsh Workers List <zsh-workers.zsh.org>
List-Post: <mailto:zsh-workers@zsh.org>
List-Help: <mailto:zsh-workers-help@zsh.org>
X-Seq: 30670
Received: (qmail 16776 invoked from network); 16 Sep 2012 12:50:22 -0000
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on f.primenet.com.au
X-Spam-Level: 
X-Spam-Status: No, score=-3.3 required=5.0 tests=BAYES_00,DKIM_ADSP_ALL,
	DKIM_SIGNED,RCVD_IN_DNSWL_MED,T_DKIM_INVALID,UNPARSEABLE_RELAY autolearn=ham
	version=3.3.2
Received-SPF: none (ns1.primenet.com.au: domain at spodhuis.org does not designate permitted sender hosts)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=spodhuis.org; s=d201107;
	h=Content-Transfer-Encoding:Content-Type:MIME-Version:Message-ID:Subject:To:From:Date; bh=vnMAqV4BE+dttVRTb/LskejlJloIUa3YmBhK5n6q6+g=;
	b=WEyRh7e7L2b9lZEX0HO83PFnREKWNBd/4rpfkcqctkEV3VJ03TDo8m8FHAo8U9qxFqry1zaOSLHFuFxpSb/a58lGOy97pTQ/ELPdpXyhEV+NDN1HXZxXYofK7njRLM0GY5F1YEABNzq+seErmRqyWyfYPa1mMKwocu9Q8lWoMls=;
Date: Sun, 16 Sep 2012 08:50:15 -0400
From: Phil Pennock <zsh-workers+phil.pennock@spodhuis.org>
To: zsh-workers@zsh.org
Subject: PATCH: PCRE support for embedded NUL characters
Message-ID: <20120916125015.GA87764@redoubt.spodhuis.org>
Mail-Followup-To: zsh-workers@zsh.org
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit

This patch does not touch the docs or code for non-PCRE.  It just
changes PCRE: code, docs, tests.  As I wrote this mail, I realised a
couple of issues which lead me to think I shouldn't just commit this as
is; there are open questions below for Peter/Bart.

Moritz Bunkus reported a problem which boiled down to regular
expressions containing ASCII NUL characters; the support for UTF-8
multibyte characters in regular expressions meant that we no longer
passed zsh's internal metafied forms to the regular expression
libraries, but this also meant that NUL characters in zsh now make it
down.  There's no way I can see to deal with this for zsh/regex, but for
zsh/pcre we can hack around it.

More careful tracking of length, instead of using strlen(), lets us pass
the search space (haystack) down intact.  For the regular expression, as
part of the unmetafy() I now check to see if strlen() doesn't match the
decoded length, in which case there's a NUL somewhere, and then all the
NULs get replaced with \x00 in the pattern.

One thing that occurs to me now: what's the correct expectation if the
pattern contains two characters, "backslash NUL"?  Before, that broke;
with this, it becomes \\x00 which won't match.  \CHAR for a non-letter
CHAR should remove special handling and treat the character as itself.
Fixing this seems to require full string parsing in zsh with knowledge
of the regexp escape sequences.  Document it as a limitation?

In addition, fixing the main NUL issues revealed that I didn't clean up
some meta-based length measuring for $mbegin/$mend setting; because we
only care about the offsets, and we have those _directly_ now that we're
not passing metafied strings in, that code got a lot simpler and so got
rid of an infinitish loop on hitting a NUL.

Another open question: are $mbegin/$mend offsets supposed to be in
octets or in characters?  Given the MB_ prefix, I'm guessing I just
broke this and will need to fix it tomorrow, before commit, after I get
some sleep.  Do we have a decent way to count the number of wide
characters in an unmetafied string which can contain NUL characters?

Right now, I'm not seeing a way to take offsets into an unmetafied
string and turn them into offsets into the corresponding metafied
string, to then be able to use the length measurement tools we already
have, so I _think_ the simplification stays, but I then need to
compensate separately for the wide characters.

Also, this means I need to expand the test suite for the multibyte
character regexps to report the indices of matches, because the
testsuite passed with this change, and since I now believe that to be a
regression, that seems to be a sign of incomplete coverage.  The
function I added to the %prep should be helpful there.

(And a few lines here come from leading/trailing whitespace/tab fixes I
 made while I was in)


Index: Doc/Zsh/mod_pcre.yo
===================================================================
RCS file: /home/cvsroot/remote-repos/zsh-repo/zsh/Doc/Zsh/mod_pcre.yo,v
retrieving revision 1.8
diff -a -u -p -r1.8 mod_pcre.yo
--- Doc/Zsh/mod_pcre.yo	25 Mar 2009 11:29:14 -0000	1.8
+++ Doc/Zsh/mod_pcre.yo	16 Sep 2012 12:25:58 -0000
@@ -2,6 +2,15 @@ COMMENT(!MOD!zsh/pcre
 Interface to the PCRE library.
 !MOD!)
 cindex(regular expressions, perl-compatible)
+The tt(zsh/pcre) module is one of two providers of regular expression
+support for zsh.  If the tt(RE_MATCH_PCRE) option is set, then the
+tt(=~) conditional operator will use this module.
+
+zsh's PCRE support is multi-byte aware and handles embedded NUL characters
+in both the string to be searched and the string specifying the regular
+expression; for the latter case, the NUL will be replaced with tt(\x00)
+before compiling the expression.
+
 The tt(zsh/pcre) module makes some commands available as builtins:
 
 startitem()
Index: Src/Modules/pcre.c
===================================================================
RCS file: /home/cvsroot/remote-repos/zsh-repo/zsh/Src/Modules/pcre.c,v
retrieving revision 1.21
diff -a -u -p -r1.21 pcre.c
--- Src/Modules/pcre.c	6 Jan 2012 10:08:54 -0000	1.21
+++ Src/Modules/pcre.c	16 Sep 2012 12:18:03 -0000
@@ -72,29 +72,73 @@ zpcre_utf8_enabled(void)
 }
 
 /**/
+static char *
+unmetafydup_pcre_pattern(const char *metafied)
+{
+    /*
+     * Premise: PCRE compilation does not take lengths of pattern strings,
+     * only of the haystack string, but we want to continue to let patterns
+     * with embedded NUL characters work, even though we now unmetafy to let
+     * multibyte text match correctly.
+     */
+    char *new_buffer, *long_enough, *p, *q;
+    int decoded_length, prenul_length, nul_count, avail_size, need_size;
+
+    new_buffer = ztrdup(metafied);
+    unmetafy(new_buffer, &decoded_length);
+    prenul_length = strlen(new_buffer);
+    if (prenul_length == decoded_length) {
+	return new_buffer;
+    }
+    /* must be an embedded NUL somewhere, replace with \x00 */
+    avail_size = strlen(metafied);
+    for (nul_count = 0, p = new_buffer; p < new_buffer + decoded_length; ++p) {
+	if (*p == '\0')
+	    nul_count++;
+    }
+    need_size = decoded_length + 3 * nul_count;
+    if (need_size > avail_size) {
+	long_enough = (char *)zalloc(need_size + 1);
+    } else {
+	long_enough = new_buffer;
+    }
+    q = long_enough + need_size;
+    p = new_buffer + decoded_length;
+    *p-- = *q-- = '\0';
+    while (p >= new_buffer) {
+	if (*p) {
+	    *q-- = *p--;
+	    continue;
+	}
+	p--;
+	*q-- = '0'; *q-- = '0'; *q-- = 'x'; *q-- = '\\';
+    }
+    return long_enough;
+}
+
+/**/
 static int
 bin_pcre_compile(char *nam, char **args, Options ops, UNUSED(int func))
 {
     int pcre_opts = 0, pcre_errptr;
     const char *pcre_error;
     char *target;
-    
+
     if(OPT_ISSET(ops,'a')) pcre_opts |= PCRE_ANCHORED;
     if(OPT_ISSET(ops,'i')) pcre_opts |= PCRE_CASELESS;
     if(OPT_ISSET(ops,'m')) pcre_opts |= PCRE_MULTILINE;
     if(OPT_ISSET(ops,'x')) pcre_opts |= PCRE_EXTENDED;
     if(OPT_ISSET(ops,'s')) pcre_opts |= PCRE_DOTALL;
-    
+
     if (zpcre_utf8_enabled())
 	pcre_opts |= PCRE_UTF8;
 
     pcre_hints = NULL;  /* Is this necessary? */
-    
+
     if (pcre_pattern)
 	pcre_free(pcre_pattern);
 
-    target = ztrdup(*args);
-    unmetafy(target, NULL);
+    target = unmetafydup_pcre_pattern(*args);
 
     pcre_pattern = pcre_compile(target, pcre_opts, &pcre_error, &pcre_errptr, NULL);
     
@@ -123,14 +167,14 @@ bin_pcre_study(char *nam, UNUSED(char **
 	zwarnnam(nam, "no pattern has been compiled for study");
 	return 1;
     }
-    
+
     pcre_hints = pcre_study(pcre_pattern, 0, &pcre_error);
     if (pcre_error != NULL)
     {
 	zwarnnam(nam, "error while studying regex: %s", pcre_error);
 	return 1;
     }
-    
+
     return 0;
 }
 
@@ -148,9 +192,12 @@ zpcre_get_substrings(char *arg, int *ove
 		     char *substravar, int want_offset_pair, int matchedinarr,
 		     int want_begin_end)
 {
+    /* Extracts substrings into variables, metafying the results. */
+
     char **captures, *match_all, **matches;
     char offset_all[50];
     int capture_start = 1;
+    int rv;
 
     if (matchedinarr)
 	capture_start = 0;
@@ -158,16 +205,29 @@ zpcre_get_substrings(char *arg, int *ove
 	matchvar = "MATCH";
     if (substravar == NULL)
 	substravar = "match";
-    
+
+    /*
+     * "Unfortunately, the interface to pcre_get_substring_list() is not
+     * adequate for handling strings containing binary zeros, because the end
+     * of the final string is not independently indicated." -- pcreapi(3)
+     *
+     * Well, by iterating over ovec manually, we can get each length in turn,
+     * and still use this call to get all the strings alloced in one memory
+     * block.
+     */
+
     /* captures[0] will be entire matched string, [1] first substring */
-    if (!pcre_get_substring_list(arg, ovec, ret, (const char ***)&captures)) {
+    rv = pcre_get_substring_list(arg, ovec, ret, (const char ***)&captures);
+    if (rv) {
+	zwarn("internal error getting PCRE substrings (%d)", rv);
+    } else {
 	int nelem = arrlen(captures)-1;
 	/* Set to the offsets of the complete match */
 	if (want_offset_pair) {
 	    sprintf(offset_all, "%d %d", ovec[0], ovec[1]);
 	    setsparam("ZPCRE_OP", ztrdup(offset_all));
 	}
-	match_all = metafy(captures[0], -1, META_DUP);
+	match_all = metafy(captures[0], ovec[1]-ovec[0], META_DUP);
 	setsparam(matchvar, match_all);
 	/*
 	 * If we're setting match, mbegin, mend we only do
@@ -176,11 +236,15 @@ zpcre_get_substrings(char *arg, int *ove
 	 */
 	if (!want_begin_end || nelem) {
 	    char **x, **y;
+	    int index_vec, ylen;
 	    y = &captures[capture_start];
+	    index_vec = capture_start * 2;
 	    matches = x = (char **) zalloc(sizeof(char *) * (arrlen(y) + 1));
 	    do {
+		ylen = ovec[index_vec+1] - ovec[index_vec];
+		index_vec += 2;
 		if (*y)
-		    *x++ = metafy(*y, -1, META_DUP);
+		    *x++ = metafy(*y, ylen, META_DUP);
 		else
 		    *x++ = NULL;
 	    } while (*y++);
@@ -188,22 +252,15 @@ zpcre_get_substrings(char *arg, int *ove
 	}
 
 	if (want_begin_end) {
-	    char *ptr = arg;
-	    zlong offs = 0;
-
-	    /* Count the characters before the match */
-	    MB_METACHARINIT();
-	    while (ptr < arg + ovec[0]) {
-		offs++;
-		ptr += MB_METACHARLEN(ptr);
-	    }
-	    setiparam("MBEGIN", offs + !isset(KSHARRAYS));
-	    /* Add on the characters in the match */
-	    while (ptr < arg + ovec[1]) {
-		offs++;
-		ptr += MB_METACHARLEN(ptr);
-	    }
-	    setiparam("MEND", offs + !isset(KSHARRAYS) - 1);
+	    /*
+	     * Note: the original string was metafied, the arg we are passed
+	     * is not (and may contain NUL characters).  For this routine,
+	     * we only care about character offsets as presented to the user,
+	     * so decoding with MB_METACHARINIT/MB_METACHARLEN is overkill
+	     * (and will loop forever on hitting a NUL).
+	     */
+	    setiparam("MBEGIN", ovec[0] + !isset(KSHARRAYS));
+	    setiparam("MEND", ovec[1] + !isset(KSHARRAYS) - 1);
 	    if (nelem) {
 		char **mbegin, **mend, **bptr, **eptr;
 		int i, *ipair;
@@ -216,22 +273,9 @@ zpcre_get_substrings(char *arg, int *ove
 		     ipair += 2, i++, bptr++, eptr++)
 		{
 		    char buf[DIGBUFSIZE];
-		    ptr = arg;
-		    offs = 0;
-		    /* Find the start offset */
-		    MB_METACHARINIT();
-		    while (ptr < arg + ipair[0]) {
-			offs++;
-			ptr += MB_METACHARLEN(ptr);
-		    }
-		    convbase(buf, offs + !isset(KSHARRAYS), 10);
+		    convbase(buf, ipair[0] + !isset(KSHARRAYS), 10);
 		    *bptr = ztrdup(buf);
-		    /* Continue to the end offset */
-		    while (ptr < arg + ipair[1]) {
-			offs++;
-			ptr += MB_METACHARLEN(ptr);
-		    }
-		    convbase(buf, offs + !isset(KSHARRAYS) - 1, 10);
+		    convbase(buf, ipair[1] + !isset(KSHARRAYS) - 1, 10);
 		    *eptr = ztrdup(buf);
 		}
 		*bptr = *eptr = NULL;
@@ -281,7 +325,7 @@ bin_pcre_match(char *nam, char **args, O
 	zwarnnam(nam, "no pattern has been compiled");
 	return 1;
     }
-    
+
     if(OPT_HASARG(ops,c='a')) {
 	receptacle = OPT_ARG(ops,c);
     }
@@ -297,7 +341,7 @@ bin_pcre_match(char *nam, char **args, O
     if(!*args) {
 	zwarnnam(nam, "not enough arguments");
     }
-    
+
     if ((ret = pcre_fullinfo(pcre_pattern, pcre_hints, PCRE_INFO_CAPTURECOUNT, &capcount)))
     {
 	zwarnnam(nam, "error %d in fullinfo", ret);
@@ -308,8 +352,7 @@ bin_pcre_match(char *nam, char **args, O
     ovec = zalloc(ovecsize*sizeof(int));
 
     plaintext = ztrdup(*args);
-    unmetafy(plaintext, NULL);
-    subject_len = (int)strlen(plaintext);
+    unmetafy(plaintext, &subject_len);
 
     if (offset_start < 0 || offset_start >= subject_len)
 	ret = PCRE_ERROR_NOMATCH;
@@ -326,7 +369,7 @@ bin_pcre_match(char *nam, char **args, O
     else {
 	zwarnnam(nam, "error in pcre_exec [%d]", ret);
     }
-    
+
     if (ovec)
 	zfree(ovec, ovecsize*sizeof(int));
 
@@ -341,6 +384,7 @@ cond_pcre_match(char **a, int id)
     const char *pcre_err;
     char *lhstr, *rhre, *lhstr_plain, *rhre_plain, *avar=NULL;
     int r = 0, pcre_opts = 0, pcre_errptr, capcnt, *ov, ovsize;
+    int lhstr_plain_len;
     int return_value = 0;
 
     if (zpcre_utf8_enabled())
@@ -349,9 +393,8 @@ cond_pcre_match(char **a, int id)
     lhstr = cond_str(a,0,0);
     rhre = cond_str(a,1,0);
     lhstr_plain = ztrdup(lhstr);
-    rhre_plain = ztrdup(rhre);
-    unmetafy(lhstr_plain, NULL);
-    unmetafy(rhre_plain, NULL);
+    unmetafy(lhstr_plain, &lhstr_plain_len);
+    rhre_plain = unmetafydup_pcre_pattern(rhre);
     pcre_pat = NULL;
     ov = NULL;
     ovsize = 0;
@@ -366,19 +409,19 @@ cond_pcre_match(char **a, int id)
 		    zwarn("failed to compile regexp /%s/: %s", rhre, pcre_err);
 		    break;
 		}
-                pcre_fullinfo(pcre_pat, NULL, PCRE_INFO_CAPTURECOUNT, &capcnt);
-    		ovsize = (capcnt+1)*3;
+		pcre_fullinfo(pcre_pat, NULL, PCRE_INFO_CAPTURECOUNT, &capcnt);
+		ovsize = (capcnt+1)*3;
 		ov = zalloc(ovsize*sizeof(int));
-    		r = pcre_exec(pcre_pat, NULL, lhstr_plain, strlen(lhstr_plain), 0, 0, ov, ovsize);
+		r = pcre_exec(pcre_pat, NULL, lhstr_plain, lhstr_plain_len, 0, 0, ov, ovsize);
 		/* r < 0 => error; r==0 match but not enough size in ov
 		 * r > 0 => (r-1) substrings found; r==1 => no substrings
 		 */
-    		if (r==0) {
+		if (r==0) {
 		    zwarn("reportable zsh problem: pcre_exec() returned 0");
 		    return_value = 1;
 		    break;
 		}
-	        else if (r==PCRE_ERROR_NOMATCH) {
+		else if (r==PCRE_ERROR_NOMATCH) {
 		    return_value = 0; /* no match */
 		    break;
 		}
@@ -386,7 +429,7 @@ cond_pcre_match(char **a, int id)
 		    zwarn("pcre_exec() error [%d]", r);
 		    break;
 		}
-                else if (r>0) {
+		else if (r>0) {
 		    zpcre_get_substrings(lhstr_plain, ov, r, NULL, avar, 0,
 					 isset(BASHREMATCH),
 					 !isset(BASHREMATCH));
Index: Test/V07pcre.ztst
===================================================================
RCS file: /home/cvsroot/remote-repos/zsh-repo/zsh/Test/V07pcre.ztst,v
retrieving revision 1.2
diff -a -u -p -r1.2 V07pcre.ztst
--- Test/V07pcre.ztst	26 Oct 2011 23:28:27 -0000	1.2
+++ Test/V07pcre.ztst	16 Sep 2012 12:07:10 -0000
@@ -26,6 +26,12 @@
     mkdir multibyte.tmp && cd multibyte.tmp
   fi
 
+  function dump_match_info {
+    print "$? ${MATCH//$'\0'/_} $match $MBEGIN $MEND\c"
+    for ((i=1; i<=$#mbegin; i++)) print "  ${mbegin[i]} $mend[i]\c"
+    print
+  }
+
 %test
 
   [[ 'foo→bar' =~ .([^[:ascii:]]). ]]
@@ -108,3 +114,20 @@
 >1
 >0 xo→t →t
 >0 Xo→t →t
+
+# embedded NUL characters need caution now that we unmetafy strings for
+# multibyte compatibility; different code-paths for [[=~]] vs builtins
+  [[ $'abc\0de' =~ '^([a-c]+)\x00([de]+)$' ]] ; dump_match_info
+  [[ $'abc\0de' =~ $'^([a-c]+)\0([de]+)$' ]] ; dump_match_info
+  pcre_compile '^([a-c]+)\x00([de]+)$'
+  pcre_match $'abc\0de'
+  dump_match_info
+  pcre_compile $'^([a-c]+)\0([de]+)$'
+  pcre_match $'abc\0de'
+  dump_match_info
+0:pcre matching strings with embedded NUL
+>0 abc_de abc de 1 6  1 3  5 6
+>0 abc_de abc de 1 6  1 3  5 6
+>0 abc_de abc de 1 6  1 3  5 6
+>0 abc_de abc de 1 6  1 3  5 6
+