From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 15568 invoked by alias); 16 Sep 2012 12:50:27 -0000 Mailing-List: contact zsh-workers-help@zsh.org; run by ezmlm Precedence: bulk X-No-Archive: yes List-Id: Zsh Workers List List-Post: List-Help: X-Seq: 30670 Received: (qmail 16776 invoked from network); 16 Sep 2012 12:50:22 -0000 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on f.primenet.com.au X-Spam-Level: X-Spam-Status: No, score=-3.3 required=5.0 tests=BAYES_00,DKIM_ADSP_ALL, DKIM_SIGNED,RCVD_IN_DNSWL_MED,T_DKIM_INVALID,UNPARSEABLE_RELAY autolearn=ham version=3.3.2 Received-SPF: none (ns1.primenet.com.au: domain at spodhuis.org does not designate permitted sender hosts) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=spodhuis.org; s=d201107; h=Content-Transfer-Encoding:Content-Type:MIME-Version:Message-ID:Subject:To:From:Date; bh=vnMAqV4BE+dttVRTb/LskejlJloIUa3YmBhK5n6q6+g=; b=WEyRh7e7L2b9lZEX0HO83PFnREKWNBd/4rpfkcqctkEV3VJ03TDo8m8FHAo8U9qxFqry1zaOSLHFuFxpSb/a58lGOy97pTQ/ELPdpXyhEV+NDN1HXZxXYofK7njRLM0GY5F1YEABNzq+seErmRqyWyfYPa1mMKwocu9Q8lWoMls=; Date: Sun, 16 Sep 2012 08:50:15 -0400 From: Phil Pennock To: zsh-workers@zsh.org Subject: PATCH: PCRE support for embedded NUL characters Message-ID: <20120916125015.GA87764@redoubt.spodhuis.org> Mail-Followup-To: zsh-workers@zsh.org MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit This patch does not touch the docs or code for non-PCRE. It just changes PCRE: code, docs, tests. As I wrote this mail, I realised a couple of issues which lead me to think I shouldn't just commit this as is; there are open questions below for Peter/Bart. Moritz Bunkus reported a problem which boiled down to regular expressions containing ASCII NUL characters; the support for UTF-8 multibyte characters in regular expressions meant that we no longer passed zsh's internal metafied forms to the regular expression libraries, but this also meant that NUL characters in zsh now make it down. There's no way I can see to deal with this for zsh/regex, but for zsh/pcre we can hack around it. More careful tracking of length, instead of using strlen(), lets us pass the search space (haystack) down intact. For the regular expression, as part of the unmetafy() I now check to see if strlen() doesn't match the decoded length, in which case there's a NUL somewhere, and then all the NULs get replaced with \x00 in the pattern. One thing that occurs to me now: what's the correct expectation if the pattern contains two characters, "backslash NUL"? Before, that broke; with this, it becomes \\x00 which won't match. \CHAR for a non-letter CHAR should remove special handling and treat the character as itself. Fixing this seems to require full string parsing in zsh with knowledge of the regexp escape sequences. Document it as a limitation? In addition, fixing the main NUL issues revealed that I didn't clean up some meta-based length measuring for $mbegin/$mend setting; because we only care about the offsets, and we have those _directly_ now that we're not passing metafied strings in, that code got a lot simpler and so got rid of an infinitish loop on hitting a NUL. Another open question: are $mbegin/$mend offsets supposed to be in octets or in characters? Given the MB_ prefix, I'm guessing I just broke this and will need to fix it tomorrow, before commit, after I get some sleep. Do we have a decent way to count the number of wide characters in an unmetafied string which can contain NUL characters? Right now, I'm not seeing a way to take offsets into an unmetafied string and turn them into offsets into the corresponding metafied string, to then be able to use the length measurement tools we already have, so I _think_ the simplification stays, but I then need to compensate separately for the wide characters. Also, this means I need to expand the test suite for the multibyte character regexps to report the indices of matches, because the testsuite passed with this change, and since I now believe that to be a regression, that seems to be a sign of incomplete coverage. The function I added to the %prep should be helpful there. (And a few lines here come from leading/trailing whitespace/tab fixes I made while I was in) Index: Doc/Zsh/mod_pcre.yo =================================================================== RCS file: /home/cvsroot/remote-repos/zsh-repo/zsh/Doc/Zsh/mod_pcre.yo,v retrieving revision 1.8 diff -a -u -p -r1.8 mod_pcre.yo --- Doc/Zsh/mod_pcre.yo 25 Mar 2009 11:29:14 -0000 1.8 +++ Doc/Zsh/mod_pcre.yo 16 Sep 2012 12:25:58 -0000 @@ -2,6 +2,15 @@ COMMENT(!MOD!zsh/pcre Interface to the PCRE library. !MOD!) cindex(regular expressions, perl-compatible) +The tt(zsh/pcre) module is one of two providers of regular expression +support for zsh. If the tt(RE_MATCH_PCRE) option is set, then the +tt(=~) conditional operator will use this module. + +zsh's PCRE support is multi-byte aware and handles embedded NUL characters +in both the string to be searched and the string specifying the regular +expression; for the latter case, the NUL will be replaced with tt(\x00) +before compiling the expression. + The tt(zsh/pcre) module makes some commands available as builtins: startitem() Index: Src/Modules/pcre.c =================================================================== RCS file: /home/cvsroot/remote-repos/zsh-repo/zsh/Src/Modules/pcre.c,v retrieving revision 1.21 diff -a -u -p -r1.21 pcre.c --- Src/Modules/pcre.c 6 Jan 2012 10:08:54 -0000 1.21 +++ Src/Modules/pcre.c 16 Sep 2012 12:18:03 -0000 @@ -72,29 +72,73 @@ zpcre_utf8_enabled(void) } /**/ +static char * +unmetafydup_pcre_pattern(const char *metafied) +{ + /* + * Premise: PCRE compilation does not take lengths of pattern strings, + * only of the haystack string, but we want to continue to let patterns + * with embedded NUL characters work, even though we now unmetafy to let + * multibyte text match correctly. + */ + char *new_buffer, *long_enough, *p, *q; + int decoded_length, prenul_length, nul_count, avail_size, need_size; + + new_buffer = ztrdup(metafied); + unmetafy(new_buffer, &decoded_length); + prenul_length = strlen(new_buffer); + if (prenul_length == decoded_length) { + return new_buffer; + } + /* must be an embedded NUL somewhere, replace with \x00 */ + avail_size = strlen(metafied); + for (nul_count = 0, p = new_buffer; p < new_buffer + decoded_length; ++p) { + if (*p == '\0') + nul_count++; + } + need_size = decoded_length + 3 * nul_count; + if (need_size > avail_size) { + long_enough = (char *)zalloc(need_size + 1); + } else { + long_enough = new_buffer; + } + q = long_enough + need_size; + p = new_buffer + decoded_length; + *p-- = *q-- = '\0'; + while (p >= new_buffer) { + if (*p) { + *q-- = *p--; + continue; + } + p--; + *q-- = '0'; *q-- = '0'; *q-- = 'x'; *q-- = '\\'; + } + return long_enough; +} + +/**/ static int bin_pcre_compile(char *nam, char **args, Options ops, UNUSED(int func)) { int pcre_opts = 0, pcre_errptr; const char *pcre_error; char *target; - + if(OPT_ISSET(ops,'a')) pcre_opts |= PCRE_ANCHORED; if(OPT_ISSET(ops,'i')) pcre_opts |= PCRE_CASELESS; if(OPT_ISSET(ops,'m')) pcre_opts |= PCRE_MULTILINE; if(OPT_ISSET(ops,'x')) pcre_opts |= PCRE_EXTENDED; if(OPT_ISSET(ops,'s')) pcre_opts |= PCRE_DOTALL; - + if (zpcre_utf8_enabled()) pcre_opts |= PCRE_UTF8; pcre_hints = NULL; /* Is this necessary? */ - + if (pcre_pattern) pcre_free(pcre_pattern); - target = ztrdup(*args); - unmetafy(target, NULL); + target = unmetafydup_pcre_pattern(*args); pcre_pattern = pcre_compile(target, pcre_opts, &pcre_error, &pcre_errptr, NULL); @@ -123,14 +167,14 @@ bin_pcre_study(char *nam, UNUSED(char ** zwarnnam(nam, "no pattern has been compiled for study"); return 1; } - + pcre_hints = pcre_study(pcre_pattern, 0, &pcre_error); if (pcre_error != NULL) { zwarnnam(nam, "error while studying regex: %s", pcre_error); return 1; } - + return 0; } @@ -148,9 +192,12 @@ zpcre_get_substrings(char *arg, int *ove char *substravar, int want_offset_pair, int matchedinarr, int want_begin_end) { + /* Extracts substrings into variables, metafying the results. */ + char **captures, *match_all, **matches; char offset_all[50]; int capture_start = 1; + int rv; if (matchedinarr) capture_start = 0; @@ -158,16 +205,29 @@ zpcre_get_substrings(char *arg, int *ove matchvar = "MATCH"; if (substravar == NULL) substravar = "match"; - + + /* + * "Unfortunately, the interface to pcre_get_substring_list() is not + * adequate for handling strings containing binary zeros, because the end + * of the final string is not independently indicated." -- pcreapi(3) + * + * Well, by iterating over ovec manually, we can get each length in turn, + * and still use this call to get all the strings alloced in one memory + * block. + */ + /* captures[0] will be entire matched string, [1] first substring */ - if (!pcre_get_substring_list(arg, ovec, ret, (const char ***)&captures)) { + rv = pcre_get_substring_list(arg, ovec, ret, (const char ***)&captures); + if (rv) { + zwarn("internal error getting PCRE substrings (%d)", rv); + } else { int nelem = arrlen(captures)-1; /* Set to the offsets of the complete match */ if (want_offset_pair) { sprintf(offset_all, "%d %d", ovec[0], ovec[1]); setsparam("ZPCRE_OP", ztrdup(offset_all)); } - match_all = metafy(captures[0], -1, META_DUP); + match_all = metafy(captures[0], ovec[1]-ovec[0], META_DUP); setsparam(matchvar, match_all); /* * If we're setting match, mbegin, mend we only do @@ -176,11 +236,15 @@ zpcre_get_substrings(char *arg, int *ove */ if (!want_begin_end || nelem) { char **x, **y; + int index_vec, ylen; y = &captures[capture_start]; + index_vec = capture_start * 2; matches = x = (char **) zalloc(sizeof(char *) * (arrlen(y) + 1)); do { + ylen = ovec[index_vec+1] - ovec[index_vec]; + index_vec += 2; if (*y) - *x++ = metafy(*y, -1, META_DUP); + *x++ = metafy(*y, ylen, META_DUP); else *x++ = NULL; } while (*y++); @@ -188,22 +252,15 @@ zpcre_get_substrings(char *arg, int *ove } if (want_begin_end) { - char *ptr = arg; - zlong offs = 0; - - /* Count the characters before the match */ - MB_METACHARINIT(); - while (ptr < arg + ovec[0]) { - offs++; - ptr += MB_METACHARLEN(ptr); - } - setiparam("MBEGIN", offs + !isset(KSHARRAYS)); - /* Add on the characters in the match */ - while (ptr < arg + ovec[1]) { - offs++; - ptr += MB_METACHARLEN(ptr); - } - setiparam("MEND", offs + !isset(KSHARRAYS) - 1); + /* + * Note: the original string was metafied, the arg we are passed + * is not (and may contain NUL characters). For this routine, + * we only care about character offsets as presented to the user, + * so decoding with MB_METACHARINIT/MB_METACHARLEN is overkill + * (and will loop forever on hitting a NUL). + */ + setiparam("MBEGIN", ovec[0] + !isset(KSHARRAYS)); + setiparam("MEND", ovec[1] + !isset(KSHARRAYS) - 1); if (nelem) { char **mbegin, **mend, **bptr, **eptr; int i, *ipair; @@ -216,22 +273,9 @@ zpcre_get_substrings(char *arg, int *ove ipair += 2, i++, bptr++, eptr++) { char buf[DIGBUFSIZE]; - ptr = arg; - offs = 0; - /* Find the start offset */ - MB_METACHARINIT(); - while (ptr < arg + ipair[0]) { - offs++; - ptr += MB_METACHARLEN(ptr); - } - convbase(buf, offs + !isset(KSHARRAYS), 10); + convbase(buf, ipair[0] + !isset(KSHARRAYS), 10); *bptr = ztrdup(buf); - /* Continue to the end offset */ - while (ptr < arg + ipair[1]) { - offs++; - ptr += MB_METACHARLEN(ptr); - } - convbase(buf, offs + !isset(KSHARRAYS) - 1, 10); + convbase(buf, ipair[1] + !isset(KSHARRAYS) - 1, 10); *eptr = ztrdup(buf); } *bptr = *eptr = NULL; @@ -281,7 +325,7 @@ bin_pcre_match(char *nam, char **args, O zwarnnam(nam, "no pattern has been compiled"); return 1; } - + if(OPT_HASARG(ops,c='a')) { receptacle = OPT_ARG(ops,c); } @@ -297,7 +341,7 @@ bin_pcre_match(char *nam, char **args, O if(!*args) { zwarnnam(nam, "not enough arguments"); } - + if ((ret = pcre_fullinfo(pcre_pattern, pcre_hints, PCRE_INFO_CAPTURECOUNT, &capcount))) { zwarnnam(nam, "error %d in fullinfo", ret); @@ -308,8 +352,7 @@ bin_pcre_match(char *nam, char **args, O ovec = zalloc(ovecsize*sizeof(int)); plaintext = ztrdup(*args); - unmetafy(plaintext, NULL); - subject_len = (int)strlen(plaintext); + unmetafy(plaintext, &subject_len); if (offset_start < 0 || offset_start >= subject_len) ret = PCRE_ERROR_NOMATCH; @@ -326,7 +369,7 @@ bin_pcre_match(char *nam, char **args, O else { zwarnnam(nam, "error in pcre_exec [%d]", ret); } - + if (ovec) zfree(ovec, ovecsize*sizeof(int)); @@ -341,6 +384,7 @@ cond_pcre_match(char **a, int id) const char *pcre_err; char *lhstr, *rhre, *lhstr_plain, *rhre_plain, *avar=NULL; int r = 0, pcre_opts = 0, pcre_errptr, capcnt, *ov, ovsize; + int lhstr_plain_len; int return_value = 0; if (zpcre_utf8_enabled()) @@ -349,9 +393,8 @@ cond_pcre_match(char **a, int id) lhstr = cond_str(a,0,0); rhre = cond_str(a,1,0); lhstr_plain = ztrdup(lhstr); - rhre_plain = ztrdup(rhre); - unmetafy(lhstr_plain, NULL); - unmetafy(rhre_plain, NULL); + unmetafy(lhstr_plain, &lhstr_plain_len); + rhre_plain = unmetafydup_pcre_pattern(rhre); pcre_pat = NULL; ov = NULL; ovsize = 0; @@ -366,19 +409,19 @@ cond_pcre_match(char **a, int id) zwarn("failed to compile regexp /%s/: %s", rhre, pcre_err); break; } - pcre_fullinfo(pcre_pat, NULL, PCRE_INFO_CAPTURECOUNT, &capcnt); - ovsize = (capcnt+1)*3; + pcre_fullinfo(pcre_pat, NULL, PCRE_INFO_CAPTURECOUNT, &capcnt); + ovsize = (capcnt+1)*3; ov = zalloc(ovsize*sizeof(int)); - r = pcre_exec(pcre_pat, NULL, lhstr_plain, strlen(lhstr_plain), 0, 0, ov, ovsize); + r = pcre_exec(pcre_pat, NULL, lhstr_plain, lhstr_plain_len, 0, 0, ov, ovsize); /* r < 0 => error; r==0 match but not enough size in ov * r > 0 => (r-1) substrings found; r==1 => no substrings */ - if (r==0) { + if (r==0) { zwarn("reportable zsh problem: pcre_exec() returned 0"); return_value = 1; break; } - else if (r==PCRE_ERROR_NOMATCH) { + else if (r==PCRE_ERROR_NOMATCH) { return_value = 0; /* no match */ break; } @@ -386,7 +429,7 @@ cond_pcre_match(char **a, int id) zwarn("pcre_exec() error [%d]", r); break; } - else if (r>0) { + else if (r>0) { zpcre_get_substrings(lhstr_plain, ov, r, NULL, avar, 0, isset(BASHREMATCH), !isset(BASHREMATCH)); Index: Test/V07pcre.ztst =================================================================== RCS file: /home/cvsroot/remote-repos/zsh-repo/zsh/Test/V07pcre.ztst,v retrieving revision 1.2 diff -a -u -p -r1.2 V07pcre.ztst --- Test/V07pcre.ztst 26 Oct 2011 23:28:27 -0000 1.2 +++ Test/V07pcre.ztst 16 Sep 2012 12:07:10 -0000 @@ -26,6 +26,12 @@ mkdir multibyte.tmp && cd multibyte.tmp fi + function dump_match_info { + print "$? ${MATCH//$'\0'/_} $match $MBEGIN $MEND\c" + for ((i=1; i<=$#mbegin; i++)) print " ${mbegin[i]} $mend[i]\c" + print + } + %test [[ 'foo→bar' =~ .([^[:ascii:]]). ]] @@ -108,3 +114,20 @@ >1 >0 xo→t →t >0 Xo→t →t + +# embedded NUL characters need caution now that we unmetafy strings for +# multibyte compatibility; different code-paths for [[=~]] vs builtins + [[ $'abc\0de' =~ '^([a-c]+)\x00([de]+)$' ]] ; dump_match_info + [[ $'abc\0de' =~ $'^([a-c]+)\0([de]+)$' ]] ; dump_match_info + pcre_compile '^([a-c]+)\x00([de]+)$' + pcre_match $'abc\0de' + dump_match_info + pcre_compile $'^([a-c]+)\0([de]+)$' + pcre_match $'abc\0de' + dump_match_info +0:pcre matching strings with embedded NUL +>0 abc_de abc de 1 6 1 3 5 6 +>0 abc_de abc de 1 6 1 3 5 6 +>0 abc_de abc de 1 6 1 3 5 6 +>0 abc_de abc de 1 6 1 3 5 6 +