zsh-workers
 help / color / mirror / code / Atom feed
* [PATCH] Add zsh/re2 module with conditions
@ 2016-09-08  4:15 Phil Pennock
  2016-09-08 13:56 ` [PATCH] re2: fix clean-up path; fix two comments Phil Pennock
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Phil Pennock @ 2016-09-08  4:15 UTC (permalink / raw)
  To: zsh-workers

[-- Attachment #1: Type: text/plain, Size: 24032 bytes --]

Folks,

I tend to get automatically kicked off the -workers list by ezmlm
because I reject mails which are self-declared as spam, so please CC
replies to me.  Also: my commit-bit is currently
surrended-for-safekeeping because I've not been doing much with Zsh, so
someone else will need to merge this, if it's accepted.

RE2 is a regular expression library, written in C++, from Google.  It
offers most of the features of PCRE, excluding those which can't be
handled without backtracking.  It's BSD-licensed.  This patch adds the
zsh/re2 module.  It used the `cre` library to have C-language bindings.

At this point, I haven't done anything about rebinding =~ to handle
this.  It's purely new infix-operators based on words.  I'm thinking
perhaps something along the lines of $zsh_reop_modules=(regex), with
`setopt rematch_pcre` becoming a compatibility interface that acts as
though `pcre` were prepended to that list and

  zsh_reop_modules=(pcre regex)

having the same effect.  Then I could use `zsh_reop_modules=(re2 regex)`.
Does this seem sane?  Anyone have better suggestions?  I do want to have
=~ able to use this module, but the current work stands alone and should
be merge-able as-is.

Is there particular interest in having command-forms too?  There's no
"study" concept, but I suppose compiling a hairy regexp only once might
be good in some situations (but why use shell for those?)

This has been tested on MacOS 10.10.5.

My ulterior motive is that I want "better than zsh/regex" available by
default on MacOS, where Apple build without GPL modules for the system
Zsh.  I hope that by offering this option, Apple's engineers might
incorporate this one day and I can be happier. :)

I've also pushed this code to a GitHub repo, philpennock/zsh-code on the
re2 branch: https://github.com/philpennock/zsh-code/tree/re2

Tested with re2 20160901 installed via Brew, cre2 installed via:

    git clone https://github.com/marcomaggi/cre2
    cd cre2
      LIBTOOLIZE=glibtoolize sh ./autogen.sh
      CXX=g++-6 CC=gcc-6 ./configure --prefix=/opt/regexps
      make doc/stamp-vti
      make
      make install

and Zsh configured with:

    CPPFLAGS=-I/opt/regexps/include LDFLAGS=-L/opt/regexps/lib \
      ./configure --prefix=/opt/zsh-devel --enable-pcre --enable-re2 \
         --enable-cap --enable-multibyte --enable-zsh-secure-free \
         --with-tcsetpgrp --enable-etcdir=/etc

Feedback welcome.
(Oh, I can't spell "tough", it seems; deferring fix for now).

Regards,
-Phil

----------------------------8< git patch >8-----------------------------
Add support for Google's BSD-licensed RE2 library, via the cre
C-language bindings (also BSD-licensed).

Guard with --enable-re2 for now.

Adds 4 infix conditions.  Currently no commands, no support for changing
how =~ binds.

Includes tests & docs
---
 Doc/Makefile.in     |   2 +-
 Doc/Zsh/mod_re2.yo  |  65 +++++++++++
 INSTALL             |   8 ++
 Src/Modules/re2.c   | 324 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 Src/Modules/re2.mdd |   5 +
 Test/V11re2.ztst    | 170 +++++++++++++++++++++++++++
 configure.ac        |  14 +++
 7 files changed, 587 insertions(+), 1 deletion(-)
 create mode 100644 Doc/Zsh/mod_re2.yo
 create mode 100644 Src/Modules/re2.c
 create mode 100644 Src/Modules/re2.mdd
 create mode 100644 Test/V11re2.ztst

diff --git a/Doc/Makefile.in b/Doc/Makefile.in
index 2752096..8c00876 100644
--- a/Doc/Makefile.in
+++ b/Doc/Makefile.in
@@ -65,7 +65,7 @@ Zsh/mod_datetime.yo Zsh/mod_db_gdbm.yo Zsh/mod_deltochar.yo \
 Zsh/mod_example.yo Zsh/mod_files.yo Zsh/mod_langinfo.yo \
 Zsh/mod_mapfile.yo Zsh/mod_mathfunc.yo Zsh/mod_newuser.yo \
 Zsh/mod_parameter.yo Zsh/mod_pcre.yo Zsh/mod_private.yo \
-Zsh/mod_regex.yo Zsh/mod_sched.yo Zsh/mod_socket.yo \
+Zsh/mod_re2.yo Zsh/mod_regex.yo Zsh/mod_sched.yo Zsh/mod_socket.yo \
 Zsh/mod_stat.yo  Zsh/mod_system.yo Zsh/mod_tcp.yo \
 Zsh/mod_termcap.yo Zsh/mod_terminfo.yo \
 Zsh/mod_zftp.yo Zsh/mod_zle.yo Zsh/mod_zleparameter.yo \
diff --git a/Doc/Zsh/mod_re2.yo b/Doc/Zsh/mod_re2.yo
new file mode 100644
index 0000000..5527440
--- /dev/null
+++ b/Doc/Zsh/mod_re2.yo
@@ -0,0 +1,65 @@
+COMMENT(!MOD!zsh/re2
+Interface to the RE2 regular expression library.
+!MOD!)
+cindex(regular expressions)
+cindex(re2)
+The tt(zsh/re2) module makes available the following test conditions:
+
+startitem()
+findex(re2-match)
+item(var(expr) tt(-re2-match) var(regex))(
+Matches a string against an RE2 regular expression.
+On successful match,
+matched portion of the string will normally be placed in the tt(MATCH)
+variable.  If there are any capturing parentheses within the regex, then
+the tt(match) array variable will contain those.
+If the match is not successful, then the variables will not be altered.
+
+In addition, the tt(MBEGIN) and tt(MEND) variables are updated to point
+to the offsets within var(expr) for the beginning and end of the matched
+text, with the tt(mbegin) and tt(mend) arrays holding the beginning and
+end of each substring matched.
+
+If tt(BASH_REMATCH) is set, then the array tt(BASH_REMATCH) will be set
+instead of all of the other variables.
+
+Canonical documentation for this syntax accepted by this regular expression
+engine can be found at:
+uref(https://github.com/google/re2/wiki/Syntax)
+)
+enditem()
+
+startitem()
+findex(re2-match-posix)
+item(var(expr) tt(-re2-match-posix) var(regex))(
+Matches as per tt(-re2-match) but configuring the RE2 engine to use
+POSIX syntax.
+)
+enditem()
+
+startitem()
+findex(re2-match-posixperl)
+item(var(expr) tt(-re2-match-posixperl) var(regex))(
+Matches as per tt(-re2-match) but configuring the RE2 engine to use
+POSIX syntax, with the Perl classes and word-boundary extensions re-enabled
+too.
+
+This thus adds support for:
+tt(\d), tt(\s), tt(\w), tt(\D), tt(\S), tt(\W), tt(\b), and tt(\B).
+)
+enditem()
+
+startitem()
+findex(re2-match-longest)
+item(var(expr) tt(-re2-match-longest) var(regex))(
+Matches as per tt(-re2-match) but configuring the RE2 engine to find
+the longest match, instead of the left-most.
+
+For example, given
+
+example([[ abb -re2-match-longest ^a+LPAR()b|bb+RPAR() ]])
+
+This will match the right-branch, thus tt(abb), where tt(-re2-match) would
+instead match only tt(ab).
+)
+enditem()
diff --git a/INSTALL b/INSTALL
index 99895bd..887dd8e 100644
--- a/INSTALL
+++ b/INSTALL
@@ -558,6 +558,14 @@ only be searched for if the option --enable-pcre is passed to configure.
 
 (Future versions of the shell may have a better fix for this problem.)
 
+--enable-re2:
+
+The RE2 library is written in C++, so a C-library shim layer is needed for
+use by Zsh.  We use https://github.com/marcomaggi/cre2 for this, which is
+currently at version 0.3.1.  Both re2 and cre2 need to be installed for
+this option to successfully enable the zsh/re2 module.  The Zsh
+functionality is currently experimental.
+
 --enable-cap:
 
 This searches for POSIX capabilities; if found, the `cap' library
diff --git a/Src/Modules/re2.c b/Src/Modules/re2.c
new file mode 100644
index 0000000..e542723
--- /dev/null
+++ b/Src/Modules/re2.c
@@ -0,0 +1,324 @@
+/*
+ * re2.c
+ *
+ * This file is part of zsh, the Z shell.
+ *
+ * Copyright (c) 2016 Phil Pennock
+ * All Rights Reserved.
+ *
+ * Permission is hereby granted, without written agreement and without
+ * license or royalty fees, to use, copy, modify, and distribute this
+ * software and to distribute modified versions of this software for any
+ * purpose, provided that the above copyright notice and the following
+ * two paragraphs appear in all copies of this software.
+ *
+ * In no event shall Phil Pennock or the Zsh Development Group be liable
+ * to any party for direct, indirect, special, incidental, or consequential
+ * damages arising out of the use of this software and its documentation,
+ * even if Phil Pennock and the Zsh Development Group have been advised of
+ * the possibility of such damage.
+ *
+ * Phil Pennock and the Zsh Development Group specifically disclaim any
+ * warranties, including, but not limited to, the implied warranties of
+ * merchantability and fitness for a particular purpose.  The software
+ * provided hereunder is on an "as is" basis, and Phil Pennock and the
+ * Zsh Development Group have no obligation to provide maintenance,
+ * support, updates, enhancements, or modifications.
+ *
+ */
+
+/* This is heavily based upon my earlier regex module, with Peter's fixes
+ * for the tought stuff I had skipped / gotten wrong. */
+
+#include "re2.mdh"
+#include "re2.pro"
+
+/*
+ * re2 itself is a C++ library; zsh needs C language bindings.
+ * These come from <https://github.com/marcomaggi/cre2>.
+ */
+#include <cre2.h>
+
+/* the conditions we support */
+#define ZRE2_COND_RE2		0
+#define ZRE2_COND_POSIX		1
+#define ZRE2_COND_POSIXPERL	2
+#define ZRE2_COND_LONGEST	3
+
+/**/
+static int
+zcond_re2_match(char **a, int id)
+{
+    cre2_regexp_t *rex;
+    cre2_options_t *opt;
+    cre2_string_t *m, *matches = NULL;
+    char *lhstr, *lhstr_zshmeta, *rhre, *rhre_zshmeta;
+    char **result_array, **x;
+    char *s;
+    char **mbegin, **mend, **bptr, **eptr;
+    size_t matchessz = 0;
+    int return_value, ncaptures, matched, nelem, start, n, indexing_base;
+    int remaining_len, charlen;
+    zlong offs;
+
+    return_value = 0; /* 1 => matched successfully */
+
+    lhstr_zshmeta = cond_str(a,0,0);
+    rhre_zshmeta = cond_str(a,1,0);
+    lhstr = ztrdup(lhstr_zshmeta);
+    unmetafy(lhstr, NULL);
+    rhre = ztrdup(rhre_zshmeta);
+    unmetafy(rhre, NULL);
+
+    opt = cre2_opt_new();
+    if (!opt) {
+	zwarn("re2 opt memory allocation failure");
+	goto CLEANUP_UNMETAONLY;
+    }
+    /* nb: we can set encoding here; re2 assumes UTF-8 by default */
+    cre2_opt_set_log_errors(opt, 0); /* don't hit stderr by default */
+    if (!isset(CASEMATCH)) {
+	cre2_opt_set_case_sensitive(opt, 0);
+    }
+
+    /* "The following options are only consulted when POSIX syntax is enabled;
+     * when POSIX syntax is disabled: these features are always enabled and
+     * cannot be turned off."
+     * Seems hard to mis-parse, but I did.  Okay, Perl classes \d,\w and friends
+     * always on normally, can _also_ be enabled in POSIX mode. */
+
+    switch (id) {
+    case ZRE2_COND_RE2:
+	/* nothing to do, this is default */
+	break;
+    case ZRE2_COND_POSIX:
+	cre2_opt_set_posix_syntax(opt, 1);
+	break;
+    case ZRE2_COND_POSIXPERL:
+	cre2_opt_set_posix_syntax(opt, 1);
+	/* we enable Perl classes (\d, \s, \w, \D, \S, \W)
+	 * and boundaries/not (\b \B) */
+	cre2_opt_set_perl_classes(opt, 1);
+	cre2_opt_set_word_boundary(opt, 1);
+	break;
+    case ZRE2_COND_LONGEST:
+	cre2_opt_set_longest_match(opt, 1);
+	break;
+    default:
+	DPUTS(1, "bad re2 option");
+	goto CLEANUP_UNMETAONLY;
+    }
+
+    rex = cre2_new(rhre, strlen(rhre), opt);
+    if (!rex) {
+	zwarn("re2 regular expression memory allocation failure");
+	goto CLEANUP_OPT;
+    }
+    if (cre2_error_code(rex)) {
+	zwarn("re2 rexexp compilation failed: %s", cre2_error_string(rex));
+	goto CLEANUP;
+    }
+
+    ncaptures = cre2_num_capturing_groups(rex);
+    /* the nmatch for cre2_match follows the usual pattern of index 0 holding
+     * the entire matched substring, index 1 holding the first capturing
+     * sub-expression, etc.  So we need ncaptures+1 elements. */
+    matchessz = (ncaptures + 1) * sizeof(cre2_string_t);
+    matches = zalloc(matchessz);
+
+    matched = cre2_match(rex,
+			 lhstr, strlen(lhstr), /* text to match against */
+			 0, strlen(lhstr), /* substring of text to consider */
+			 CRE2_UNANCHORED, /* user should explicitly anchor */
+			 matches, (ncaptures+1));
+    if (!matched)
+	goto CLEANUP;
+    return_value = 1;
+
+    /* We have a match, we will return success, we have array of cre2_string_t
+     * items, each with .data and .length fields pointing into the matched text,
+     * all in unmetafied format.
+     *
+     * We need to collect the results, put together various arrays and offset
+     * variables, while respecting options to change the array set, the indexing
+     * of that array and everything else that 26 years of history has endowed
+     * upon us. */
+    /* option BASHREMATCH set:
+     *    set $BASH_REMATCH instead of $MATCH/$match
+     *    entire matched portion in index 0 (useful with option KSH_ARRAYS)
+     * option _not_ set:
+     *    $MATCH scalar gets entire string
+     *    $match array gets substrings
+     *    $MBEGIN $MEND scalars get offsets of entire match
+     *    $mbegin $mend arrays get offsets of substrings
+     *    all of the offsets depend upon KSHARRAYS to determine indexing!
+     */
+
+    if (isset(BASHREMATCH)) {
+	start = 0;
+	nelem = ncaptures + 1;
+    } else {
+	start = 1;
+	nelem = ncaptures;
+    }
+    result_array = NULL;
+    if (nelem) {
+	result_array = x = (char **) zalloc(sizeof(char *) * (nelem + 1));
+	for (m = matches + start, n = start; n <= ncaptures; ++n, ++m, ++x) {
+	    /* .data is (const char *), metafy can modify in-place so takes
+	     * (char *) but doesn't modify given META_DUP, so safe to drop
+	     * the const. */
+	    *x = metafy((char *)m->data, m->length, META_DUP);
+	}
+	*x = NULL;
+    }
+
+    if (isset(BASHREMATCH)) {
+	setaparam("BASH_REMATCH", result_array);
+	goto CLEANUP;
+    }
+
+    indexing_base = isset(KSHARRAYS) ? 0 : 1;
+
+    setsparam("MATCH", metafy((char *)matches[0].data, matches[0].length, META_DUP));
+    /* count characters before the match */
+    s = lhstr;
+    remaining_len = matches[0].data - lhstr;
+    offs = 0;
+    MB_CHARINIT();
+    while (remaining_len) {
+	offs++;
+	charlen = MB_CHARLEN(s, remaining_len);
+	s += charlen;
+	remaining_len -= charlen;
+    }
+    setiparam("MBEGIN", offs + indexing_base);
+    /* then the characters within the match */
+    remaining_len = matches[0].length;
+    while (remaining_len) {
+	offs++;
+	charlen = MB_CHARLEN(s, remaining_len);
+	s += charlen;
+	remaining_len -= charlen;
+    }
+    /* zsh ${foo[a,b]} is inclusive of end-points, [a,b] not [a,b) */
+    setiparam("MEND", offs + indexing_base - 1);
+    if (!nelem) {
+	goto CLEANUP;
+    }
+
+    bptr = mbegin = (char **)zalloc(sizeof(char *)*(nelem+1));
+    eptr = mend = (char **)zalloc(sizeof(char *)*(nelem+1));
+    for (m = matches + start, n = 0;
+	 n < nelem;
+	 ++n, ++m, ++bptr, ++eptr)
+    {
+	char buf[DIGBUFSIZE];
+	if (m->data == NULL) {
+	    /* FIXME: have assumed this is the API for non-matching substrings; confirm! */
+	    *bptr = ztrdup("-1");
+	    *eptr = ztrdup("-1");
+	    continue;
+	}
+	s = lhstr;
+	remaining_len = m->data - lhstr;
+	offs = 0;
+	/* Find the start offset */
+	MB_CHARINIT();
+	while (remaining_len) {
+	    offs++;
+	    charlen = MB_CHARLEN(s, remaining_len);
+	    s += charlen;
+	    remaining_len -= charlen;
+	}
+	convbase(buf, offs + indexing_base, 10);
+	*bptr = ztrdup(buf);
+	/* Continue to the end offset */
+	remaining_len = m->length;
+	while (remaining_len) {
+	    offs++;
+	    charlen = MB_CHARLEN(s, remaining_len);
+	    s += charlen;
+	    remaining_len -= charlen;
+	}
+	convbase(buf, offs + indexing_base - 1, 10);
+	*eptr = ztrdup(buf);
+    }
+    *bptr = *eptr = NULL;
+
+    setaparam("match", result_array);
+    setaparam("mbegin", mbegin);
+    setaparam("mend", mend);
+
+CLEANUP:
+    if (matches)
+	zfree(matches, matchessz);
+    cre2_delete(rex);
+CLEANUP_OPT:
+    cre2_opt_delete(opt);
+CLEANUP_UNMETAONLY:
+    free(lhstr);
+    free(rhre);
+    return return_value;
+}
+
+
+static struct conddef cotab[] = {
+    CONDDEF("re2-match", CONDF_INFIX, zcond_re2_match, 0, 0, ZRE2_COND_RE2),
+    CONDDEF("re2-match-posix", CONDF_INFIX, zcond_re2_match, 0, 0, ZRE2_COND_POSIX),
+    CONDDEF("re2-match-posixperl", CONDF_INFIX, zcond_re2_match, 0, 0, ZRE2_COND_POSIXPERL),
+    CONDDEF("re2-match-longest", CONDF_INFIX, zcond_re2_match, 0, 0, ZRE2_COND_LONGEST),
+};
+
+
+static struct features module_features = {
+    NULL, 0,
+    cotab, sizeof(cotab)/sizeof(*cotab),
+    NULL, 0,
+    NULL, 0,
+    0
+};
+
+
+/**/
+int
+setup_(UNUSED(Module m))
+{
+    return 0;
+}
+
+/**/
+int
+features_(Module m, char ***features)
+{
+    *features = featuresarray(m, &module_features);
+    return 0;
+}
+
+/**/
+int
+enables_(Module m, int **enables)
+{
+    return handlefeatures(m, &module_features, enables);
+}
+
+/**/
+int
+boot_(UNUSED(Module m))
+{
+    return 0;
+}
+
+/**/
+int
+cleanup_(Module m)
+{
+    return setfeatureenables(m, &module_features, NULL);
+}
+
+/**/
+int
+finish_(UNUSED(Module m))
+{
+    return 0;
+}
diff --git a/Src/Modules/re2.mdd b/Src/Modules/re2.mdd
new file mode 100644
index 0000000..b20838c
--- /dev/null
+++ b/Src/Modules/re2.mdd
@@ -0,0 +1,5 @@
+name=zsh/re2
+link='if test "x$enable_re2" = xyes && test "x$ac_cv_lib_cre2_cre2_version_string" = xyes; then echo dynamic; else echo no; fi'
+load=no
+
+objects="re2.o"
diff --git a/Test/V11re2.ztst b/Test/V11re2.ztst
new file mode 100644
index 0000000..d6e327c
--- /dev/null
+++ b/Test/V11re2.ztst
@@ -0,0 +1,170 @@
+%prep
+
+  if ! zmodload -F zsh/re2 C:re2-match 2>/dev/null
+  then
+    ZTST_unimplemented="the zsh/re2 module is not available"
+    return 0
+  fi
+# Load the rest of the builtins
+  zmodload zsh/re2
+  ##FIXME#setopt rematch_pcre
+# Find a UTF-8 locale.
+  setopt multibyte
+# Don't let LC_* override our choice of locale.
+  unset -m LC_\*
+  mb_ok=
+  langs=(en_{US,GB}.{UTF-,utf}8 en.UTF-8
+	 $(locale -a 2>/dev/null | egrep 'utf8|UTF-8'))
+  for LANG in $langs; do
+    if [[ é = ? ]]; then
+      mb_ok=1
+      break;
+    fi
+  done
+  if [[ -z $mb_ok ]]; then
+    ZTST_unimplemented="no UTF-8 locale or multibyte mode is not implemented"
+  else
+    print -u $ZTST_fd Testing RE2 multibyte with locale $LANG
+    mkdir multibyte.tmp && cd multibyte.tmp
+  fi
+
+%test
+
+  [[ 'foo→bar' -re2-match .([^[:ascii:]]). ]]
+  print $MATCH
+  print $match[1]
+0:Basic non-ASCII regexp matching
+>o→b
+>→
+
+  [[ alphabeta -re2-match a([^a]+)a ]]
+  echo "$? basic"
+  print $MATCH
+  print $match[1]
+  [[ ! alphabeta -re2-match a(.+)a ]]
+  echo "$? negated op"
+  [[ alphabeta -re2-match ^b ]]
+  echo "$? failed match"
+# default matches on first, then takes longest substring
+# -longest keeps looking
+  [[ abb -re2-match a(b|bb) ]]
+  echo "$? first .${MATCH}.${match[1]}."
+  [[ abb -re2-match-longest a(b|bb) ]]
+  echo "$? longest .${MATCH}.${match[1]}."
+  [[ alphabeta -re2-match ab ]]; echo "$? unanchored"
+  [[ alphabeta -re2-match ^ab ]]; echo "$? anchored"
+  [[ alphabeta -re2-match '^a(\w+)a$' ]]
+  echo "$? perl class used"
+  echo ".${MATCH}. .${match[1]}."
+  [[ alphabeta -re2-match-posix '^a(\w+)a$' ]]
+  echo "$? POSIX-mode, should inhibit Perl class"
+  [[ alphabeta -re2-match-posixperl '^a(\w+)a$' ]]
+  echo "$? POSIX-mode with Perl classes enabled .${match[1]}."
+  unset MATCH match
+  [[ alphabeta -re2-match ^a([^a]+)a([^a]+)a$ ]]
+  echo "$? matched, set vars"
+  echo ".$MATCH. ${#MATCH}"
+  echo ".${(j:|:)match[*]}."
+  unset MATCH match
+  [[ alphabeta -re2-match fr(.+)d ]]
+  echo "$? unmatched, not setting MATCH/match"
+  echo ".$MATCH. ${#MATCH}"
+  echo ".${(j:|:)match[*]}."
+0:Basic matching & result codes
+>0 basic
+>alpha
+>lph
+>1 negated op
+>1 failed match
+>0 first .ab.b.
+>0 longest .abb.bb.
+>0 unanchored
+>1 anchored
+>0 perl class used
+>.alphabeta. .lphabet.
+>1 POSIX-mode, should inhibit Perl class
+>0 POSIX-mode with Perl classes enabled .lphabet.
+>0 matched, set vars
+>.alphabeta. 9
+>.lph|bet.
+>1 unmatched, not setting MATCH/match
+>.. 0
+>..
+
+  m() {
+    unset MATCH MBEGIN MEND match mbegin mend
+    [[ $2 -re2-match $3 ]]
+    print $? $1: m:${MATCH}: ma:${(j:|:)match}: MBEGIN=$MBEGIN MEND=$MEND mbegin="(${mbegin[*]})" mend="(${mend[*]})"
+  }
+  data='alpha beta gamma delta'
+  m uncapturing $data '\b\w+\b'
+  m capturing $data '\b(\w+)\b'
+  m 'capture 2' $data '\b(\w+)\s+(\w+)\b'
+  m 'capture repeat' $data '\b(?:(\w+)\s+)+(\w+)\b'
+0:Beginning and end testing
+>0 uncapturing: m:alpha: ma:: MBEGIN=1 MEND=5 mbegin=() mend=()
+>0 capturing: m:alpha: ma:alpha: MBEGIN=1 MEND=5 mbegin=(1) mend=(5)
+>0 capture 2: m:alpha beta: ma:alpha|beta: MBEGIN=1 MEND=10 mbegin=(1 7) mend=(5 10)
+>0 capture repeat: m:alpha beta gamma delta: ma:gamma|delta: MBEGIN=1 MEND=22 mbegin=(12 18) mend=(16 22)
+
+
+  unset match mend
+  s=$'\u00a0'
+  [[ $s -re2-match '^.$' ]] && print OK
+  [[ A${s}B -re2-match .(.). && $match[1] == $s ]] && print OK
+  [[ A${s}${s}B -re2-match A([^[:ascii:]]*)B && $mend[1] == 3 ]] && print OK
+  unset s
+0:Raw IMETA characters in input string
+>OK
+>OK
+>OK
+
+  [[ foo -re2-match f.+ ]] ; print $?
+  [[ foo -re2-match x.+ ]] ; print $?
+  [[ ! foo -re2-match f.+ ]] ; print $?
+  [[ ! foo -re2-match x.+ ]] ; print $?
+  [[ foo -re2-match f.+ && bar -re2-match b.+ ]] ; print $?
+  [[ foo -re2-match x.+ && bar -re2-match b.+ ]] ; print $?
+  [[ foo -re2-match f.+ && bar -re2-match x.+ ]] ; print $?
+  [[ ! foo -re2-match f.+ && bar -re2-match b.+ ]] ; print $?
+  [[ foo -re2-match f.+ && ! bar -re2-match b.+ ]] ; print $?
+  [[ ! ( foo -re2-match f.+ && bar -re2-match b.+ ) ]] ; print $?
+  [[ ! foo -re2-match x.+ && bar -re2-match b.+ ]] ; print $?
+  [[ foo -re2-match x.+ && ! bar -re2-match b.+ ]] ; print $?
+  [[ ! ( foo -re2-match x.+ && bar -re2-match b.+ ) ]] ; print $?
+0:Regex result inversion detection
+>0
+>1
+>1
+>0
+>0
+>1
+>1
+>1
+>1
+>1
+>0
+>1
+>0
+
+# Subshell because crash on failure
+  ( [[ test.txt -re2-match '^(.*_)?(test)' ]]
+    echo $match[2] )
+0:regression for segmentation fault (pcre, dup for re2), workers/38307
+>test
+
+  setopt BASH_REMATCH KSH_ARRAYS
+  unset MATCH MBEGIN MEND match mbegin mend BASH_REMATCH
+  [[ alphabeta -re2-match '^a([^a]+)(a)([^a]+)a$' ]]
+  echo "$? bash_rematch"
+  echo "m:${MATCH}: ma:${(j:|:)match}:"
+  echo MBEGIN=$MBEGIN MEND=$MEND mbegin="(${mbegin[*]})" mend="(${mend[*]})"
+  echo "BASH_REMATCH=[${(j:, :)BASH_REMATCH[@]}]"
+  echo "[0]=${BASH_REMATCH[0]} [1]=${BASH_REMATCH[1]}"
+0:bash_rematch works
+>0 bash_rematch
+>m:: ma::
+>MBEGIN= MEND= mbegin=() mend=()
+>BASH_REMATCH=[alphabeta, lph, a, bet]
+>[0]=alphabeta [1]=lph
+
diff --git a/configure.ac b/configure.ac
index 0e0bd53..9c23691 100644
--- a/configure.ac
+++ b/configure.ac
@@ -442,6 +442,11 @@ AC_ARG_ENABLE(pcre,
 AC_HELP_STRING([--enable-pcre],
 [enable the search for the pcre library (may create run-time library dependencies)]))
 
+dnl Do you want to look for re2 support?
+AC_ARG_ENABLE(re2,
+AC_HELP_STRING([--enable-re2],
+[enable the search for cre2 C-language bindings and re2 library]))
+
 dnl Do you want to look for capability support?
 AC_ARG_ENABLE(cap,
 AC_HELP_STRING([--enable-cap],
@@ -683,6 +688,15 @@ if test "x$ac_cv_prog_PCRECONF" = xpcre-config; then
 fi
 fi
 
+if test x$enable_re2 = xyes; then
+AC_CHECK_LIB([re2],[main],,
+  [AC_MSG_FAILURE([test for RE2 library failed])])
+AC_CHECK_LIB([cre2],[cre2_version_string],,
+  [AC_MSG_FAILURE([test for CRE2 library failed])])
+AC_CHECK_HEADERS([cre2.h],,
+  [AC_MSG_ERROR([test for RE2 header failed])])
+fi
+
 AC_CHECK_HEADERS(sys/time.h sys/times.h sys/select.h termcap.h termio.h \
 		 termios.h sys/param.h sys/filio.h string.h memory.h \
 		 limits.h fcntl.h libc.h sys/utsname.h sys/resource.h \
-- 
2.10.0


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] re2: fix clean-up path; fix two comments
  2016-09-08  4:15 [PATCH] Add zsh/re2 module with conditions Phil Pennock
@ 2016-09-08 13:56 ` Phil Pennock
  2016-09-08 21:14 ` [PATCH] Add zsh/re2 module with conditions Oliver Kiddle
       [not found] ` <20160908144203.GA28545@fujitsu.shahaf.local2>
  2 siblings, 0 replies; 8+ messages in thread
From: Phil Pennock @ 2016-09-08 13:56 UTC (permalink / raw)
  To: zsh-workers

[-- Attachment #1: Type: text/plain, Size: 2125 bytes --]

On 2016-09-08 at 00:15 -0400, Phil Pennock wrote:
> I've also pushed this code to a GitHub repo, philpennock/zsh-code on the
> re2 branch: https://github.com/philpennock/zsh-code/tree/re2

This change is there too.

> (Oh, I can't spell "tough", it seems; deferring fix for now).

Fixed.  Also fixed a bug described just below in the patch body, and
swapped a FIXME comment for a TODO, referencing whatever future work
changes =~ binding.  (Feedback on that idea, outlined in previous mail,
appreciated!)

-Phil

----------------------------8< git patch >8-----------------------------

The clean-up path is for an internal function being passed an id which
it can't handle, but the ids come from this file, so it's protection
against coding mistakes in future extension.  In that hypothetical case,
we'd leak the memory of one RE2 opt object each time the matching
function was called in the unhandled id-profile.

Also clean up two comments.
---
 Src/Modules/re2.c | 4 ++--
 Test/V11re2.ztst  | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/Src/Modules/re2.c b/Src/Modules/re2.c
index e542723..f6a5283 100644
--- a/Src/Modules/re2.c
+++ b/Src/Modules/re2.c
@@ -28,7 +28,7 @@
  */
 
 /* This is heavily based upon my earlier regex module, with Peter's fixes
- * for the tought stuff I had skipped / gotten wrong. */
+ * for the tougher stuff I had skipped / gotten wrong. */
 
 #include "re2.mdh"
 #include "re2.pro"
@@ -106,7 +106,7 @@ zcond_re2_match(char **a, int id)
 	break;
     default:
 	DPUTS(1, "bad re2 option");
-	goto CLEANUP_UNMETAONLY;
+	goto CLEANUP_OPT;
     }
 
     rex = cre2_new(rhre, strlen(rhre), opt);
diff --git a/Test/V11re2.ztst b/Test/V11re2.ztst
index d6e327c..823a5ef 100644
--- a/Test/V11re2.ztst
+++ b/Test/V11re2.ztst
@@ -7,7 +7,7 @@
   fi
 # Load the rest of the builtins
   zmodload zsh/re2
-  ##FIXME#setopt rematch_pcre
+  # TODO: use future mechanism to switch =~ to use re2 and test =~ too
 # Find a UTF-8 locale.
   setopt multibyte
 # Don't let LC_* override our choice of locale.
-- 
2.10.0


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Add zsh/re2 module with conditions
  2016-09-08  4:15 [PATCH] Add zsh/re2 module with conditions Phil Pennock
  2016-09-08 13:56 ` [PATCH] re2: fix clean-up path; fix two comments Phil Pennock
@ 2016-09-08 21:14 ` Oliver Kiddle
  2016-09-08 21:48   ` Phil Pennock
       [not found] ` <20160908144203.GA28545@fujitsu.shahaf.local2>
  2 siblings, 1 reply; 8+ messages in thread
From: Oliver Kiddle @ 2016-09-08 21:14 UTC (permalink / raw)
  To: zsh-workers; +Cc: Phil Pennock

Phil Pennock wrote:
> At this point, I haven't done anything about rebinding =~ to handle
> this.  It's purely new infix-operators based on words.  I'm thinking
> perhaps something along the lines of $zsh_reop_modules=(regex), with
> `setopt rematch_pcre` becoming a compatibility interface that acts as
> though `pcre` were prepended to that list and
>
>   zsh_reop_modules=(pcre regex)
>
> having the same effect.  Then I could use `zsh_reop_modules=(re2 regex)`.
> Does this seem sane?  Anyone have better suggestions?  I do want to have

If the first listed module in the array has control of =~, what is
the meaning of subsequent ones?

How about perhaps using a module alias so you would do, e.g.
  zmodload -A zsh/default/regex=zsh/re2

Oliver


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Add zsh/re2 module with conditions
  2016-09-08 21:14 ` [PATCH] Add zsh/re2 module with conditions Oliver Kiddle
@ 2016-09-08 21:48   ` Phil Pennock
  0 siblings, 0 replies; 8+ messages in thread
From: Phil Pennock @ 2016-09-08 21:48 UTC (permalink / raw)
  To: Oliver Kiddle; +Cc: zsh-workers

[-- Attachment #1: Type: text/plain, Size: 1240 bytes --]

On 2016-09-08 at 23:14 +0200, Oliver Kiddle wrote:
> Phil Pennock wrote:
> > At this point, I haven't done anything about rebinding =~ to handle
> > this.  It's purely new infix-operators based on words.  I'm thinking
> > perhaps something along the lines of $zsh_reop_modules=(regex), with
> > `setopt rematch_pcre` becoming a compatibility interface that acts as
> > though `pcre` were prepended to that list and
> >
> >   zsh_reop_modules=(pcre regex)
> >
> > having the same effect.  Then I could use `zsh_reop_modules=(re2 regex)`.
> > Does this seem sane?  Anyone have better suggestions?  I do want to have
> 
> If the first listed module in the array has control of =~, what is
> the meaning of subsequent ones?

Ignored, as long as the first one could be loaded.

The first loadable one gets =~
It's bound and tied at that point.
If the variable is re-assigned to, the shell would try again to work
through the list.

> How about perhaps using a module alias so you would do, e.g.
>   zmodload -A zsh/default/regex=zsh/re2

Would probably need to be more than that, to be able to alias explicit
features.  It's C: infix-conditionals which need to be grabbed, a
different one from each module.

-Phil

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: zsh/re2 : avoid until further notice
       [not found]             ` <20160910190924.GB4045@fujitsu.shahaf.local2>
@ 2016-09-11 19:23               ` Phil Pennock
  2016-09-11 19:27                 ` Phil Pennock
  2016-09-12  3:50                 ` Daniel Shahaf
  0 siblings, 2 replies; 8+ messages in thread
From: Phil Pennock @ 2016-09-11 19:23 UTC (permalink / raw)
  To: zsh-workers

[ returning to on-list with Daniel's permission; being a little
  repetitive with my answer to provide context for others ]

On 2016-09-10 at 19:09 +0000, Daniel Shahaf wrote:
> Phil Pennock wrote on Sat, Sep 10, 2016 at 01:04:56 +0000:
> > On 2016-09-09 at 04:57 +0000, Daniel Shahaf wrote:
> > > I don't follow the bit about "require setting a var".  When the test
> > > program crashes, AC_TRY_RUN evaluates its third parameter; when
> > > cross-compiling, it evaluates the fourth.  The first three actual
> > > parameters seem fine to me.  The fourth as written makes --with-re2 and
> > > cross-builds mutually exclusive.
> > 
> > In my testing (autoconf 2.69) I initially had:
> >   PROG  =yes  =no   =unknown
> 
> That's exactly what I would expect the result of the test to be.  
> 
> > and the =unknown for cross-compiling meant that when the build crashed,
> > autoconf declared the status 'unknown' and continued.
> > 
> 
> What do you mean, "when the build crashed"?  When cross-compiling,
> AC_TRY_RUN does not attempt to run the program it compiles and links.
> If you mean, "when the test program would crash if run in the target
> hardware", then I think there ought to be a way to enable re2 even for
> cross-compiling.  It could be, for example:

At no point have I tried this with cross-compiling.  I will also state
up-front that I don't know autotools very well.

If the cre2 shim library, which provides the C-language bindings to the
re2 C++ library, is built with a different C++ compiler to that used for
re2 then various weird failures happen at runtime.  I have not been able
to get to a simple case which will "exit false" on failure, reliably.
The test program in the `AC_TRY_RUN` is as likely to segfault as to fail
exiting cleanly.

In my environment, 64-bit Intel, MacOS 10.10.5, re2 built with clang++
and cre2 built with either clang++ or g++-6, I tested this test-program
with cre2 built with both compilers, to confirm that it seems to
reliably detect in my environment when the environment is bad.

With the AC_TRY_RUN(PROG,=yes,=no,=unknown), when the program crashed,
the `./configure` run reported the status of that test as "unknown" and
continued on.  Yes, this was building locally, no cross-compiling.
Changing the last parameter of the AC_TRY_RUN to be =no instead of
=unknown resulted in the generated configure script correctly aborting.
autoconf 2.69.

This is not my reading of the documented behaviour of autotools, and I
don't think it's your reading either.  I am not asserting anything about
how cross-compilation should work; for myself, =unknown and
proceed-without-the-safety-check seemed a reasonable approach, which is
why I tried it first and discovered that the non-cross-compiling build
was affected by that fourth parameter.

I don't know how to deal with this, other than if someone else can
reproduce both the working and crashing scenarios and suggest a better
test.  In practice, the patch as included in zsh-workers 39249 works for
me.

>     if test=unknown enableval=yes: issue a warning and go through with the build
>     if test=unknown enableval=no: build without re2
> 
> Or:
> 
>     if test=unknown enableval=force: go through with the build
>     if test=unknown enableval=yes: abort the build, instructing user to retry with --enable-re2=force
>     if test=unknown enableval=no: build without re2
> 
> (These examples are incomplete: they don't cover the non-cross case nor
> the "neither --enable-re2 nor --disable-re2 passed" case.)
> 
> (In light of your later remarks: feel free to CC the list upon reply.)

At this point, the test is only run if `--enable-re2` is passed; the
feature is hard-guarded by that flag.

Also, I don't know if any of the ac logic should be in aclocal.m4,
aczsh.m4, should stay in configure.ac or something else, and how this
should be decided.

If anyone has tried the zsh-workers 39249 patch, success/failure reports
even on the basic functionality would be nice.  :)

Thanks,
-Phil


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: zsh/re2 : avoid until further notice
  2016-09-11 19:23               ` zsh/re2 : avoid until further notice Phil Pennock
@ 2016-09-11 19:27                 ` Phil Pennock
  2016-09-12  3:50                 ` Daniel Shahaf
  1 sibling, 0 replies; 8+ messages in thread
From: Phil Pennock @ 2016-09-11 19:27 UTC (permalink / raw)
  To: zsh-workers

On 2016-09-11 at 19:23 +0000, Phil Pennock wrote:
> [ returning to on-list with Daniel's permission; being a little
>   repetitive with my answer to provide context for others ]

*sigh* and because this was sent from my regular mail-reading box,
not the box where I can PGP-sign mail, and because I'd not done a
`git submodule update` in the config area, that mail again went out
with `Mail-Followup-To:` set.

Sorry.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: zsh/re2 : avoid until further notice
  2016-09-11 19:23               ` zsh/re2 : avoid until further notice Phil Pennock
  2016-09-11 19:27                 ` Phil Pennock
@ 2016-09-12  3:50                 ` Daniel Shahaf
  2016-09-14 18:47                   ` Phil Pennock
  1 sibling, 1 reply; 8+ messages in thread
From: Daniel Shahaf @ 2016-09-12  3:50 UTC (permalink / raw)
  To: zsh-workers

Phil Pennock wrote on Sun, Sep 11, 2016 at 19:23:51 +0000:
> With the AC_TRY_RUN(PROG,=yes,=no,=unknown), when the program crashed,
> the `./configure` run reported the status of that test as "unknown" and
> continued on.  Yes, this was building locally, no cross-compiling.

A comma is missing after the first "yes":

+  AC_TRY_RUN([
⋮
+}],
+  zsh_cv_cre2_runtime_broken=no,
+  zsh_cv_cre2_runtime_broken=yes
+  zsh_cv_cre2_runtime_broken=yes))

(So the "=no =yes =unknown" took the 'unknown' for part if the if-false
action, rather than the if-cross action.)

> Also, I don't know if any of the ac logic should be in aclocal.m4,
> aczsh.m4, should stay in configure.ac or something else, and how this
> should be decided.

I'd vote for configure.ac.  The other two seem to be about autoconf
library functions and about system tests (as opposed to dependency
libraries).

> If anyone has tried the zsh-workers 39249 patch, success/failure reports
> even on the basic functionality would be nice.  :)

I haven't, sorry :(

FWIW on my OS (debian) I see only a package for re2, not for cre2.

Cheers,

Daniel


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: zsh/re2 : avoid until further notice
  2016-09-12  3:50                 ` Daniel Shahaf
@ 2016-09-14 18:47                   ` Phil Pennock
  0 siblings, 0 replies; 8+ messages in thread
From: Phil Pennock @ 2016-09-14 18:47 UTC (permalink / raw)
  To: Daniel Shahaf; +Cc: zsh-workers

On 2016-09-12 at 03:50 +0000, Daniel Shahaf wrote:
> A comma is missing after the first "yes":
> 
> +  AC_TRY_RUN([
> ⋮
> +}],
> +  zsh_cv_cre2_runtime_broken=no,
> +  zsh_cv_cre2_runtime_broken=yes
> +  zsh_cv_cre2_runtime_broken=yes))

*facepalm*

Thank you.

diff --git a/configure.ac b/configure.ac
index c000d6a..f055583 100644
--- a/configure.ac
+++ b/configure.ac
@@ -718,8 +718,8 @@ int main(int argc, char **argv) {
        return 0;
 }],
   zsh_cv_cre2_runtime_broken=no,
-  zsh_cv_cre2_runtime_broken=yes
-  zsh_cv_cre2_runtime_broken=yes))
+  zsh_cv_cre2_runtime_broken=yes,
+  zsh_cv_cre2_runtime_broken=unknown))
   if test x$zsh_cv_cre2_runtime_broken = xyes; then
     AC_MSG_ERROR([cre2 library hard-unusable, rebuild with same compiler as for RE2])
   fi


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-09-14 18:47 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-08  4:15 [PATCH] Add zsh/re2 module with conditions Phil Pennock
2016-09-08 13:56 ` [PATCH] re2: fix clean-up path; fix two comments Phil Pennock
2016-09-08 21:14 ` [PATCH] Add zsh/re2 module with conditions Oliver Kiddle
2016-09-08 21:48   ` Phil Pennock
     [not found] ` <20160908144203.GA28545@fujitsu.shahaf.local2>
     [not found]   ` <20160908204737.GA12164@breadbox.private.spodhuis.org>
     [not found]     ` <20160908211643.GA4432@fujitsu.shahaf.local2>
     [not found]       ` <20160909005557.GB12371@breadbox.private.spodhuis.org>
     [not found]         ` <20160909045739.GA6623@fujitsu.shahaf.local2>
     [not found]           ` <20160910010456.GA85981@tower.spodhuis.org>
     [not found]             ` <20160910190924.GB4045@fujitsu.shahaf.local2>
2016-09-11 19:23               ` zsh/re2 : avoid until further notice Phil Pennock
2016-09-11 19:27                 ` Phil Pennock
2016-09-12  3:50                 ` Daniel Shahaf
2016-09-14 18:47                   ` Phil Pennock

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).