zsh-workers
 help / color / mirror / code / Atom feed
* zsh/bash behavior variance: regex ERE matching
@ 2018-03-14  2:40 Phil Pennock
  2018-03-14 14:37 ` Stephane Chazelas
  0 siblings, 1 reply; 2+ messages in thread
From: Phil Pennock @ 2018-03-14  2:40 UTC (permalink / raw)
  To: zsh-workers

This is just to note that I have observed a behavior variance.  My
proposed solution is to do absolutely nothing, and accept the variance
as "sane in an insane world".

Note that, per my standing practice, I do not cause risk to a code-base
which does not belong to me by reading GPL code of a related code-base,
so still have not read the bash code.  (I like the GPL and use it
elsewhere, but Zsh isn't GPL and it's not my call to risk that, so I
stubbornly refuse to risk it).  Descriptions of bash are based on
surmise from observed behavior.

Background: when bash copied the Perl-ish `=~` syntax, they declared it
to be an ERE match.  When I saw that Bash had added the `=~` comparison
infix operator, I went "that's a good idea" and did likewise for Zsh;
during on-list discussion at the time, the core maintainers expressed a
preference for closer compatibility with Bash, so I wrote the
`zsh/regex` module to do ERE matching and introduced the `re_match_pcre`
option to let folks map `=~` onto our long-standing `-pcre-match` infix
operator.  (I think Peter chose to make zsh/regex the default always,
which was very sane.)

Situation: on macOS (10.12.6, Sierrra), the regex library is based on
TRE, not on Henry Spencer's library or any other.  Further, re_format(7)
documents a number of features for `REG_ENHANCED` mode, as distinct from
`REG_EXTENDED`.  These are Perl-ish/PCRE-ish features such as `\d` for
`[[:digit:]]` and `(?:whatever)` for non-capturing grouping.

Using Zsh 5.4.2 built from Homebrew, which has no relevant patches, the
`=~` operator in Zsh is picking up features documented as `REG_ENHANCED`
when we only ask for `REG_EXTENDED`.  Homebrew reports that zsh is:

    Built from source on 2018-01-07 at 18:10:37 with: --with-unicode9 --with-gdbm --with-pcre

Specifically, the added features are the two features cited above,
`\d` and `(?:...)`.

So: we ask for ERE, we get ERE+nonstandard.

On the same platform, Bash 4.4.19(1)-release from Homebrew does _NOT_
match with `REG_ENHANCED` features.

Best operating hypothesis is:

 * Darwin userland bug
 * Bash build process has logic to detect broken ERE in system libraries
   and use a GNU ERE implementation (or ships with such always?) so that
   it's immune from bugs like this

Proposed action: nothing
Reason: most folks aren't familiar enough with regexps to know the
variances and I suspect a non-trivial number of macOS users who are
unwittingly relying upon TRE REG_ENHANCED features.  Fixing the
incompatibility (1) risks breaking working user scripts and (2) requires
shipping our own reliable ERE regexp library, and really I just don't
want to go there.

FWIW, somewhere lying around I also have a module which adds zsh/re2 as
a module, using Russ Cox's RE2 engine (as popularized by Go).  I suspect
that this would cause more confusion than it would solve, and I think I
dropped it part-way through converting RE_MATCH_PCRE to a compatibility
shim which edits a zsh-specific parameter which defines the engine to be
used and so can be set to any of (regex, re2, pcre).  If any of the core
team express interest, I can probably dust that off.

-Phil


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: zsh/bash behavior variance: regex ERE matching
  2018-03-14  2:40 zsh/bash behavior variance: regex ERE matching Phil Pennock
@ 2018-03-14 14:37 ` Stephane Chazelas
  0 siblings, 0 replies; 2+ messages in thread
From: Stephane Chazelas @ 2018-03-14 14:37 UTC (permalink / raw)
  To: Phil Pennock; +Cc: zsh-workers

2018-03-13 22:40:33 -0400, Phil Pennock:
[...]
> So: we ask for ERE, we get ERE+nonstandard.
> 
> On the same platform, Bash 4.4.19(1)-release from Homebrew does _NOT_
> match with `REG_ENHANCED` features.
[...]

An important note about how bash's =~ works since 3.2 (in 3.1
or with the compat31 option it works more like zsh):

In bash (and to some extent in ksh93 as well though it's very
buggy there), the shell quoting operators have an influence on
the regex matching like it does for shell wildcards.

[[ a =~ "." ]] or [[ a =~ \. ]]

actually call regcomp() with a "\." regexp.

To do that, bash needs to parse the regexp and does it using the
POSIX ERE syntax. In 

[[ a =~ \d ]] there is the same as [[ a =~ "d" ]] and calls
regcomp() with "d" while for [[ a =~ '\d' ]], it calls it with
"\\d" (the "\" being shell-quoted results in it being
regexp-escaped).

That means that if you want to use extensions, you need to use
variables or other expansions there (which you  leave unquoted).

Like:

re='\d'
[[ a =~ $re ]]

for regcomp() to be called with "\d".

Note that  (?:...) and \d are fine. We're not breaking EREs by
supporting it as the behaviour for (?:...) and \d is unspecified
in the POSIX ERE specification.

Other regexp implementations have other backward-compatible
extensions. For instance, GNU EREs support \b, \<, \>...

Some incompatibilities I'm aware of between ERE and PCRE (I
don't know if that also applies to those macOS REs):

- In POSIX ERE, [\d] matches on \ and d while it matches on a
  digit in PCRE (see also [\]] and co)
- in POSIX ERE, alternation looks for the longest match, while
  PCRE the  leftmost one that matches.

  $ echo abc | grep -oE 'a|ab'
  ab
  $ echo abc | grep -oP 'a|ab'
  a

  $ [[ abc =~ '(a|ab)' ]]; echo $match
  ab
  $ setopt rematchpcre
  $ [[ abc =~ '(a|ab)' ]]; echo $match
  a

As long as the regex library does what is required for POSIX
compliant regular expressions, since we document that =~ does
POSIX ERE, I'd say it doesn't matter what extension are
implemented over the standard.

-- 
Stephane


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2018-03-14 14:37 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-14  2:40 zsh/bash behavior variance: regex ERE matching Phil Pennock
2018-03-14 14:37 ` Stephane Chazelas

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).