zsh-workers
 help / color / mirror / code / Atom feed
* please consider using PCRE_DOLLAR_ENDONLY (and PCRE_DOTALL) for rematchpcre
@ 2017-11-22 12:25 Stephane Chazelas
  2017-11-22 21:40 ` Stephane Chazelas
  0 siblings, 1 reply; 6+ messages in thread
From: Stephane Chazelas @ 2017-11-22 12:25 UTC (permalink / raw)
  To: Zsh hackers list

Hi.

With setopt rematchpcre

[[ $a =~ a$ ]] currently matches on a=$'a\n'

and

[[ $a =~ ^$ ]] matches on a=$'\n'

While [[ $a =~ . ]] does *not* match on a=$'\n'

That can be quite surprising, and means the behaviour is
different from when using ERE (with norematchpcre)

It can be worked around ([[ $a =~ 'a\z' ]], [[ $a =~ '(?s).'
]]), but IMO at least PCRE_DOLLAR_ENDONLY (if not PCRE_DOTALL)
should be the default at least for [[ $string =~ ... ]] as
in shells, $string usually do not include the newline delimiter.

fish has the same issue with its "string" builtin.

ksh93 with [[ $var =~ (?P)pcre ]] does behave as if both
PCRE_DOLLAR_ENDONLY and PCRE_DOTALL were on.

Note that even with $PCRE_DOLLAR_ENDONLY, one can still do
line-based matching with (?m) or match before the trailing
newline character with \Z.

[[ $'a\nb\n' =~ '(?m)a$' ]] and [[ $'a\nb\n' =~ 'b\Z' ]] return
true.

-- 
Stephane


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: please consider using PCRE_DOLLAR_ENDONLY (and PCRE_DOTALL) for rematchpcre
  2017-11-22 12:25 please consider using PCRE_DOLLAR_ENDONLY (and PCRE_DOTALL) for rematchpcre Stephane Chazelas
@ 2017-11-22 21:40 ` Stephane Chazelas
  2018-01-20  7:48   ` Bart Schaefer
  0 siblings, 1 reply; 6+ messages in thread
From: Stephane Chazelas @ 2017-11-22 21:40 UTC (permalink / raw)
  To: Zsh hackers list

2017-11-22 12:25:19 +0000, Stephane Chazelas:
[...]
> It can be worked around ([[ $a =~ 'a\z' ]], [[ $a =~ '(?s).'
> ]]), but IMO at least PCRE_DOLLAR_ENDONLY (if not PCRE_DOTALL)
> should be the default at least for [[ $string =~ ... ]] as
> in shells, $string usually do not include the newline delimiter.
[...]

The situation in other tools languages:

ksh93:

$ ksh93 -c "[[ $'a\n' = ~(P:a$) ]] || echo no; [[ $'\n' = ~(P:.) ]] && echo yes"
no
yes


(both PCRE_DOLLAR_ENDONLY and PCRE_DOTALL (or equivalent as
ksh93 comes with its own pcre-like implementation))

$ php -r 'echo preg_match("/a$/", "a\n") . "\n" . preg_match("/./", "\n") . "\n";'
1
0

neither PCRE_DOLLAR_ENDONLY nor PCRE_DOTALL. Clearly documented
and has a "D" flag to enable PCRE_DOLLAR_ENDONLY
https://secure.php.net/manual/en/reference.pcre.pattern.modifiers.php

$ php -r 'echo preg_match("/a$/D", "a\n") . "\n";'
0

ssed:

printf 'a\n\n' | ssed -Rn 'N;/a$/=;/a./!='

neither PCRE_DOLLAR_ENDONLY nor PCRE_DOTALL

GNU grep:

$ printf 'a\n\0' | ltrace -e 'pcre_compile' grep -zP 'a$'
grep->pcre_compile("a$", 2080, 0x7ffcaf25aff8, 0x7ffcaf25aff4, 0x1e89280)

PCRE_DOLLAR_ENDONLY (32) but not PCRE_DOTALL

python (not PCRE)

neither PCRE_DOLLAR_ENDONLY nor PCRE_DOTALL. Documented:
https://docs.python.org/3/library/re.html

\Z means the opposite from perl/PCREs! (matches at the end only)

fish (string match -r pcre strings...)

neither PCRE_DOLLAR_ENDONLY nor PCRE_DOTALL

So I'd understand if you leave it as it is as many other tools
do not use PCRE_DOLLAR_ENDONLY.

I still find the idea of $ not matching only at the end of the
subject dangerous, as most people assume it does (like it does
in BRE and ERE). If not changed, it would be worth clearly
documenting (if only to flag the difference with ERE and warn of
potential implications). See how the documentation current has
this misleading example:

  [[ "$text" -pcre-match ^d+$ ]] &&
  print text variable contains only "d's".

Should be: 

  print text variable contains only "d's" optionally followed by a newline character

or:.

  [[ "$text" -pcre-match '^d+\z' ]]


It affects perl and co already. Like, many people do:

rename 's/\.back$//i' ./*

When they meant:

rename 's/\.back\z//i' ./*

Same for PCRE_DOTALL

rename 's/-.*//' ./*-*

when they meant

rename 's/(?s)-.*//' ./*-*

for instance.

-- 
Stephane


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: please consider using PCRE_DOLLAR_ENDONLY (and PCRE_DOTALL) for rematchpcre
  2017-11-22 21:40 ` Stephane Chazelas
@ 2018-01-20  7:48   ` Bart Schaefer
  2018-01-22  5:28     ` Phil Pennock
  0 siblings, 1 reply; 6+ messages in thread
From: Bart Schaefer @ 2018-01-20  7:48 UTC (permalink / raw)
  To: zsh-workers

Another one that seems to have lost traction. Holiday malaise?
(workers/42044)


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: please consider using PCRE_DOLLAR_ENDONLY (and PCRE_DOTALL) for rematchpcre
  2018-01-20  7:48   ` Bart Schaefer
@ 2018-01-22  5:28     ` Phil Pennock
  2018-01-23  6:57       ` Stephane Chazelas
  0 siblings, 1 reply; 6+ messages in thread
From: Phil Pennock @ 2018-01-22  5:28 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: zsh-workers

On 2018-01-19 at 23:48 -0800, Bart Schaefer wrote:
> Another one that seems to have lost traction. Holiday malaise?
> (workers/42044)

Speaking only for myself: the =~ for PCRE is a thin wrapper around the
long-standing PCRE module (introduced in 2001) and that in turn is a
thin wrapper around the PCRE defaults.

Changing the default behavior of valid semantics risks hard-to-debug
breakage of existing scripts and I am erring on the side of being
against this change.  It's not hard opposition, but I'd like to see
stronger justification before risking breaking changes.

I know that I myself have scripts which rely upon PCRE matching against
multiline data behaving as per the defaults of pcrepattern(3).

In addition, while the DOTALL change can be turned off in-regex, the
dollar-endonly one can't, AFAIK, so that becomes a breaking change which
can't be worked around.

-Phil


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: please consider using PCRE_DOLLAR_ENDONLY (and PCRE_DOTALL) for rematchpcre
  2018-01-22  5:28     ` Phil Pennock
@ 2018-01-23  6:57       ` Stephane Chazelas
  2018-01-23 13:55         ` Stephane Chazelas
  0 siblings, 1 reply; 6+ messages in thread
From: Stephane Chazelas @ 2018-01-23  6:57 UTC (permalink / raw)
  To: Phil Pennock; +Cc: Bart Schaefer, zsh-workers

2018-01-22 00:28:29 -0500, Phil Pennock:
[...]
> Changing the default behavior of valid semantics risks hard-to-debug
> breakage of existing scripts and I am erring on the side of being
> against this change.  It's not hard opposition, but I'd like to see
> stronger justification before risking breaking changes.
> 
> I know that I myself have scripts which rely upon PCRE matching against
> multiline data behaving as per the defaults of pcrepattern(3).
> 
> In addition, while the DOTALL change can be turned off in-regex, the
> dollar-endonly one can't, AFAIK, so that becomes a breaking change which
> can't be worked around.
[...]

dollar-endonly is not really about multiline

[[ $'a\nb' =~ 'a$' ]]

will not match with or without it and

[[ $'a\nb' =~ '(?m)a$' ]]

will match with or without it.

It's more about single-line where the line delimiter happens to
be included (and you want the $ to match on the end of that line
as opposed to the end of the string).

$ matches before a trailing newline in a string in perl because
of how its <> operator works. perl is a text processing utility,
its regexps are primarily matched against single lines where the
newline is included (contrary to traditional text processing
utilities like sed/grep/awk where the record separator is not
included).

In:

    perl -pe 's/.$//'

(which calls <>).

you want to remove the last character of the line, not the
newline character.

That $ behaviour makes a lot of sense there. Even if you use:

   perl -lpe 's/.$//'

where that -l causes the delimiter to be removed on input and
added back on output like in sed/awk, that behaviour doesn't
harm because the record does *not* contain any newline
delimiter.

But zsh is not a text processing utility, and its "read" builtin
(the closest equivalent to perl's <>) does not include the
delimiter. It's actually hard to have a trailing newline when
processing text in shells given that $(...) strips them..

On the other hand, having

[[ $file =~ '\.txt$' ]]

match on files that don't end in .txt is a concern (and in my
experience, file names (as opposed to text lines with
delimiters) is the kind of thing I deal most often with in zsh).

And again, note that it only happens with pcrematch, it works as
expected with EREs.


-- 
Stephane


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: please consider using PCRE_DOLLAR_ENDONLY (and PCRE_DOTALL) for rematchpcre
  2018-01-23  6:57       ` Stephane Chazelas
@ 2018-01-23 13:55         ` Stephane Chazelas
  0 siblings, 0 replies; 6+ messages in thread
From: Stephane Chazelas @ 2018-01-23 13:55 UTC (permalink / raw)
  To: Phil Pennock, Bart Schaefer, zsh-workers

2018-01-23 06:57:35 +0000, Stephane Chazelas:
[...]
> It's more about single-line where the line delimiter happens to
> be included (and you want the $ to match on the end of that line
> as opposed to the end of the string).
[...]

I've been trying to imagine a use case that could be broken by
the change (of adding PCRE_DOLLAR_ENDONLY).

One case could be with $mapfile[file] (which does expand to the
full content of the file) for text files with a single line, like

set -o rematchpcre
zmodload zsh/mapfile
[[ $mapfile[/etc/debian_version] =~ '/sid$' ]]

Which would have to be changed to:

[[ $mapfile[/etc/debian_version] =~ '/sid\n?$' ]]

or

[[ $mapfile[/etc/debian_version] =~ '/(?m)sid$' ]]

though personally, I'd rather do:

[[ $(</etc/debian_version) = */sid ]]

Or 

grep -q '/sid$' /etc/debian_version

Can anyone think of less far-fetched ones?

-- 
Stephane


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-01-23 13:55 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-22 12:25 please consider using PCRE_DOLLAR_ENDONLY (and PCRE_DOTALL) for rematchpcre Stephane Chazelas
2017-11-22 21:40 ` Stephane Chazelas
2018-01-20  7:48   ` Bart Schaefer
2018-01-22  5:28     ` Phil Pennock
2018-01-23  6:57       ` Stephane Chazelas
2018-01-23 13:55         ` Stephane Chazelas

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).