zsh-workers
 help / color / mirror / code / Atom feed
* =~ doesn't work with NUL characters
@ 2017-06-13 10:02 Stephane Chazelas
  2017-06-14 20:49 ` Phil Pennock
  0 siblings, 1 reply; 6+ messages in thread
From: Stephane Chazelas @ 2017-06-13 10:02 UTC (permalink / raw)
  To: Zsh hackers list

[[ $'a\0b' =~ 'a$' ]]

returns true both with and without rematchpcre

Same for

[[ abc =~ $'a\0xy' ]]

If not fixable (I'd expect it should be possible at least with
PCRE though, at least for the subject argument if not for the
pattern (where one can use '\0' to match a NUL)), it would be
worth documenting.

The [[ subject = pattern ]] operator seems to be OK in that
regard.

-- 
Stephane


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: =~ doesn't work with NUL characters
  2017-06-13 10:02 =~ doesn't work with NUL characters Stephane Chazelas
@ 2017-06-14 20:49 ` Phil Pennock
  2017-06-14 23:08   ` Bart Schaefer
                     ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Phil Pennock @ 2017-06-14 20:49 UTC (permalink / raw)
  To: Zsh hackers list

On 2017-06-13 at 11:02 +0100, Stephane Chazelas wrote:
> [[ $'a\0b' =~ 'a$' ]]
> 
> returns true both with and without rematchpcre

Let's break this down, non-PCRE and PCRE, and consider appropriate
behaviour for each separately.

Without rematchpcre, this is ERE per POSIX APIs, which don't portably
support size-supplied strings, relying instead upon C-string
null-termination.

Current macOS has regnexec() but this is not in the system regexp
library I see on Ubuntu Trusty or FreeBSD 10.3.  It appears to be an
extension from when they switched to the TRE implementation in macOS
10.8.  <https://laurikari.net/tre/>

Trying to support this would result in variations in behaviour across
systems in a way which I think might be undesirable.  The whole point of
adding the non-PCRE implementation was to match Bash behaviour by
default, and Bash does the same thing.

So for non-PCRE, I think this current behaviour is the only sane choice.

For PCRE, I'm inclined to agree that we should be able to portably
supply the length and there would not be any cross-platform behavioural
variances.  I think it's also reasonable that PCRE matching could
diverge from ERE matching even more.  Others might disagree?

We've "always" used strlen here; the most recent change was to handle
meta/unmeta (by me), but the strlen usage has been present since the
pcre module was introduced in commit bff61cf9e1 in 2001.

Thus: do we want to change behaviour, after 16 years, to allow embedded
NUL for the PCRE case, being different from the ERE case?

There's enough room for disagreement here that I'm not rushing to write
a patch, but instead deferring to those with commit-bit.  My personal
inclination is to handle NULL in the PCRE case.  It should just be a
case of passing an int* instead of NULL as the second parameter to
unmetafy().

-Phil


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: =~ doesn't work with NUL characters
  2017-06-14 20:49 ` Phil Pennock
@ 2017-06-14 23:08   ` Bart Schaefer
  2017-06-15  7:38   ` Peter Stephenson
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Bart Schaefer @ 2017-06-14 23:08 UTC (permalink / raw)
  To: Zsh hackers list

On Jun 14,  4:49pm, Phil Pennock wrote:
}
} My personal inclination is to handle NULL in the PCRE case.

This harkens back to the discussion of whether =~ should implicitly
(i.e., by way of setopt) choose a regex package, or should always
mean the same thing.

Given that it's already doing the choose-implicitly thing, I have no
additional objection to it also handling nul bytes.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: =~ doesn't work with NUL characters
  2017-06-14 20:49 ` Phil Pennock
  2017-06-14 23:08   ` Bart Schaefer
@ 2017-06-15  7:38   ` Peter Stephenson
  2017-06-15  8:18   ` Stephane Chazelas
  2017-06-15  9:50   ` Stephane Chazelas
  3 siblings, 0 replies; 6+ messages in thread
From: Peter Stephenson @ 2017-06-15  7:38 UTC (permalink / raw)
  To: Zsh hackers list

On Wed, 14 Jun 2017 16:49:38 -0400
Phil Pennock <zsh-workers+phil.pennock@spodhuis.org> wrote:
> Thus: do we want to change behaviour, after 16 years, to allow embedded
> NUL for the PCRE case, being different from the ERE case?

I don't see why not --- the behaviour with NUL may be longstanding but
it's a bit cloudy what the actual intention is.  Generally, accepting
NUL as a regular character anywhere we can get it to work and shrugging
our shoulders in cases where the system fails to support it seems
reasonable to me.

pws


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: =~ doesn't work with NUL characters
  2017-06-14 20:49 ` Phil Pennock
  2017-06-14 23:08   ` Bart Schaefer
  2017-06-15  7:38   ` Peter Stephenson
@ 2017-06-15  8:18   ` Stephane Chazelas
  2017-06-15  9:50   ` Stephane Chazelas
  3 siblings, 0 replies; 6+ messages in thread
From: Stephane Chazelas @ 2017-06-15  8:18 UTC (permalink / raw)
  To: Phil Pennock; +Cc: Zsh hackers list

2017-06-14 16:49:38 -0400, Phil Pennock:
[...]
> Trying to support this would result in variations in behaviour across
> systems in a way which I think might be undesirable.  The whole point of
> adding the non-PCRE implementation was to match Bash behaviour by
> default, and Bash does the same thing.
[...]

Note that bash does not support the NUL character anywhere (except
(mostly by accident) for read -d '' and now readarray -d '', and
it strips it in some contexts to limit the damage).

In bash, [[ $'a\0b' = a ]] would also be true, just like
echo $'a\0b' would output a\n because $'x\0anything' is always x
there.

zsh AFAIK is the only shell that attempts to handle the NUL
character (making it the only shell that can reliably handle
arbitrary data internally).

There are still a few issues here and there like this one (or
the strcoll() one discussed not so long ago), when interacting
with the rest of the system that can't cope with it. The most
obvious one and that can't be fixed is the parameters to the
execve() system call (external command arguments and environment
variables).

-- 
Stephane


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: =~ doesn't work with NUL characters
  2017-06-14 20:49 ` Phil Pennock
                     ` (2 preceding siblings ...)
  2017-06-15  8:18   ` Stephane Chazelas
@ 2017-06-15  9:50   ` Stephane Chazelas
  3 siblings, 0 replies; 6+ messages in thread
From: Stephane Chazelas @ 2017-06-15  9:50 UTC (permalink / raw)
  To: Phil Pennock; +Cc: Zsh hackers list

2017-06-14 16:49:38 -0400, Phil Pennock:
[...]
> Without rematchpcre, this is ERE per POSIX APIs, which don't portably
> support size-supplied strings, relying instead upon C-string
> null-termination.
> 
> Current macOS has regnexec() but this is not in the system regexp
> library I see on Ubuntu Trusty or FreeBSD 10.3.  It appears to be an
> extension from when they switched to the TRE implementation in macOS
> 10.8.  <https://laurikari.net/tre/>
> 
> Trying to support this would result in variations in behaviour across
> systems in a way which I think might be undesirable.  The whole point of
> adding the non-PCRE implementation was to match Bash behaviour by
> default, and Bash does the same thing.
[...]

A dirty trick in UTF-8 locales (the norm these days) may be to
encode NUL as U+7FFFFF00 (and bytes 0x80 -> 0xff that don't
form part  of valid characters as U_7FFFFF{80..FF}) (in both the
string and regexp).

That wouldn't work with every regexp implementation though as
some would treat those as invalid characters if they go by
the newer definition where valid characters are only
0000->D7FF, E000->10FFFF.

But with those that do, that would also make the behaviour more
consistent in cases like:

[[ $'\x80' = ? ]] vs [[ $'\x80' =~ '^.$' ]]

That wouldn't help in things like [[ x =~ $'[\0-\177]' ]] (which
anyway doesn't make sense in locales other than C/POSIX) though.

-- 
Stephane


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-06-15  9:50 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-13 10:02 =~ doesn't work with NUL characters Stephane Chazelas
2017-06-14 20:49 ` Phil Pennock
2017-06-14 23:08   ` Bart Schaefer
2017-06-15  7:38   ` Peter Stephenson
2017-06-15  8:18   ` Stephane Chazelas
2017-06-15  9:50   ` Stephane Chazelas

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).