* Re: =~ doesn't work with NUL characters
2017-06-14 20:49 ` Phil Pennock
@ 2017-06-14 23:08 ` Bart Schaefer
2017-06-15 7:38 ` Peter Stephenson
` (2 subsequent siblings)
3 siblings, 0 replies; 6+ messages in thread
From: Bart Schaefer @ 2017-06-14 23:08 UTC (permalink / raw)
To: Zsh hackers list
On Jun 14, 4:49pm, Phil Pennock wrote:
}
} My personal inclination is to handle NULL in the PCRE case.
This harkens back to the discussion of whether =~ should implicitly
(i.e., by way of setopt) choose a regex package, or should always
mean the same thing.
Given that it's already doing the choose-implicitly thing, I have no
additional objection to it also handling nul bytes.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: =~ doesn't work with NUL characters
2017-06-14 20:49 ` Phil Pennock
2017-06-14 23:08 ` Bart Schaefer
@ 2017-06-15 7:38 ` Peter Stephenson
2017-06-15 8:18 ` Stephane Chazelas
2017-06-15 9:50 ` Stephane Chazelas
3 siblings, 0 replies; 6+ messages in thread
From: Peter Stephenson @ 2017-06-15 7:38 UTC (permalink / raw)
To: Zsh hackers list
On Wed, 14 Jun 2017 16:49:38 -0400
Phil Pennock <zsh-workers+phil.pennock@spodhuis.org> wrote:
> Thus: do we want to change behaviour, after 16 years, to allow embedded
> NUL for the PCRE case, being different from the ERE case?
I don't see why not --- the behaviour with NUL may be longstanding but
it's a bit cloudy what the actual intention is. Generally, accepting
NUL as a regular character anywhere we can get it to work and shrugging
our shoulders in cases where the system fails to support it seems
reasonable to me.
pws
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: =~ doesn't work with NUL characters
2017-06-14 20:49 ` Phil Pennock
2017-06-14 23:08 ` Bart Schaefer
2017-06-15 7:38 ` Peter Stephenson
@ 2017-06-15 8:18 ` Stephane Chazelas
2017-06-15 9:50 ` Stephane Chazelas
3 siblings, 0 replies; 6+ messages in thread
From: Stephane Chazelas @ 2017-06-15 8:18 UTC (permalink / raw)
To: Phil Pennock; +Cc: Zsh hackers list
2017-06-14 16:49:38 -0400, Phil Pennock:
[...]
> Trying to support this would result in variations in behaviour across
> systems in a way which I think might be undesirable. The whole point of
> adding the non-PCRE implementation was to match Bash behaviour by
> default, and Bash does the same thing.
[...]
Note that bash does not support the NUL character anywhere (except
(mostly by accident) for read -d '' and now readarray -d '', and
it strips it in some contexts to limit the damage).
In bash, [[ $'a\0b' = a ]] would also be true, just like
echo $'a\0b' would output a\n because $'x\0anything' is always x
there.
zsh AFAIK is the only shell that attempts to handle the NUL
character (making it the only shell that can reliably handle
arbitrary data internally).
There are still a few issues here and there like this one (or
the strcoll() one discussed not so long ago), when interacting
with the rest of the system that can't cope with it. The most
obvious one and that can't be fixed is the parameters to the
execve() system call (external command arguments and environment
variables).
--
Stephane
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: =~ doesn't work with NUL characters
2017-06-14 20:49 ` Phil Pennock
` (2 preceding siblings ...)
2017-06-15 8:18 ` Stephane Chazelas
@ 2017-06-15 9:50 ` Stephane Chazelas
3 siblings, 0 replies; 6+ messages in thread
From: Stephane Chazelas @ 2017-06-15 9:50 UTC (permalink / raw)
To: Phil Pennock; +Cc: Zsh hackers list
2017-06-14 16:49:38 -0400, Phil Pennock:
[...]
> Without rematchpcre, this is ERE per POSIX APIs, which don't portably
> support size-supplied strings, relying instead upon C-string
> null-termination.
>
> Current macOS has regnexec() but this is not in the system regexp
> library I see on Ubuntu Trusty or FreeBSD 10.3. It appears to be an
> extension from when they switched to the TRE implementation in macOS
> 10.8. <https://laurikari.net/tre/>
>
> Trying to support this would result in variations in behaviour across
> systems in a way which I think might be undesirable. The whole point of
> adding the non-PCRE implementation was to match Bash behaviour by
> default, and Bash does the same thing.
[...]
A dirty trick in UTF-8 locales (the norm these days) may be to
encode NUL as U+7FFFFF00 (and bytes 0x80 -> 0xff that don't
form part of valid characters as U_7FFFFF{80..FF}) (in both the
string and regexp).
That wouldn't work with every regexp implementation though as
some would treat those as invalid characters if they go by
the newer definition where valid characters are only
0000->D7FF, E000->10FFFF.
But with those that do, that would also make the behaviour more
consistent in cases like:
[[ $'\x80' = ? ]] vs [[ $'\x80' =~ '^.$' ]]
That wouldn't help in things like [[ x =~ $'[\0-\177]' ]] (which
anyway doesn't make sense in locales other than C/POSIX) though.
--
Stephane
^ permalink raw reply [flat|nested] 6+ messages in thread