Chiming in because I'm the most recent person to mess with regex
functionality (I think).
On 2009-02-26 at 23:15 -0800, Jon Strait wrote:
1. A new '-s' option to pcre_compile. This is the frequently set
PCRE_DOTALL option, allowing the dot character to match a newline as well.
This makes sense, since we already have the other common options as
flags. In the meantime, you know about the internal option setting
feature of PCRE syntax, right? Putting (?s) at the start of the pattern
is equivalent.
Yes, I've had to use the (? prefixes on language bindings where the
command options were missing. It's a matter of taste, I guess. If
one's not expecting the prefix options, then it's a little more
confusing to read, but if one's accustom to using the prefix options
then they may even prefer that way.
On the safe side, regarding the possibility of multi-byte characters,
I'm assuming that the returned offset positions are only for sending
back to pcre_match and not for indexing on a match string, because the
offsets are in byte count, not character count.
This is dubious. I can see someone quite reasonably using
$var[start,end] for substring extraction; the shell should be internally
consistent. In a worst-case scenario, there could be another option to
select which offset semantics shall be used. Peter's work on UTF-8
support has so far managed to keep the user from ever knowing or caring
about this.
Or return four numbers instead of two, so that anyone using the
interface has to be aware of the difference and can think about it.
I'm not coming up with a more elegant solution.
I understand that the use of UTF-8 is transparent to the Z shell user.
This should mean that any character, whether represented as single or
multi-byte, will be seen as one character position in a user string.
However, the offset numbers that pcre_match will return are
representing byte lengths. If one character position is really two
bytes, then the PCRE offset will be increased by two, and a mismatch
will occur if the user tries to use the offset number for indexing on
their user string.
3. A needed correction: all of the module's external variables are now
unset on each match attempt, so that a failed match will be obvious.
Well, the exit status is set already. And since the last shell release,
we've documented explicitly that nothing is altered:
2009-01-15:
* 26312: Phil Pennock: Doc/Zsh/cond.yo, Doc/Zsh/mod_pcre.yo,
Doc/Zsh/mod_regex.yo: Document no variables altered on failed
match.
On the other hand, there's value to a reset too. I don't have a strong
preference either way, but now is the time to fix it, before there's
been a release which documents the behaviour. :) Part of the problem
is that pcre_regex has been in Zsh for many years and we tend to be
cautious when changing behaviour.
I doubt that anyone is relying on the value being unchanged after a
match attempt. Thus the lack of a strong preference. Anyone else?
While you're at it, there's also the zsh/regex module which uses the
system's normal extended regex libraries and if you're changing the
semantics of one, both should change.
Right, the exit status should really be checked anyway instead of my
checking the match results. Like so:
Could someone please point me to the doc files that would need updating
(for the zshmodule man page), or if someone here has that part
automated, I can send them whatever targeted write-up they want.
Doc/Zsh/mod_pcre.yo (and mod_regex.yo), which are in YODL format.
- ret = pcre_exec(pcre_pattern, pcre_hints, *args, strlen(*args), 0, 0, ovec, ovecsize);
+ ret = pcre_exec(pcre_pattern, pcre_hints, *args, strlen(*args), offset_start, 0, ovec, ovecsize);
How gracefully does pcre_exec() fail when offset_start is set to a value
larger than the length of the string? To maxint-smallnumber?
-Phil
Oops, it fails like a winged elephant. I'll put a check in the next
patch. Thanks Phil.