zsh-workers
 help / color / mirror / code / Atom feed
* Re: PATCH: New options for the PCRE module
@ 2009-02-27 23:10 Jon Strait
  0 siblings, 0 replies; 3+ messages in thread
From: Jon Strait @ 2009-02-27 23:10 UTC (permalink / raw)
  To: zsh workers

[-- Attachment #1: Type: text/plain, Size: 4904 bytes --]

Phil Pennock wrote:
> Chiming in because I'm the most recent person to mess with regex
> functionality (I think).
>
> On 2009-02-26 at 23:15 -0800, Jon Strait wrote:
>   
>> 1.  A new '-s' option to pcre_compile.  This is the frequently set 
>> PCRE_DOTALL option, allowing the dot character to match a newline as well.
>>     
>
> This makes sense, since we already have the other common options as
> flags.  In the meantime, you know about the internal option setting
> feature of PCRE syntax, right?  Putting (?s) at the start of the pattern
> is equivalent.
>
>   
Yes, I've had to use the (? prefixes on language bindings where the 
command options were missing.  It's a matter of taste, I guess.  If 
one's not expecting the prefix options, then it's a little more 
confusing to read, but if one's accustom to using the prefix options 
then they may even prefer that way.
>> On the safe side, regarding the possibility of multi-byte characters, 
>> I'm assuming that the returned offset positions are only for sending 
>> back to pcre_match and not for indexing on a match string, because the 
>> offsets are in byte count, not character count.
>>     
>
> This is dubious.  I can see someone quite reasonably using
> $var[start,end] for substring extraction; the shell should be internally
> consistent.  In a worst-case scenario, there could be another option to
> select which offset semantics shall be used.  Peter's work on UTF-8
> support has so far managed to keep the user from ever knowing or caring
> about this.
>
> Or return four numbers instead of two, so that anyone using the
> interface has to be aware of the difference and can think about it.
>
> I'm not coming up with a more elegant solution.
>
>   
I understand that the use of UTF-8 is transparent to the Z shell user.  
This should mean that any character, whether represented as single or 
multi-byte, will be seen as one character position in a user string.  
However, the offset numbers that pcre_match will return are representing 
byte lengths.  If one character position is really two bytes, then the 
PCRE offset will be increased by two, and a mismatch will occur if the 
user tries to use the offset number for indexing on their user string.
This is how I understand it, or have I missed something along the way?  
I'm not that UTF-8 savvy, so maybe there is a way around this that I'm 
not seeing.  Also, I would like to have those two extra offset numbers 
that represent character positions, but AFAIK, PCRE doesn't provide that 
info on a match.
>> 3.  A needed correction: all of the module's external variables are now 
>> unset on each match attempt, so that a failed match will be obvious.
>>     
>
> Well, the exit status is set already.  And since the last shell release,
> we've documented explicitly that nothing is altered:
>
> 2009-01-15:
>         * 26312: Phil Pennock: Doc/Zsh/cond.yo, Doc/Zsh/mod_pcre.yo,
>         Doc/Zsh/mod_regex.yo: Document no variables altered on failed
>         match.
>
> On the other hand, there's value to a reset too.  I don't have a strong
> preference either way, but now is the time to fix it, before there's
> been a release which documents the behaviour.  :)  Part of the problem
> is that pcre_regex has been in Zsh for many years and we tend to be
> cautious when changing behaviour.
>
> I doubt that anyone is relying on the value being unchanged after a
> match attempt.  Thus the lack of a strong preference.  Anyone else?
>
> While you're at it, there's also the zsh/regex module which uses the
> system's normal extended regex libraries and if you're changing the
> semantics of one, both should change.
>
>   
Right, the exit status should really be checked anyway instead of my 
checking the match results.  Like so:
~~~~~~

pcre_match -b -- $body_match

accum=()

while [[ $? -eq 0 ]] do
    b=($=ZPCRE_OP)
    accum+=$match[1]
    pcre_match -b -n $b[2] -- $body_match
done
   
print -l $accum

~~~~~~~
I'll change it back, unless others think that the new reset actions are 
worth changing to.  The standard behavior is sloppy, but then the whole 
mechanism is sloppy and it doesn't matter.
>> Could someone please point me to the doc files that would need updating 
>> (for the zshmodule man page), or if someone here has that part 
>> automated, I can send them whatever targeted write-up they want.
>>     
>
> Doc/Zsh/mod_pcre.yo (and mod_regex.yo), which are in YODL format.
>
>   
>> -    ret = pcre_exec(pcre_pattern, pcre_hints, *args, strlen(*args), 0, 0, ovec, ovecsize);
>> +    ret = pcre_exec(pcre_pattern, pcre_hints, *args, strlen(*args), offset_start, 0, ovec, ovecsize);
>>     
>
> How gracefully does pcre_exec() fail when offset_start is set to a value
> larger than the length of the string?  To maxint-smallnumber?
>
> -Phil
>   
Oops, it fails like a winged elephant.  I'll put a check in the next 
patch.  Thanks Phil.


-Jon

[-- Attachment #2: Type: text/html, Size: 5953 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: PATCH: New options for the PCRE module
  2009-02-27  7:15 Jon Strait
@ 2009-02-27  8:33 ` Phil Pennock
  0 siblings, 0 replies; 3+ messages in thread
From: Phil Pennock @ 2009-02-27  8:33 UTC (permalink / raw)
  To: Jon Strait; +Cc: zsh workers

Chiming in because I'm the most recent person to mess with regex
functionality (I think).

On 2009-02-26 at 23:15 -0800, Jon Strait wrote:
> 1.  A new '-s' option to pcre_compile.  This is the frequently set 
> PCRE_DOTALL option, allowing the dot character to match a newline as well.

This makes sense, since we already have the other common options as
flags.  In the meantime, you know about the internal option setting
feature of PCRE syntax, right?  Putting (?s) at the start of the pattern
is equivalent.

> On the safe side, regarding the possibility of multi-byte characters, 
> I'm assuming that the returned offset positions are only for sending 
> back to pcre_match and not for indexing on a match string, because the 
> offsets are in byte count, not character count.

This is dubious.  I can see someone quite reasonably using
$var[start,end] for substring extraction; the shell should be internally
consistent.  In a worst-case scenario, there could be another option to
select which offset semantics shall be used.  Peter's work on UTF-8
support has so far managed to keep the user from ever knowing or caring
about this.

Or return four numbers instead of two, so that anyone using the
interface has to be aware of the difference and can think about it.

I'm not coming up with a more elegant solution.

> 3.  A needed correction: all of the module's external variables are now 
> unset on each match attempt, so that a failed match will be obvious.

Well, the exit status is set already.  And since the last shell release,
we've documented explicitly that nothing is altered:

2009-01-15:
        * 26312: Phil Pennock: Doc/Zsh/cond.yo, Doc/Zsh/mod_pcre.yo,
        Doc/Zsh/mod_regex.yo: Document no variables altered on failed
        match.

On the other hand, there's value to a reset too.  I don't have a strong
preference either way, but now is the time to fix it, before there's
been a release which documents the behaviour.  :)  Part of the problem
is that pcre_regex has been in Zsh for many years and we tend to be
cautious when changing behaviour.

I doubt that anyone is relying on the value being unchanged after a
match attempt.  Thus the lack of a strong preference.  Anyone else?

While you're at it, there's also the zsh/regex module which uses the
system's normal extended regex libraries and if you're changing the
semantics of one, both should change.

> Could someone please point me to the doc files that would need updating 
> (for the zshmodule man page), or if someone here has that part 
> automated, I can send them whatever targeted write-up they want.

Doc/Zsh/mod_pcre.yo (and mod_regex.yo), which are in YODL format.

> -    ret = pcre_exec(pcre_pattern, pcre_hints, *args, strlen(*args), 0, 0, ovec, ovecsize);
> +    ret = pcre_exec(pcre_pattern, pcre_hints, *args, strlen(*args), offset_start, 0, ovec, ovecsize);

How gracefully does pcre_exec() fail when offset_start is set to a value
larger than the length of the string?  To maxint-smallnumber?

-Phil


^ permalink raw reply	[flat|nested] 3+ messages in thread

* PATCH: New options for the PCRE module
@ 2009-02-27  7:15 Jon Strait
  2009-02-27  8:33 ` Phil Pennock
  0 siblings, 1 reply; 3+ messages in thread
From: Jon Strait @ 2009-02-27  7:15 UTC (permalink / raw)
  To: zsh workers

[-- Attachment #1: Type: text/plain, Size: 1770 bytes --]

Hi all,

I thought the PCRE module could use a little enhancement, so I added a 
few things that may be useful to some.

Let's see if I can get through all of this before the coffee wears off...

1.  A new '-s' option to pcre_compile.  This is the frequently set 
PCRE_DOTALL option, allowing the dot character to match a newline as well.

2. For pcre_match, a  '-n offset' option for starting the search at the 
offset position in the match string, and a '-b' option for setting the 
variable ZPCRE_OP to the offset pair of positions of the entire 
successful pattern match.  For example, if a pattern matches with the 
'-b' option set, a ZPCRE_OP set to the string "32 45" indicates that the 
entire match started on byte position 32 and ended on byte position 44.  
PCRE is saying byte position 32 to 45 exclusive, zero based.

All of this is to enable the 'find all' functionality.  For example, if 
I want all of the non-overlapping matches within a string, I can now do:

accum=()

pcre_match -b -- $match_string

while [[ -n $ZPCRE_OP ]] do
    b=($=ZPCRE_OP)
    accum+=$MATCH
    pcre_match -b -n $(( b[2] )) -- $match_string
done
   
print -l $accum

On the safe side, regarding the possibility of multi-byte characters, 
I'm assuming that the returned offset positions are only for sending 
back to pcre_match and not for indexing on a match string, because the 
offsets are in byte count, not character count.

3.  A needed correction: all of the module's external variables are now 
unset on each match attempt, so that a failed match will be obvious.

Could someone please point me to the doc files that would need updating 
(for the zshmodule man page), or if someone here has that part 
automated, I can send them whatever targeted write-up they want.



[-- Attachment #2: pcre.patch --]
[-- Type: text/x-patch, Size: 4207 bytes --]

--- pcre.c	2007-07-09 02:30:42.000000000 -0700
+++ pcre-new.c	2009-02-26 22:10:46.000000000 -0800
@@ -82,6 +82,7 @@
     if(OPT_ISSET(ops,'i')) pcre_opts |= PCRE_CASELESS;
     if(OPT_ISSET(ops,'m')) pcre_opts |= PCRE_MULTILINE;
     if(OPT_ISSET(ops,'x')) pcre_opts |= PCRE_EXTENDED;
+    if(OPT_ISSET(ops,'s')) pcre_opts |= PCRE_DOTALL;
     
     if (zpcre_utf8_enabled())
 	pcre_opts |= PCRE_UTF8;
@@ -137,20 +138,23 @@
 
 /**/
 static int
-zpcre_get_substrings(char *arg, int *ovec, int ret, char *matchvar, char *substravar, int matchedinarr)
+zpcre_get_substrings(char *arg, int *ovec, int ret, char *matchvar, char *substravar, 
+    int want_offset_pair, int matchedinarr)
 {
     char **captures, *match_all, **matches;
+    char offset_all[50];
     int capture_start = 1;
 
     if (matchedinarr)
 	capture_start = 0;
-    if (matchvar == NULL)
-	matchvar = "MATCH";
-    if (substravar == NULL)
-	substravar = "match";
-
+    
     /* captures[0] will be entire matched string, [1] first substring */
-    if(!pcre_get_substring_list(arg, ovec, ret, (const char ***)&captures)) {
+    if (!pcre_get_substring_list(arg, ovec, ret, (const char ***)&captures)) {
+	/* Set to the offsets of the complete match */
+	if (want_offset_pair) {
+	    sprintf(offset_all, "%d %d", ovec[0], ovec[1]);
+	    setsparam("ZPCRE_OP", ztrdup(offset_all));
+	}
 	match_all = ztrdup(captures[0]);
 	setsparam(matchvar, match_all);
 	matches = zarrdup(&captures[capture_start]);
@@ -163,12 +167,30 @@
 
 /**/
 static int
+getposint(char *instr, char *nam)
+{
+    char *eptr;
+    int ret;
+
+    ret = (int)zstrtol(instr, &eptr, 10);
+    if (*eptr || ret < 0) {
+	zwarnnam(nam, "integer expected: %s", instr);
+	return -1;
+    }
+
+    return ret;
+}
+
+/**/
+static int
 bin_pcre_match(char *nam, char **args, Options ops, UNUSED(int func))
 {
     int ret, capcount, *ovec, ovecsize, c;
     char *matched_portion = NULL;
     char *receptacle = NULL;
     int return_value = 1;
+    int offset_start = 0;
+    int want_offset_pair = 0;
 
     if (pcre_pattern == NULL) {
 	zwarnnam(nam, "no pattern has been compiled");
@@ -181,6 +203,12 @@
     if(OPT_HASARG(ops,c='v')) {
 	matched_portion = OPT_ARG(ops,c);
     }
+    if(OPT_HASARG(ops,c='n')) { /* The offset position to start the search */
+	offset_start = getposint(OPT_ARG(ops,c), nam);
+    }
+    /* For the entire match, 'Return' the offset positions instead of the matched string */
+    if(OPT_ISSET(ops,'b')) want_offset_pair = 1; 
+    
     if(!*args) {
 	zwarnnam(nam, "not enough arguments");
     }
@@ -194,12 +222,22 @@
     ovecsize = (capcount+1)*3;
     ovec = zalloc(ovecsize*sizeof(int));
     
-    ret = pcre_exec(pcre_pattern, pcre_hints, *args, strlen(*args), 0, 0, ovec, ovecsize);
+    ret = pcre_exec(pcre_pattern, pcre_hints, *args, strlen(*args), offset_start, 0, ovec, ovecsize);
+
+    if (matched_portion == NULL)
+	matched_portion = "MATCH";
+    if (receptacle == NULL)
+	receptacle = "match";
+
+    /* Reset the external variables */
+    unsetparam(matched_portion);
+    unsetparam(receptacle);
+    unsetparam("ZPCRE_OP");
     
     if (ret==0) return_value = 0;
     else if (ret==PCRE_ERROR_NOMATCH) /* no match */;
     else if (ret>0) {
-	zpcre_get_substrings(*args, ovec, ret, matched_portion, receptacle, 0);
+	zpcre_get_substrings(*args, ovec, ret, matched_portion, receptacle, want_offset_pair, 0);
 	return_value = 0;
     }
     else {
@@ -258,7 +296,7 @@
 		    break;
 		}
                 else if (r>0) {
-		    zpcre_get_substrings(lhstr, ov, r, NULL, avar, isset(BASHREMATCH));
+		    zpcre_get_substrings(lhstr, ov, r, NULL, avar, 0, isset(BASHREMATCH));
 		    return_value = 1;
 		    break;
 		}
@@ -289,8 +327,8 @@
 #endif /* !(HAVE_PCRE_COMPILE && HAVE_PCRE_EXEC) */
 
 static struct builtin bintab[] = {
-    BUILTIN("pcre_compile", 0, bin_pcre_compile, 1, 1, 0, "aimx",  NULL),
-    BUILTIN("pcre_match",   0, bin_pcre_match,   1, 1, 0, "a:v:",    NULL),
+    BUILTIN("pcre_compile", 0, bin_pcre_compile, 1, 1, 0, "aimxs",  NULL),
+    BUILTIN("pcre_match",   0, bin_pcre_match,   1, 1, 0, "a:v:n:b",    NULL),
     BUILTIN("pcre_study",   0, bin_pcre_study,   0, 0, 0, NULL,    NULL)
 };
 

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2009-02-27 23:21 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-02-27 23:10 PATCH: New options for the PCRE module Jon Strait
  -- strict thread matches above, loose matches on Subject: below --
2009-02-27  7:15 Jon Strait
2009-02-27  8:33 ` Phil Pennock

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).