* [musl] regex Back reference matching result not same as glibc and tre.
@ 2020-04-18 8:44 liheng (P)
2020-04-18 10:28 ` Florian Weimer
0 siblings, 1 reply; 9+ messages in thread
From: liheng (P) @ 2020-04-18 8:44 UTC (permalink / raw)
To: Rich Felker; +Cc: musl, Xiangrui (Euler), Lizefan
Rich Felker:
Hello, I've noticed musl regex matching result is not same as glibc and tre.
The back reference maybe not supported well in latest version.
Here is a simple test case:
#include <regex.h>
#include <stdio.h>
#include <string.h>
#define str "aba"
#define N 2
static const char *expected[N] =
{
str, "a"
};
static const char pat[] = "(.?).?\\1";
int test_regex(void)
{
regex_t rbuf;
int err = regcomp(&rbuf, pat, REG_EXTENDED);
if (err != 0) {
char errstr[300];
regerror(err, &rbuf, errstr, sizeof (errstr));
puts (errstr);
return err;
}
regmatch_t m[N];
err = regexec(&rbuf, str, N, m, 0);
if (err != 0) {
puts ("regexec failed");
return 1;
}
int result = 0;
int i;
for (i = 0; i < N; ++i) {
if (m[i].rm_so == -1) {
printf ("m[%d] unused\n", i);
result = 1;
}
else {
int len = m[i].rm_eo - m[i].rm_so;
printf ("m[%d] = \"%.*s\"\n", i, len, str + m[i].rm_so);
if (strlen (expected[i]) != len
|| memcmp (expected[i], str + m[i].rm_so, len) != 0)
result = 1;
}
}
return result;
}
int main (void)
{
int result = 0;
result = test_regex();
if (result != 0) {
printf("test regex failed\n");
} else {
printf("test regex success\n");
}
return result;
}
musl:
# ./test
regexec failed
test regex failed
glibc:
# ./test
m[0] = "aba"
m[1] = "a"
m[2] = ""
test regex success
tre:
# ./test
m[0] = "aba"
m[1] = "a"
m[2] = ""
test regex success
I noticed Rich Felker made change about back reference in below commit to suppress back reference processing in ERE regcomp.
commit 7c8c86f6308c7e0816b9638465a5917b12159e8f
Author: Rich Felker <dalias@aerifal.cx>
Date: Fri Mar 20 18:25:01 2015 -0400
suppress backref processing in ERE regcomp
one of the features of ERE is that it's actually a regular language
and does not admit expressions which cannot be matched in linear time.
introduction of \n backref support into regcomp's ERE parsing was
unintentional.
diff --git a/src/regex/regcomp.c b/src/regex/regcomp.c index bce6bc15..4d80cb1c 100644
--- a/src/regex/regcomp.c
+++ b/src/regex/regcomp.c
@@ -839,7 +839,7 @@ static reg_errcode_t parse_atom(tre_parse_ctx_t *ctx, const char *s)
break;
default:
- if (isdigit(*s)) {
+ if (!ere && isdigit(*s)) {
/* back reference */
This commit reminds me that if i want to use back reference i should not to tag REG_EXTENDED, but this test case matching still failed.
And I try to support back reference in ERE regcomp by below modify and then the musl regex matching success same as glibc and tre.
--- a/src/regex/regcomp.c
+++ b/src/regex/regcomp.c
default:
+ if (!ere && isdigit(*s)) {
+ if (ere && isdigit(*s)) {
/* back reference */
Thank you for considering this.
Li Heng
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [musl] regex Back reference matching result not same as glibc and tre.
2020-04-18 8:44 [musl] regex Back reference matching result not same as glibc and tre liheng (P)
@ 2020-04-18 10:28 ` Florian Weimer
2020-04-18 11:07 ` liheng (P)
0 siblings, 1 reply; 9+ messages in thread
From: Florian Weimer @ 2020-04-18 10:28 UTC (permalink / raw)
To: liheng (P); +Cc: Rich Felker, musl, Xiangrui (Euler), Lizefan
* liheng:
> static const char pat[] = "(.?).?\\1";
> This commit reminds me that if i want to use back reference i should
> not to tag REG_EXTENDED, but this test case matching still failed.
Did you change the expression to this for the basic regular expression
test?
static const char pat[] = "\\(.?\\).?\\1";
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [musl] regex Back reference matching result not same as glibc and tre.
2020-04-18 10:28 ` Florian Weimer
@ 2020-04-18 11:07 ` liheng (P)
2020-04-18 11:13 ` Szabolcs Nagy
0 siblings, 1 reply; 9+ messages in thread
From: liheng (P) @ 2020-04-18 11:07 UTC (permalink / raw)
To: Florian Weimer; +Cc: Rich Felker, musl, Xiangrui (Euler), Lizefan
static const char pat[] = "\\(.?\\).?\\1";
string: "aba";
I tested this pattern by my test case just now.
musl:
# ./test
regexec failed
test regex failed
glibc:
# ./test
Invalid back reference
test regex failed
tre:
# ./test
Invalid back reference
test regex failed
-----Original Message-----
From: Florian Weimer [mailto:fw@deneb.enyo.de]
Sent: Saturday, April 18, 2020 6:29 PM
To: liheng (P) <liheng40@huawei.com>
Cc: Rich Felker <dalias@libc.org>; musl@lists.openwall.com; Xiangrui (Euler) <rui.xiang@huawei.com>; Lizefan <lizefan@huawei.com>
Subject: Re: [musl] regex Back reference matching result not same as glibc and tre.
* liheng:
> static const char pat[] = "(.?).?\\1";
> This commit reminds me that if i want to use back reference i should
> not to tag REG_EXTENDED, but this test case matching still failed.
Did you change the expression to this for the basic regular expression test?
static const char pat[] = "\\(.?\\).?\\1";
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [musl] regex Back reference matching result not same as glibc and tre.
2020-04-18 11:07 ` liheng (P)
@ 2020-04-18 11:13 ` Szabolcs Nagy
2020-04-18 11:37 ` liheng (P)
0 siblings, 1 reply; 9+ messages in thread
From: Szabolcs Nagy @ 2020-04-18 11:13 UTC (permalink / raw)
To: liheng (P); +Cc: Florian Weimer, Rich Felker, musl, Xiangrui (Euler), Lizefan
* liheng (P) <liheng40@huawei.com> [2020-04-18 11:07:20 +0000]:
> static const char pat[] = "\\(.?\\).?\\1";
> string: "aba";
? is not special in bre
it should be \{0,1\} (i think we support \? as an extension,
but unescaped ? only matches literal ?). try one of
static const char pat[] = "\\(.\\{0,1\\}\\).\\{0,1\\}\\1";
static const char pat[] = "\\(.\\?\\).\\?\\1";
>
> I tested this pattern by my test case just now.
>
> musl:
> # ./test
> regexec failed
> test regex failed
>
> glibc:
> # ./test
> Invalid back reference
> test regex failed
>
> tre:
> # ./test
> Invalid back reference
> test regex failed
>
> -----Original Message-----
> From: Florian Weimer [mailto:fw@deneb.enyo.de]
> Sent: Saturday, April 18, 2020 6:29 PM
> To: liheng (P) <liheng40@huawei.com>
> Cc: Rich Felker <dalias@libc.org>; musl@lists.openwall.com; Xiangrui (Euler) <rui.xiang@huawei.com>; Lizefan <lizefan@huawei.com>
> Subject: Re: [musl] regex Back reference matching result not same as glibc and tre.
>
> * liheng:
>
> > static const char pat[] = "(.?).?\\1";
>
> > This commit reminds me that if i want to use back reference i should
> > not to tag REG_EXTENDED, but this test case matching still failed.
>
>
> Did you change the expression to this for the basic regular expression test?
>
> static const char pat[] = "\\(.?\\).?\\1";
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [musl] regex Back reference matching result not same as glibc and tre.
2020-04-18 11:13 ` Szabolcs Nagy
@ 2020-04-18 11:37 ` liheng (P)
2020-04-18 14:07 ` Szabolcs Nagy
0 siblings, 1 reply; 9+ messages in thread
From: liheng (P) @ 2020-04-18 11:37 UTC (permalink / raw)
To: musl; +Cc: Florian Weimer, Rich Felker, Xiangrui (Euler), Lizefan
static const char pat[] = "\\(.?\\).?\\1";
str = "aba"
ok, I retest this pat with no tag.
regcomp(&rbuf, pat, 0);
regexec1(&rbuf, str, N, m, 0);
glibc:
# ./test
regexec failed
test regex failed
musl:
# ./test
regexec failed
test regex failed
-----Original Message-----
From: Szabolcs Nagy [mailto:nsz@port70.net]
Sent: Saturday, April 18, 2020 7:13 PM
To: liheng (P) <liheng40@huawei.com>
Cc: Florian Weimer <fw@deneb.enyo.de>; Rich Felker <dalias@libc.org>; musl@lists.openwall.com; Xiangrui (Euler) <rui.xiang@huawei.com>; Lizefan <lizefan@huawei.com>
Subject: Re: [musl] regex Back reference matching result not same as glibc and tre.
* liheng (P) <liheng40@huawei.com> [2020-04-18 11:07:20 +0000]:
> static const char pat[] = "\\(.?\\).?\\1";
> string: "aba";
? is not special in bre
it should be \{0,1\} (i think we support \? as an extension, but unescaped ? only matches literal ?). try one of
static const char pat[] = "\\(.\\{0,1\\}\\).\\{0,1\\}\\1"; static const char pat[] = "\\(.\\?\\).\\?\\1";
>
> I tested this pattern by my test case just now.
>
> musl:
> # ./test
> regexec failed
> test regex failed
>
> glibc:
> # ./test
> Invalid back reference
> test regex failed
>
> tre:
> # ./test
> Invalid back reference
> test regex failed
>
> -----Original Message-----
> From: Florian Weimer [mailto:fw@deneb.enyo.de]
> Sent: Saturday, April 18, 2020 6:29 PM
> To: liheng (P) <liheng40@huawei.com>
> Cc: Rich Felker <dalias@libc.org>; musl@lists.openwall.com; Xiangrui
> (Euler) <rui.xiang@huawei.com>; Lizefan <lizefan@huawei.com>
> Subject: Re: [musl] regex Back reference matching result not same as glibc and tre.
>
> * liheng:
>
> > static const char pat[] = "(.?).?\\1";
>
> > This commit reminds me that if i want to use back reference i should
> > not to tag REG_EXTENDED, but this test case matching still failed.
>
>
> Did you change the expression to this for the basic regular expression test?
>
> static const char pat[] = "\\(.?\\).?\\1";
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [musl] regex Back reference matching result not same as glibc and tre.
2020-04-18 11:37 ` liheng (P)
@ 2020-04-18 14:07 ` Szabolcs Nagy
2020-04-19 12:26 ` liheng (P)
0 siblings, 1 reply; 9+ messages in thread
From: Szabolcs Nagy @ 2020-04-18 14:07 UTC (permalink / raw)
To: liheng (P); +Cc: musl, Florian Weimer, Rich Felker, Xiangrui (Euler), Lizefan
* liheng (P) <liheng40@huawei.com> [2020-04-18 11:37:13 +0000]:
> static const char pat[] = "\\(.?\\).?\\1";
> str = "aba"
>
> ok, I retest this pat with no tag.
why?
? is not special in bre.
you need "\\{0,1\\}" or "\\?" instead of "?" to match "aba"
your pat would match str="a?b?a?" in a standard conform
implementation.
>
> regcomp(&rbuf, pat, 0);
> regexec1(&rbuf, str, N, m, 0);
>
> glibc:
> # ./test
> regexec failed
> test regex failed
>
> musl:
> # ./test
> regexec failed
> test regex failed
>
>
>
> -----Original Message-----
> From: Szabolcs Nagy [mailto:nsz@port70.net]
> Sent: Saturday, April 18, 2020 7:13 PM
> To: liheng (P) <liheng40@huawei.com>
> Cc: Florian Weimer <fw@deneb.enyo.de>; Rich Felker <dalias@libc.org>; musl@lists.openwall.com; Xiangrui (Euler) <rui.xiang@huawei.com>; Lizefan <lizefan@huawei.com>
> Subject: Re: [musl] regex Back reference matching result not same as glibc and tre.
>
> * liheng (P) <liheng40@huawei.com> [2020-04-18 11:07:20 +0000]:
> > static const char pat[] = "\\(.?\\).?\\1";
> > string: "aba";
>
> ? is not special in bre
>
> it should be \{0,1\} (i think we support \? as an extension, but unescaped ? only matches literal ?). try one of
>
> static const char pat[] = "\\(.\\{0,1\\}\\).\\{0,1\\}\\1"; static const char pat[] = "\\(.\\?\\).\\?\\1";
>
> >
> > I tested this pattern by my test case just now.
> >
> > musl:
> > # ./test
> > regexec failed
> > test regex failed
> >
> > glibc:
> > # ./test
> > Invalid back reference
> > test regex failed
> >
> > tre:
> > # ./test
> > Invalid back reference
> > test regex failed
> >
> > -----Original Message-----
> > From: Florian Weimer [mailto:fw@deneb.enyo.de]
> > Sent: Saturday, April 18, 2020 6:29 PM
> > To: liheng (P) <liheng40@huawei.com>
> > Cc: Rich Felker <dalias@libc.org>; musl@lists.openwall.com; Xiangrui
> > (Euler) <rui.xiang@huawei.com>; Lizefan <lizefan@huawei.com>
> > Subject: Re: [musl] regex Back reference matching result not same as glibc and tre.
> >
> > * liheng:
> >
> > > static const char pat[] = "(.?).?\\1";
> >
> > > This commit reminds me that if i want to use back reference i should
> > > not to tag REG_EXTENDED, but this test case matching still failed.
> >
> >
> > Did you change the expression to this for the basic regular expression test?
> >
> > static const char pat[] = "\\(.?\\).?\\1";
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [musl] regex Back reference matching result not same as glibc and tre.
2020-04-18 14:07 ` Szabolcs Nagy
@ 2020-04-19 12:26 ` liheng (P)
2020-04-19 13:10 ` Florian Weimer
2020-04-20 1:26 ` Rich Felker
0 siblings, 2 replies; 9+ messages in thread
From: liheng (P) @ 2020-04-19 12:26 UTC (permalink / raw)
To: Szabolcs Nagy
Cc: musl, Florian Weimer, Rich Felker, Xiangrui (Euler), Lizefan
Ok, you are right, I retest to match "aba" by pat[] = "\\(.\\?\\).\\?\\1" success without tags (basic regular expression mode I think).
regcomp1(&rbuf, pat, 0);
But my point is that why pat[] = "(.?).?\\1" to match "aba" in extended regular expression mode that success in glibc and failed in musl? Are musl-regex and glibc-regex different?
-----Original Message-----
From: Szabolcs Nagy [mailto:nsz@port70.net]
Sent: Saturday, April 18, 2020 10:07 PM
To: liheng (P) <liheng40@huawei.com>
Cc: musl@lists.openwall.com; Florian Weimer <fw@deneb.enyo.de>; Rich Felker <dalias@libc.org>; Xiangrui (Euler) <rui.xiang@huawei.com>; Lizefan <lizefan@huawei.com>
Subject: Re: [musl] regex Back reference matching result not same as glibc and tre.
* liheng (P) <liheng40@huawei.com> [2020-04-18 11:37:13 +0000]:
> static const char pat[] = "\\(.?\\).?\\1"; str = "aba"
>
> ok, I retest this pat with no tag.
why?
? is not special in bre.
you need "\\{0,1\\}" or "\\?" instead of "?" to match "aba"
your pat would match str="a?b?a?" in a standard conform implementation.
>
> regcomp(&rbuf, pat, 0);
> regexec1(&rbuf, str, N, m, 0);
>
> glibc:
> # ./test
> regexec failed
> test regex failed
>
> musl:
> # ./test
> regexec failed
> test regex failed
>
>
>
> -----Original Message-----
> From: Szabolcs Nagy [mailto:nsz@port70.net]
> Sent: Saturday, April 18, 2020 7:13 PM
> To: liheng (P) <liheng40@huawei.com>
> Cc: Florian Weimer <fw@deneb.enyo.de>; Rich Felker <dalias@libc.org>;
> musl@lists.openwall.com; Xiangrui (Euler) <rui.xiang@huawei.com>;
> Lizefan <lizefan@huawei.com>
> Subject: Re: [musl] regex Back reference matching result not same as glibc and tre.
>
> * liheng (P) <liheng40@huawei.com> [2020-04-18 11:07:20 +0000]:
> > static const char pat[] = "\\(.?\\).?\\1";
> > string: "aba";
>
> ? is not special in bre
>
> it should be \{0,1\} (i think we support \? as an extension, but
> unescaped ? only matches literal ?). try one of
>
> static const char pat[] = "\\(.\\{0,1\\}\\).\\{0,1\\}\\1"; static
> const char pat[] = "\\(.\\?\\).\\?\\1";
>
> >
> > I tested this pattern by my test case just now.
> >
> > musl:
> > # ./test
> > regexec failed
> > test regex failed
> >
> > glibc:
> > # ./test
> > Invalid back reference
> > test regex failed
> >
> > tre:
> > # ./test
> > Invalid back reference
> > test regex failed
> >
> > -----Original Message-----
> > From: Florian Weimer [mailto:fw@deneb.enyo.de]
> > Sent: Saturday, April 18, 2020 6:29 PM
> > To: liheng (P) <liheng40@huawei.com>
> > Cc: Rich Felker <dalias@libc.org>; musl@lists.openwall.com; Xiangrui
> > (Euler) <rui.xiang@huawei.com>; Lizefan <lizefan@huawei.com>
> > Subject: Re: [musl] regex Back reference matching result not same as glibc and tre.
> >
> > * liheng:
> >
> > > static const char pat[] = "(.?).?\\1";
> >
> > > This commit reminds me that if i want to use back reference i
> > > should not to tag REG_EXTENDED, but this test case matching still failed.
> >
> >
> > Did you change the expression to this for the basic regular expression test?
> >
> > static const char pat[] = "\\(.?\\).?\\1";
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [musl] regex Back reference matching result not same as glibc and tre.
2020-04-19 12:26 ` liheng (P)
@ 2020-04-19 13:10 ` Florian Weimer
2020-04-20 1:26 ` Rich Felker
1 sibling, 0 replies; 9+ messages in thread
From: Florian Weimer @ 2020-04-19 13:10 UTC (permalink / raw)
To: liheng (P); +Cc: Szabolcs Nagy, musl, Rich Felker, Xiangrui (Euler), Lizefan
* liheng:
> But my point is that why pat[] = "(.?).?\\1" to match "aba" in
> extended regular expression mode that success in glibc and failed in
> musl? Are musl-regex and glibc-regex different?
They are different. Nowadays, accepting backreferences for extended
regular expressions is probably a bug: it prevents certain strategies
for implementing regular expressions because they are not, in fact,
regular.
The glibc implementation is problematic for several reasons. I cannot
recommend to use it as a reference.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [musl] regex Back reference matching result not same as glibc and tre.
2020-04-19 12:26 ` liheng (P)
2020-04-19 13:10 ` Florian Weimer
@ 2020-04-20 1:26 ` Rich Felker
1 sibling, 0 replies; 9+ messages in thread
From: Rich Felker @ 2020-04-20 1:26 UTC (permalink / raw)
To: liheng (P); +Cc: Szabolcs Nagy, musl, Florian Weimer, Xiangrui (Euler), Lizefan
On Sun, Apr 19, 2020 at 12:26:58PM +0000, liheng (P) wrote:
> Ok, you are right, I retest to match "aba" by pat[] =
> "\\(.\\?\\).\\?\\1" success without tags (basic regular expression
> mode I think). regcomp1(&rbuf, pat, 0);
>
> But my point is that why pat[] = "(.?).?\\1" to match "aba" in
> extended regular expression mode that success in glibc and failed in
> musl? Are musl-regex and glibc-regex different?
Having backreferences in regular expressions is inherently a bug,
because it makes them non-regular and unsafe for use where the
expression is user input and finishing in finite time is a
requirement. They're supported in BRE because it's a historical
requirement impossed by the standard, but a major *feature* of ERE is
actually being regular and not having backreferences.
Rich
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2020-04-20 1:26 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-18 8:44 [musl] regex Back reference matching result not same as glibc and tre liheng (P)
2020-04-18 10:28 ` Florian Weimer
2020-04-18 11:07 ` liheng (P)
2020-04-18 11:13 ` Szabolcs Nagy
2020-04-18 11:37 ` liheng (P)
2020-04-18 14:07 ` Szabolcs Nagy
2020-04-19 12:26 ` liheng (P)
2020-04-19 13:10 ` Florian Weimer
2020-04-20 1:26 ` Rich Felker
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).