From: Szabolcs Nagy <nsz@port70.net>
To: musl@lists.openwall.com
Subject: [PATCH] handle ^ and $ in BRE subexpression start and end as anchors
Date: Thu, 24 Nov 2016 01:44:49 +0100 [thread overview]
Message-ID: <20161124004448.GV5749@port70.net> (raw)
In BRE, ^ is an anchor at the beginning of an expression, optionally
it may be an anchor at the beginning of a subexpression and must be
treated as a literal otherwise.
Previously musl treated ^ in subexpressions as literal, but at least
glibc and gnu sed treats it as an anchor and that's the more useful
behaviour: it can always be escaped to get back the literal meaning.
Same for $ at the end of a subexpression.
Portable BRE should not rely on this, but there are sed commands in
build scripts which do.
This changes the meaning of the BREs:
\(^a\)
\(a\|^b\)
\(a$\)
\(a$\|b\)
---
bit hackish solution, but turns out ctx->re was not used for anything
else than to detect if ^ was at the start of a full bre, changed that
to start of a subexpr now.
no regressions on my regex tests.
src/regex/regcomp.c | 21 ++++++++++++---------
1 file changed, 12 insertions(+), 9 deletions(-)
diff --git a/src/regex/regcomp.c b/src/regex/regcomp.c
index 65f2fd0..5a7b53a 100644
--- a/src/regex/regcomp.c
+++ b/src/regex/regcomp.c
@@ -401,8 +401,8 @@ typedef struct {
tre_ast_node_t *n;
/* Position in the regexp pattern after a parse function returns. */
const char *s;
- /* The first character of the regexp. */
- const char *re;
+ /* The first character of the last subexpression parsed. */
+ const char *start;
/* Current submatch ID. */
int submatch_id;
/* Current position (number of literal). */
@@ -876,14 +876,14 @@ static reg_errcode_t parse_atom(tre_parse_ctx_t *ctx, const char *s)
break;
case '^':
/* '^' has a special meaning everywhere in EREs, and at beginning of BRE. */
- if (!ere && s != ctx->re)
+ if (!ere && s != ctx->start)
goto parse_literal;
node = tre_ast_new_literal(ctx->mem, ASSERTION, ASSERT_AT_BOL, -1);
s++;
break;
case '$':
- /* '$' is special everywhere in EREs, and in the end of the string in BREs. */
- if (!ere && s[1])
+ /* '$' is special everywhere in EREs, and at the end of a BRE subexpression. */
+ if (!ere && s[1] && (s[1]!='\\'|| (s[2]!=')' && s[2]!='|')))
goto parse_literal;
node = tre_ast_new_literal(ctx->mem, ASSERTION, ASSERT_AT_EOL, -1);
s++;
@@ -944,7 +944,7 @@ static reg_errcode_t tre_parse(tre_parse_ctx_t *ctx)
{
tre_ast_node_t *nbranch=0, *nunion=0;
int ere = ctx->cflags & REG_EXTENDED;
- const char *s = ctx->re;
+ const char *s = ctx->start;
int subid = 0;
int depth = 0;
reg_errcode_t err;
@@ -962,6 +962,7 @@ static reg_errcode_t tre_parse(tre_parse_ctx_t *ctx)
s++;
depth++;
nbranch = nunion = 0;
+ ctx->start = s;
continue;
}
if ((!ere && *s == '\\' && s[1] == ')') ||
@@ -994,8 +995,8 @@ static reg_errcode_t tre_parse(tre_parse_ctx_t *ctx)
if (*s=='\\')
s++;
- /* handle ^* at the start of a complete BRE. */
- if (!ere && s==ctx->re+1 && s[-1]=='^')
+ /* handle ^* at the start of a BRE. */
+ if (!ere && s==ctx->start+1 && s[-1]=='^')
break;
/* extension: multiple consecutive *+?{,} is unspecified,
@@ -1038,8 +1039,10 @@ static reg_errcode_t tre_parse(tre_parse_ctx_t *ctx)
if (c == '\\' && s[1] == '|') {
s+=2;
+ ctx->start = s;
} else if (c == '|') {
s++;
+ ctx->start = s;
} else {
if (c == '\\') {
if (!depth) return REG_EPAREN;
@@ -2705,7 +2708,7 @@ regcomp(regex_t *restrict preg, const char *restrict regex, int cflags)
memset(&parse_ctx, 0, sizeof(parse_ctx));
parse_ctx.mem = mem;
parse_ctx.stack = stack;
- parse_ctx.re = regex;
+ parse_ctx.start = regex;
parse_ctx.cflags = cflags;
parse_ctx.max_backref = -1;
errcode = tre_parse(&parse_ctx);
--
2.10.2
next reply other threads:[~2016-11-24 0:44 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-11-24 0:44 Szabolcs Nagy [this message]
2016-11-24 14:46 ` Rich Felker
2016-11-24 15:14 ` Szabolcs Nagy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20161124004448.GV5749@port70.net \
--to=nsz@port70.net \
--cc=musl@lists.openwall.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).