mailing list of musl libc
 help / color / mirror / code / Atom feed
* [PATCH] handle ^ and $ in BRE subexpression start and end as anchors
@ 2016-11-24  0:44 Szabolcs Nagy
  2016-11-24 14:46 ` Rich Felker
  0 siblings, 1 reply; 3+ messages in thread
From: Szabolcs Nagy @ 2016-11-24  0:44 UTC (permalink / raw)
  To: musl

In BRE, ^ is an anchor at the beginning of an expression, optionally
it may be an anchor at the beginning of a subexpression and must be
treated as a literal otherwise.

Previously musl treated ^ in subexpressions as literal, but at least
glibc and gnu sed treats it as an anchor and that's the more useful
behaviour: it can always be escaped to get back the literal meaning.

Same for $ at the end of a subexpression.

Portable BRE should not rely on this, but there are sed commands in
build scripts which do.

This changes the meaning of the BREs:

	\(^a\)
	\(a\|^b\)
	\(a$\)
	\(a$\|b\)
---
bit hackish solution, but turns out ctx->re was not used for anything
else than to detect if ^ was at the start of a full bre, changed that
to start of a subexpr now.

no regressions on my regex tests.

 src/regex/regcomp.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/src/regex/regcomp.c b/src/regex/regcomp.c
index 65f2fd0..5a7b53a 100644
--- a/src/regex/regcomp.c
+++ b/src/regex/regcomp.c
@@ -401,8 +401,8 @@ typedef struct {
 	tre_ast_node_t *n;
 	/* Position in the regexp pattern after a parse function returns. */
 	const char *s;
-	/* The first character of the regexp. */
-	const char *re;
+	/* The first character of the last subexpression parsed. */
+	const char *start;
 	/* Current submatch ID. */
 	int submatch_id;
 	/* Current position (number of literal). */
@@ -876,14 +876,14 @@ static reg_errcode_t parse_atom(tre_parse_ctx_t *ctx, const char *s)
 		break;
 	case '^':
 		/* '^' has a special meaning everywhere in EREs, and at beginning of BRE. */
-		if (!ere && s != ctx->re)
+		if (!ere && s != ctx->start)
 			goto parse_literal;
 		node = tre_ast_new_literal(ctx->mem, ASSERTION, ASSERT_AT_BOL, -1);
 		s++;
 		break;
 	case '$':
-		/* '$' is special everywhere in EREs, and in the end of the string in BREs. */
-		if (!ere && s[1])
+		/* '$' is special everywhere in EREs, and at the end of a BRE subexpression. */
+		if (!ere && s[1] && (s[1]!='\\'|| (s[2]!=')' && s[2]!='|')))
 			goto parse_literal;
 		node = tre_ast_new_literal(ctx->mem, ASSERTION, ASSERT_AT_EOL, -1);
 		s++;
@@ -944,7 +944,7 @@ static reg_errcode_t tre_parse(tre_parse_ctx_t *ctx)
 {
 	tre_ast_node_t *nbranch=0, *nunion=0;
 	int ere = ctx->cflags & REG_EXTENDED;
-	const char *s = ctx->re;
+	const char *s = ctx->start;
 	int subid = 0;
 	int depth = 0;
 	reg_errcode_t err;
@@ -962,6 +962,7 @@ static reg_errcode_t tre_parse(tre_parse_ctx_t *ctx)
 				s++;
 			depth++;
 			nbranch = nunion = 0;
+			ctx->start = s;
 			continue;
 		}
 		if ((!ere && *s == '\\' && s[1] == ')') ||
@@ -994,8 +995,8 @@ static reg_errcode_t tre_parse(tre_parse_ctx_t *ctx)
 			if (*s=='\\')
 				s++;
 
-			/* handle ^* at the start of a complete BRE. */
-			if (!ere && s==ctx->re+1 && s[-1]=='^')
+			/* handle ^* at the start of a BRE. */
+			if (!ere && s==ctx->start+1 && s[-1]=='^')
 				break;
 
 			/* extension: multiple consecutive *+?{,} is unspecified,
@@ -1038,8 +1039,10 @@ static reg_errcode_t tre_parse(tre_parse_ctx_t *ctx)
 
 			if (c == '\\' && s[1] == '|') {
 				s+=2;
+				ctx->start = s;
 			} else if (c == '|') {
 				s++;
+				ctx->start = s;
 			} else {
 				if (c == '\\') {
 					if (!depth) return REG_EPAREN;
@@ -2705,7 +2708,7 @@ regcomp(regex_t *restrict preg, const char *restrict regex, int cflags)
   memset(&parse_ctx, 0, sizeof(parse_ctx));
   parse_ctx.mem = mem;
   parse_ctx.stack = stack;
-  parse_ctx.re = regex;
+  parse_ctx.start = regex;
   parse_ctx.cflags = cflags;
   parse_ctx.max_backref = -1;
   errcode = tre_parse(&parse_ctx);
-- 
2.10.2



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] handle ^ and $ in BRE subexpression start and end as anchors
  2016-11-24  0:44 [PATCH] handle ^ and $ in BRE subexpression start and end as anchors Szabolcs Nagy
@ 2016-11-24 14:46 ` Rich Felker
  2016-11-24 15:14   ` Szabolcs Nagy
  0 siblings, 1 reply; 3+ messages in thread
From: Rich Felker @ 2016-11-24 14:46 UTC (permalink / raw)
  To: musl

On Thu, Nov 24, 2016 at 01:44:49AM +0100, Szabolcs Nagy wrote:
> In BRE, ^ is an anchor at the beginning of an expression, optionally
> it may be an anchor at the beginning of a subexpression and must be
> treated as a literal otherwise.
> 
> Previously musl treated ^ in subexpressions as literal, but at least
> glibc and gnu sed treats it as an anchor and that's the more useful
> behaviour: it can always be escaped to get back the literal meaning.
> 
> Same for $ at the end of a subexpression.
> 
> Portable BRE should not rely on this, but there are sed commands in
> build scripts which do.
> 
> This changes the meaning of the BREs:
> 
> 	\(^a\)
> 	\(a\|^b\)
> 	\(a$\)
> 	\(a$\|b\)
> ---
> bit hackish solution, but turns out ctx->re was not used for anything
> else than to detect if ^ was at the start of a full bre, changed that
> to start of a subexpr now.

The renaming of the member from re to start is to prove that there are
no other users that get broken by this? If so, I like that.

Rich


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] handle ^ and $ in BRE subexpression start and end as anchors
  2016-11-24 14:46 ` Rich Felker
@ 2016-11-24 15:14   ` Szabolcs Nagy
  0 siblings, 0 replies; 3+ messages in thread
From: Szabolcs Nagy @ 2016-11-24 15:14 UTC (permalink / raw)
  To: musl

* Rich Felker <dalias@libc.org> [2016-11-24 09:46:35 -0500]:
> The renaming of the member from re to start is to prove that there are
> no other users that get broken by this? If so, I like that.

yeah, without the rename the patch was puzzling,
updating ->re magically made the code work :)


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-11-24 15:14 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-24  0:44 [PATCH] handle ^ and $ in BRE subexpression start and end as anchors Szabolcs Nagy
2016-11-24 14:46 ` Rich Felker
2016-11-24 15:14   ` Szabolcs Nagy

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).