mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] Bug: BOL/EOL anchors in regex capture groups won't match EOL
@ 2022-07-21  6:08 Mike Beattie
  2022-07-22  2:02 ` [musl] " Mike Beattie
  2022-08-03 22:43 ` [musl] " Szabolcs Nagy
  0 siblings, 2 replies; 3+ messages in thread
From: Mike Beattie @ 2022-07-21  6:08 UTC (permalink / raw)
  To: musl

FRRouting uses musl-libc in its docker container build, and it also appears
to be in use in the GNS3 appliances for frr available online.

BGP as-path matching is regex powered, and usage of a special token of '_'
allows for the easy matching of the boundary of an ASN in an as-path.
Internally, it's translated into the regex capture group of:

   (^|[,{}() ]|$)

A valid as-path is a sequence of integers such as:

   100 200 300

A BGP as-path filter might be specified as so:

   bgp as-path access-list foo seq 20 permit _300_

which would get expanded to:

   (^|[,{}() ]|$)300(^|[,{}() ]|$)

when checking for a match. The usage of the pattern "(^|$)" in musl's regex
implementation will never match EOL, but it does match BOL. Removal of the
circumflex will let the match succeed.

Here is the output of a test programs I've written to confirm this:

   $ musl-gcc -o r r.c

   $ ./r "_300_" "100 200 300"
   regex: (^|[,{}() ]|$)300(^|[,{}() ]|$)
   regexec on [100 200 300]: NOT Found

Removal of "^|" from the beginning of the trailing capture group:

   $ ./r "(^|[,{}() ]|$)300([,{}() ]|$)" "0000 1111 2222"
   regex: (^|[,{}() ]|$)300([,{}() ]|$)
   regexec on [100 200 300]: Found

Thanks,
Mike.
-- 
Mike Beattie <mike@ethernal.org>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [musl] Re: Bug: BOL/EOL anchors in regex capture groups won't match EOL
  2022-07-21  6:08 [musl] Bug: BOL/EOL anchors in regex capture groups won't match EOL Mike Beattie
@ 2022-07-22  2:02 ` Mike Beattie
  2022-08-03 22:43 ` [musl] " Szabolcs Nagy
  1 sibling, 0 replies; 3+ messages in thread
From: Mike Beattie @ 2022-07-22  2:02 UTC (permalink / raw)
  To: musl

On Thu, Jul 21, 2022 at 06:08:19PM +1200, Mike Beattie wrote:
> Removal of "^|" from the beginning of the trailing capture group:
> 
>    $ ./r "(^|[,{}() ]|$)300([,{}() ]|$)" "0000 1111 2222"
>    regex: (^|[,{}() ]|$)300([,{}() ]|$)
>    regexec on [100 200 300]: Found

Please excuse the usage of "0000 1111 2222", that should be "100 200 300", I
was testing some much longer complex paths and regexes, but was distilling
the content of this report down to be as concise as possible.

I'm also not subscribed, please CC me on replies.

Mike.
-- 
Mike Beattie <mike@ethernal.org>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [musl] Bug: BOL/EOL anchors in regex capture groups won't match EOL
  2022-07-21  6:08 [musl] Bug: BOL/EOL anchors in regex capture groups won't match EOL Mike Beattie
  2022-07-22  2:02 ` [musl] " Mike Beattie
@ 2022-08-03 22:43 ` Szabolcs Nagy
  1 sibling, 0 replies; 3+ messages in thread
From: Szabolcs Nagy @ 2022-08-03 22:43 UTC (permalink / raw)
  To: Mike Beattie; +Cc: musl

* Mike Beattie <mike@ethernal.org> [2022-07-21 18:08:19 +1200]:
> FRRouting uses musl-libc in its docker container build, and it also appears
> to be in use in the GNS3 appliances for frr available online.
> 
> BGP as-path matching is regex powered, and usage of a special token of '_'
> allows for the easy matching of the boundary of an ASN in an as-path.
> Internally, it's translated into the regex capture group of:
> 
>    (^|[,{}() ]|$)
> 
> A valid as-path is a sequence of integers such as:
> 
>    100 200 300
> 
> A BGP as-path filter might be specified as so:
> 
>    bgp as-path access-list foo seq 20 permit _300_
> 
> which would get expanded to:
> 
>    (^|[,{}() ]|$)300(^|[,{}() ]|$)
> 
> when checking for a match. The usage of the pattern "(^|$)" in musl's regex
> implementation will never match EOL, but it does match BOL. Removal of the
> circumflex will let the match succeed.

thanks for the report.

it seems to me regcomp does not handle assertions corretly if there is
a union (|) of multiple subexpressions that match the empty string.

it simply takes the assertion of the leftmost subexpression so e.g.

'(|$)a' matches 'a' but
'($|)a' does not because it matches as '$a' and the $ assertion fail.

since posix does not allow (| empty pattern in the syntax a conforming
example is e.g.

'(b*|$)a' vs '($|b*)a'

all supported assertions are affected (^, $, \b, \B, \<, \>).

the fix is not obvious: there is a regcomp step like

	tags, assertions = leftmost_empty_match(subexpr)
	process(tags, assertions)

which should be

	list = all_empty_match(subexpr)
	for tags, assertions in list:
		if assertions are weaker than previous ones:
			process(tags, assertions)

i think this can increase storage and computation requirements
significantly unless the algorithm is further optimized.


> 
> Here is the output of a test programs I've written to confirm this:
> 
>    $ musl-gcc -o r r.c
> 
>    $ ./r "_300_" "100 200 300"
>    regex: (^|[,{}() ]|$)300(^|[,{}() ]|$)
>    regexec on [100 200 300]: NOT Found
> 
> Removal of "^|" from the beginning of the trailing capture group:
> 
>    $ ./r "(^|[,{}() ]|$)300([,{}() ]|$)" "0000 1111 2222"
>    regex: (^|[,{}() ]|$)300([,{}() ]|$)
>    regexec on [100 200 300]: Found
> 
> Thanks,
> Mike.
> -- 
> Mike Beattie <mike@ethernal.org>

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-08-03 22:43 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-21  6:08 [musl] Bug: BOL/EOL anchors in regex capture groups won't match EOL Mike Beattie
2022-07-22  2:02 ` [musl] " Mike Beattie
2022-08-03 22:43 ` [musl] " Szabolcs Nagy

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).