From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=5.0 tests=MAILING_LIST_MULTI, RCVD_IN_MSPIKE_H2 autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 31533 invoked from network); 3 Aug 2022 22:43:58 -0000 Received: from second.openwall.net (193.110.157.125) by inbox.vuxu.org with ESMTPUTF8; 3 Aug 2022 22:43:58 -0000 Received: (qmail 27713 invoked by uid 550); 3 Aug 2022 22:43:54 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 27678 invoked from network); 3 Aug 2022 22:43:54 -0000 Date: Thu, 4 Aug 2022 00:43:42 +0200 From: Szabolcs Nagy To: Mike Beattie Cc: musl@lists.openwall.com Message-ID: <20220803224342.GF1320090@port70.net> Mail-Followup-To: Mike Beattie , musl@lists.openwall.com References: <20220721060819.GB9838@prometheus.ethernal.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20220721060819.GB9838@prometheus.ethernal.org> Subject: Re: [musl] Bug: BOL/EOL anchors in regex capture groups won't match EOL * Mike Beattie [2022-07-21 18:08:19 +1200]: > FRRouting uses musl-libc in its docker container build, and it also appears > to be in use in the GNS3 appliances for frr available online. > > BGP as-path matching is regex powered, and usage of a special token of '_' > allows for the easy matching of the boundary of an ASN in an as-path. > Internally, it's translated into the regex capture group of: > > (^|[,{}() ]|$) > > A valid as-path is a sequence of integers such as: > > 100 200 300 > > A BGP as-path filter might be specified as so: > > bgp as-path access-list foo seq 20 permit _300_ > > which would get expanded to: > > (^|[,{}() ]|$)300(^|[,{}() ]|$) > > when checking for a match. The usage of the pattern "(^|$)" in musl's regex > implementation will never match EOL, but it does match BOL. Removal of the > circumflex will let the match succeed. thanks for the report. it seems to me regcomp does not handle assertions corretly if there is a union (|) of multiple subexpressions that match the empty string. it simply takes the assertion of the leftmost subexpression so e.g. '(|$)a' matches 'a' but '($|)a' does not because it matches as '$a' and the $ assertion fail. since posix does not allow (| empty pattern in the syntax a conforming example is e.g. '(b*|$)a' vs '($|b*)a' all supported assertions are affected (^, $, \b, \B, \<, \>). the fix is not obvious: there is a regcomp step like tags, assertions = leftmost_empty_match(subexpr) process(tags, assertions) which should be list = all_empty_match(subexpr) for tags, assertions in list: if assertions are weaker than previous ones: process(tags, assertions) i think this can increase storage and computation requirements significantly unless the algorithm is further optimized. > > Here is the output of a test programs I've written to confirm this: > > $ musl-gcc -o r r.c > > $ ./r "_300_" "100 200 300" > regex: (^|[,{}() ]|$)300(^|[,{}() ]|$) > regexec on [100 200 300]: NOT Found > > Removal of "^|" from the beginning of the trailing capture group: > > $ ./r "(^|[,{}() ]|$)300([,{}() ]|$)" "0000 1111 2222" > regex: (^|[,{}() ]|$)300([,{}() ]|$) > regexec on [100 200 300]: Found > > Thanks, > Mike. > -- > Mike Beattie