From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/13379 Path: news.gmane.org!.POSTED!not-for-mail From: Szabolcs Nagy Newsgroups: gmane.linux.lib.musl.general Subject: Re: Unexpected regex behaviour Date: Tue, 30 Oct 2018 12:05:05 +0100 Message-ID: <20181030110505.GA2032@port70.net> References: <20181029225957.GR5150@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: blaine.gmane.org 1540897397 31445 195.159.176.226 (30 Oct 2018 11:03:17 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Tue, 30 Oct 2018 11:03:17 +0000 (UTC) User-Agent: Mutt/1.10.1 (2018-07-13) Cc: Robert =?utf-8?Q?H=C3=B6gberg?= To: musl@lists.openwall.com Original-X-From: musl-return-13395-gllmg-musl=m.gmane.org@lists.openwall.com Tue Oct 30 12:03:13 2018 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by blaine.gmane.org with smtp (Exim 4.84_2) (envelope-from ) id 1gHRni-00083N-Ir for gllmg-musl@m.gmane.org; Tue, 30 Oct 2018 12:03:10 +0100 Original-Received: (qmail 19895 invoked by uid 550); 30 Oct 2018 11:05:18 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 19877 invoked from network); 30 Oct 2018 11:05:18 -0000 Mail-Followup-To: musl@lists.openwall.com, Robert =?utf-8?Q?H=C3=B6gberg?= Content-Disposition: inline In-Reply-To: <20181029225957.GR5150@brightrain.aerifal.cx> Xref: news.gmane.org gmane.linux.lib.musl.general:13379 Archived-At: * Rich Felker [2018-10-29 18:59:57 -0400]: > On Mon, Oct 29, 2018 at 11:26:19PM +0100, Robert H=C3=B6gberg wrote: > > Hi, > >=20 > > I've noticed that the musl regex implementation behaves slightly > > differently than the glibc implementation. I'm attaching a short program > > showing the behaviour. > >=20 > > The difference makes yate (http://yate.null.ro) misbehave when running = with > > musl (reported here: https://github.com/openwrt/telephony/issues/378). > >=20 > > Yate uses a regexp like this: > > "^\\([[:alpha:]][[:alnum:]]\\+:\\)\\?/\\?/\\?\\([^[:space:][:cntrl:]@]\= \+@\\)\\?\\([[:alnum:]._+-]\\+\\|[[][[:xdigit:].:]\\+[]]\\)\\(:[0-9]\\+\\)\= \?" > >=20 > > ... to parse strings like: > > "sip:012345678@11.111.11.111:5060;user=3Dphone" > >=20 > > ... and the matches produced by musl are: > > Match 0: 0 - 32 sip:012345678@11.111.11.111:5060 > > Match 1: -1 - -1 > > Match 2: 0 - 14 sip:012345678@ > > Match 3: 14 - 27 11.111.11.111 > > Match 4: 27 - 32 :5060 > >=20 > > ... while glibc produces: > > Match 0: 0 - 32 sip:012345678@11.111.11.111:5060 > > Match 1: 0 - 4 sip: > > Match 2: 4 - 14 012345678@ > > Match 3: 14 - 27 11.111.11.111 > > Match 4: 27 - 32 :5060 > >=20 > > What do you think? > >=20 > > I've only tested musl 1.1.19. Sorry if this is not valid for later > > releases. I skimmed the 1.1.20 release notes and didn't find anything r= egex > > related. >=20 > I haven't checked which of the extensions you're using are supported > in musl, but the above is not a conforming POSIX BRE. It would be a > lot more readable and portable to use POSIX ERE (REG_EXTENDED) which > has the +, ?, and | operators as standard features. This looks like it > should work: >=20 > "^([[:alpha:]][[:alnum:]]+:)?/?/?([^[:space:][:cntrl:]@]+@)?([[:alnum:]._= +-]+|[[][[:xdigit:].:]+[]])(:[0-9]+)?" >=20 > The only reason to use POSIX BRE is if you need backreferences, which > are not regular and explicitly not supported in ERE. rewriting it as ERE should not change the grouping behaviour (\+, \? and \| are non-standard extensions in BRE, but we support those and the same engine is used as for ERE) the problem is that the string can be divided in multiple ways into groups to match the pattern, in such cases posix requires that the left-most pattern should match longest, which does not seem to work in musl. i think neither musl nor glibc gets this right at all times, but i think this is a simple case that should work. simpler example (musl busybox sed): $ echo 'sip:0123' |sed -r 's,^(sip:)?(.+)?,1=3D\1\n2=3D\2\n,' 1=3Dsip: 2=3D0123 $ echo 'sip:0123' |sed -r 's,^(sip:)?/?(.+)?,1=3D\1\n2=3D\2\n,' 1=3D 2=3Dsip:0123 $ echo 'sip:0123' |sed -r 's,^(sip:)?/*(.+)?,1=3D\1\n2=3D\2\n,' 1=3D 2=3Dsip:0123 in all cases \1 should match sip:, but somehow .+ wins when there is a subpattern with empty match in the middle.