mailing list of musl libc
 help / color / mirror / code / Atom feed
* regex libs (was Re: [musl] embedded newbies site.)
@ 2013-07-16 15:10 LM
  2013-07-16 15:32 ` Rich Felker
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: LM @ 2013-07-16 15:10 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 3005 bytes --]

On Tue, Jul 16, 2013 at 10:00 AM, Rich Felker <dalias@aerifal.cx> wrote:

> The whole concept of regular expressions is that they're regular,
> meaning they're matchable in O(n) time with O(1) space. PCRE (the
> implementation) uses backtracking for everything, giving it
> exponentially-bad performance (JIT cannot fix this), and PCRE (the
> language) has a lot of features that are fundamentally not regular and
> thus can't be implemented efficiently. Also, the behavior of some of
> the features (e.g. greedy vs non-greedy matching) were not designed
> intentionally but just arose out of the backtracking implementation,
> and thus don't make a lot of sense unless you think from the
> standpoint of such an implementation.
>

Went back and rechecked the documentation (
http://www.pcre.org/readme.txt).  You're both right, PCRE is offering
the Perl regular expressions
implementation even when one uses the pcreposix interface.  Would have been
nice if they offered actual regular expressions handling if you only want
to use the POSIX compatible part of the interface.

So what are some good regex library solutions?  I'm also wondering if there
are some good cross-platform portable library solutions (or if PCRE is the
best pattern matching solution from a portability standpoint even if it's
not strictly regex compatible).

There's http://code.google.com/p/re2/ , but I've read some issues with its
performance in a few web articles and didn't have much luck with
portability to non-Linux platforms.  There's the glibc solution:
http://sourceforge.net/p/mingw/regex/ci/master/tree/  There's TRE (
https://github.com/laurikari/tre/), which some BSD systems want to use to
create their grep ( https://wiki.freebsd.org/BSDgrep ).  There's the
Oniguruma library ( http://www.geocities.jp/kosako3/oniguruma/ ). There's
Henry Spencer's regex ( http://www.arglist.com/regex ).  That looks
promising for portability.  There's http://re2c.org/ Further searching also
turns up http://tiny-rex.sourceforge.net/ which may have the same issues as
PCRE.  ICU seems to offer regex code (
http://userguide.icu-project.org/strings/regexp), probably same issue as
PCRE.  (Just my opinion, but ICU seems to do a lot of stuff for just one
library.)  There's BOOST's regex (
http://www.boost.org/doc/libs/1_54_0/libs/regex/doc/ ) which a lot of web
sites recommend, but I've just never been a fan of the BOOST libraries.
The Heirloom Project has regex code ( http://heirloom.sourceforge.net/ ).
Have I missed any other interesting solutions?

Sounds like I need to better clarify between regex pattern matching
libraries and pattern matching libraries on the musl wiki's alternative
library page.  If you recommend any of the above libraries or possibly
others and think certain implementations would be useful to others, let me
know and I'll add the links to the wiki.  I haven't really added anything
for pattern-matching libraries beyond benchmark information.

Thanks.

Sincerely,
Laura
http://www.distasis.com

[-- Attachment #2: Type: text/html, Size: 4091 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: regex libs (was Re: [musl] embedded newbies site.)
  2013-07-16 15:10 regex libs (was Re: [musl] embedded newbies site.) LM
@ 2013-07-16 15:32 ` Rich Felker
  2013-07-17  5:38   ` Isaac
  2013-07-16 15:41 ` Justin Cormack
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 8+ messages in thread
From: Rich Felker @ 2013-07-16 15:32 UTC (permalink / raw)
  To: musl

On Tue, Jul 16, 2013 at 11:10:34AM -0400, LM wrote:
> On Tue, Jul 16, 2013 at 10:00 AM, Rich Felker <dalias@aerifal.cx> wrote:
> 
> > The whole concept of regular expressions is that they're regular,
> > meaning they're matchable in O(n) time with O(1) space. PCRE (the
> > implementation) uses backtracking for everything, giving it
> > exponentially-bad performance (JIT cannot fix this), and PCRE (the
> > language) has a lot of features that are fundamentally not regular and
> > thus can't be implemented efficiently. Also, the behavior of some of
> > the features (e.g. greedy vs non-greedy matching) were not designed
> > intentionally but just arose out of the backtracking implementation,
> > and thus don't make a lot of sense unless you think from the
> > standpoint of such an implementation.
> >
> 
> Went back and rechecked the documentation (
> http://www.pcre.org/readme.txt).  You're both right, PCRE is offering
> the Perl regular expressions
> implementation even when one uses the pcreposix interface.  Would have been
> nice if they offered actual regular expressions handling if you only want
> to use the POSIX compatible part of the interface.
> 
> So what are some good regex library solutions?  I'm also wondering if there

POSIX systems provide the regcomp and regexec api as part of libc. On
musl, glibc, uclibc, and (afaik) all the bsd libcs, this gives you a
good implementation. However musl's is the only one I know of with no
conformance bugs, but the conformance bugs in the others are isolated
to obscure subexpression match issues.

> are some good cross-platform portable library solutions (or if PCRE is the
> best pattern matching solution from a portability standpoint even if it's
> not strictly regex compatible).

TRE, on which musl's is based, is the closest to being good, in my
opinion. It has optimal big-O but fairly suboptimal constant. However
it's unmaintained and has at least two bugs that aren't fixed
upstream, one of which may be exploitable if an untrusted party can
supply arbitrary regular expressions, and the other which just causes
some POSIX BRE expressions (the legacy style used by sed and non-e
grep) to be misinterpreted.

Rich


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: regex libs (was Re: [musl] embedded newbies site.)
  2013-07-16 15:10 regex libs (was Re: [musl] embedded newbies site.) LM
  2013-07-16 15:32 ` Rich Felker
@ 2013-07-16 15:41 ` Justin Cormack
  2013-07-16 16:55   ` Szabolcs Nagy
  2013-07-16 17:13 ` Strake
  2013-07-16 17:14 ` Kurt H Maier
  3 siblings, 1 reply; 8+ messages in thread
From: Justin Cormack @ 2013-07-16 15:41 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 3254 bytes --]

On 16 Jul 2013 16:10, "LM" <lmemsm@gmail.com> wrote:
>
> On Tue, Jul 16, 2013 at 10:00 AM, Rich Felker <dalias@aerifal.cx> wrote:
>>
>> The whole concept of regular expressions is that they're regular,
>> meaning they're matchable in O(n) time with O(1) space. PCRE (the
>> implementation) uses backtracking for everything, giving it
>> exponentially-bad performance (JIT cannot fix this), and PCRE (the
>> language) has a lot of features that are fundamentally not regular and
>> thus can't be implemented efficiently. Also, the behavior of some of
>> the features (e.g. greedy vs non-greedy matching) were not designed
>> intentionally but just arose out of the backtracking implementation,
>> and thus don't make a lot of sense unless you think from the
>> standpoint of such an implementation.
>
>
> Went back and rechecked the documentation ( http://www.pcre.org/readme.txt).  You're both right, PCRE is offering the Perl regular expressions
implementation even when one uses the pcreposix interface.  Would have been
nice if they offered actual regular expressions handling if you only want
to use the POSIX compatible part of the interface.
>
> So what are some good regex library solutions?  I'm also wondering if
there are some good cross-platform portable library solutions (or if PCRE
is the best pattern matching solution from a portability standpoint even if
it's not strictly regex compatible).
>
> There's http://code.google.com/p/re2/ , but I've read some issues with
its performance in a few web articles and didn't have much luck with
portability to non-Linux platforms.  There's the glibc solution:
http://sourceforge.net/p/mingw/regex/ci/master/tree/  There's TRE (
https://github.com/laurikari/tre/), which some BSD systems want to use to
create their grep ( https://wiki.freebsd.org/BSDgrep ).  There's the
Oniguruma library ( http://www.geocities.jp/kosako3/oniguruma/ ). There's
Henry Spencer's regex ( http://www.arglist.com/regex ).  That looks
promising for portability.  There's http://re2c.org/ Further searching also
turns up http://tiny-rex.sourceforge.net/ which may have the same issues as
PCRE.  ICU seems to offer regex code (
http://userguide.icu-project.org/strings/regexp), probably same issue as
PCRE.  (Just my opinion, but ICU seems to do a lot of stuff for just one
library.)  There's BOOST's regex (
http://www.boost.org/doc/libs/1_54_0/libs/regex/doc/ ) which a lot of web
sites recommend, but I've just never been a fan of the BOOST libraries.
The Heirloom Project has regex code ( http://heirloom.sourceforge.net/ ).
Have I missed any other interesting solutions?
>
> Sounds like I need to better clarify between regex pattern matching
libraries and pattern matching libraries on the musl wiki's alternative
library page.  If you recommend any of the above libraries or possibly
others and think certain implementations would be useful to others, let me
know and I'll add the links to the wiki.  I haven't really added anything
for pattern-matching libraries beyond benchmark information.

Lua includes a proper regex library. Lua is smaller than PCRE and you get a
whole programming language thrown in which shows how complex PCRE is...

Justin

> Thanks.
>
> Sincerely,
> Laura
> http://www.distasis.com
>

[-- Attachment #2: Type: text/html, Size: 4442 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: regex libs (was Re: [musl] embedded newbies site.)
  2013-07-16 15:41 ` Justin Cormack
@ 2013-07-16 16:55   ` Szabolcs Nagy
  0 siblings, 0 replies; 8+ messages in thread
From: Szabolcs Nagy @ 2013-07-16 16:55 UTC (permalink / raw)
  To: musl

* Justin Cormack <justin@specialbusservice.com> [2013-07-16 16:41:44 +0100]:
> Lua includes a proper regex library. Lua is smaller than PCRE and you get a
> whole programming language thrown in which shows how complex PCRE is...

no, lua uses backtracking matching just like most of the
other listed regex matchers
(it also has weird syntax that is gratuitously different
from the posix one)

you can easily try it yourself with the (a?)^n a^n pattern:

s1 = ''
s2 = ''
for i = 1,24 do
        s1 = s1 .. 'a?'
        s2 = s2 .. 'a'
end
print(string.match(s2, '^'..s1..s2..'$'))

24 is not even a big number but since the matching
is O(2^n) it quickly becomes unresponsive..
(it takes 5s here)

(in the main lua implementation there is a recursion
limit of 200 which is more than enough to render a lua
interpreter completely unresponsive if the regex comes
from untrusted input)


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: regex libs (was Re: [musl] embedded newbies site.)
  2013-07-16 15:10 regex libs (was Re: [musl] embedded newbies site.) LM
  2013-07-16 15:32 ` Rich Felker
  2013-07-16 15:41 ` Justin Cormack
@ 2013-07-16 17:13 ` Strake
  2013-07-16 17:14 ` Kurt H Maier
  3 siblings, 0 replies; 8+ messages in thread
From: Strake @ 2013-07-16 17:13 UTC (permalink / raw)
  To: musl

On 16/07/2013, LM <lmemsm@gmail.com> wrote:
> Would have been
> nice if they offered actual regular expressions handling if you only want
> to use the POSIX compatible part of the interface.

Why? If one only needs the POSIX compatible interface, just use POSIX libc.

> So what are some good regex library solutions?  I'm also wondering if there
> are some good cross-platform portable library solutions (or if PCRE is the
> best pattern matching solution from a portability standpoint even if it's
> not strictly regex compatible).

This is no matter of compatibility; it's in general simply NOT a
regex. Actually, neither is POSIX Basic.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: regex libs (was Re: [musl] embedded newbies site.)
  2013-07-16 15:10 regex libs (was Re: [musl] embedded newbies site.) LM
                   ` (2 preceding siblings ...)
  2013-07-16 17:13 ` Strake
@ 2013-07-16 17:14 ` Kurt H Maier
  2013-07-16 17:38   ` Szabolcs Nagy
  3 siblings, 1 reply; 8+ messages in thread
From: Kurt H Maier @ 2013-07-16 17:14 UTC (permalink / raw)
  To: musl

On Tue, Jul 16, 2013 at 11:10:34AM -0400, LM wrote:
> 
> There's http://code.google.com/p/re2/ , but I've read some issues with its
> performance in a few web articles and didn't have much luck with
> portability to non-Linux platforms. 

Can you expand on this?  IIRC re2 is linux-focused from Google but Russ
Cox et al have implemented various similar things:
http://swtch.com/~rsc/regexp/ 

Also I believe libregex9 from plan9ports is available under an MIT-like
license.

How well does TRE deal with UTF-8?

khm


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: regex libs (was Re: [musl] embedded newbies site.)
  2013-07-16 17:14 ` Kurt H Maier
@ 2013-07-16 17:38   ` Szabolcs Nagy
  0 siblings, 0 replies; 8+ messages in thread
From: Szabolcs Nagy @ 2013-07-16 17:38 UTC (permalink / raw)
  To: musl

* Kurt H Maier <khm-lists@intma.in> [2013-07-16 13:14:08 -0400]:
> On Tue, Jul 16, 2013 at 11:10:34AM -0400, LM wrote:
> > 
> > There's http://code.google.com/p/re2/ , but I've read some issues with its
> > performance in a few web articles and didn't have much luck with
> > portability to non-Linux platforms. 
> 
> Can you expand on this?  IIRC re2 is linux-focused from Google but Russ

the main issue with re2 is c++

> How well does TRE deal with UTF-8?

it works with unicode codepoints (wchar_t)
so it works fine with utf8, but it needs
separate codepoint decoding and matching


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: regex libs (was Re: [musl] embedded newbies site.)
  2013-07-16 15:32 ` Rich Felker
@ 2013-07-17  5:38   ` Isaac
  0 siblings, 0 replies; 8+ messages in thread
From: Isaac @ 2013-07-17  5:38 UTC (permalink / raw)
  To: musl

On Tue, Jul 16, 2013 at 11:32:20AM -0400, Rich Felker wrote:
> On Tue, Jul 16, 2013 at 11:10:34AM -0400, LM wrote:
> > On Tue, Jul 16, 2013 at 10:00 AM, Rich Felker <dalias@aerifal.cx> wrote:
> > 
> > > The whole concept of regular expressions is that they're regular,
> > > meaning they're matchable in O(n) time with O(1) space. PCRE (the
> > > implementation) uses backtracking for everything, giving it
> > > exponentially-bad performance (JIT cannot fix this), and PCRE (the
> > > language) has a lot of features that are fundamentally not regular and
> > > thus can't be implemented efficiently. Also, the behavior of some of
> > > the features (e.g. greedy vs non-greedy matching) were not designed
> > > intentionally but just arose out of the backtracking implementation,
> > > and thus don't make a lot of sense unless you think from the
> > > standpoint of such an implementation.
> > >
> > 
> > Went back and rechecked the documentation (
> > http://www.pcre.org/readme.txt).  You're both right, PCRE is offering
> > the Perl regular expressions
> > implementation even when one uses the pcreposix interface.  Would have been
> > nice if they offered actual regular expressions handling if you only want
> > to use the POSIX compatible part of the interface.

The pcreposix stuff is only for those who want perl regexes with an existing
POSIX codebase.

> > So what are some good regex library solutions?  I'm also wondering if there
> 
> POSIX systems provide the regcomp and regexec api as part of libc. On
> musl, glibc, uclibc, and (afaik) all the bsd libcs, this gives you a
> good implementation. However musl's is the only one I know of with no
> conformance bugs, but the conformance bugs in the others are isolated
> to obscure subexpression match issues.
> 
> > are some good cross-platform portable library solutions (or if PCRE is the
> > best pattern matching solution from a portability standpoint even if it's
> > not strictly regex compatible).
> 
> TRE, on which musl's is based, is the closest to being good, in my
> opinion. It has optimal big-O but fairly suboptimal constant. However
> it's unmaintained and has at least two bugs that aren't fixed
> upstream, one of which may be exploitable if an untrusted party can
> supply arbitrary regular expressions, and the other which just causes
> some POSIX BRE expressions (the legacy style used by sed and non-e
> grep) to be misinterpreted.

Have these been mentioned to NetBSD? I ask because they use TRE for their libc
(the relicense happened at their request).

Thanks,
Isaac Dunham



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-07-17  5:38 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-16 15:10 regex libs (was Re: [musl] embedded newbies site.) LM
2013-07-16 15:32 ` Rich Felker
2013-07-17  5:38   ` Isaac
2013-07-16 15:41 ` Justin Cormack
2013-07-16 16:55   ` Szabolcs Nagy
2013-07-16 17:13 ` Strake
2013-07-16 17:14 ` Kurt H Maier
2013-07-16 17:38   ` Szabolcs Nagy

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).