From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3657 Path: news.gmane.org!not-for-mail From: Justin Cormack Newsgroups: gmane.linux.lib.musl.general Subject: Re: regex libs (was Re: [musl] embedded newbies site.) Date: Tue, 16 Jul 2013 16:41:44 +0100 Message-ID: References: Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary=047d7b86e8c27134c104e1a2d025 X-Trace: ger.gmane.org 1373989316 10710 80.91.229.3 (16 Jul 2013 15:41:56 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 16 Jul 2013 15:41:56 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-3661-gllmg-musl=m.gmane.org@lists.openwall.com Tue Jul 16 17:42:00 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1Uz7O2-0005DX-Il for gllmg-musl@plane.gmane.org; Tue, 16 Jul 2013 17:41:58 +0200 Original-Received: (qmail 16062 invoked by uid 550); 16 Jul 2013 15:41:57 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 16048 invoked from network); 16 Jul 2013 15:41:57 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=specialbusservice.com; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=GPrTcfACs9Wqdw3A+etsRb912fi+XCZw6gqfxEmtq/Q=; b=g0XjdF0NIleHKaCmdKb0KoCY2RXaC/D1R/4CK18Btpcy6+owrrT5ZU3r9DbwvJwKgV f8wVJftNv256qxTRDpg3jjUv3MlzMuKFE9ccSsqPt2rTQ+AhxrhF+oEt+K0GQKyd5LDi 9rjgdYTQIKhWbC/hFuDmTMArHU/Cpl8pRzBfk= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:x-gm-message-state; bh=GPrTcfACs9Wqdw3A+etsRb912fi+XCZw6gqfxEmtq/Q=; b=SVcvpZk4NGUqYfI8hmB2uiLMOg7P1i7ZJ3HMun/VQKlU4gKOJWg7K9VgOtqhrpKv8F qQDgCIg1Pgrb/NEBmPY/pSgahazSQnS4CifiN4vrte3QHpn0wvRrKnUA/fmgGdSsJB4q iwCsDuI2to1CFnRg1kWSTmnUT5Bb8eONfWY6DgWugKneJjaVQ5oYRC2Hu3wMVsVPbNX7 vnhqLh12wlPtUVJTWirk7dTl4kjaMJzakHq62JYU9Cc0ZeXQTQPfmPh4PQUePKUAmgJb FgdlyLkcSpu+VVpIypGErO/v+Ts5NJhn+PIVdYbc44t0vIP7HKFPCfcsnACQpgqh2QMP fE2w== X-Received: by 10.66.162.9 with SMTP id xw9mr3250876pab.64.1373989304867; Tue, 16 Jul 2013 08:41:44 -0700 (PDT) In-Reply-To: X-Gm-Message-State: ALoCoQltQZD5ozGnSUSHYz+OnsPwPiMliVD04NoKuNmH0ZmOrSw0/W8d9X/6smvqSrVvyvZZseqk Xref: news.gmane.org gmane.linux.lib.musl.general:3657 Archived-At: --047d7b86e8c27134c104e1a2d025 Content-Type: text/plain; charset=UTF-8 On 16 Jul 2013 16:10, "LM" wrote: > > On Tue, Jul 16, 2013 at 10:00 AM, Rich Felker wrote: >> >> The whole concept of regular expressions is that they're regular, >> meaning they're matchable in O(n) time with O(1) space. PCRE (the >> implementation) uses backtracking for everything, giving it >> exponentially-bad performance (JIT cannot fix this), and PCRE (the >> language) has a lot of features that are fundamentally not regular and >> thus can't be implemented efficiently. Also, the behavior of some of >> the features (e.g. greedy vs non-greedy matching) were not designed >> intentionally but just arose out of the backtracking implementation, >> and thus don't make a lot of sense unless you think from the >> standpoint of such an implementation. > > > Went back and rechecked the documentation ( http://www.pcre.org/readme.txt). You're both right, PCRE is offering the Perl regular expressions implementation even when one uses the pcreposix interface. Would have been nice if they offered actual regular expressions handling if you only want to use the POSIX compatible part of the interface. > > So what are some good regex library solutions? I'm also wondering if there are some good cross-platform portable library solutions (or if PCRE is the best pattern matching solution from a portability standpoint even if it's not strictly regex compatible). > > There's http://code.google.com/p/re2/ , but I've read some issues with its performance in a few web articles and didn't have much luck with portability to non-Linux platforms. There's the glibc solution: http://sourceforge.net/p/mingw/regex/ci/master/tree/ There's TRE ( https://github.com/laurikari/tre/), which some BSD systems want to use to create their grep ( https://wiki.freebsd.org/BSDgrep ). There's the Oniguruma library ( http://www.geocities.jp/kosako3/oniguruma/ ). There's Henry Spencer's regex ( http://www.arglist.com/regex ). That looks promising for portability. There's http://re2c.org/ Further searching also turns up http://tiny-rex.sourceforge.net/ which may have the same issues as PCRE. ICU seems to offer regex code ( http://userguide.icu-project.org/strings/regexp), probably same issue as PCRE. (Just my opinion, but ICU seems to do a lot of stuff for just one library.) There's BOOST's regex ( http://www.boost.org/doc/libs/1_54_0/libs/regex/doc/ ) which a lot of web sites recommend, but I've just never been a fan of the BOOST libraries. The Heirloom Project has regex code ( http://heirloom.sourceforge.net/ ). Have I missed any other interesting solutions? > > Sounds like I need to better clarify between regex pattern matching libraries and pattern matching libraries on the musl wiki's alternative library page. If you recommend any of the above libraries or possibly others and think certain implementations would be useful to others, let me know and I'll add the links to the wiki. I haven't really added anything for pattern-matching libraries beyond benchmark information. Lua includes a proper regex library. Lua is smaller than PCRE and you get a whole programming language thrown in which shows how complex PCRE is... Justin > Thanks. > > Sincerely, > Laura > http://www.distasis.com > --047d7b86e8c27134c104e1a2d025 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable


On 16 Jul 2013 16:10, "LM" <lmemsm@gmail.com> wrote:
>
> On Tue, Jul 16, 2013 at 10:00 AM, Rich Felker <dalias@aerifal.cx> wrote:
>>
>> The whole concept of regular expressions is that they're regul= ar,
>> meaning they're matchable in O(n) time with O(1) space. PCRE (= the
>> implementation) uses backtracking for everything, giving it
>> exponentially-bad performance (JIT cannot fix this), and PCRE (the=
>> language) has a lot of features that are fundamentally not regular= and
>> thus can't be implemented efficiently. Also, the behavior of s= ome of
>> the features (e.g. greedy vs non-greedy matching) were not designe= d
>> intentionally but just arose out of the backtracking implementatio= n,
>> and thus don't make a lot of sense unless you think from the >> standpoint of such an implementation.
>
>
> Went back and rechecked the documentation ( http://www.pcre.org/readme.txt ).=C2=A0 You're bot= h right, PCRE is offering the Perl regular expressions implementation even = when one uses the pcreposix interface.=C2=A0 Would have been nice if they o= ffered actual regular expressions handling if you only want to use the POSI= X compatible part of the interface.
>
> So what are some good regex library solutions?=C2=A0 I'm also wond= ering if there are some good cross-platform portable library solutions (or = if PCRE is the best pattern matching solution from a portability standpoint= even if it's not strictly regex compatible).
>
> There's http://code.goog= le.com/p/re2/ , but I've read some issues with its performance in a= few web articles and didn't have much luck with portability to non-Lin= ux platforms.=C2=A0 There's the glibc solution:=C2=A0 http://sourceforge.net/p/mi= ngw/regex/ci/master/tree/=C2=A0 There's TRE ( https://github.com/laurikari/tre/), which some= BSD systems want to use to create their grep ( https://wiki.freebsd.org/BSDgrep ).=C2=A0 There's= the Oniguruma library ( http://www.geocities.jp/kosako3/oniguruma/ ). There's Henry Sp= encer's regex ( http://www.arg= list.com/regex ).=C2=A0 That looks promising for portability.=C2=A0 The= re's http://re2c.org/ Further searchin= g also turns up http://tiny-re= x.sourceforge.net/ which may have the same issues as PCRE.=C2=A0 ICU se= ems to offer regex code (http://userguide.icu-project.org/strings/regexp), probably sa= me issue as PCRE.=C2=A0 (Just my opinion, but ICU seems to do a lot of stuf= f for just one library.)=C2=A0 There's BOOST's regex ( http://www.boost.org/d= oc/libs/1_54_0/libs/regex/doc/ ) which a lot of web sites recommend, bu= t I've just never been a fan of the BOOST libraries.=C2=A0 The Heirloom= Project has regex code ( http= ://heirloom.sourceforge.net/ ).=C2=A0 Have I missed any other interesti= ng solutions?
>
> Sounds like I need to better clarify between regex pattern matching li= braries and pattern matching libraries on the musl wiki's alternative l= ibrary page.=C2=A0 If you recommend any of the above libraries or possibly = others and think certain implementations would be useful to others, let me = know and I'll add the links to the wiki.=C2=A0 I haven't really add= ed anything for pattern-matching libraries beyond benchmark information.

Lua includes a proper regex library. Lua is smaller than PCR= E and you get a whole programming language thrown in which shows how comple= x PCRE is...

Justin

> Thanks.
>
> Sincerely,
> Laura
> http://www.distasis.com
>

--047d7b86e8c27134c104e1a2d025--