From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Mon, 27 Oct 2008 13:13:57 +0000 From: Eris Discordia To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Message-ID: <7F426112C3E059066F0024AE@[192.168.1.2]> In-Reply-To: <45432371edb2231a8b90975a0efdf7bb@terzarima.net> References: <45432371edb2231a8b90975a0efdf7bb@terzarima.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Subject: Re: [9fans] non greedy regular expressions Topicbox-Message-UUID: 27f10c26-ead4-11e9-9d60-3106f5b1d025 > practical application. now there are big books on `regular expressions' > mainly because they are no longer regular but a big collection of ad-hoc I thought they were "regular" because they "regularly" occurred in the=20 target text. Turns out other interpretations are possible. Though, mine has = the advantage of being correct :-D The set of "big books on regular expressions" includes Jeffrey Friedl's=20 "Mastering Regular Expressions" that happens to contain a chapter by the=20 title "NFA, DFA, and POSIX" wherein he says: > DFA Speed with NFA Capabilities: Regex Nirvana? > > I've said several times that a DFA can't provide capturing parentheses or > backreferences. This is quite true, but it certainly doesn't preclude > hybrid approaches that mix technologies in an attempt to reach regex > nirvana. The sidebar in Section 4.6.3.1 told how NFAs have diverged from > the theoretical straight and narrow in search of more power, and it's > only natural that the same happens with DFAs. A DFA's construction makes > it more difficult, but that doesn't mean impossible. > > GNU grep takes a simple but effective approach. It uses a DFA when > possible, reverting to an NFA when backreferences are used. GNU awk does > something similar=E2=80=94it uses GNU grep's fast shortest-leftmost DFA = engine > for simple "does it match" checks, and reverts to a different engine for > checks where the actual extent of the match must be known. Since that > other engine is an NFA, GNU awk can conveniently offer capturing > parentheses, and it does via its special gensub function. > > Tcl's regex engine is a true hybrid, custom built by Henry Spencer (whom > you may remember having played an important part in the early development > and popularization of regular expressions see Section 3.1.1.6). The Tcl > engine sometimes appears to be an NFA=E2=80=94 it has lookaround, = capturing > parentheses, backreferences, and lazy quantifiers. Yet, it has true POSIX > longest-leftmost match (see Section 4.6.1), and doesn't suffer from some > of the NFA problems that we'll see in Chapter 6. It really seems quite > wonderful. > > Currently, this engine is available only to Tcl, but Henry tells me that > it's on his to-do list to break it out into a separate package that can > be used by others. Again, turns out the "big books on regular expressions" can give the=20 lowlife--that's me--things "hackers" deny them. --On Monday, October 27, 2008 10:18 AM +0000 Charles Forsyth=20 wrote: >> Both systems are complex enough that essentially no one completely >> understands them. > > this touches on an important point. the first introduction of regular > expressions to editors was great, because it took some formal language > theory and made it useful in an `every day' way. conversely, the theory > provided significant underpinning (in a structural sense) to that > practical application. now there are big books on `regular expressions' > mainly because they are no longer regular but a big collection of ad-hoc > additions. (i think there is a similarity to Perl's relationship to the > history of programming languages.) the result really is hard to reason > about ... because it hasn't got that underpinning and the algebra has > gone. it's worse than that: the code is a mess too, i supposed in > consequence. >