From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id JAA32096; Wed, 7 Aug 2002 09:36:03 +0200 (MET DST) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id JAA32149 for ; Wed, 7 Aug 2002 09:36:02 +0200 (MET DST) Received: from shiva.jussieu.fr (shiva.jussieu.fr [134.157.0.129]) by concorde.inria.fr (8.11.1/8.11.1) with ESMTP id g777a2529671 for ; Wed, 7 Aug 2002 09:36:02 +0200 (MET DST) Received: from helium.pps.jussieu.fr (helium.pps.jussieu.fr [134.157.168.2]) by shiva.jussieu.fr (8.12.5/jtpda-5.4) with ESMTP id g777a1jR005655 ; Wed, 7 Aug 2002 09:36:01 +0200 (CEST) Received: from strontium.pps.jussieu.fr (mail@strontium.pps.jussieu.fr [134.157.168.38]) by helium.pps.jussieu.fr (8.11.6/jtpda-5.3.2) with ESMTP id g777a1O38643 ; Wed, 7 Aug 2002 09:36:01 +0200 (CEST) Received: from vouillon by strontium.pps.jussieu.fr with local (Exim 3.35 #1 (Debian)) id 17cLMP-0008FY-00; Wed, 07 Aug 2002 09:36:01 +0200 Date: Wed, 7 Aug 2002 09:36:00 +0200 To: John Skaller Cc: caml-list@inria.fr Subject: Re: [Caml-list] Regarding regular expressions Message-ID: <20020807073600.GA31615@strontium.pps.jussieu.fr> References: <3D50A9E8.9060605@ozemail.com.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3D50A9E8.9060605@ozemail.com.au> User-Agent: Mutt/1.4i From: Jerome Vouillon X-Antivirus: scanned by sophie at shiva.jussieu.fr Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk > For a language where backreferences are available, I thought it would be > interesting to see how often people make use of this feature. So I > downloaded the CPAN archive and analyzed the perl scripts there. Here > is what I found: > > - I found 165228 perl scripts/modules in CPAN [1]. > - Of those, 68501 use regular expressions [2]. > - Of those, 32359 (or 47 percent) use backreferences [3]. > > So, nearly half of all perl scripts on CPAN that use regular expressions > make use of the backreference feature. IMO this argues strongly in > favor of supporting backreferences in C++. (Backreferences can only be > handled by a backtracking NFA engine, IIRC.) What he means by backreference is a way to refer to a submatch. For instance, with the regular expression "^([^ ]*) *([^ ]*)", the backreference "$1" will refer to the substring matched by the first parenthesed subexpression "([^ ]*)". As long as the references do not occur in the regular expression itself, they can be handled perfectly well with a DFA engine. So, the numbers above do not prove anything. > There are other features besides backreferences that can only be > provided by a backtracking NFA. These features include non-greedy > quantification, positive and negative look-ahead and look-behind > assertions, independent sub-expressions, conditional sub-expressions, > and backreferences within the pattern itself. I believe all these features but backreferences within a pattern can be provided by a DFA engine (though my RE library only support one of these features, non-greedy quantification, at the moment). So, the real questions are: - how often are backreferences used within a pattern? - when they are used, is it just for convenience, or would it be hard to rewrite the pattern without using backreferences? [Feel free to forward this mail back to the C++ commitee.] -- Jerome ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners