From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id HAA30169; Wed, 7 Aug 2002 07:02:58 +0200 (MET DST) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id HAA30142 for ; Wed, 7 Aug 2002 07:02:57 +0200 (MET DST) Received: from mail3.tpgi.com.au (mail.tpgi.com.au [203.12.160.59]) by nez-perce.inria.fr (8.11.1/8.11.1) with ESMTP id g7752s520495 for ; Wed, 7 Aug 2002 07:02:56 +0200 (MET DST) Received: from ozemail.com.au (syd-ts13-2600-038.tpgi.com.au [203.213.81.38]) (authenticated (0 bits)) by mail3.tpgi.com.au (8.11.6/8.11.6) with ESMTP id g7751ZF16794 for ; Wed, 7 Aug 2002 15:01:35 +1000 Message-ID: <3D50A9E8.9060605@ozemail.com.au> Date: Wed, 07 Aug 2002 15:02:32 +1000 From: John Skaller User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.2.1) Gecko/20010901 X-Accept-Language: en-us MIME-Version: 1.0 To: caml-list@inria.fr Subject: [Caml-list] Regarding regular expressions Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk Below is some analysis of use of backreferences in Perl, done by Eric Niebler for the C++ committee. This post is stolen from the library subgroup committee reflector. The C++ committee is looking at which kind of regular expresssions to provide as standard in the C++ Standard Library. Thought I'd post it here in case it's useful to the Ocaml team which is also looking at RE libraries. ------------------------------------------------------------------------ OK, I have some numbers. As a refresher, POSIX-extended doesn't do backreferences, but guarantees complexity O(SN) where S is the number of states in the NFA, and N is the number of characters in the input string. ECMAScript does backreferences, but worst-case complexity is O(2^N). For a language where backreferences are available, I thought it would be interesting to see how often people make use of this feature. So I downloaded the CPAN archive and analyzed the perl scripts there. Here is what I found: - I found 165228 perl scripts/modules in CPAN [1]. - Of those, 68501 use regular expressions [2]. - Of those, 32359 (or 47 percent) use backreferences [3]. So, nearly half of all perl scripts on CPAN that use regular expressions make use of the backreference feature. IMO this argues strongly in favor of supporting backreferences in C++. (Backreferences can only be handled by a backtracking NFA engine, IIRC.) There are other features besides backreferences that can only be provided by a backtracking NFA. These features include non-greedy quantification, positive and negative look-ahead and look-behind assertions, independent sub-expressions, conditional sub-expressions, and backreferences within the pattern itself. I did a separate analysis to see how often these features were used. Here's what I found: - Of the 68501 scripts that use regular expressions, 12199 (or 18 percent) use at least one of the features listed above [4]. If I count the files that use either backreferences or one of these other features, I find that: - 33881 files, or 49.5 percent, use some feature that requires a backtracking NFA engine. To get some idea of how many false-positives were turning up in my tests, I ran the same search against the files which do not use regular expressions at all. I found that of the 96727 scripts that do *not* use regular expressions [2], 2300 tested positive for a backtracking regex feature. That amounts to a 2% false positive rate. So it would be fair to knock 2% off the percentages quoted above. Eric [1] - I was only looking at files with .pm and .pl extensions, after extracting all archives with .tar.gz, .tgz, and .zip extensions. [2] - A perl file was determined to use regular expressions if it matched the pattern, "[=!]~ *[^ty ]". That is, if it used the operators =~ or !~ for matching or substituting (i.e. not character translation). [3] - A file was determined to use backreferences if it contained the pattern, "\$[1-9][^0-9]". [4] - A file was included in this total if it matched one of the following patterns: \(\?= // a positive look-ahead assertion \(\?! // a negative look-ahead assertion [0-9]\}\? // a non-greedy quantifier [+*]\? // also a non-greedy quantifier \(\?> // an independent sub-expression \(\?<= // a positive look-behind assertion \(\?