From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id KAA01571 for caml-red; Fri, 8 Dec 2000 10:06:57 +0100 (MET) Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id SAA02779 for ; Thu, 7 Dec 2000 18:09:26 +0100 (MET) Received: from localhost.localdomain (cartman118.zip.com.au [61.8.20.246]) by concorde.inria.fr (8.11.1/8.10.0) with ESMTP id eB7H9L510384 for ; Thu, 7 Dec 2000 18:09:23 +0100 (MET) Received: from ozemail.com.au (IDENT:root@localhost [127.0.0.1]) by localhost.localdomain (8.9.3/8.8.7) with ESMTP id EAA24433; Fri, 8 Dec 2000 04:08:11 +1100 Message-ID: <3A2FC3FB.A0BB09DD@ozemail.com.au> Date: Fri, 08 Dec 2000 04:08:11 +1100 From: John Max Skaller X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.12-20 i686) X-Accept-Language: en MIME-Version: 1.0 To: Markus Mottl CC: OCAML Subject: Re: features of PCRE-OCaml References: <20001206015139.D31140@miss.wu-wien.ac.at> <3A2FB459.416E1E05@ozemail.com.au> <20001207173228.B9463@miss.wu-wien.ac.at> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: weis@pauillac.inria.fr Markus Mottl wrote: > > On Fri, 08 Dec 2000, John Max Skaller wrote: > > Funny. Python 1.5.2 used the _same_ C library by Philip Hazel. :-) > > Given the fact this library builds DFA's instead of NFA's, Python > > ought to be faster than Perl. :-) > > Well, the matching engine is not everything... ;) It is for code doing extensive matching of long strings against a single pattern: everything else should be dwarfed by the match time. > > Note also, Python 2.0 uses a modified library which does something > > PCRE-OCaml cannot: it works with Unicode characters (supposedly). > > To my knowledge, Phil Hazel is working on support for this. Unless the > PCRE-library supports Unicode (and unless OCaml does ;), there is not > much one can do about it... What? You mean it isn't generic enough to just change 'char' to 'short' and recompile? [:-)] > I am not sure whether it is really necessary to have a Str compatible > interface: the regular expressions are already different so exchanging > the old against the new library would break code anyway. If the expressions were translated? BTW: I think some of the features of the regex are parochial, and should be eliminated: support for case insensitive matching, and matching 'words' etc should be dropped. Such things might make sense in English, but are much too hard to build in to a regexp facility correctly for internationalised text. By the way, how big can the DFA tables get? Does it eliminate duplicate columns? [Ocaml lex cannot support large enough tables for matching ISO-10646 identifiers, when encoded using UTF-8. This is a real pain, since all my languages specify UTF-8 encoded ISO-10646: I have to cheat, and assume 'almost everything' is a suitable character to put in an identifier, and then check it afterwards. This makes it hard to use use special symbols as tokens. I'm not sure why this is, but I guess it doesn't eliminate duplicate columns?] -- John (Max) Skaller, mailto:skaller@maxtal.com.au 10/1 Toxteth Rd Glebe NSW 2037 Australia voice: 61-2-9660-0850 checkout Vyper http://Vyper.sourceforge.net download Interscript http://Interscript.sourceforge.net