From mboxrd@z Thu Jan  1 00:00:00 1970
Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id KAA01571 for caml-red; Fri, 8 Dec 2000 10:06:57 +0100 (MET)
Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id SAA02779 for <caml-list@pauillac.inria.fr>; Thu, 7 Dec 2000 18:09:26 +0100 (MET)
Received: from localhost.localdomain (cartman118.zip.com.au [61.8.20.246])
	by concorde.inria.fr (8.11.1/8.10.0) with ESMTP id eB7H9L510384
	for <caml-list@inria.fr>; Thu, 7 Dec 2000 18:09:23 +0100 (MET)
Received: from ozemail.com.au (IDENT:root@localhost [127.0.0.1])
	by localhost.localdomain (8.9.3/8.8.7) with ESMTP id EAA24433;
	Fri, 8 Dec 2000 04:08:11 +1100
Message-ID: <3A2FC3FB.A0BB09DD@ozemail.com.au>
Date: Fri, 08 Dec 2000 04:08:11 +1100
From: John Max Skaller <skaller@ozemail.com.au>
X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.12-20 i686)
X-Accept-Language: en
MIME-Version: 1.0
To: Markus Mottl <mottl@miss.wu-wien.ac.at>
CC: OCAML <caml-list@inria.fr>
Subject: Re: features of PCRE-OCaml
References: <20001206015139.D31140@miss.wu-wien.ac.at> <3A2FB459.416E1E05@ozemail.com.au> <20001207173228.B9463@miss.wu-wien.ac.at>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: weis@pauillac.inria.fr

Markus Mottl wrote:
> 
> On Fri, 08 Dec 2000, John Max Skaller wrote:
> > Funny. Python 1.5.2 used the _same_ C library by Philip Hazel. :-)
> > Given the fact this library builds DFA's instead of NFA's, Python
> > ought to be faster than Perl. :-)
> 
> Well, the matching engine is not everything... ;)

	It is for code doing extensive matching of long strings
against a single pattern: everything else should be dwarfed
by the match time.

> > Note also, Python 2.0 uses a modified library which does something
> > PCRE-OCaml cannot: it works with Unicode characters (supposedly).
> 
> To my knowledge, Phil Hazel is working on support for this. Unless the
> PCRE-library supports Unicode (and unless OCaml does ;), there is not
> much one can do about it...

	What? You mean it isn't generic enough to just change
'char' to 'short' and recompile?  [:-)]
 
> I am not sure whether it is really necessary to have a Str compatible
> interface: the regular expressions are already different so exchanging
> the old against the new library would break code anyway.

	If the expressions were translated?

	BTW: I think some of the features of the regex are
parochial, and should be eliminated: support for case insensitive
matching, and matching 'words' etc should be dropped. Such things
might make sense in English, but are much too hard to build in
to a regexp facility correctly for internationalised text.

	By the way, how big can the DFA tables get?
Does it eliminate duplicate columns? 

	[Ocaml lex cannot support large enough tables for matching
ISO-10646 identifiers, when encoded using UTF-8. This is a real pain,
since all my languages specify UTF-8 encoded ISO-10646: I have to 
cheat, and assume 'almost everything' is a suitable character to
put in an identifier, and then check it afterwards. This makes it
hard to use use special symbols as tokens. I'm not sure why
this is, but I guess it doesn't eliminate duplicate columns?]

-- 
John (Max) Skaller, mailto:skaller@maxtal.com.au
10/1 Toxteth Rd Glebe NSW 2037 Australia voice: 61-2-9660-0850
checkout Vyper http://Vyper.sourceforge.net
download Interscript http://Interscript.sourceforge.net