From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id KAA10149 for caml-red; Fri, 8 Dec 2000 10:23:53 +0100 (MET) Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id KAA08786 for ; Fri, 8 Dec 2000 10:19:53 +0100 (MET) Received: from nef.ens.fr (nef.ens.fr [129.199.96.32]) by concorde.inria.fr (8.11.1/8.10.0) with ESMTP id eB89Jqr00124 for ; Fri, 8 Dec 2000 10:19:53 +0100 (MET) Received: from clipper.ens.fr (clipper-gw.ens.fr [129.199.1.22]) by nef.ens.fr (8.10.1/1.01.28121999) with ESMTP id eB89JpN07538 ; Fri, 8 Dec 2000 10:19:51 +0100 (CET) Received: from localhost (frisch@localhost) by clipper.ens.fr (8.9.2/jb-1.1) id KAA04331 ; Fri, 8 Dec 2000 10:19:51 +0100 (MET) Date: Fri, 8 Dec 2000 10:19:51 +0100 (MET) From: Alain Frisch To: John Max Skaller cc: OCAML Subject: Re: features of PCRE-OCaml In-Reply-To: <3A2FC3FB.A0BB09DD@ozemail.com.au> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: weis@pauillac.inria.fr On Fri, 8 Dec 2000, John Max Skaller wrote: > [Ocaml lex cannot support large enough tables for matching > ISO-10646 identifiers, when encoded using UTF-8. This is a real pain, > since all my languages specify UTF-8 encoded ISO-10646: I have to > cheat, and assume 'almost everything' is a suitable character to > put in an identifier, and then check it afterwards. This makes it > hard to use use special symbols as tokens. I'm not sure why > this is, but I guess it doesn't eliminate duplicate columns?] Have a look at wlex: http://www.eleves.ens.fr:8080/home/frisch/soft http://www.eleves.ens.fr:8080/home/frisch/info/wlex-20001006.tar.gz << This package consists of a lexer generator and the associated runtime system. The new lexing model adds a "classification" layer between the lexbuf and the lexer itself. This layer classifies characters from the lexbuf into a few number of classes, on which the regexps in the lexer specification are built. This reduces the number of states and transitions in the automaton, especially when working with large encodings such as UTF-8 (the primary motivation for wlex). >> The development release of pxp may use wlex (same lexer for different encodings: UTF-8, Latin-1). wlex is distributed as a patch to ocamllex. -- Alain Frisch