From mboxrd@z Thu Jan  1 00:00:00 1970
Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id KAA10149 for caml-red; Fri, 8 Dec 2000 10:23:53 +0100 (MET)
Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id KAA08786 for <caml-list@pauillac.inria.fr>; Fri, 8 Dec 2000 10:19:53 +0100 (MET)
Received: from nef.ens.fr (nef.ens.fr [129.199.96.32])
	by concorde.inria.fr (8.11.1/8.10.0) with ESMTP id eB89Jqr00124
	for <caml-list@inria.fr>; Fri, 8 Dec 2000 10:19:53 +0100 (MET)
Received: from clipper.ens.fr (clipper-gw.ens.fr [129.199.1.22])
          by nef.ens.fr (8.10.1/1.01.28121999) with ESMTP id eB89JpN07538
          ; Fri, 8 Dec 2000 10:19:51 +0100 (CET)
Received: from localhost (frisch@localhost) by clipper.ens.fr (8.9.2/jb-1.1)
	id KAA04331 ; Fri, 8 Dec 2000 10:19:51 +0100 (MET)
Date: Fri, 8 Dec 2000 10:19:51 +0100 (MET)
From: Alain Frisch <frisch@clipper.ens.fr>
To: John Max Skaller <skaller@ozemail.com.au>
cc: OCAML <caml-list@inria.fr>
Subject: Re: features of PCRE-OCaml
In-Reply-To: <3A2FC3FB.A0BB09DD@ozemail.com.au>
Message-ID: <Pine.GSO.4.04.10012081013430.1583-100000@clipper.ens.fr>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: weis@pauillac.inria.fr

On Fri, 8 Dec 2000, John Max Skaller wrote:

> 	[Ocaml lex cannot support large enough tables for matching
> ISO-10646 identifiers, when encoded using UTF-8. This is a real pain,
> since all my languages specify UTF-8 encoded ISO-10646: I have to 
> cheat, and assume 'almost everything' is a suitable character to
> put in an identifier, and then check it afterwards. This makes it
> hard to use use special symbols as tokens. I'm not sure why
> this is, but I guess it doesn't eliminate duplicate columns?]

Have a look at wlex:
http://www.eleves.ens.fr:8080/home/frisch/soft
http://www.eleves.ens.fr:8080/home/frisch/info/wlex-20001006.tar.gz


<< This package consists of a lexer generator and the associated runtime
system. The new lexing model adds a "classification" layer between the
lexbuf and the lexer itself. This layer classifies characters from the
lexbuf into a few number of classes, on which the regexps in the lexer
specification are built. 

 This reduces the number of states and transitions in the automaton,
especially when working with large encodings such as UTF-8 (the primary
motivation for wlex).  >>

The development release of pxp may use wlex (same lexer for different
encodings: UTF-8, Latin-1).

wlex is distributed as a patch to ocamllex.


-- 
  Alain Frisch