From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pierre.weis@inria.fr>
Delivered-To: caml-list@yquem.inria.fr
Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39])
	by yquem.inria.fr (Postfix) with ESMTP id 5676ABB91
	for <caml-list@yquem.inria.fr>; Thu, 27 Jan 2005 17:56:00 +0100 (CET)
Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35])
	by concorde.inria.fr (8.13.0/8.13.0) with ESMTP id j0RGtxHa023120
	for <caml-list@yquem.inria.fr>; Thu, 27 Jan 2005 17:55:59 +0100
Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id RAA24774 for <caml-list@pauillac.inria.fr>; Thu, 27 Jan 2005 17:55:59 +0100 (MET)
Received: from smtp1.adl2.internode.on.net (smtp1.adl2.internode.on.net [203.16.214.181])
	by concorde.inria.fr (8.13.0/8.13.0) with ESMTP id j0RGtvD6023115
	for <caml-list@inria.fr>; Thu, 27 Jan 2005 17:55:58 +0100
Received: from [192.168.1.200] (ppp205-248.lns1.syd3.internode.on.net [203.122.205.248])
	by smtp1.adl2.internode.on.net (8.12.9/8.12.9) with ESMTP id j0RGtaTX015006;
	Fri, 28 Jan 2005 03:25:43 +1030 (CST)
Subject: Re: [Caml-list] index of substring
From: skaller <skaller@users.sourceforge.net>
Reply-To: skaller@users.sourceforge.net
To: Jean-Christophe Filliatre <Jean-Christophe.Filliatre@lri.fr>
Cc: Radu Grigore <radugrigore@gmail.com>,
	"yjc01@doc.ic.ac.uk" <yjc01@doc.ic.ac.uk>,
	caml-list <caml-list@inria.fr>
In-Reply-To: <16888.49247.996632.225091@gargle.gargle.HOWL>
References: <1106786478.41f838aecb7dd@www.doc.ic.ac.uk>
	 <7f8e92aa05012622441fce509b@mail.gmail.com>
	 <1106818582.6734.154.camel@pelican.wigram>
	 <16888.49247.996632.225091@gargle.gargle.HOWL>
Content-Type: text/plain
Message-Id: <1106844935.12114.62.camel@pelican.wigram>
Mime-Version: 1.0
X-Mailer: Ximian Evolution 1.2.2 (1.2.2-4) 
Date: 28 Jan 2005 03:55:36 +1100
Content-Transfer-Encoding: 7bit
X-Miltered: at concorde with ID 41F91D1F.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)!
X-Miltered: at concorde with ID 41F91D1D.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)!
X-Spam: no; 0.00; caml-list:01 sourceforge:01 filliatre:01 wrote:01 ocamllex:01 lexbuf:01 lexbuf:01 lexer:01 lexing:01 recursion:01 lexer:01 fallbacks:01 toying:01 token:01 iterator:01 
X-Spam-Checker-Version: SpamAssassin 3.0.0 (2004-09-13) on yquem.inria.fr
X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled 
	version=3.0.0
X-Spam-Level: 

On Thu, 2005-01-27 at 21:20, Jean-Christophe Filliatre wrote:
> 
> What's wrong with this way of handling nested comments with ocamllex:
> 
> ======================================================================
> 
> | "(*" { comment lexbuf; ... }
> ...
> 
> and comment = parse
> | "(*" { comment lexbuf; comment lexbuf }
> | "*)" { () }
> | _    { comment lexbuf }
> 

Well it doesn't handle (* or *) in strings ..

However, whilst this code (given a fix
to handle strings) would probably work,
it isn't a plain lexer, but a mix of lexing
and client code. 

In particular the recursion only works 'by luck of 
the implementation' that allows sharing of a lexbuf
between two lexer functions -- who knows how
the lexer handles fallbacks?

I have actually being toying with an implementation
that does a bit of 'Boyer/Moore' processing
to reduce or eliminate the need to rescan characters.

The idea is that whenever to reach an accepting
state, you create a new automaton, and begin
scanning from the start state. When you finally
fail you can fallback to the last success,
but to lex the next token you do not need
to rescan from the start state with the next
character, since you have already done that.

Although this idea may actually lead to faster
processing sometimes, a big advantage is that
it works with input iterators, whereas fallback
and rescan requires a forward iterator. 
An input iterator can be converted to a forward
one, but only with a potentially infinite buffer.

In theory this means that whilst a deterministic
finite state automaton can *recognize* a string
with finite memory in linear time, tokenisation
is in fact NOT finite state at all, but requires
linear state -- which is a major disadvantage.

By actually merging the automata described above,
it may be possible for the 'regexp compiler' to
bound the extra state required for some
regexps, obtaining finite state tokenisers,
whereas at present lexers are not.

Anyhow, the point is that with such an augmented
lexer it is not clear 'client code' would be allowed
to call another (or the same) lexer with the same
lexbuf.

[In other words, your code is not transparent.]

-- 
John Skaller, mailto:skaller@users.sf.net
voice: 061-2-9660-0850, 
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net