From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Delivered-To: caml-list@yquem.inria.fr Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by yquem.inria.fr (Postfix) with ESMTP id 5676ABB91 for ; Thu, 27 Jan 2005 17:56:00 +0100 (CET) Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35]) by concorde.inria.fr (8.13.0/8.13.0) with ESMTP id j0RGtxHa023120 for ; Thu, 27 Jan 2005 17:55:59 +0100 Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id RAA24774 for ; Thu, 27 Jan 2005 17:55:59 +0100 (MET) Received: from smtp1.adl2.internode.on.net (smtp1.adl2.internode.on.net [203.16.214.181]) by concorde.inria.fr (8.13.0/8.13.0) with ESMTP id j0RGtvD6023115 for ; Thu, 27 Jan 2005 17:55:58 +0100 Received: from [192.168.1.200] (ppp205-248.lns1.syd3.internode.on.net [203.122.205.248]) by smtp1.adl2.internode.on.net (8.12.9/8.12.9) with ESMTP id j0RGtaTX015006; Fri, 28 Jan 2005 03:25:43 +1030 (CST) Subject: Re: [Caml-list] index of substring From: skaller Reply-To: skaller@users.sourceforge.net To: Jean-Christophe Filliatre Cc: Radu Grigore , "yjc01@doc.ic.ac.uk" , caml-list In-Reply-To: <16888.49247.996632.225091@gargle.gargle.HOWL> References: <1106786478.41f838aecb7dd@www.doc.ic.ac.uk> <7f8e92aa05012622441fce509b@mail.gmail.com> <1106818582.6734.154.camel@pelican.wigram> <16888.49247.996632.225091@gargle.gargle.HOWL> Content-Type: text/plain Message-Id: <1106844935.12114.62.camel@pelican.wigram> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 (1.2.2-4) Date: 28 Jan 2005 03:55:36 +1100 Content-Transfer-Encoding: 7bit X-Miltered: at concorde with ID 41F91D1F.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Miltered: at concorde with ID 41F91D1D.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Spam: no; 0.00; caml-list:01 sourceforge:01 filliatre:01 wrote:01 ocamllex:01 lexbuf:01 lexbuf:01 lexer:01 lexing:01 recursion:01 lexer:01 fallbacks:01 toying:01 token:01 iterator:01 X-Spam-Checker-Version: SpamAssassin 3.0.0 (2004-09-13) on yquem.inria.fr X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled version=3.0.0 X-Spam-Level: On Thu, 2005-01-27 at 21:20, Jean-Christophe Filliatre wrote: > > What's wrong with this way of handling nested comments with ocamllex: > > ====================================================================== > > | "(*" { comment lexbuf; ... } > ... > > and comment = parse > | "(*" { comment lexbuf; comment lexbuf } > | "*)" { () } > | _ { comment lexbuf } > Well it doesn't handle (* or *) in strings .. However, whilst this code (given a fix to handle strings) would probably work, it isn't a plain lexer, but a mix of lexing and client code. In particular the recursion only works 'by luck of the implementation' that allows sharing of a lexbuf between two lexer functions -- who knows how the lexer handles fallbacks? I have actually being toying with an implementation that does a bit of 'Boyer/Moore' processing to reduce or eliminate the need to rescan characters. The idea is that whenever to reach an accepting state, you create a new automaton, and begin scanning from the start state. When you finally fail you can fallback to the last success, but to lex the next token you do not need to rescan from the start state with the next character, since you have already done that. Although this idea may actually lead to faster processing sometimes, a big advantage is that it works with input iterators, whereas fallback and rescan requires a forward iterator. An input iterator can be converted to a forward one, but only with a potentially infinite buffer. In theory this means that whilst a deterministic finite state automaton can *recognize* a string with finite memory in linear time, tokenisation is in fact NOT finite state at all, but requires linear state -- which is a major disadvantage. By actually merging the automata described above, it may be possible for the 'regexp compiler' to bound the extra state required for some regexps, obtaining finite state tokenisers, whereas at present lexers are not. Anyhow, the point is that with such an augmented lexer it is not clear 'client code' would be allowed to call another (or the same) lexer with the same lexbuf. [In other words, your code is not transparent.] -- John Skaller, mailto:skaller@users.sf.net voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net