Re: [Caml-list] index of substring

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

From: skaller <skaller@users.sourceforge.net>
To: Jean-Christophe Filliatre <Jean-Christophe.Filliatre@lri.fr>
Cc: Radu Grigore <radugrigore@gmail.com>,
	"yjc01@doc.ic.ac.uk" <yjc01@doc.ic.ac.uk>,
	caml-list <caml-list@inria.fr>
Subject: Re: [Caml-list] index of substring
Date: 28 Jan 2005 03:55:36 +1100	[thread overview]
Message-ID: <1106844935.12114.62.camel@pelican.wigram> (raw)
In-Reply-To: <16888.49247.996632.225091@gargle.gargle.HOWL>

On Thu, 2005-01-27 at 21:20, Jean-Christophe Filliatre wrote:
> 
> What's wrong with this way of handling nested comments with ocamllex:
> 
> ======================================================================
> 
> | "(*" { comment lexbuf; ... }
> ...
> 
> and comment = parse
> | "(*" { comment lexbuf; comment lexbuf }
> | "*)" { () }
> | _    { comment lexbuf }
> 

Well it doesn't handle (* or *) in strings ..

However, whilst this code (given a fix
to handle strings) would probably work,
it isn't a plain lexer, but a mix of lexing
and client code. 

In particular the recursion only works 'by luck of 
the implementation' that allows sharing of a lexbuf
between two lexer functions -- who knows how
the lexer handles fallbacks?

I have actually being toying with an implementation
that does a bit of 'Boyer/Moore' processing
to reduce or eliminate the need to rescan characters.

The idea is that whenever to reach an accepting
state, you create a new automaton, and begin
scanning from the start state. When you finally
fail you can fallback to the last success,
but to lex the next token you do not need
to rescan from the start state with the next
character, since you have already done that.

Although this idea may actually lead to faster
processing sometimes, a big advantage is that
it works with input iterators, whereas fallback
and rescan requires a forward iterator. 
An input iterator can be converted to a forward
one, but only with a potentially infinite buffer.

In theory this means that whilst a deterministic
finite state automaton can *recognize* a string
with finite memory in linear time, tokenisation
is in fact NOT finite state at all, but requires
linear state -- which is a major disadvantage.

By actually merging the automata described above,
it may be possible for the 'regexp compiler' to
bound the extra state required for some
regexps, obtaining finite state tokenisers,
whereas at present lexers are not.

Anyhow, the point is that with such an augmented
lexer it is not clear 'client code' would be allowed
to call another (or the same) lexer with the same
lexbuf.

[In other words, your code is not transparent.]

-- 
John Skaller, mailto:skaller@users.sf.net
voice: 061-2-9660-0850, 
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net

next prev parent reply	other threads:[~2005-01-27 16:56 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-01-27  0:41 yjc01
2005-01-27  4:21 ` [Caml-list] " Matt Gushee
2005-01-27  6:44 ` Radu Grigore
2005-01-27  9:36   ` skaller
2005-01-27 10:20     ` Jean-Christophe Filliatre
2005-01-27 16:55       ` skaller [this message]
2005-01-27 17:20         ` Radu Grigore
2005-01-28  8:51         ` Jean-Christophe Filliatre
2005-01-28 13:31           ` skaller
2005-01-27  8:37 ` Oliver Bandel
2005-01-27  9:17 ` Xavier Leroy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1106844935.12114.62.camel@pelican.wigram \
    --to=skaller@users.sourceforge.net \
    --cc=Jean-Christophe.Filliatre@lri.fr \
    --cc=caml-list@inria.fr \
    --cc=radugrigore@gmail.com \
    --cc=yjc01@doc.ic.ac.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).