Re: [Caml-list] Matching start of input in lexer created with ocamllex

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

From: skaller <skaller@users.sourceforge.net>
To: Janne Hellsten <jjhellst@gmail.com>
Cc: caml-list@inria.fr
Subject: Re: [Caml-list] Matching start of input in lexer created with ocamllex
Date: Fri, 06 Apr 2007 15:52:35 +1000	[thread overview]
Message-ID: <1175838755.6045.90.camel@rosella.wigram> (raw)
In-Reply-To: <700d600f0704051358g294d540ey96a546f7f57a08e7@mail.gmail.com>

On Thu, 2007-04-05 at 23:58 +0300, Janne Hellsten wrote:

> I do have the latter construct in my lexer.  How does the above help
> me match the beginning of line in the rule

Ok, sorry .. lets try this:

rule headlex = 
| "!for" nonalpha { Some FOR }
| { None }

rule taillex = ...

and now use 

let make_lexer () =
  let atstart  = ref true in
  fun lexer ->
    if !atstart then begin
      atstart := false
      match headlex lexbuf with
      | None -> 
         lexbuf.curr_pos <- lexbuf.curr_pos -1;
         taillex lexbuf
      | Some FOR -> FOR
    end else taillex lexbuf

let lexer = make_lexer () in

This is an animal, because you have to 'pushback' an
unwanted character following the "for" keyword.
This will cease working if the implementation of lexbuf
changes, in particular if the record is made private.

I may also have got the details on this wrong .. it's
an undocumented feature, you'll need to read lexing.ml
to figure out if it works.

The problem I'm trying to solve here is that you want 

	rulename lexbuf

to return a single correct token. This is a requirement
of the parser.

In Felix I don't bother with that. Instead, I just return
a pre-token, I gather the whole input as a list, and then
I write filters that translate the list of pre-tokens into
tokens. Then I make a 'lexer' function that just returns
the next token in the list already prepared, and uses
a dummy lexbuf to satisfy the broken ocamlyacc interface.

In other words, I control invert by returning a list and then
actively rewriting it. For example I would do this,
assuming you have multiple lines that can start with "for":

let strip_white toks = filter (fun tok -> tok = WHITE) toks

let append_eol toks = NEWLINE :: toks 

let fixup_for toks =
  let aux out inp = match inp with
  | NEWLINE :: BANG :: IDENT "for" :: tail -> aux (FOR :: out) tail
  | NEWLINE :: tail -> aux out tail
  | head :: tail -> aux (head::out) tail
  | [] -> rev out
  in aux [] toks

let tokens = fixup_for (append_eol (strip_white toks)) 

The point is this process has arbitrary lookahead.

The downside is you have to hold the whole input in memory.
This is not going to be a problem unless it's the encoding
of an exported database.

The tagged automata underlying the lexer can probably be modified
to allow forward context checks, eg to match stuff like:

	"!for" / nonalpha

where / denotes lookahead (but I'm not sure). I know TRE can
do this, however ocamllex is using a prebuilt tagged 
deterministic automaton, whereas TRE uses a lazily 
constructed one.

The technique basically matches

 	"!for" (TAG) nonalpha

first, but sets the end location to the TAG position. I believe this
might be possible because the automata ALREADY looks ahead and
backtracks to the last success .. so we need to keep two 'success'
points to do this, the real one and the lookahead one. Normally
they're the same, but in this case they're not. The lexer engine
uses the lookahead one until it is finished, but then fudges
its result to the real one.

The two pointers are needed because in general in

	regexp1 / regexp2

we have regexp2 'success' point to track. If you allow

	regexp1 / (regexp2 / regexp3)

you might need more pointers (not sure), so this would probably 
have to be banned (not sure).

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

next prev parent reply	other threads:[~2007-04-06  5:52 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-04-05 14:37 Janne Hellsten
2007-04-05 19:55 ` [Caml-list] " skaller
2007-04-05 20:58   ` Janne Hellsten
2007-04-06  5:52     ` skaller [this message]
     [not found] <20070405205804.90509BC76@yquem.inria.fr>
2007-04-06  7:40 ` David Allsopp
2007-04-06 20:14   ` Janne Hellsten

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1175838755.6045.90.camel@rosella.wigram \
    --to=skaller@users.sourceforge.net \
    --cc=caml-list@inria.fr \
    --cc=jjhellst@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).