Matching start of input in lexer created with ocamllex

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

* Matching start of input in lexer created with ocamllex
@ 2007-04-05 14:37 Janne Hellsten
  2007-04-05 19:55 ` [Caml-list] " skaller
  0 siblings, 1 reply; 6+ messages in thread
From: Janne Hellsten @ 2007-04-05 14:37 UTC (permalink / raw)
  To: caml-list

Hi,

I'd like to match the beginning of input (or beginning of line) in my
lexer.  Is there an easy way to do that?

I have a lexer that looks something like this (simplified):

rule initial = parse
  | '!' [' ' '\t']* "for" { FOR (current_loc ()) }
  | ident as id { IDENT (id, current_loc ()) }
  | '!' { BANG (current_loc ()) }

The !for token should only be matched at the beginning of a
line/input.  However, in the above lexer, there's nothing that
prevents !for from being matched in the middle of an input string.
This causes a problem: An input string containing !forbidXyz will be
lexed FOR, IDENT "bidXyz".  I'd like to lex it as BANG, IDENT
"forbidXyz".

Thank you for your help!

Janne

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Caml-list] Matching start of input in lexer created with ocamllex
  2007-04-05 14:37 Matching start of input in lexer created with ocamllex Janne Hellsten
@ 2007-04-05 19:55 ` skaller
  2007-04-05 20:58   ` Janne Hellsten
  0 siblings, 1 reply; 6+ messages in thread
From: skaller @ 2007-04-05 19:55 UTC (permalink / raw)
  To: Janne Hellsten; +Cc: caml-list

On Thu, 2007-04-05 at 17:37 +0300, Janne Hellsten wrote:
> Hi,
> 
> I'd like to match the beginning of input (or beginning of line) in my
> lexer.  Is there an easy way to do that?
> 
> I have a lexer that looks something like this (simplified):
> 
> rule initial = parse
>   | '!' [' ' '\t']* "for" { FOR (current_loc ()) }
>   | ident as id { IDENT (id, current_loc ()) }
>   | '!' { BANG (current_loc ()) }
> 
> The !for token should only be matched at the beginning of a
> line/input.  However, in the above lexer, there's nothing that
> prevents !for from being matched in the middle of an input string.
> This causes a problem: An input string containing !forbidXyz will be
> lexed FOR, IDENT "bidXyz".  I'd like to lex it as BANG, IDENT
> "forbidXyz".

I do something like this:

let table = ["for", FOR; "while", WHILE]
..
| space-not-newline + { WHITE }
| newline { NEWLINE }
| ident as id { try assoc id table with Not_found -> IDENT id }

An alternative to the WHITE and NEWLINE tokens is a tail
recursive call to the lexer:

| space + { initial lexbuf }

which just skips over the spaces.


-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Caml-list] Matching start of input in lexer created with ocamllex
  2007-04-05 19:55 ` [Caml-list] " skaller
@ 2007-04-05 20:58   ` Janne Hellsten
  2007-04-06  5:52     ` skaller
  0 siblings, 1 reply; 6+ messages in thread
From: Janne Hellsten @ 2007-04-05 20:58 UTC (permalink / raw)
  To: caml-list

> let table = ["for", FOR; "while", WHILE]
> ..
> | space-not-newline + { WHITE }
> | newline { NEWLINE }
> | ident as id { try assoc id table with Not_found -> IDENT id }
>
> An alternative to the WHITE and NEWLINE tokens is a tail
> recursive call to the lexer:
>
> | space + { initial lexbuf }
>
> which just skips over the spaces.

I do have the latter construct in my lexer.  How does the above help
me match the beginning of line in the rule

  | '!' [' ' '\t']* "for" { FOR (current_loc ()) }

I'd like it to look like this:

  | "^!for" { FOR }

where ^ would denote the start of input.  To simplify my question, we
can assume there are no new line chars in my input.  This regexp:

'!' [' ' '\t']* "for"

could actually be simplified to

| "!for"

and I would still need to match the "beginning of input" somehow.

I must've missed your point.

Janne

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Caml-list] Matching start of input in lexer created with ocamllex
  2007-04-05 20:58   ` Janne Hellsten
@ 2007-04-06  5:52     ` skaller
  0 siblings, 0 replies; 6+ messages in thread
From: skaller @ 2007-04-06  5:52 UTC (permalink / raw)
  To: Janne Hellsten; +Cc: caml-list

On Thu, 2007-04-05 at 23:58 +0300, Janne Hellsten wrote:

> I do have the latter construct in my lexer.  How does the above help
> me match the beginning of line in the rule

Ok, sorry .. lets try this:

rule headlex = 
| "!for" nonalpha { Some FOR }
| { None }

rule taillex = ...

and now use 

let make_lexer () =
  let atstart  = ref true in
  fun lexer ->
    if !atstart then begin
      atstart := false
      match headlex lexbuf with
      | None -> 
         lexbuf.curr_pos <- lexbuf.curr_pos -1;
         taillex lexbuf
      | Some FOR -> FOR
    end else taillex lexbuf

let lexer = make_lexer () in

This is an animal, because you have to 'pushback' an
unwanted character following the "for" keyword.
This will cease working if the implementation of lexbuf
changes, in particular if the record is made private.

I may also have got the details on this wrong .. it's
an undocumented feature, you'll need to read lexing.ml
to figure out if it works.

The problem I'm trying to solve here is that you want 

	rulename lexbuf

to return a single correct token. This is a requirement
of the parser.

In Felix I don't bother with that. Instead, I just return
a pre-token, I gather the whole input as a list, and then
I write filters that translate the list of pre-tokens into
tokens. Then I make a 'lexer' function that just returns
the next token in the list already prepared, and uses
a dummy lexbuf to satisfy the broken ocamlyacc interface.

In other words, I control invert by returning a list and then
actively rewriting it. For example I would do this,
assuming you have multiple lines that can start with "for":

let strip_white toks = filter (fun tok -> tok = WHITE) toks

let append_eol toks = NEWLINE :: toks 

let fixup_for toks =
  let aux out inp = match inp with
  | NEWLINE :: BANG :: IDENT "for" :: tail -> aux (FOR :: out) tail
  | NEWLINE :: tail -> aux out tail
  | head :: tail -> aux (head::out) tail
  | [] -> rev out
  in aux [] toks

let tokens = fixup_for (append_eol (strip_white toks)) 

The point is this process has arbitrary lookahead.

The downside is you have to hold the whole input in memory.
This is not going to be a problem unless it's the encoding
of an exported database.

The tagged automata underlying the lexer can probably be modified
to allow forward context checks, eg to match stuff like:

	"!for" / nonalpha

where / denotes lookahead (but I'm not sure). I know TRE can
do this, however ocamllex is using a prebuilt tagged 
deterministic automaton, whereas TRE uses a lazily 
constructed one.

The technique basically matches

 	"!for" (TAG) nonalpha

first, but sets the end location to the TAG position. I believe this
might be possible because the automata ALREADY looks ahead and
backtracks to the last success .. so we need to keep two 'success'
points to do this, the real one and the lookahead one. Normally
they're the same, but in this case they're not. The lexer engine
uses the lookahead one until it is finished, but then fudges
its result to the real one.

The two pointers are needed because in general in

	regexp1 / regexp2

we have regexp2 'success' point to track. If you allow

	regexp1 / (regexp2 / regexp3)

you might need more pointers (not sure), so this would probably 
have to be banned (not sure).

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Caml-list] Matching start of input in lexer created with ocamllex
  2007-04-06  7:40 ` David Allsopp
@ 2007-04-06 20:14   ` Janne Hellsten
  0 siblings, 0 replies; 6+ messages in thread
From: Janne Hellsten @ 2007-04-06 20:14 UTC (permalink / raw)
  To: David Allsopp; +Cc: caml-list

On 4/6/07, David Allsopp <dra-news@metastack.com> wrote:
> > I'd like to match the beginning of input (or beginning of line) in my
> > lexer.  Is there an easy way to do that?
> Ocamllex doesn't have a notion for beginning of line. Three possible
> solutions:
>
> 1. You can simulate it with a bool ref parameter to your lexer that gets set
> to true by each rule to indicate that you're no longer at the beginning of a
> line - the "!for" rule than raises Failure if it matches when this ref is
> true. Slightly tedious for code maintenance...
> 2. You use two lexers - one with the "!for" rule and one without and call
> one lexer from the other (not very nice, because ocamllex doesn't support
> code reuse between lexers so you'll be duplicating a lot of code).
> 3. Pre-process the input to ocamllex to include a special character that
> cannot appear in your text and place that at the beginning of the line (e.g.
> one of the control characters in 0..31).

Thanks, my program was already running its input through a
preprocessor and I thus chose to use solution #3.  I had already
considered solution #2 but came to the same conclusion: it leads to
code duplication.

Janne


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Caml-list] Matching start of input in lexer created with ocamllex
       [not found] <20070405205804.90509BC76@yquem.inria.fr>
@ 2007-04-06  7:40 ` David Allsopp
  2007-04-06 20:14   ` Janne Hellsten
  0 siblings, 1 reply; 6+ messages in thread
From: David Allsopp @ 2007-04-06  7:40 UTC (permalink / raw)
  To: caml-list

> Hi,
>
> I'd like to match the beginning of input (or beginning of line) in my
> lexer.  Is there an easy way to do that?
>
> I have a lexer that looks something like this (simplified):
>
> rule initial = parse
>   | '!' [' ' '\t']* "for" { FOR (current_loc ()) }
>   | ident as id { IDENT (id, current_loc ()) }
>   | '!' { BANG (current_loc ()) }
>
> The !for token should only be matched at the beginning of a
> line/input.  However, in the above lexer, there's nothing that
> prevents !for from being matched in the middle of an input string.
> This causes a problem: An input string containing !forbidXyz will be
> lexed FOR, IDENT "bidXyz".  I'd like to lex it as BANG, IDENT
> "forbidXyz".

Ocamllex doesn't have a notion for beginning of line. Three possible
solutions:

1. You can simulate it with a bool ref parameter to your lexer that gets set
to true by each rule to indicate that you're no longer at the beginning of a
line - the "!for" rule than raises Failure if it matches when this ref is
true. Slightly tedious for code maintenance... 
2. You use two lexers - one with the "!for" rule and one without and call
one lexer from the other (not very nice, because ocamllex doesn't support
code reuse between lexers so you'll be duplicating a lot of code).
3. Pre-process the input to ocamllex to include a special character that
cannot appear in your text and place that at the beginning of the line (e.g.
one of the control characters in 0..31).

HTH,

David

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2007-04-06 20:14 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-04-05 14:37 Matching start of input in lexer created with ocamllex Janne Hellsten
2007-04-05 19:55 ` [Caml-list] " skaller
2007-04-05 20:58   ` Janne Hellsten
2007-04-06  5:52     ` skaller
     [not found] <20070405205804.90509BC76@yquem.inria.fr>
2007-04-06  7:40 ` David Allsopp
2007-04-06 20:14   ` Janne Hellsten

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).