From mboxrd@z Thu Jan  1 00:00:00 1970
Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id QAA20340 for caml-red; Fri, 2 Feb 2001 16:23:30 +0100 (MET)
Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id PAA00060 for <caml-list@pauillac.inria.fr>; Thu, 1 Feb 2001 15:22:16 +0100 (MET)
Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35])
	by concorde.inria.fr (8.11.1/8.10.0) with ESMTP id f11EM7n08847;
	Thu, 1 Feb 2001 15:22:07 +0100 (MET)
Received: (from xleroy@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id PAA20033; Thu, 1 Feb 2001 15:22:06 +0100 (MET)
Date: Thu, 1 Feb 2001 15:22:06 +0100
From: Xavier Leroy <Xavier.Leroy@inria.fr>
To: Laurent Reveillere <Laurent.Reveillere@labri.u-bordeaux.fr>
Cc: caml-list@inria.fr
Subject: Re: ocamlyacc/ocamllex problems
Message-ID: <20010201152206.A30653@pauillac.inria.fr>
References: <3A784863.78765C0@labri.u-bordeaux.fr>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 1.0i
In-Reply-To: <3A784863.78765C0@labri.u-bordeaux.fr>; from Laurent.Reveillere@labri.u-bordeaux.fr on Wed, Jan 31, 2001 at 06:16:19PM +0100
Sender: weis@pauillac.inria.fr

> I am writing a parser that uses Parsing.rhs_start and Parsing.rhs_end in
> a rule.  The problem is the following,
> 
> 1) If I use a simple rule in the lexer that matches a token all is fine.
> ex:
>     |   "'" ['0' '1' '*' '.']+ "'"  { ... }
> 
> 2) If I use an automata in the lexer for matching the same token, the
> results of Parsing.rhs_start and Parsing.rhs_end are wrong.
> ex:	
>     | "'"  { ... bits lexbuf ... }
> and bits = parse
>     | '\'' { ... }
>     | ['0' '1' '.' '*' ] { ... }
>     | eof  { ... }
>     | _    { ... }
>
> I am not sure to undertand the reasons of my problem?

For terminal symbols (tokens), the locations returned by
Parsing.rhs_start and Parsing.rhs_end are those returned by
Lexing.lexeme_start and Lexing.lexeme_end.  However, these two
functions track the location of the *last* regular expression matched by
the ocamllex-generated automaton.  (This location is stored and
updated in place in the "lexbuf" argument.)

So, if your lexing rule recursively calls other lexing rules (as in
case 2 above), the locations reported correspond to the part of the
token that was last matched by a regular expression (i.e. the last
"bit" of the token in your example 2).

To get correct locations in example 2, a bit of "lexbuf" hacking is
required to restore the start location to what it was when the first
regexp was matched:

| "'"  { let start = Lexing.lexeme_start lexbuf in
         let res = ... bits lexbuf ... in
         lexbuf.Lexing.lex_start_pos <- start - lexbuf.Lexing.lex_abs_pos;
         res }
and bits = parse ...

Hope this helps,

- Xavier Leroy