caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: skaller <skaller@ozemail.com.au>
To: katre <katre@henchmonkey.org>
Cc: caml-list@inria.fr
Subject: Re: [Caml-list] Calling ocamlyacc from ocamlyacc
Date: 06 Sep 2003 13:36:26 +1000	[thread overview]
Message-ID: <1062819385.17265.30.camel@localhost.localdomain> (raw)
In-Reply-To: <20030902152340.GA32745@henchmonkey.org>

On Wed, 2003-09-03 at 01:23, katre wrote:
> Michal Moskal wrote:
> > 
> > You can define several start symbols in your grammar. Parsing functions
> > are defined for each. You can also define several rule ... in your lexer
> > (lexing functions are defined for each). Hope that helps, I can't help
> > more, since I don't quite understand nature of your problem.
> > 
> 
> Right, but is there a way, in a ocamlyacc action, to switch which lexer
> rule you're using?  That seems to be the main part I am missing.  Or if
> I could access the lexbuf directly, I could also use that.

I think the answer is no, you can't do that.
The reason is that yacc et al are LR(1) meaning
1 token look ahead is needed before a reduction:
in general, when a reduction occurs there is no
guarrantee what other tokens haven't been fetched.

Now yacc/lex are normally driven by the parser
fetching tokens. So what you _can_ do is pretend
that the 'deviant sublanguage' you need to use
a different grammar for is a single *huge* token.

The lexer, unlike the parser, can be invoked
recursively, and in particular when a regex
is matched, the code which returns a value can
do anything.

In particular, you can switch to another lexing rule.
I do this all the time for handling comments
and strings: the lexeme for open quote is
recognised and it calls a string gathering
rule which uses the same lexbuf.

Lexbufs have a current position, which is the
end of the lexeme .. even if the finite state
automaton actually looked ahead further.

So with lexbuf, you know *exactly* where you are,
whenever the code associated with a regexp is matched.

Here is the mainline lexer:
....
(* Python strings *)
| quote  { fun state -> state#inbody; parse_qstring lexbuf state }
| qqq    { fun state -> state#inbody; parse_qqqstring lexbuf state }


which invokes the sublexer:

rule parse_qstring = parse
| qstring { 
    fun state -> 
      state#inbody;
      [STRING (
        state#get_srcref lexbuf, 
        state#decode decode_qstring (lexeme lexbuf)
      )] 
  }
| _ { 
  fun state -> 
    [ERRORTOKEN (
      state#get_srcref lexbuf, 
      "' string"
    )] 
  }

and parse_qqqstring = parse
| qqqstring { 
    fun state -> 
      state#inbody;
      [STRING (
        state#get_srcref lexbuf, 
        state#decode decode_qqqstring (lexeme lexbuf)
      )] 
  }
| _ { 
  fun state -> 
    state#inbody;
    [ERRORTOKEN (
      state#get_srcref lexbuf, 
      "''' string"
    )] 
  }

-------
Now, I said you can do anything, and in the example I just
call another lexer rule, but .. there is no reason you
can't call a parser function, passing the same lexbuf.

Note that you have to do this from the LEXER code,
to ensure that the sub-parser is invoked on exactly
the correct starting character.


You may also note in the example my lexer codes
have a 

	fun state -> ...

form (for every lexeme which is boring to write).
This state is a mutable object which is passed
to the lexer as an extra argument (just add it
after the call to the rule as in the nested example:

| quote  { fun state -> state#inbody; parse_qstring lexbuf state }
--------------------------------------*************--------*****
-                                     rule                extra arg


Note: I am returning lists of tokens not tokens. My lexer code
is NOT called by the parser. I call it myself and build a list
of tokens, pre-process them, and pass the output of that to
the parser via a dummy lexbuf.

I have in fact constructed a PYTHON parser using Ocamlyacc,
even though Python grammar is 'strongly not LR(1)' :-))
I do this by something like 13 filterings of the token
streams (to find the indentation etc) before it
is in an LR(1) form.


-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


      parent reply	other threads:[~2003-09-06  3:36 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-09-01 17:28 katre
2003-09-02 15:19 ` Michal Moskal
2003-09-02 15:23   ` katre
2003-09-02 15:40     ` Michal Moskal
2003-09-03 10:28     ` Hendrik Tews
2003-09-06  3:36     ` skaller [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1062819385.17265.30.camel@localhost.localdomain \
    --to=skaller@ozemail.com.au \
    --cc=caml-list@inria.fr \
    --cc=katre@henchmonkey.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).