From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id FAA30470; Sat, 6 Sep 2003 05:36:58 +0200 (MET DST) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id FAA02624 for ; Sat, 6 Sep 2003 05:36:57 +0200 (MET DST) Received: from mail1.tpgi.com.au (mail.tpgi.com.au [203.12.160.57]) by nez-perce.inria.fr (8.11.1/8.11.1) with ESMTP id h863atT21547 for ; Sat, 6 Sep 2003 05:36:55 +0200 (MET DST) Received: from 203-219-16-22-syd-ts21-2600.tpgi.com.au (203-219-16-22-syd-ts21-2600.tpgi.com.au [203.219.16.22]) by mail1.tpgi.com.au (8.11.6/8.11.6) with ESMTP id h863aj427776; Sat, 6 Sep 2003 13:36:48 +1000 Subject: Re: [Caml-list] Calling ocamlyacc from ocamlyacc From: skaller Reply-To: skaller@ozemail.com.au To: katre Cc: caml-list@inria.fr In-Reply-To: <20030902152340.GA32745@henchmonkey.org> References: <20030901172828.GA24208@henchmonkey.org> <20030902151922.GB4055@roke.freak> <20030902152340.GA32745@henchmonkey.org> Content-Type: text/plain Message-Id: <1062819385.17265.30.camel@localhost.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 (1.2.2-4) Date: 06 Sep 2003 13:36:26 +1000 Content-Transfer-Encoding: 7bit X-Loop: caml-list@inria.fr X-Spam: no; 0.00; caml-list:01 ozemail:01 michal:01 moskal:01 pretend:01 recursively:01 regex:01 python:01 invokes:01 srcref:01 srcref:01 passing:01 python:01 indentation:01 regexp:01 Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk On Wed, 2003-09-03 at 01:23, katre wrote: > Michal Moskal wrote: > > > > You can define several start symbols in your grammar. Parsing functions > > are defined for each. You can also define several rule ... in your lexer > > (lexing functions are defined for each). Hope that helps, I can't help > > more, since I don't quite understand nature of your problem. > > > > Right, but is there a way, in a ocamlyacc action, to switch which lexer > rule you're using? That seems to be the main part I am missing. Or if > I could access the lexbuf directly, I could also use that. I think the answer is no, you can't do that. The reason is that yacc et al are LR(1) meaning 1 token look ahead is needed before a reduction: in general, when a reduction occurs there is no guarrantee what other tokens haven't been fetched. Now yacc/lex are normally driven by the parser fetching tokens. So what you _can_ do is pretend that the 'deviant sublanguage' you need to use a different grammar for is a single *huge* token. The lexer, unlike the parser, can be invoked recursively, and in particular when a regex is matched, the code which returns a value can do anything. In particular, you can switch to another lexing rule. I do this all the time for handling comments and strings: the lexeme for open quote is recognised and it calls a string gathering rule which uses the same lexbuf. Lexbufs have a current position, which is the end of the lexeme .. even if the finite state automaton actually looked ahead further. So with lexbuf, you know *exactly* where you are, whenever the code associated with a regexp is matched. Here is the mainline lexer: .... (* Python strings *) | quote { fun state -> state#inbody; parse_qstring lexbuf state } | qqq { fun state -> state#inbody; parse_qqqstring lexbuf state } which invokes the sublexer: rule parse_qstring = parse | qstring { fun state -> state#inbody; [STRING ( state#get_srcref lexbuf, state#decode decode_qstring (lexeme lexbuf) )] } | _ { fun state -> [ERRORTOKEN ( state#get_srcref lexbuf, "' string" )] } and parse_qqqstring = parse | qqqstring { fun state -> state#inbody; [STRING ( state#get_srcref lexbuf, state#decode decode_qqqstring (lexeme lexbuf) )] } | _ { fun state -> state#inbody; [ERRORTOKEN ( state#get_srcref lexbuf, "''' string" )] } ------- Now, I said you can do anything, and in the example I just call another lexer rule, but .. there is no reason you can't call a parser function, passing the same lexbuf. Note that you have to do this from the LEXER code, to ensure that the sub-parser is invoked on exactly the correct starting character. You may also note in the example my lexer codes have a fun state -> ... form (for every lexeme which is boring to write). This state is a mutable object which is passed to the lexer as an extra argument (just add it after the call to the rule as in the nested example: | quote { fun state -> state#inbody; parse_qstring lexbuf state } --------------------------------------*************--------***** - rule extra arg Note: I am returning lists of tokens not tokens. My lexer code is NOT called by the parser. I call it myself and build a list of tokens, pre-process them, and pass the output of that to the parser via a dummy lexbuf. I have in fact constructed a PYTHON parser using Ocamlyacc, even though Python grammar is 'strongly not LR(1)' :-)) I do this by something like 13 filterings of the token streams (to find the indentation etc) before it is in an LR(1) form. ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners