From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by yquem.inria.fr (Postfix) with ESMTP id 93F82BB9A for ; Thu, 27 Oct 2005 09:36:02 +0200 (CEST) Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35]) by nez-perce.inria.fr (8.13.0/8.13.0) with ESMTP id j9R7a24U002821 for ; Thu, 27 Oct 2005 09:36:02 +0200 Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id JAA28716 for ; Thu, 27 Oct 2005 09:36:01 +0200 (MET DST) Received: from ash25e.internode.on.net (ash25e.internode.on.net [203.16.214.182]) by nez-perce.inria.fr (8.13.0/8.13.0) with ESMTP id j9R7ZxaP002812 for ; Thu, 27 Oct 2005 09:36:00 +0200 Received: from rosella (ppp1-105.lns1.syd7.internode.on.net [59.167.1.105]) by ash25e.internode.on.net (8.12.9/8.12.6) with ESMTP id j9R7ZlYg072888 for ; Thu, 27 Oct 2005 17:05:51 +0930 (CST) (envelope-from skaller@users.sourceforge.net) Subject: Parser state variables From: skaller To: Caml Mailing List Content-Type: text/plain Date: Thu, 27 Oct 2005 17:35:47 +1000 Message-Id: <1130398547.7544.66.camel@rosella> Mime-Version: 1.0 X-Mailer: Evolution 2.2.1.1 Content-Transfer-Encoding: 7bit X-Miltered: at nez-perce with ID 43608362.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Miltered: at nez-perce with ID 4360835F.001 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Spam: no; 0.00; parser:01 ocamlyacc:01 ocamllex:01 workarounds:01 token:01 token:01 lexer:01 lexer:01 parser:01 pointer:01 tokens:01 lexing:01 tokens:01 typedef:01 parsers:01 X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled version=3.0.3 Are there any plans to extend Ocamlyacc to allow state variables, in a manner similar to the recent extensions to Ocamllex to allow them? At present, there are two workarounds: (a) global variables I hope I do not need to say how bad this is for a language that is supposed to provide good support for functional programming. (b) Token hack trick This trick avoids the problems of global variables by using a proper state object. The idea is that EVERY token should contain the state object, which is put into it by the lexer: thus the state variables passed to the lexer can be transmitted to the parser. The cost of course is an extra pointer in every token, plus the messiness of actually constructing the tokens, which has to be repeated in every single lexing rule, and of course, you have to declare the tokens: %token ATOKEN .... An obvious use of such a facility is a C parser, which needs to modify a list of typedef names, so the that lexer can generate the right token for a given identifier. Most C parsers use a global variable for communication, which of course is very bad. In other cases, such as the Felix parser, tokenisation is entirely divorced from parsing. Nevertheless, whilst the parse is not influence by any data, the *user actions* could be. One such use is a 'fresh variable' counter. Whilst LALR(1) parsers are not directly amenable to extension, one can certainly use a pushdown list of LALR(1) automata to integrate 'recursive descent' parsing techniques with LALR(1). For example, a grammar for statements might use a single token for 'expressions', and we have a separate expression grammar. We then create a pair of lexers: one lexes statements, recognizing an expression as a single token, attaching the whole expression as an attribute. The statement parser, on seeing such a token, extracts the string from it, and lexes and parses it using an expression lexer/parser, takes the expression AST which results, and glues it into the statement AST. Without arguing about the feasibility or quality of such a design, the point is that, for example, the expression lexer parser might be made extensible and depend on tables. And the question is: where do the tables come from? The only way to do this properly at the moment is the token hack trick -- the tables have to be stored IN the tokens, simply because there is no way to pass them to the parser, so that the parser can pass them to user actions. One possible solution is to put the state data in a Lexbuf.t, unfortunately, this would require it to be polymorphic, and thus the lexer and parsers would also have to be polymorphic. Since the parser has access to the lexbuf .. it can get the user data out and pass it to the user actions. -- John Skaller Felix, successor to C++: http://felix.sf.net