From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <weis@pauillac.inria.fr>
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78])
	by yquem.inria.fr (Postfix) with ESMTP id 93F82BB9A
	for <caml-list@yquem.inria.fr>; Thu, 27 Oct 2005 09:36:02 +0200 (CEST)
Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35])
	by nez-perce.inria.fr (8.13.0/8.13.0) with ESMTP id j9R7a24U002821
	for <caml-list@yquem.inria.fr>; Thu, 27 Oct 2005 09:36:02 +0200
Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id JAA28716 for <caml-list@pauillac.inria.fr>; Thu, 27 Oct 2005 09:36:01 +0200 (MET DST)
Received: from ash25e.internode.on.net (ash25e.internode.on.net [203.16.214.182])
	by nez-perce.inria.fr (8.13.0/8.13.0) with ESMTP id j9R7ZxaP002812
	for <caml-list@inria.fr>; Thu, 27 Oct 2005 09:36:00 +0200
Received: from rosella (ppp1-105.lns1.syd7.internode.on.net [59.167.1.105])
	by ash25e.internode.on.net (8.12.9/8.12.6) with ESMTP id j9R7ZlYg072888
	for <caml-list@inria.fr>; Thu, 27 Oct 2005 17:05:51 +0930 (CST)
	(envelope-from skaller@users.sourceforge.net)
Subject: Parser state variables
From: skaller <skaller@users.sourceforge.net>
To: Caml Mailing List <caml-list@inria.fr>
Content-Type: text/plain
Date: Thu, 27 Oct 2005 17:35:47 +1000
Message-Id: <1130398547.7544.66.camel@rosella>
Mime-Version: 1.0
X-Mailer: Evolution 2.2.1.1 
Content-Transfer-Encoding: 7bit
X-Miltered: at nez-perce with ID 43608362.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)!
X-Miltered: at nez-perce with ID 4360835F.001 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)!
X-Spam: no; 0.00; parser:01 ocamlyacc:01 ocamllex:01 workarounds:01 token:01 token:01 lexer:01 lexer:01 parser:01 pointer:01 tokens:01 lexing:01 tokens:01 typedef:01 parsers:01 
X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on yquem.inria.fr
X-Spam-Level: 
X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled 
	version=3.0.3

Are there any plans to extend Ocamlyacc to allow state variables,
in a manner similar to the recent extensions to Ocamllex to
allow them?

At present, there are two workarounds:

(a) global variables

I hope I do not need to say how bad this is for a language
that is supposed to provide good support for functional programming.

(b) Token hack trick

This trick avoids the problems of global variables by using
a proper state object. The idea is that EVERY token should
contain the state object, which is put into it by the
lexer: thus the state variables passed to the lexer can
be transmitted to the parser.

The cost of course is an extra pointer in every token,
plus the messiness of actually constructing the tokens,
which has to be repeated in every single lexing rule,
and of course, you have to declare the tokens:

%token<MyLexerState.lexer_state> ATOKEN
....

An obvious use of such a facility is a C parser,
which needs to modify a list of typedef names,
so the that lexer can generate the right token
for a given identifier.

Most C parsers use a global variable for communication,
which of course is very bad.

In other cases, such as the Felix parser, tokenisation
is entirely divorced from parsing. Nevertheless, whilst
the parse is not influence by any data, the *user actions*
could be. One such use is a 'fresh variable' counter.

Whilst LALR(1) parsers are not directly
amenable to extension, one can certainly use a pushdown
list of LALR(1) automata to integrate 'recursive descent'
parsing techniques with LALR(1).

For example, a grammar for statements might use a single
token for 'expressions', and we have a separate expression
grammar. We then create a pair of lexers: one lexes statements,
recognizing an expression as a single token, attaching the whole
expression as an attribute. The statement parser, on seeing
such a token, extracts the string from it, and lexes and
parses it using an expression lexer/parser, takes the expression
AST which results, and glues it into the statement AST.

Without arguing about the feasibility or quality of such a 
design, the point is that, for example, the expression lexer
parser might be made extensible and depend on tables.
And the question is: where do the tables come from?

The only way to do this properly at the moment is the
token hack trick -- the tables have to be stored IN
the tokens, simply because there is no way to pass them
to the parser, so that the parser can pass them to user
actions.

One possible solution is to put the state data in a Lexbuf.t,
unfortunately, this would require it to be polymorphic,
and thus the lexer and parsers would also have to be polymorphic.
Since the parser has access to the lexbuf .. it can get the user
data out and pass it to the user actions. 


-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net