From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by yquem.inria.fr (Postfix) with ESMTP id 4149DBB81; Thu, 15 Dec 2005 12:03:57 +0100 (CET) Received: from smtp3.adl2.internode.on.net (smtp3.adl2.internode.on.net [203.16.214.203]) by concorde.inria.fr (8.13.0/8.13.0) with ESMTP id jBFB3ti4013580; Thu, 15 Dec 2005 12:03:56 +0100 Received: from rosella (ppp33-4.lns1.syd2.internode.on.net [59.167.33.4] (may be forged)) by smtp3.adl2.internode.on.net (8.12.9/8.12.6) with ESMTP id jBFB3jdt079747; Thu, 15 Dec 2005 21:33:45 +1030 (CST) (envelope-from skaller@users.sourceforge.net) Subject: Re: [Caml-list] [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml From: skaller To: Francois.Pottier@inria.fr Cc: Alessandro Baretta , Caml Mailing List In-Reply-To: <20051215084616.GC1966@yquem.inria.fr> References: <20051212175838.GA8502@yquem.inria.fr> <1134540495.8980.63.camel@rosella> <20051214090427.GB1330@yquem.inria.fr> <439FF395.3090503@barettadeit.com> <1134594294.8942.82.camel@rosella> <20051215084616.GC1966@yquem.inria.fr> Content-Type: text/plain Date: Thu, 15 Dec 2005 22:03:44 +1100 Message-Id: <1134644624.8934.44.camel@rosella> Mime-Version: 1.0 X-Mailer: Evolution 2.4.1 Content-Transfer-Encoding: 7bit X-Miltered: at concorde with ID 43A14D9B.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Spam: no; 0.00; caml-list:01 parser:01 ocaml:01 token:01 expr:01 expr:01 token:01 grammar:01 lexers:01 parsers:01 semicolon:01 parser:01 terminate:01 recursion:01 tokens:01 X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled version=3.0.3 On Thu, 2005-12-15 at 09:46 +0100, Francois Pottier wrote: > Doing this properly requires *not* reading the lookahead token. > As you point > out, the problems that arise are similar to the way the end of stream is > handled. > > > It would be really nice if I could write the > > production as: > > > > exprx: expr ~expr > > > > where ~expr means 'any token which forces a reduction > > of the whole expression' if you know what I mean. > > No, I don't know exactly what you mean :) I need to think more to understand > if this construction makes sense. > It makes the grammar self-referential, in a > way, so it at least has to be monotonic in order to make sense. > > Our plan regarding calls to external lexers or parsers was to be more rigid, > that is, to require the current statement (or expression, or whatever) to be > unambiguously terminated (for instance, by a semicolon) at the point where the > external call is made. So what you do with statements would be accepted, but > what you do with expressions might be impossible. So, the idea is this: for an 1 token look ahead parser, all productions terminate either with 0 lookahead or 1 lookahead. Now, you're saying only to support recursion for 0 lookahead productions. So what if you have a 1 lookahead and you still want to recurse?? One answer is: make a new augmented production with is 0 lookahead: prod': prod ~prod { $1, $2 } where ~prod is the nonterminal ~prod: | t1 { t1 $1 } | t2 { t2 $1 } | t3 { t3 $1 } .... where t1, t2, t3, .. are tokens that would not be shifted by the parser, and therefore cause prod to be reduced. The action then returns the result of prod, PLUS the lookahead token. Before recursing, we can simply push the lookahead token back into the head of the token stream. Felix does precisely this -- it convents a 1 lookahead production into a 0 lookahead production and therefore: > what you do with expressions might be impossible. seems possible by the above mechanism. It is equivalent to adding a set of terminating tokens. The *problem* is choosing terminating tokens that are viable first tokens of any sequence of symbols that can follow an expression elsewhere in the grammar. For example consider: while expr do .. for i in expr to expr do This grammar would be ambiguous, if, whilst parsing an expression, a 'do' or 'to' token did not cause expr to be reduced. So 'do' and 'to' are 'expression termination tokens', that is, 'do' and 'to' are members of the set '~expr'. I forget the correct terminology about viable prefixes and handles.. but I hope the idea is clear. I think you did raise an important point though: the meaning of ~x is dependent on the start symbol. So the notation should probably be something like: start~x meaning 'all the tokens which can terminate an x whilst parsing start'. This algorithm 'works' for finding 'start~x': "try every token. If it leads to a conflict it is not in start~x, otherwise it is" I am doing that by hand at the moment. Here are the relevant parts of Felix: %type exprx %start exprx exprx: expr expr_terminator { $1,$2 } expr_terminator: | SEMI { SEMI $1 } | USER_KEYWORD { USER_KEYWORD $1 } | VBAR { VBAR $1 } | ENDMARKER { ENDMARKER } @for n,t in flx_expr_terminator_keywords: tangle(" | " + t + " { " + t + " $1 }",inhibit_sref=1) The @ stuff is a Python script that uses a Python table to generate code, elsewhere i have: flx_expr_terminator_keywords = [ ("all", "ALL"), ("assert", "ASSERT"), ("body", "BODY"), ("call", "CALL"), ("case", "CASE"), .... ("with", "WITH"), ("until", "UNTIL"), ("_", "UNDERSCORE"), ("_gc_pointer", "GC_POINTER"), ("_svc", "SVC"), ] flx_other_keywords = [ ("and", "AND"), ("as", "AS"), .... ("regmatch", "REGMATCH"), ("the", "THE"), ("typematch", "TYPEMATCH"), ] flx_keywords = flx_expr_terminator_keywords + flx_other_keywords I generated the partition by trial and error. When I modify the grammar I could miss some cases. In fact, I have no idea if the set of terminators is maximal. The parser KNOWS .. if I add an wrong keyword into the terminator set I get a conflict. The technique I am using would be impossible to maintain if there were 20-30 such productions at 'recursion points'. [At the moment in Felix there are just two: statements and expression] I hope this makes sense: I think what I'm asking for is quite general, since it enables any LR1 parser to be used with an LL parser, provided one can 'push back' the offending lookahead token. (that is, of course, always possible to implement by buffering for mutable streams, and persistence for functional ones). -- John Skaller Felix, successor to C++: http://felix.sf.net