From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: by yquem.inria.fr (Postfix, from userid 18965) id 17862BB81; Wed, 14 Dec 2005 10:04:27 +0100 (CET) Date: Wed, 14 Dec 2005 10:04:27 +0100 From: Francois Pottier To: skaller Cc: Caml Mailing List Subject: Re: [Caml-list] [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml Message-ID: <20051214090427.GB1330@yquem.inria.fr> Reply-To: Francois.Pottier@inria.fr References: <20051212175838.GA8502@yquem.inria.fr> <1134540495.8980.63.camel@rosella> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1134540495.8980.63.camel@rosella> User-Agent: Mutt/1.5.9i X-Spam: no; 0.00; caml-list:01 parser:01 ocaml:01 distro:01 ocaml:01 parser:01 functor:01 ocamlyacc:01 val:01 token:01 lexbuf:01 val:01 token:01 lexers:01 lexbuf:01 X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED autolearn=disabled version=3.0.3 Hello, On Wed, Dec 14, 2005 at 05:08:15PM +1100, skaller wrote: > 0. The licence. Q public licence for the generator???? > Please NO NO NO!! Not unless it is distributed > as part of the official distro. Is there any chance of that? Our plan (not yet approved by the OCaml higher authorities) is to make it part of the OCaml distribution. > In particular, can the parser be generated *inside* > an environment such a function or let binding? No. The only way of passing arguments to the parser is through functor arguments. (Or through global state, of course.) Why is that a problem? > Ocamlyacc usesthe typing > > val parser: (lexbuf->token) -> lexbuf -> 'a > > which is just bad. A better signature is > > val parser: ( unit -> token ) -> 'a > > There is no need to provide location information: the correct > solution is to throw an exception, which is caught in a > context which can determine the location. It can be nice to have the parser automatically extract position information for you; doing it manually and explicitly within semantic actions would be quite tedious. On the other hand, the signature that you suggest would allow using lexers that do not necessarily conform to the lexbuf interface; is that your point? If so, we will consider adding a command line switch, as you suggest. It would disallow the use of the position keywords ($startpos, $endpos, etc.) > 3. I have doubts about the claim that parsers can 'share' > token types. I do not see how this is possible. It certainly is possible: have a look at demos/calc-two in the distribution. The example is artificial but illustrates the possibility of sharing a token type. > Actually, I personally find the 'yacc' technique of > generating tokens to be rather lame. Felix does this > much better -- the parser simply expects a token type > which is a variant, the type can be defined wherever > you like. With Menhir, the token type can also be hand-written or generated via other means. It doesn't have to be generated by Menhir. It has to be an algebraic data type, though, not a polymorphic variant. > %import_tokens "filename" This is called --external-tokens Filename. It's a command line switch. > 4. Just curious, but how practical is LR(1) in terms of > generated code sizes? Felix is using Elkhound as its > parser which is a GLR parser with an LALR(1) core. In theory > there is an option for choosing the core automaton, which > also allows LR(1) however I recall Scott McPeak commenting > it wasn't worth supporting because it generated tables > which were far too big. There are two questions: how big is the automaton compared to an LALR automaton? and how big is the code that implements the automaton? The answer to the first question is, in most cases, the grammar is LALR(1) anyway, and the automaton generated by Menhir is identical to the one that would be produced by ocamlyacc or by some other LALR(1) parser generator. When the grammar is not LALR(1), the LR(1) automaton has more states than an LALR(1) automaton for the same grammar. The number of extra states is usually small (often just one, or a handful). So there's no concern here. Considering the answer to the first question, the second question would not arise at all if Menhir produced tables, like ocamlyacc. However, Menhir does not produce tables; it compiles the automaton down to a bunch of mutually recursive functions. We have not yet scientifically assessed the difference in size between tables and code, but a few quick experiments indicate that it is reasonable (the code is larger than the tables by a factor of two or three). -- François Pottier Francois.Pottier@inria.fr http://pauillac.inria.fr/~fpottier/