From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <fpottier@yquem.inria.fr>
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: by yquem.inria.fr (Postfix, from userid 18965)
	id 17862BB81; Wed, 14 Dec 2005 10:04:27 +0100 (CET)
Date: Wed, 14 Dec 2005 10:04:27 +0100
From: Francois Pottier <Francois.Pottier@inria.fr>
To: skaller <skaller@users.sourceforge.net>
Cc: Caml Mailing List <caml-list@yquem.inria.fr>
Subject: Re: [Caml-list] [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml
Message-ID: <20051214090427.GB1330@yquem.inria.fr>
Reply-To: Francois.Pottier@inria.fr
References: <20051212175838.GA8502@yquem.inria.fr> <aee06c9e0512131307k3fc494a5k3591d549d552f1b@mail.gmail.com> <1134540495.8980.63.camel@rosella>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <1134540495.8980.63.camel@rosella>
User-Agent: Mutt/1.5.9i
X-Spam: no; 0.00; caml-list:01 parser:01 ocaml:01 distro:01 ocaml:01 parser:01 functor:01 ocamlyacc:01 val:01 token:01 lexbuf:01 val:01 token:01 lexers:01 lexbuf:01 
X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on yquem.inria.fr
X-Spam-Level: 
X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED 
	autolearn=disabled version=3.0.3


Hello,

On Wed, Dec 14, 2005 at 05:08:15PM +1100, skaller wrote:

> 0. The licence. Q public licence for the generator????
> Please NO NO NO!! Not unless it is distributed
> as part of the official distro. Is there any chance of that?

Our plan (not yet approved by the OCaml higher authorities)
is to make it part of the OCaml distribution.

> In particular, can the parser be generated *inside*
> an environment such a function or let binding?

No. The only way of passing arguments to the parser is through
functor arguments. (Or through global state, of course.) Why is
that a problem?

> Ocamlyacc usesthe typing
> 
> 	val parser: (lexbuf->token) -> lexbuf -> 'a
> 
> which is just bad. A better signature is
> 
> 	val parser: ( unit -> token ) -> 'a
> 
> There is no need to provide location information: the correct
> solution is to throw an exception, which is caught in a 
> context which can determine the location.

It can be nice to have the parser automatically extract position information
for you; doing it manually and explicitly within semantic actions would be
quite tedious.

On the other hand, the signature that you suggest would allow using lexers
that do not necessarily conform to the lexbuf interface; is that your point?
If so, we will consider adding a command line switch, as you suggest. It
would disallow the use of the position keywords ($startpos, $endpos, etc.)

> 3. I have doubts about the claim that parsers can 'share'
> token types. I do not see how this is possible. 

It certainly is possible: have a look at demos/calc-two in the
distribution. The example is artificial but illustrates the
possibility of sharing a token type.

> Actually, I personally find the 'yacc' technique of
> generating tokens to be rather lame. Felix does this
> much better -- the parser simply expects a token type
> which is a variant, the type can be defined wherever
> you like.

With Menhir, the token type can also be hand-written or generated via other
means. It doesn't have to be generated by Menhir. It has to be an algebraic
data type, though, not a polymorphic variant.

> %import_tokens "filename"

This is called --external-tokens Filename. It's a command line switch.

> 4. Just curious, but how practical is LR(1) in terms of
> generated code sizes? Felix is using Elkhound as its 
> parser which is a GLR parser with an LALR(1) core. In theory
> there is an option for choosing the core automaton, which
> also allows LR(1) however I recall Scott McPeak commenting
> it wasn't worth supporting because it generated tables
> which were far too big. 

There are two questions: how big is the automaton compared to
an LALR automaton? and how big is the code that implements the
automaton?

The answer to the first question is, in most cases, the grammar
is LALR(1) anyway, and the automaton generated by Menhir is
identical to the one that would be produced by ocamlyacc or
by some other LALR(1) parser generator. When the grammar is
not LALR(1), the LR(1) automaton has more states than an
LALR(1) automaton for the same grammar. The number of extra
states is usually small (often just one, or a handful). So
there's no concern here.

Considering the answer to the first question, the second
question would not arise at all if Menhir produced tables, like
ocamlyacc. However, Menhir does not produce tables; it
compiles the automaton down to a bunch of mutually recursive
functions. We have not yet scientifically assessed the
difference in size between tables and code, but a few
quick experiments indicate that it is reasonable (the
code is larger than the tables by a factor of two or
three).

-- 
François Pottier
Francois.Pottier@inria.fr
http://pauillac.inria.fr/~fpottier/