From mboxrd@z Thu Jan  1 00:00:00 1970
Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id LAA26840; Tue, 3 Dec 2002 11:02:03 +0100 (MET)
X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f
Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id LAA26767 for <caml-list@pauillac.inria.fr>; Tue, 3 Dec 2002 11:02:01 +0100 (MET)
Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35])
	by concorde.inria.fr (8.11.1/8.11.1) with ESMTP id gB3A1sX18142;
	Tue, 3 Dec 2002 11:01:54 +0100 (MET)
Received: (from xleroy@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id LAA26111; Tue, 3 Dec 2002 11:01:53 +0100 (MET)
Date: Tue, 3 Dec 2002 11:01:53 +0100
From: Xavier Leroy <xavier.leroy@inria.fr>
To: Caml List <caml-list@inria.fr>, ostruszk@order.if.uj.edu.pl
Subject: Re: [Caml-list] yacc & lex: Does the contructor name matter?
Message-ID: <20021203110153.B7512@pauillac.inria.fr>
References: <20021202095557.A20380@order.if.uj.edu.pl>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 1.0i
In-Reply-To: <20021202095557.A20380@order.if.uj.edu.pl>; from ostruszk@order.if.uj.edu.pl on Mon, Dec 02, 2002 at 09:55:57AM +0100
Sender: owner-caml-list@pauillac.inria.fr
Precedence: bulk

> 1. Does the constructor name matter in yacc and lex?
> 
> I've suddenly wanted to have an interpreter (I'm rather numeric guy so
> this comes to me as an surprise :)) so as you probably guess I started
> with a calculator :).  The problem is that I don't understand behaviour
> of the following implementation -- if I change the constructor name EoL
> into EOF then everything works OK but otherwise I have a "parse error".
> Could someone enlighten me?  If the grammar (line rule especially) is
> incorrect then I should always get "parse error", if it is correct then
> it should work no matter the name of the constructor --- at least this
> is what I think.

The EOF token is indeed special, in the sense that it is the only
valid token for the lookahead of a grammar entry point.
To understand why in full details, you need to look at the .output
file generated by ocamlyacc -v.  But here is a simplified account.
The actual grammar that ocamlyacc compiles down to a PDA has some
additional productions:

$accept: %entry EOF       { should not reduce }
%entry:  '\001' line      { stop parsing with result $2 }
line:    expr             { Some $1 }
      | /*nothing*/       { None }
...

where "$accept" is the real entry point and EOF is either a
user-provided token with that name, or if none was provided,
an internally-generated token distinct from any other.

(The %entry and '\001' business is used to select between multiple
entry points; you can ignore it for the purposes of this explanation.)

Here, the parsing automaton cannot reduce "line" by the "nothing"
production (thus terminating the parsing) without looking at the next
token, which is EoL in your case.  However, EoL and EOF are distinct
tokens, so the input doesn't match the grammar, and a parse error is
generated.

The best way to avoid this difficulty is to make sure that all your
grammar entry points are explicitly terminated by a token, as shown
in the example in section 12.6 of the manual:

    %start line

    %%
    line:
      | expr EoL { Some $1 }
      | /* nothing */ EoL { None }
    ...

Here, the parsing automaton will reduce "line" by the correct rule
just by seeing the EoL token; no test "is the next token EOF?" will
ever be performed.

The moral of this story is thus: always terminate your entry points.

> 2. My second question is what is the most efficient representation of
> the compiled expression.  I don't care how long (with some reasonable
> limits :)) it will take to compile but I want the result be the fastest
> afterwords.  The only representations I can come up with are:
> - keep AST and recursively pattern match on it
> - compile AST to stack like evaluator: there's stack of the results and
>   a list of functions that perform operations on it: push_var,
>   push_const, plus, minus etc... and the result is left as the only
>   value on the stack after applying all function in the list
> - compile AST to list of functions which will take in arguments the
>   operands and the "pointer" of the place where to store the result.
>   I don't have now a clear idea how to do it (I still think in C :)) but
>   the motivation is to remove pushing functions and stack management of
>   the previous solution

I don't understand your third option.  My advice would be to go for
the simplest, clearest solution, which is the first you mentioned.
Your second solution is closer to a bytecode interpreter (a virtual
machine), but to make it significantly more efficient than the first,
you'd need to go further: encode instructions as a datatype, not as
functions; represent the program by arrays or strings rather than
lists of instructions; etc.  Make sure it's worth the effort before
embarking on this task :-)

- Xavier Leroy
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners