Hello,

On Wed, May 02, 2007 at 06:41:44PM +1000, skaller wrote:
> Exactly. In Ocamlyacc it is named 'eof', and you can use that
> token in your productions.

As far as I know, this is incorrect. ocamlyacc does not have a predefined
eof token. Perhaps you are thinking of ocamllex, which has an eof pattern.

> compilation_unit:
>   | statement_aster ENDMARKER { $1 }
> 
> where no non-terminal on which statements_aster depends 
> has a production containing ENDMARKER or # (eof).
> 
> Therefore, there is no conflict. When compilation_unit
> is reduced the parser returns, the next token, whether
> it is # or any other, is irrelevant.

Good. I seem to agree with you. Menhir should not report an end-of-stream
conflict here. So, what does it report?

> If you are augmenting the grammar to be:
> 
> compilation_unit:
> 	| statement_aster ENDMARKER eof { $1 }

No, actually, I am augmenting the grammar with the production
compilation_unit' -> compilation_unit, and using
compilation_unit' -> compilation_unit { # } as my initial LR(1)
item in the construction of the automaton.

> IMHO: the end-of-stream thing is a serious bug in the
> design of Ocamlyacc which unfortunately Menhir duplicates.

I believe ocamlyacc's behavior is reasonable but subtle and undocumented...
Menhir attempts to reproduce that behavior, while providing more guidance in
the form of end-of-stream conflict reports... That said, I could be missing a
point. What is the bug?

> In particular there should be no possibility of the parser
> consulting the lexbuf, the signature of parsers is very
> wrong:
> 
> 	parser: (lexbuf -> token) -> lexbuf -> ast
> 
> is a disaster. Because of this you're duplicating the
> disaster and probing the lexbuf for end of stream,
> which is just wrong. 
> 
> There should be no implicit end of stream: a stream
> is an INFINITE sequence, a parser should process the
> head of the stream, and the signature should be:
> 
> 	parser: (state -> token) -> state -> ast
> 
> where state is a USER defined type. If the user provides
> no state argument, then it defaults to unit.

I believe this is a separate issue. You are right in saying that the historic
signature, which involves lexbuf, is dubious. Following your suggestion, we
could just as well use

  parser: (state -> token * state) -> state -> ast * state

if we wish to promote a purely functional style (where values of type
state are immutable), or just

  parser: (unit -> token) -> ast

if we are willing to accept mutable state. (I am sweeping the issue of
locations under the rug; we should use token * location instead of just
token.)

That said, the historic signature 

  parser: (lexbuf -> token) -> lexbuf -> ast

is really equivalent to the previous one, in the sense that I can write
functions that convert between the two styles (see attached file).

> The point again is that the token input to the parser is infinite: it can't
> ever be an error to read a next token.

I beg to disagree. First, the input stream does not have to be infinite: if
I am reading from a file, clearly it is finite. Second, regardless of whether
the stream is finite or infinite, it *is* an error to read more tokens than
you were supposed to. If the grammar's start symbol is S, then the parser
should read a sequence of tokens that derives from S, and nothing more; it
should not overshoot and consume the first token that follows.
 
> Anyhow the problem seems to reduce to this: I think
> a parser should read the head of a stream, and Menhir
> thinks it should read the whole stream, and it uses
> the bugged lexbuf interface to test for end of stream.

As I just said in the previous paragraph, we really are in agreement: a parser
should read only a prefix of the stream.

As I said earlier, a Menhir-generated parser does not test for the physical
end of stream. It relies on the lexbuf interface only to (i) read the next
token and (ii) access the location of the current token.

> So I could fix my grammar by adding 'end-of-stream'
> to my start symbols AND hack the otherwise unused
> lexbuf to show end-of-stream after ENDMARKER is read.

Hacking the dynamic behavior of your lexbuf won't make any difference, since
end-of-stream conflicts are reported at compile time! Ponder that.

The only way of avoiding these conflicts is to change your grammar somehow.
But I still haven't understood what causes these conflicts in your grammar.
Perhaps it would be time to show it?

> Ocamlyacc doesn't complain though, suggesting it processes
> only the head of a stream .. and that there's a fundamental
> design distinction between the two parser generators.

ocamlyacc never complains. It just trusts you to know what you are doing.

-- 
François Pottier
Francois.Pottier@inria.fr
http://cristal.inria.fr/~fpottier/