Re: [Caml-list] menhir - Francois Pottier

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

From: Francois Pottier <Francois.Pottier@inria.fr>
To: skaller <skaller@users.sourceforge.net>
Cc: caml-list@inria.fr
Subject: Re: [Caml-list] menhir
Date: Wed, 2 May 2007 14:30:58 +0200	[thread overview]
Message-ID: <20070502123058.GC21560@yquem.inria.fr> (raw)
In-Reply-To: <1178095304.13660.71.camel@rosella.wigram>

[-- Attachment #1: Type: text/plain, Size: 4788 bytes --]

Hello,

On Wed, May 02, 2007 at 06:41:44PM +1000, skaller wrote:
> Exactly. In Ocamlyacc it is named 'eof', and you can use that
> token in your productions.

As far as I know, this is incorrect. ocamlyacc does not have a predefined
eof token. Perhaps you are thinking of ocamllex, which has an eof pattern.

> compilation_unit:
>   | statement_aster ENDMARKER { $1 }
> 
> where no non-terminal on which statements_aster depends 
> has a production containing ENDMARKER or # (eof).
> 
> Therefore, there is no conflict. When compilation_unit
> is reduced the parser returns, the next token, whether
> it is # or any other, is irrelevant.

Good. I seem to agree with you. Menhir should not report an end-of-stream
conflict here. So, what does it report?

> If you are augmenting the grammar to be:
> 
> compilation_unit:
> 	| statement_aster ENDMARKER eof { $1 }

No, actually, I am augmenting the grammar with the production
compilation_unit' -> compilation_unit, and using
compilation_unit' -> compilation_unit { # } as my initial LR(1)
item in the construction of the automaton.

> IMHO: the end-of-stream thing is a serious bug in the
> design of Ocamlyacc which unfortunately Menhir duplicates.

I believe ocamlyacc's behavior is reasonable but subtle and undocumented...
Menhir attempts to reproduce that behavior, while providing more guidance in
the form of end-of-stream conflict reports... That said, I could be missing a
point. What is the bug?

> In particular there should be no possibility of the parser
> consulting the lexbuf, the signature of parsers is very
> wrong:
> 
> 	parser: (lexbuf -> token) -> lexbuf -> ast
> 
> is a disaster. Because of this you're duplicating the
> disaster and probing the lexbuf for end of stream,
> which is just wrong. 
> 
> There should be no implicit end of stream: a stream
> is an INFINITE sequence, a parser should process the
> head of the stream, and the signature should be:
> 
> 	parser: (state -> token) -> state -> ast
> 
> where state is a USER defined type. If the user provides
> no state argument, then it defaults to unit.

I believe this is a separate issue. You are right in saying that the historic
signature, which involves lexbuf, is dubious. Following your suggestion, we
could just as well use

  parser: (state -> token * state) -> state -> ast * state

if we wish to promote a purely functional style (where values of type
state are immutable), or just

  parser: (unit -> token) -> ast

if we are willing to accept mutable state. (I am sweeping the issue of
locations under the rug; we should use token * location instead of just
token.)

That said, the historic signature 

  parser: (lexbuf -> token) -> lexbuf -> ast

is really equivalent to the previous one, in the sense that I can write
functions that convert between the two styles (see attached file).

> The point again is that the token input to the parser is infinite: it can't
> ever be an error to read a next token.

I beg to disagree. First, the input stream does not have to be infinite: if
I am reading from a file, clearly it is finite. Second, regardless of whether
the stream is finite or infinite, it *is* an error to read more tokens than
you were supposed to. If the grammar's start symbol is S, then the parser
should read a sequence of tokens that derives from S, and nothing more; it
should not overshoot and consume the first token that follows.

> Anyhow the problem seems to reduce to this: I think
> a parser should read the head of a stream, and Menhir
> thinks it should read the whole stream, and it uses
> the bugged lexbuf interface to test for end of stream.

As I just said in the previous paragraph, we really are in agreement: a parser
should read only a prefix of the stream.

As I said earlier, a Menhir-generated parser does not test for the physical
end of stream. It relies on the lexbuf interface only to (i) read the next
token and (ii) access the location of the current token.

> So I could fix my grammar by adding 'end-of-stream'
> to my start symbols AND hack the otherwise unused
> lexbuf to show end-of-stream after ENDMARKER is read.

Hacking the dynamic behavior of your lexbuf won't make any difference, since
end-of-stream conflicts are reported at compile time! Ponder that.

The only way of avoiding these conflicts is to change your grammar somehow.
But I still haven't understood what causes these conflicts in your grammar.
Perhaps it would be time to show it?

> Ocamlyacc doesn't complain though, suggesting it processes
> only the head of a stream .. and that there's a fundamental
> design distinction between the two parser generators.

ocamlyacc never complains. It just trusts you to know what you are doing.

-- 
François Pottier
Francois.Pottier@inria.fr
http://cristal.inria.fr/~fpottier/

[-- Attachment #2: convert.ml --]
[-- Type: text/plain, Size: 580 bytes --]

open Lexing

type ('token, 'ast) historic_parser =
    (lexbuf -> 'token) -> lexbuf -> 'ast

type ('token, 'ast) revised_parser =
    (unit -> 'token) -> 'ast

let upgrade (p : ('token, 'ast) historic_parser) : ('token, 'ast) revised_parser =
  fun (next : unit -> 'token) ->
    let (* dummy *) lexbuf =
      Lexing.from_string ""
    and lexer lexbuf =
      next()
    in
    p lexer lexbuf

let downgrade (p : ('token, 'ast) revised_parser) : ('token, 'ast) historic_parser =
  fun (lexer : lexbuf -> 'token) lexbuf ->
    let next () =
      lexer lexbuf
    in
    p next

next prev parent reply	other threads:[~2007-05-02 12:31 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-04-28 10:32 menhir skaller
2007-04-28 16:50 ` [Caml-list] menhir Francois Pottier
2007-04-28 19:47   ` Markus Mottl
2007-04-28 21:15     ` Jon Harrop
2007-04-29  4:43   ` skaller
2007-04-29  7:27     ` Christophe Raffalli
2007-05-01 15:57     ` Francois Pottier
2007-05-01 17:11       ` skaller
2007-05-01 17:34         ` Francois Pottier
2007-05-01 23:42           ` skaller
2007-05-02  5:38             ` Francois Pottier
2007-05-02  5:50               ` Francois Pottier
2007-05-02  8:41               ` skaller
2007-05-02 12:30                 ` Francois Pottier [this message]
2007-05-02 16:29                   ` skaller
2007-05-02 18:35                     ` Francois Pottier
2007-05-03  1:30                       ` skaller
2007-05-03  8:43                         ` Joel Reymont
2007-05-01 17:15       ` skaller
2007-05-01 17:31         ` Francois Pottier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070502123058.GC21560@yquem.inria.fr \
    --to=francois.pottier@inria.fr \
    --cc=caml-list@inria.fr \
    --cc=skaller@users.sourceforge.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).