caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* Yet another yacc question
@ 2007-05-23 15:09 David Allsopp
  2007-05-24 13:24 ` [Caml-list] " Francois Pottier
  0 siblings, 1 reply; 2+ messages in thread
From: David Allsopp @ 2007-05-23 15:09 UTC (permalink / raw)
  To: OCaml List

Suppose I have the context-free grammar

P -> A | A B | A B C | A C

In ocamlyacc, I code this up as:

%token A B C
%type <unit> parse
%start parse
%%

parse:
  A     {()}
| A B   {()}
| A B C {()}
| A C   {()}
;

And set-up a lexer (called lexer) so that the characters 'A', 'B' and 'C'
produce the tokens A, B and C. Then I write the following function:

let f s = parse lexer (Lexing.from_string s)

And use it a few times...

f "ABCZ"   ...   gives ()
f "ACZ"    ...   gives ()
f "AA"     ...   raises Parsing.Parse_error

The third case fails because "AA" is not in the grammar. However, the first
two work even though "ABCZ" and "ACZ" are also not in the grammar (and Z
isn't even a token!). They work because ocamlyacc doesn't need look-ahead
after the "C" in each case to determine that it can reduce to the entry
non-terminal and so return (). In the third case, look-ahead is required -
it looks ahead, sees an A and so fails.

I would quite like the third to match as well and ignore the second A
(ignore and leave on the buffer ready for a future parse... so "peek-ahead"
rather than "look-ahead", I guess). I think I'm probably right in assuming
that ocamlyacc can't do this. I'm not willing to alter my parser to return a
list of tokens which as far as I can see is the only way to make ocamlyacc
do this correctly - i.e.

parse:
  token parse {$1::$2}
| EOF {[]}
;

token:
  /as for parse in the previous grammar/ 

(Incidentally, lest anyone have it confirmed that I'm mad, I'm trying to
parse batches of SQL statements so have no obvious terminating token for a
clause - the parser needs to do a longest possible match ignoring anything
else following that would appear to be a syntax error)

So my question: can menhir, dypgen or any of the other parser generators out
there do this - i.e. return one () on the first call and then another () on
the second with the string "AA"? It would finally be a reason for abandoning
ocamlyacc :o)

Thanks! (in hope that I haven't missed something blindingly obvious...)


David


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [Caml-list] Yet another yacc question
  2007-05-23 15:09 Yet another yacc question David Allsopp
@ 2007-05-24 13:24 ` Francois Pottier
  0 siblings, 0 replies; 2+ messages in thread
From: Francois Pottier @ 2007-05-24 13:24 UTC (permalink / raw)
  To: David Allsopp; +Cc: OCaml List


Hello,

On Wed, May 23, 2007 at 04:09:58PM +0100, David Allsopp wrote:

> The third case fails because "AA" is not in the grammar. However, the first
> two work even though "ABCZ" and "ACZ" are also not in the grammar (and Z
> isn't even a token!). They work because ocamlyacc doesn't need look-ahead
> after the "C" in each case to determine that it can reduce to the entry
> non-terminal and so return (). In the third case, look-ahead is required -
> it looks ahead, sees an A and so fails.

Your analysis is correct.

> I would quite like the third to match as well and ignore the second A
> (ignore and leave on the buffer ready for a future parse... so "peek-ahead"
> rather than "look-ahead", I guess).

It's hard to ignore the second A once one has requested it... The lexer
interface does not have a "peek" or "put-back" operation, and LR parsers
aren't designed to support these operations anyway.

> So my question: can menhir, dypgen or any of the other parser generators out
> there do this - i.e. return one () on the first call and then another () on
> the second with the string "AA"? It would finally be a reason for abandoning
> ocamlyacc :o)

In theory, menhir adopts the same semantics as ocamlyacc, but it will attempt
to help you by providing more warnings -- here, it will complain that your
grammar has an end-of-stream conflict.

> (Incidentally, lest anyone have it confirmed that I'm mad, I'm trying to
> parse batches of SQL statements so have no obvious terminating token for a
> clause - the parser needs to do a longest possible match ignoring anything
> else following that would appear to be a syntax error)

If you want to parse sequences of statements without clear separators, it
seems to me that the best approach would be to invoke the parser just once for
an entire sequence, instead of trying to invoke it once per statement and have
it leave the token stream in a consistent state. This approach would require
an inversion of control (the parser would drive the rest of your code, instead
of the code invoking the parser). If you can accept that, you should be fine.

Hope this helps,

-- 
François Pottier
Francois.Pottier@inria.fr
http://cristal.inria.fr/~fpottier/


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2007-05-24 13:24 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-05-23 15:09 Yet another yacc question David Allsopp
2007-05-24 13:24 ` [Caml-list] " Francois Pottier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).