Re: [Caml-list] Ocamlyacc vs stream parser - Diego Olivier Fernandez Pons

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

From: Diego Olivier Fernandez Pons <Diego.FERNANDEZ_PONS@etu.upmc.fr>
To: caml-list@inria.fr
Subject: Re: [Caml-list] Ocamlyacc vs stream parser
Date: Thu, 5 Jun 2003 16:02:41 +0200 (DST)	[thread overview]
Message-ID: <Pine.A41.4.44.0306051548260.925930-100000@ibm1.cicrp.jussieu.fr> (raw)

    Bonjour,

Michal Moskal a écrit :
> Oh, so by proof-by-book you're right :-) But in practice LARL(1)
> seems more usefull for parsing, at parsing least programming
> languages.

It is easier to be right when one uses up-to-date books. Even if it is
an excellent book, the Dragon is almost 20 years old, which in
computer science is quite a lot.

Anyway, that is not the main point.

The question is 'are LR parsers really more powerfull/difficult to
implement/ difficult to hack/difficult to understand... than LL
parsers' ? And why ?

There are three fundamental parts in a parser :
- the control automaton
- the desambiguisation technique
- the non-determinism handling system

LL parsers are traditionaly compiled to push-down automata whereas LR
parsers usualy use LR(0) automata. In the first case the stack is
explicit and transitions have the form (q, E, t, E', q') which means q
-> q' is possible if (E, t) is the current stack * string, and if the
transition is taken, (E, t) will be removed and (E', epsilon) pushed.
In the second case the stack is implicit and contains the list of the
previously visited nodes. When a transition (q, t, q') is taken, q is
pushed in the stack. When a reduction takes place A -> abc, 3 symbols
of the stack are popped.

This may seem a big difference, but both stack-automata are
equivalent. You can transform one into the other and then write a LL
parser with a LR(0) control automata, and conversely a pda LR parser.
In fact, the Dragon book (chapter 4.4 of the french edition at least)
shows a LL parser controled by a LR(0) automaton. Wim Pijls wrote in
1993 a paper titled 'Unifying LL and LR parsing' in which he shows
that 'traversing the LR(0) automaton in a special way is equivalent to
LL(1) parsing'. (More recent papers 'LR, LC and LL parsing, some new
points of view'. All papers available on citeseer)

Then, LR and LL have the same (theoretical) power, which is the
ability of parsing algebraic (ie context-free) grammars.

Of course... but conflicts ?

When you clearly separate the control automaton and the
desambiguisation mechanism, you notice that all usual algorithms
(LL(k), LR(k), LALR(k), SLR(k)) use the same principle : approximating
algebraic languages by rational ones.

Let's take an LL conflict example :

sentence to parse 'abaa'

S -> aX | aY | cZ
X -> b...
Y -> c...
Z -> d...

Which one to chose, X or Y ?

For an LR algorithm,

a shift/reduce conflict

X -> a
Y -> aa

a reduce/reduce conflict

X -> a
Y -> a

Every non-terminal X in a grammar generates a context-free language
L(X) which can be approximated (upper bound) by a rational language
R(X). The idea is that checking if a sentence can be derivated from a
rational language is linear while it is cubic for the algebraic case.

(LL example)

L(X) is included in aE*
L(Y) is included in bE*

When your entery is ab... if it can be derived by the grammar, the
only way is to be derived by X (S -> aX -> ab...)

Both LL(k) and LR(k) use a k-prefix rational approximation technique.
That is why their tables occupate so much space (the cartesian product
of the control automaton and a k-trie). LALR and SLR follow the same
idea but merge some states according to some given rules.

Then, there is no reason for LR, LALR and SLR to be more conceptually
difficult than LL.

Notice that TAG (Tree Adjoint Grammars) and HPSG (Head driven Phrase
Structure Grammar) try to compute more discriminating approximations
using the properties of some particular terminals (named 'foot' in the
first, 'head' in the latter).

The difference of power can be explained by a few simple arguments :
- LL(1) should be compared to LR(0), not to LR(1)
- top-down algorithms elagate, bottom-up algorithms memorize. But the
approximation is essentially an elagation technique. Then, LR take
more advantage of this orthogonality. But the same power could be
achieved for LL parsers if they were added memorization (LL + CPS,
...).

Finally, the non-determinism handling question... There are some
conflicts you just cannot solve (otherwhise you would have proven
rational = algebraic). There are two techniques : breadth first search
(GLR) and depth-first search (backtracking). Both require memorization
not to compute several times the same thing.

Of course, Masaru Tomita named in 1986 his parsing technique (using
ideas of Bernard Lang in 1974, theirselves taked from Jan Earley 1970)
General LR. But it could have been General LL too, since graph
traversals are generic algorithms.

Brian Rogoff a écrit :
> I happen to think that recursive descent is the best way to write
> parsers, but note that recursive descent parsers are capable of
> parsing non-LL(1) grammars, even without the fairly obvious hacks.

Michal Moskal a écrit :
> But keyword parser build with camlp4 libraries can be modified at
> runtime

The ease of implementation is another classical discussion. The
parsing algorithm (ascending or descending) is orthogonal to its
recursive functions/table implementation or the static/dynamic data
structures used.

Most of the people think 'huge tables' as soon as they hear LR(1).
But this is only the case if you make the two following design choices
- collapsing the control automaton and the desambiguisation automata
- representing graphs by tables

You can of course implement a LR(0) automaton with recursive functions
(recursive ascent), using function calls and exceptions (or any
equivalent technique...). LR(0) has in general a quite moderated
number of states.

You can represent all your graphs (stack or desambiguisation automata)
with purely functional data structures, leading to a parser which can
be modified at runtime.

More over, separating correctly the three parts of your parsers allow
you to use different representation/optimisation techniques for every
element :  non-deterministic bit atomata for desambiguisation automata
of less than 31 states, matrix representation for dense graphs, ...
instead of one monolithic 'table compression' technique. And all this
algorithms can be hidden behind a library, not to be seen by the
low-level students, casual users, common programmers...

Last point, as said by Luc Maranget, exhaustiveness can be computed on
pda or lr(0) automata. Not having anything to do with conflict
resolution, there is no need to work on the desambiguisation automata.

        Diego Olivier

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners

next             reply	other threads:[~2003-06-05 14:03 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-06-05 14:02 Diego Olivier Fernandez Pons [this message]
2003-06-10  9:03 ` Diego Olivier Fernandez Pons
  -- strict thread matches above, loose matches on Subject: below --
2003-05-23  9:07 Lukasz Lew
2003-05-23  9:26 ` Michal Moskal
2003-05-26 12:28   ` Damien Doligez
2003-05-27 21:12   ` Pierre Weis
2003-05-28  9:20     ` Michal Moskal
2003-05-28  9:37       ` Diego Olivier Fernandez Pons
2003-05-28 10:24         ` Michal Moskal
2003-05-28 15:45           ` brogoff
2003-05-28 20:34     ` Alain.Frisch
2003-05-28 10:48   ` Anton Moscal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.A41.4.44.0306051548260.925930-100000@ibm1.cicrp.jussieu.fr \
    --to=diego.fernandez_pons@etu.upmc.fr \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).