From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id QAA30873; Thu, 5 Jun 2003 16:03:35 +0200 (MET DST) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id QAA30810 for ; Thu, 5 Jun 2003 16:03:33 +0200 (MET DST) Received: from shiva.jussieu.fr (shiva.jussieu.fr [134.157.0.129]) by nez-perce.inria.fr (8.11.1/8.11.1) with ESMTP id h55E3XT21098 for ; Thu, 5 Jun 2003 16:03:33 +0200 (MET DST) Received: from ibm3.cicrp.jussieu.fr (ibm3.cicrp.jussieu.fr [134.157.15.3]) by shiva.jussieu.fr (8.12.9/jtpda-5.4) with ESMTP id h55E3WZA054266 for ; Thu, 5 Jun 2003 16:03:32 +0200 (CEST) Received: from ibm1.cicrp.jussieu.fr (ibm1.cicrp.jussieu.fr [134.157.15.1]) by ibm3.cicrp.jussieu.fr (8.8.8/jtpda/mob-V8) with ESMTP id QAA28034 for ; Thu, 5 Jun 2003 16:02:43 +0200 Received: from localhost (fernande@localhost) by ibm1.cicrp.jussieu.fr (8.8.8/jtpda/mob-v8) with ESMTP id QAA1720572 for ; Thu, 5 Jun 2003 16:02:42 +0200 Date: Thu, 5 Jun 2003 16:02:41 +0200 (DST) From: Diego Olivier Fernandez Pons To: caml-list@inria.fr Subject: Re: [Caml-list] Ocamlyacc vs stream parser Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE X-Antivirus: scanned by sophie at shiva.jussieu.fr X-Spam: no; 0.00; pons:01 etu:99 caml-list:01 michal:01 moskal:01 powerfull:01 transitions:01 citeseer:01 context-free:99 lalr:01 conceptually:01 memorize:01 generic:01 descent:01 hacks:01 Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk Bonjour, Michal Moskal a =E9crit : > Oh, so by proof-by-book you're right :-) But in practice LARL(1) > seems more usefull for parsing, at parsing least programming > languages. It is easier to be right when one uses up-to-date books. Even if it is an excellent book, the Dragon is almost 20 years old, which in computer science is quite a lot. Anyway, that is not the main point. The question is 'are LR parsers really more powerfull/difficult to implement/ difficult to hack/difficult to understand... than LL parsers' ? And why ? There are three fundamental parts in a parser : - the control automaton - the desambiguisation technique - the non-determinism handling system LL parsers are traditionaly compiled to push-down automata whereas LR parsers usualy use LR(0) automata. In the first case the stack is explicit and transitions have the form (q, E, t, E', q') which means q -> q' is possible if (E, t) is the current stack * string, and if the transition is taken, (E, t) will be removed and (E', epsilon) pushed. In the second case the stack is implicit and contains the list of the previously visited nodes. When a transition (q, t, q') is taken, q is pushed in the stack. When a reduction takes place A -> abc, 3 symbols of the stack are popped. This may seem a big difference, but both stack-automata are equivalent. You can transform one into the other and then write a LL parser with a LR(0) control automata, and conversely a pda LR parser. In fact, the Dragon book (chapter 4.4 of the french edition at least) shows a LL parser controled by a LR(0) automaton. Wim Pijls wrote in 1993 a paper titled 'Unifying LL and LR parsing' in which he shows that 'traversing the LR(0) automaton in a special way is equivalent to LL(1) parsing'. (More recent papers 'LR, LC and LL parsing, some new points of view'. All papers available on citeseer) Then, LR and LL have the same (theoretical) power, which is the ability of parsing algebraic (ie context-free) grammars. Of course... but conflicts ? When you clearly separate the control automaton and the desambiguisation mechanism, you notice that all usual algorithms (LL(k), LR(k), LALR(k), SLR(k)) use the same principle : approximating algebraic languages by rational ones. Let's take an LL conflict example : sentence to parse 'abaa' S -> aX | aY | cZ X -> b... Y -> c... Z -> d... Which one to chose, X or Y ? For an LR algorithm, a shift/reduce conflict X -> a Y -> aa a reduce/reduce conflict X -> a Y -> a Every non-terminal X in a grammar generates a context-free language L(X) which can be approximated (upper bound) by a rational language R(X). The idea is that checking if a sentence can be derivated from a rational language is linear while it is cubic for the algebraic case. (LL example) L(X) is included in aE* L(Y) is included in bE* When your entery is ab... if it can be derived by the grammar, the only way is to be derived by X (S -> aX -> ab...) Both LL(k) and LR(k) use a k-prefix rational approximation technique. That is why their tables occupate so much space (the cartesian product of the control automaton and a k-trie). LALR and SLR follow the same idea but merge some states according to some given rules. Then, there is no reason for LR, LALR and SLR to be more conceptually difficult than LL. Notice that TAG (Tree Adjoint Grammars) and HPSG (Head driven Phrase Structure Grammar) try to compute more discriminating approximations using the properties of some particular terminals (named 'foot' in the first, 'head' in the latter). The difference of power can be explained by a few simple arguments : - LL(1) should be compared to LR(0), not to LR(1) - top-down algorithms elagate, bottom-up algorithms memorize. But the approximation is essentially an elagation technique. Then, LR take more advantage of this orthogonality. But the same power could be achieved for LL parsers if they were added memorization (LL + CPS, =2E..). Finally, the non-determinism handling question... There are some conflicts you just cannot solve (otherwhise you would have proven rational =3D algebraic). There are two techniques : breadth first search (GLR) and depth-first search (backtracking). Both require memorization not to compute several times the same thing. Of course, Masaru Tomita named in 1986 his parsing technique (using ideas of Bernard Lang in 1974, theirselves taked from Jan Earley 1970) General LR. But it could have been General LL too, since graph traversals are generic algorithms. Brian Rogoff a =E9crit : > I happen to think that recursive descent is the best way to write > parsers, but note that recursive descent parsers are capable of > parsing non-LL(1) grammars, even without the fairly obvious hacks. Michal Moskal a =E9crit : > But keyword parser build with camlp4 libraries can be modified at > runtime The ease of implementation is another classical discussion. The parsing algorithm (ascending or descending) is orthogonal to its recursive functions/table implementation or the static/dynamic data structures used. Most of the people think 'huge tables' as soon as they hear LR(1). But this is only the case if you make the two following design choices - collapsing the control automaton and the desambiguisation automata - representing graphs by tables You can of course implement a LR(0) automaton with recursive functions (recursive ascent), using function calls and exceptions (or any equivalent technique...). LR(0) has in general a quite moderated number of states. You can represent all your graphs (stack or desambiguisation automata) with purely functional data structures, leading to a parser which can be modified at runtime. More over, separating correctly the three parts of your parsers allow you to use different representation/optimisation techniques for every element : non-deterministic bit atomata for desambiguisation automata of less than 31 states, matrix representation for dense graphs, ... instead of one monolithic 'table compression' technique. And all this algorithms can be hidden behind a library, not to be seen by the low-level students, casual users, common programmers... Last point, as said by Luc Maranget, exhaustiveness can be computed on pda or lr(0) automata. Not having anything to do with conflict resolution, there is no need to work on the desambiguisation automata. Diego Olivier ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners