From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <effbiae@ivorykite.com>
Delivered-To: caml-list@yquem.inria.fr
Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78])
	by yquem.inria.fr (Postfix) with ESMTP id 2A30BBC48
	for <caml-list@yquem.inria.fr>; Sat,  2 Apr 2005 07:10:07 +0200 (CEST)
Received: from ensim.smartydns22.com (ensim.smartydns22.com [67.15.74.65])
	by nez-perce.inria.fr (8.13.0/8.13.0) with ESMTP id j325A5pM010059
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO)
	for <caml-list@yquem.inria.fr>; Sat, 2 Apr 2005 07:10:06 +0200
Received: from www.ivorykite.com (localhost.localdomain [127.0.0.1])
	by ensim.smartydns22.com (8.12.11/8.12.10) with SMTP id j325A4b2013679
	for <caml-list@yquem.inria.fr>; Sat, 2 Apr 2005 15:10:04 +1000
Received: from 202.164.198.46
        (SquirrelMail authenticated user effbiae@ivorykite.com)
        by www.ivorykite.com with HTTP;
        Sat, 2 Apr 2005 15:10:04 +1000 (EST)
Message-ID: <50130.202.164.198.46.1112418604.squirrel@www.ivorykite.com>
In-Reply-To: <424DA923.7020106@tfb.com>
References: <49464.202.164.198.46.1112355123.squirrel@www.ivorykite.com>
    <424DA923.7020106@tfb.com>
Date: Sat, 2 Apr 2005 15:10:04 +1000 (EST)
Subject: some comments on ocaml{lex,yacc} from a novice's POV
From: "Jack Andrews" <effbiae@ivorykite.com>
To: caml-list@yquem.inria.fr
Reply-To: effbiae@ivorykite.com
User-Agent: SquirrelMail/1.4.2
MIME-Version: 1.0
Content-Type: text/plain;charset=iso-8859-1
Content-Transfer-Encoding: 8bit
X-Priority: 3
Importance: Normal
X-Miltered: at nez-perce with ID 424E292D.001 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)!
X-Spam: no; 0.00; ocaml:01 ocaml:01 experimented:01 parsing:01 uchicago:01 tokens:01 grammar:01 tokens:01 def:01 lineno:01 token:01 usr:01 lexer:01 def:01 expr:01 
X-Spam-Checker-Version: SpamAssassin 3.0.2 (2004-11-16) on yquem.inria.fr
X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled 
	version=3.0.2
X-Spam-Level: 

hi,

this is a little long.  i'm new to ocaml, but like most, have been
educated in FLs and experimented with and applied functional languages and
techniques.  python has been the first language i turn to for a few years
now.

i need to parse text as a sequence of records (with odd variations). i
have used ply (python lex-yacc) most recently for parsing and believe it
to be one of the more elegant mechanisms i've seen. 
http://systems.cs.uchicago.edu/ply/ply.html

elegant because there are no lex and yacc input files, but rather the
tokens and grammar rules are defined in python code -- succinctly!  eg:

# calclex.py
import lex
tokens = ( 'NUMBER', 'PLUS', 'MINUS', 'TIMES', 'DIVIDE', 'LPAREN', 'RPAREN',)
t_PLUS    = r'\+'  # in python, the r prefix to a string literal
t_MINUS   = r'-'   #  means as-is.  r'\' in python is "\\" in c
[snip]
def t_NUMBER(t):
    r'\d+'
    try: t.value = int(t.value)
    except ValueError:
         print "Line %d: Number %s is too large!" % (t.lineno,t.value)
         t.value = 0
    return t

by reflection/introspection ply finds all the token definitions in
calclex.py.  the only trick here is the first line of the t_NUMBER
function.  in python, any string literal as the first expression in a
function is the doc_string (accessible by t_NUMBER.__doc__ in this case)

#!/usr/local/bin/python
import yacc
from calclex import tokens  # this is where python builds the lexer

def p_expression_plus(p):
    'expression : expression PLUS term'
    p[0] = p[1] + p[3]
[snip]
def p_factor_expr(p):
    'factor : LPAREN expression RPAREN'
    p[0] = p[2]
# this is where python builds the parser
yacc.yacc()   # or yacc.yacc(method="LALR") for alternate parsing methods
while 1:
   try: s = raw_input('calc > ')
   except EOFError: break
   if not s: continue
   result = yacc.parse(s)
   print result

once again, using the names of functions and their docstrings, ply can
build a parser.

but i want to use ocaml, not python because i know i need (more) speed. 
after using ply, the ocaml{yacc,lex} implementation looks like it's just
glued on GNU tools.  not that there's anything wrong with that, but
integration with the language is nothing like that of ply.

don't get me wrong, i don't think ply is perfect, and i don't know enough
about parsing to be any kind of authority, but it seems to me a bit odd
that a comment in a caml parser is either (**) or /**/ depending on
context and in lexical analysis, a character set is expressed as ['A'-'Z'
'a'-'z' '_'] rather than usual (succinct) regexp syntax: [A-Za-z_]  (less
than half the characters)   really, the .mll and .mly look nothing like
caml

take what i say with a grain of salt, i'm no authority on anything i've said.


jack