From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Delivered-To: caml-list@yquem.inria.fr Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by yquem.inria.fr (Postfix) with ESMTP id 2A30BBC48 for ; Sat, 2 Apr 2005 07:10:07 +0200 (CEST) Received: from ensim.smartydns22.com (ensim.smartydns22.com [67.15.74.65]) by nez-perce.inria.fr (8.13.0/8.13.0) with ESMTP id j325A5pM010059 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Sat, 2 Apr 2005 07:10:06 +0200 Received: from www.ivorykite.com (localhost.localdomain [127.0.0.1]) by ensim.smartydns22.com (8.12.11/8.12.10) with SMTP id j325A4b2013679 for ; Sat, 2 Apr 2005 15:10:04 +1000 Received: from 202.164.198.46 (SquirrelMail authenticated user effbiae@ivorykite.com) by www.ivorykite.com with HTTP; Sat, 2 Apr 2005 15:10:04 +1000 (EST) Message-ID: <50130.202.164.198.46.1112418604.squirrel@www.ivorykite.com> In-Reply-To: <424DA923.7020106@tfb.com> References: <49464.202.164.198.46.1112355123.squirrel@www.ivorykite.com> <424DA923.7020106@tfb.com> Date: Sat, 2 Apr 2005 15:10:04 +1000 (EST) Subject: some comments on ocaml{lex,yacc} from a novice's POV From: "Jack Andrews" To: caml-list@yquem.inria.fr Reply-To: effbiae@ivorykite.com User-Agent: SquirrelMail/1.4.2 MIME-Version: 1.0 Content-Type: text/plain;charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Priority: 3 Importance: Normal X-Miltered: at nez-perce with ID 424E292D.001 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Spam: no; 0.00; ocaml:01 ocaml:01 experimented:01 parsing:01 uchicago:01 tokens:01 grammar:01 tokens:01 def:01 lineno:01 token:01 usr:01 lexer:01 def:01 expr:01 X-Spam-Checker-Version: SpamAssassin 3.0.2 (2004-11-16) on yquem.inria.fr X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled version=3.0.2 X-Spam-Level: hi, this is a little long. i'm new to ocaml, but like most, have been educated in FLs and experimented with and applied functional languages and techniques. python has been the first language i turn to for a few years now. i need to parse text as a sequence of records (with odd variations). i have used ply (python lex-yacc) most recently for parsing and believe it to be one of the more elegant mechanisms i've seen. http://systems.cs.uchicago.edu/ply/ply.html elegant because there are no lex and yacc input files, but rather the tokens and grammar rules are defined in python code -- succinctly! eg: # calclex.py import lex tokens = ( 'NUMBER', 'PLUS', 'MINUS', 'TIMES', 'DIVIDE', 'LPAREN', 'RPAREN',) t_PLUS = r'\+' # in python, the r prefix to a string literal t_MINUS = r'-' # means as-is. r'\' in python is "\\" in c [snip] def t_NUMBER(t): r'\d+' try: t.value = int(t.value) except ValueError: print "Line %d: Number %s is too large!" % (t.lineno,t.value) t.value = 0 return t by reflection/introspection ply finds all the token definitions in calclex.py. the only trick here is the first line of the t_NUMBER function. in python, any string literal as the first expression in a function is the doc_string (accessible by t_NUMBER.__doc__ in this case) #!/usr/local/bin/python import yacc from calclex import tokens # this is where python builds the lexer def p_expression_plus(p): 'expression : expression PLUS term' p[0] = p[1] + p[3] [snip] def p_factor_expr(p): 'factor : LPAREN expression RPAREN' p[0] = p[2] # this is where python builds the parser yacc.yacc() # or yacc.yacc(method="LALR") for alternate parsing methods while 1: try: s = raw_input('calc > ') except EOFError: break if not s: continue result = yacc.parse(s) print result once again, using the names of functions and their docstrings, ply can build a parser. but i want to use ocaml, not python because i know i need (more) speed. after using ply, the ocaml{yacc,lex} implementation looks like it's just glued on GNU tools. not that there's anything wrong with that, but integration with the language is nothing like that of ply. don't get me wrong, i don't think ply is perfect, and i don't know enough about parsing to be any kind of authority, but it seems to me a bit odd that a comment in a caml parser is either (**) or /**/ depending on context and in lexical analysis, a character set is expressed as ['A'-'Z' 'a'-'z' '_'] rather than usual (succinct) regexp syntax: [A-Za-z_] (less than half the characters) really, the .mll and .mly look nothing like caml take what i say with a grain of salt, i'm no authority on anything i've said. jack