From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by yquem.inria.fr (Postfix) with ESMTP id 33442BC57 for ; Fri, 22 Oct 2010 16:59:52 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AtYBABREwUzAbSoIe2dsb2JhbACTcI0gUhUBARYiBB7AdoVKBA X-IronPort-AV: E=Sophos;i="4.58,223,1286143200"; d="scan'208";a="62388955" Received: from einhorn.in-berlin.de ([192.109.42.8]) by mail3-smtp-sop.national.inria.fr with ESMTP/TLS/DHE-RSA-AES256-SHA; 22 Oct 2010 16:59:52 +0200 X-Envelope-From: oliver@first.in-berlin.de X-Envelope-To: Received: from localhost (okapi.in-berlin.de [192.109.42.117]) by einhorn.in-berlin.de (8.13.6/8.13.6/Debian-1) with ESMTP id o9MExoMF026486 for ; Fri, 22 Oct 2010 16:59:50 +0200 Received: from e178021239.adsl.alicedsl.de (e178021239.adsl.alicedsl.de [85.178.21.239]) by webmail.in-berlin.de (Horde Framework) with HTTP; Fri, 22 Oct 2010 16:59:50 +0200 Message-ID: <20101022165950.13786wcej1xp31eu@webmail.in-berlin.de> Date: Fri, 22 Oct 2010 16:59:50 +0200 From: "Oliver Bandel" To: caml-list@yquem.inria.fr Subject: Typesystem and Parsers MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; DelSp="Yes"; format="flowed" Content-Disposition: inline Content-Transfer-Encoding: quoted-printable User-Agent: Internet Messaging Program (IMP) 4.3.3 X-Scanned-By: MIMEDefang_at_IN-Berlin_e.V. on 192.109.42.8 X-Spam: no; 0.00; bandel:01 in-berlin:01 parsers:01 ocaml:01 arranging:01 parser:01 ocamlyacc:01 syntax:01 parser:01 ocaml's:01 syntax:01 recursive:01 prepend:01 compiler:01 compiler:01 Hello, when reading papers or books on parsing techniques, the parsing often is done in different distinctive steps, where type checking and semantic checks are done after the parse tree is build up. This may be the classical way, for example when doing it in C. When using OCaml with it's strong type system, IMHO the big advantage is, that the type system can be used to =20 restrict the input data in a way that without additionally checks =20 correct input is enforced, otherwise a parse error is detected. Also with arranging a parser (e.g. with ocamlyacc) both ways can be =20 walked along, either by just accepting everything and build up the =20 tree, and later detect erros in syntax or type... (for example all =20 scanned entities given back as strings or string lists)... ...or the parser can just accept only what the type system would accept, which would be enforced by using sum types. I think, both ways have their advantages, but the strongly typed =20 approach seems not to be talked about in books and papers. So somehow I'm looking for arguments and techniques on how to use =20 Ocaml's type system efficiently, but maybe also take advantage of the =20 flexibility of a weak type system (e.g. by exploring the tree qith the =20 wrong syntax, to analyze it nevertheless). Can you point me to some arguments on how you would do your parsing and why and in which csituation you would chose the one or the other approac= h? For making it more specific: the current program I'm sitting on, is a simple interpreter that helps me in analyzing webpages - a DSL for doing that. It already runs and has some nice features, but in the development stage I toggled between using sum-types (changed forth and back between some =20 solutuons) and using just simple basic types (string and string list). I once read an interesting paper on which types to use... ... it was by... hmhhh forgot the name, but I think it was one of the outstanding FP-programmers. The argument was: use basic types as often as you can, and avoid specific types. This approach has helped: string and string list can be easily used in a parser, when uzsing recursive riiules: just prepend the new value =20 to the already built list. Easy done. But this throws out the type system's strength. Giving the values sum types, which is more specific, would also have =20 advantages. One reason why I'm not really decided, which way to go also is, that =20 at the moment it's an interpreter, but maybe lkater I want to make it =20 createing in-between-code maybe even optimizing. So, can you please elaborate on advantages and disadvantages, when to use which way? Ciao, Oliver P.S.: Writing a compiler I also have in mind, but it has nothing to do =20 with the web-parser-DSL, and is related to microcontroller programming... but = if you could also elaborate on the compiler issues regarding usage of typesystem in parsers, this would be also very interesting too me.