From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <oliver@first.in-berlin.de>
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104])
	by yquem.inria.fr (Postfix) with ESMTP id 33442BC57
	for <caml-list@yquem.inria.fr>; Fri, 22 Oct 2010 16:59:52 +0200 (CEST)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AtYBABREwUzAbSoIe2dsb2JhbACTcI0gUhUBARYiBB7AdoVKBA
X-IronPort-AV: E=Sophos;i="4.58,223,1286143200"; 
   d="scan'208";a="62388955"
Received: from einhorn.in-berlin.de ([192.109.42.8])
  by mail3-smtp-sop.national.inria.fr with ESMTP/TLS/DHE-RSA-AES256-SHA; 22 Oct 2010 16:59:52 +0200
X-Envelope-From: oliver@first.in-berlin.de
X-Envelope-To: <caml-list@yquem.inria.fr>
Received: from localhost (okapi.in-berlin.de [192.109.42.117])
	by einhorn.in-berlin.de (8.13.6/8.13.6/Debian-1) with ESMTP id o9MExoMF026486
	for <caml-list@yquem.inria.fr>; Fri, 22 Oct 2010 16:59:50 +0200
Received: from e178021239.adsl.alicedsl.de (e178021239.adsl.alicedsl.de
	[85.178.21.239]) by webmail.in-berlin.de (Horde Framework) with HTTP; Fri,
	22 Oct 2010 16:59:50 +0200
Message-ID: <20101022165950.13786wcej1xp31eu@webmail.in-berlin.de>
Date: Fri, 22 Oct 2010 16:59:50 +0200
From: "Oliver Bandel" <oliver@first.in-berlin.de>
To: caml-list@yquem.inria.fr
Subject: Typesystem and Parsers
MIME-Version: 1.0
Content-Type: text/plain;
	charset=ISO-8859-1;
	DelSp="Yes";
	format="flowed"
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
User-Agent: Internet Messaging Program (IMP) 4.3.3
X-Scanned-By: MIMEDefang_at_IN-Berlin_e.V. on 192.109.42.8
X-Spam: no; 0.00; bandel:01 in-berlin:01 parsers:01 ocaml:01 arranging:01 parser:01 ocamlyacc:01 syntax:01 parser:01 ocaml's:01 syntax:01 recursive:01 prepend:01 compiler:01 compiler:01 

Hello,

when reading papers or books on parsing
techniques, the parsing often is done in
different distinctive steps, where type checking
and semantic checks are done after the parse tree
is build up.

This may be the classical way, for example when doing it in C.

When using OCaml with it's strong type system,
IMHO the big advantage is, that the type system can be used to =20
restrict the input data in a way that without additionally checks =20
correct input is enforced, otherwise a parse error is detected.

Also with arranging a parser (e.g. with ocamlyacc) both ways can be =20
walked along, either by just accepting everything and build up the =20
tree, and later detect erros in syntax or type... (for example all =20
scanned entities given back as strings or string lists)...

...or the parser can just accept only what the type system would accept,
which would be enforced by using sum types.

I think, both ways have their advantages, but the strongly typed =20
approach seems not to be talked about in books and papers.

So somehow I'm looking for arguments and techniques on how to use =20
Ocaml's type system efficiently, but maybe also take advantage of the =20
flexibility of a weak type system (e.g. by exploring the tree qith the =20
wrong syntax, to analyze it nevertheless).

Can you point me to some arguments on how you would do your parsing
and why and in which csituation you would chose the one or the other approac=
h?


For making it more specific: the current program I'm sitting on, is a simple
interpreter that helps me in analyzing webpages - a DSL for doing that.
It already runs and has some nice features, but in the development stage
I toggled between using sum-types (changed forth and back between some =20
solutuons) and using just simple basic types (string and string list).

I once read an interesting paper on which types to use...
... it was by... hmhhh forgot the name, but I think it was
one of the outstanding FP-programmers. The argument was: use
basic types as often as you can, and avoid specific types.

This approach has helped: string and string list can be easily used
in a parser, when uzsing recursive riiules: just prepend the new value =20
to the already built list. Easy done.
But this throws out the type system's strength.

Giving the values sum types, which is more specific, would also have =20
advantages.

One reason why I'm not really decided, which way to go also is, that =20
at the moment it's an interpreter, but maybe lkater I want to make it =20
createing in-between-code maybe even optimizing.

So, can you please elaborate on advantages and disadvantages,
when to use which way?

Ciao,
     Oliver

P.S.: Writing a compiler I also have in mind, but it has nothing to do =20
with the
       web-parser-DSL, and is related to microcontroller programming... but =
if
       you could also elaborate on the compiler issues regarding usage of
       typesystem in parsers, this would be also very interesting too me.