From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by yquem.inria.fr (Postfix) with ESMTP id 99FE4BC37 for ; Mon, 8 Jun 2009 13:41:03 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AukAACOWLEpCbwQdkWdsb2JhbACYKAEBAQEJCwoHEgSxS4QKBQ X-IronPort-AV: E=Sophos;i="4.41,324,1241388000"; d="scan'208";a="27581182" Received: from out5.smtp.messagingengine.com ([66.111.4.29]) by mail2-smtp-roc.national.inria.fr with ESMTP; 08 Jun 2009 13:41:03 +0200 Received: from compute1.internal (compute1.internal [10.202.2.41]) by out1.messagingengine.com (Postfix) with ESMTP id 9CC3435DA05 for ; Mon, 8 Jun 2009 07:41:02 -0400 (EDT) Received: from heartbeat1.messagingengine.com ([10.202.2.160]) by compute1.internal (MEProxy); Mon, 08 Jun 2009 07:41:02 -0400 X-Sasl-enc: lqF9y3cSy0wTzbIkbWfDaqadEb+Pq1pqQWqudqxNRT/3 1244461262 Received: from [192.168.1.10] (ALyon-157-1-70-183.w81-251.abo.wanadoo.fr [81.251.157.183]) by mail.messagingengine.com (Postfix) with ESMTPSA id 2BEDB144D8 for ; Mon, 8 Jun 2009 07:41:02 -0400 (EDT) Message-ID: <4A2CF796.1070504@ens-lyon.org> Date: Mon, 08 Jun 2009 13:35:50 +0200 From: Martin Jambon User-Agent: Thunderbird 2.0.0.17 (X11/20081008) MIME-Version: 1.0 To: caml-list@inria.fr Subject: Re: [Caml-list] Which types for representing HTML documents ? References: <20090608045957.GA7611@pema> In-Reply-To: <20090608045957.GA7611@pema> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-Spam: no; 0.00; ens-lyon:01 ocaml:01 non-standard:01 nodes:01 trivial:01 printf:01 conditionals:01 variants:01 constructors:01 javascript:98 polymorphic:01 wrote:01 parsing:01 inline:01 caml-list:01 Sébastien Hinderer wrote: > Dear all, > > According to you, how could an HTML document best be represented in > OCaml ? Ocamlnet's Nethtml works fine for parsing: type document = Element of (string * (string * string) list * Nethtml.document list) | Data of string If your goal is to interpret arbitrary web pages, you have to allow all kinds of standard or non-standard elements and attributes anywhere in the document. If you are creating HTML documents, beware that you can't embed Flash objects using standard HTML. I'm not even speaking of javascript happily manipulating the DOM tree with little restrictions. Personally I use text templates and validate web pages once they are in my browser (using the shortcut to validator.w3.org that opera provides). For javascript-generated nodes, I just check that it works in various browsers (the firefox "View source chart" extension is useful for debugging the DOM tree). I do not suffer at all from the absence of static type-checking of the HTML tree. I imagine that the reasons for this are: * HTML is the final product and is trivial to debug (no need to printf everything since everything is already printed...) * There are no complicated conditionals that would leave certain parts of the code untested for a long time. * Mainstream web browsers are very tolerant. Small accidental deviations from the strict W3C standards usually have no visible effect. > In particular: would you rather use classes or records, polymorphic > variants or normal constructors ? > > There are attributes which can occur in several elements, such as id, > class... How shold these be represented ? > Should the types reflect the differences between inline elements and > other types of elements ? I know this is going to annoy a lot of people on that list, but this feels very academic to me :-) Martin -- http://mjambon.com/