From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83])
	by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id p27Dejfl028791
	for <caml-list@sympa-roc.inria.fr>; Mon, 7 Mar 2011 14:40:45 +0100
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AuEBAMNudE3U4366kWdsb2JhbACEKaIrFQEBAQEJCwoHEQMirSOQMgKBJYNFdgSIBYdu
X-IronPort-AV: E=Sophos;i="4.62,277,1297033200"; 
   d="scan'208";a="92837189"
Received: from moutng.kundenserver.de ([212.227.126.186])
  by mail2-smtp-roc.national.inria.fr with ESMTP; 07 Mar 2011 14:40:39 +0100
Received: from office1.lan.sumadev.de (dslb-188-097-066-025.pools.arcor-ip.net [188.97.66.25])
	by mrelayeu.kundenserver.de (node=mrbap0) with ESMTP (Nemesis)
	id 0Ldsvr-1QNe6P3TLn-00izGb; Mon, 07 Mar 2011 14:40:39 +0100
Received: from [192.168.5.106] (dslb-188-097-066-025.pools.arcor-ip.net [188.97.66.25])
	by office1.lan.sumadev.de (Postfix) with ESMTPA id 9A3465F701;
	Mon,  7 Mar 2011 14:40:38 +0100 (CET)
From: Gerd Stolpmann <info@gerd-stolpmann.de>
To: oliver <oliver@first.in-berlin.de>
Cc: caml-list@inria.fr
In-Reply-To: <20110307125748.GA4977@siouxsie>
References: <20110306225242.GA9087@siouxsie>
	 <1299500875.30035.31.camel@thinkpad>  <20110307125748.GA4977@siouxsie>
Content-Type: text/plain; charset="UTF-8"
Date: Mon, 07 Mar 2011 14:40:37 +0100
Message-ID: <1299505237.30035.82.camel@thinkpad>
Mime-Version: 1.0
X-Mailer: Evolution 2.28.1 
Content-Transfer-Encoding: 7bit
X-Provags-ID: V02:K0:SuHA9eH1wFHzq7hhIYkFZVtIrnFSt/04k5+C76BBjTF
 SzJCc5JCbsJCYn/6/solZzY1O943CEPafF5Jzuj+ed+ttWjqeE
 URp9j+JsKJSWNBPw4VoThR6cH3EwSo9Jx6k/8qCblmGMLbJOfp
 dhRXNnOvlQFuCItfP0fV5iy7w9d7zOuPcrHBS85YJHSJwn9JV1
 evSRsI4QG6tSkX0V98Iug==
Subject: Re: [Caml-list] ocamlnet: Netheml: simple-dtd: how does this work?

Am Montag, den 07.03.2011, 13:57 +0100 schrieb oliver:
> On Mon, Mar 07, 2011 at 01:27:55PM +0100, Gerd Stolpmann wrote:
> > Am Sonntag, den 06.03.2011, 23:52 +0100 schrieb oliver:
> > > Hello,
> > > 
> > > tried around using the simple-dtd argument
> > > for Nethtme.parse.
> > > 
> > > It changes the behaviour compared to
> > > the default behaviour, but I could not find out
> > > how this works.
> > > 
> > > Someone here who can explain me this
> > > argument and describe, how it can be used?
> > 
> > Maybe the HTML specification would be a good reference here:
> > http://www.w3.org/TR/1999/REC-html401-19991224. You will see there that
> > most HTML elements are either an inline element, a block element, or
> > both ("flow" element). The grammar of HTML is described in terms of
> > these classes. For instance, a P tag (paragraph) is a block element and
> > contains block elements whereas B (bold) is an inline element and
> > contains inline elements. From this follows that you cannot put a P
> > inside a B: <B><P>something</P></B> is illegal.
> > 
> > The parser needs this information to resolve such input, i.e. do
> > something with bad HTML. As HTML allows tag minimization (many end tags
> > can be omitted), the parser can read this as: <B></B><P>something</P>
> > (and the </B> in the input is ignored).
> > 
> > If all start and all end tags are written out, changing the
> > simplified_dtd does not make any difference.
> > 
> > There is no normative text that says how to read bad HTML. Because of
> > this, it is - to a large degree - an interpretation of HTML what you put
> > into simplified_dtd.
> > 
> > > The description IMHO is not sufficient to explain
> > > this feature.
> > 
> > I'd say your formal knowledge about HTML is insufficient.
> [...]
> 
> If formal HTML spec is sufficient to know the behaviour of the module,
> there would no need to have the dtd-argument, which seems, follwoinjg your
> explanations, to change the behavior in a way that it does NOT follow
> the formal specifications.

There is no standard regarding that (except that HTML is also SGML, and
there are some rules for that in SGML). You could also reject bad HTML.
But this has not become common practice (unlike for XML, for instance).

So, it depends on your HTML documents how you want to fix bad HTML.
That's the reason why you can configure it.

> > It is
> > impossible to explain all the basics of HTML in the scope of an mli.
> 
> Do you mean the basics of html spec or the many different ways,
> bad html is written?!

This is connected to some degree - if you define a spec you also define
ways how to violate it. For example, the spec defines for each element
whether the start tag, the end tag, or both can be omitted. But what to
do if this is not done correctly?

The formal basics would here be the DTD (document type definition). If
you know what a DTD is, you also know the problem of omitting tags. If I
included all that in the mli, it would be 1000 lines long, and would
contain that much information that it would be unclear what the relevant
part is. In short - I don't want to write books in mli's. The user needs
background information, but the mli is not the right place where to give
it.

The specific problem with HTML is that everybody knows "something" about
it, but most knowledge is second-hand. However, HTML is a systematic
definition (with syntax and semantics), and everybody who knows that
will also be able to read my mli.

Gerd
-- 
------------------------------------------------------------
Gerd Stolpmann, Bad Nauheimer Str.3, 64289 Darmstadt,Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------