From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id p27ErDJU031972 for ; Mon, 7 Mar 2011 15:53:13 +0100 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AnIDAFaAdE3AbSoIe2dsb2JhbACEKaIrFQEBFiIEIa1XkDINgRqDRXYE X-IronPort-AV: E=Sophos;i="4.62,277,1297033200"; d="scan'208";a="92842968" Received: from einhorn.in-berlin.de ([192.109.42.8]) by mail2-smtp-roc.national.inria.fr with ESMTP/TLS/DHE-RSA-AES256-SHA; 07 Mar 2011 15:53:08 +0100 X-Envelope-From: oliver@first.in-berlin.de X-Envelope-To: Received: from first (e178006068.adsl.alicedsl.de [85.178.6.68]) (authenticated bits=0) by einhorn.in-berlin.de (8.13.6/8.13.6/Debian-1) with ESMTP id p27Er7IW005515 for ; Mon, 7 Mar 2011 15:53:07 +0100 Received: by first (Postfix, from userid 1000) id 70BCD15408DE; Mon, 7 Mar 2011 15:53:07 +0100 (CET) Date: Mon, 7 Mar 2011 15:53:07 +0100 From: oliver To: caml-list@inria.fr Message-ID: <20110307145307.GA2137@siouxsie> References: <20110306225242.GA9087@siouxsie> <1299500875.30035.31.camel@thinkpad> <20110307125748.GA4977@siouxsie> <1299505237.30035.82.camel@thinkpad> <20110307144415.GA1600@siouxsie> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20110307144415.GA1600@siouxsie> User-Agent: Mutt/1.5.20 (2009-06-14) X-Scanned-By: MIMEDefang_at_IN-Berlin_e.V. on 192.109.42.8 Subject: Re: [Caml-list] ocamlnet: Netheml: simple-dtd: how does this work? On Mon, Mar 07, 2011 at 03:44:15PM +0100, oliver wrote: > On Mon, Mar 07, 2011 at 02:40:37PM +0100, Gerd Stolpmann wrote: > > Am Montag, den 07.03.2011, 13:57 +0100 schrieb oliver: > > > On Mon, Mar 07, 2011 at 01:27:55PM +0100, Gerd Stolpmann wrote: > > > > Am Sonntag, den 06.03.2011, 23:52 +0100 schrieb oliver: > > > > > Hello, > > > > > > > > > > tried around using the simple-dtd argument > > > > > for Nethtme.parse. > > > > > > > > > > It changes the behaviour compared to > > > > > the default behaviour, but I could not find out > > > > > how this works. > > > > > > > > > > Someone here who can explain me this > > > > > argument and describe, how it can be used? > > > > > > > > Maybe the HTML specification would be a good reference here: > > > > http://www.w3.org/TR/1999/REC-html401-19991224. You will see there that > > > > most HTML elements are either an inline element, a block element, or > > > > both ("flow" element). The grammar of HTML is described in terms of > > > > these classes. For instance, a P tag (paragraph) is a block element and > > > > contains block elements whereas B (bold) is an inline element and > > > > contains inline elements. From this follows that you cannot put a P > > > > inside a B:

something

is illegal. > > > > > > > > The parser needs this information to resolve such input, i.e. do > > > > something with bad HTML. As HTML allows tag minimization (many end tags > > > > can be omitted), the parser can read this as:

something

> > > > (and the in the input is ignored). > > > > > > > > If all start and all end tags are written out, changing the > > > > simplified_dtd does not make any difference. > > > > > > > > There is no normative text that says how to read bad HTML. Because of > > > > this, it is - to a large degree - an interpretation of HTML what you put > > > > into simplified_dtd. > > > > > > > > > The description IMHO is not sufficient to explain > > > > > this feature. > > > > > > > > I'd say your formal knowledge about HTML is insufficient. > > > [...] > > > > > > If formal HTML spec is sufficient to know the behaviour of the module, > > > there would no need to have the dtd-argument, which seems, follwoinjg your > > > explanations, to change the behavior in a way that it does NOT follow > > > the formal specifications. > > > > There is no standard regarding that (except that HTML is also SGML, and > > there are some rules for that in SGML). You could also reject bad HTML. > > But this has not become common practice (unlike for XML, for instance). > > > > So, it depends on your HTML documents how you want to fix bad HTML. > > That's the reason why you can configure it. > [...] > > But it's not mentioned how the dtd-Argument works. [...] For example, if you use the empty string "" as a tag, it changes the parsing behaviour, even I doubt there is a tag that has "" as name. The same problem occurs with any phantasy-tag. If I change non existing tags, why does the module parse correct html different when using such a dtd-arg, compared to not using the dtd-arg? I doubt I can find the answer in the HTML-spec. Ciao, Oliver