From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82])
	by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id p27EiL9C031548
	for <caml-list@sympa-roc.inria.fr>; Mon, 7 Mar 2011 15:44:21 +0100
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AnIDAP59dE3AbSoIe2dsb2JhbACEKaIrFQEBFiIEIa1dkDINgRqDRXYE
X-IronPort-AV: E=Sophos;i="4.62,277,1297033200"; 
   d="scan'208";a="101469501"
Received: from einhorn.in-berlin.de ([192.109.42.8])
  by mail1-smtp-roc.national.inria.fr with ESMTP/TLS/DHE-RSA-AES256-SHA; 07 Mar 2011 15:44:16 +0100
X-Envelope-From: oliver@first.in-berlin.de
X-Envelope-To: <caml-list@inria.fr>
Received: from first (e178006068.adsl.alicedsl.de [85.178.6.68])
	(authenticated bits=0)
	by einhorn.in-berlin.de (8.13.6/8.13.6/Debian-1) with ESMTP id p27EiFbZ004831
	for <caml-list@inria.fr>; Mon, 7 Mar 2011 15:44:15 +0100
Received: by first (Postfix, from userid 1000)
	id 83EF915408DE; Mon,  7 Mar 2011 15:44:15 +0100 (CET)
Date: Mon, 7 Mar 2011 15:44:15 +0100
From: oliver <oliver@first.in-berlin.de>
To: caml-list@inria.fr
Message-ID: <20110307144415.GA1600@siouxsie>
References: <20110306225242.GA9087@siouxsie>
 <1299500875.30035.31.camel@thinkpad>
 <20110307125748.GA4977@siouxsie>
 <1299505237.30035.82.camel@thinkpad>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <1299505237.30035.82.camel@thinkpad>
User-Agent: Mutt/1.5.20 (2009-06-14)
X-Scanned-By: MIMEDefang_at_IN-Berlin_e.V. on 192.109.42.8
Subject: Re: [Caml-list] ocamlnet: Netheml: simple-dtd: how does this work?

On Mon, Mar 07, 2011 at 02:40:37PM +0100, Gerd Stolpmann wrote:
> Am Montag, den 07.03.2011, 13:57 +0100 schrieb oliver:
> > On Mon, Mar 07, 2011 at 01:27:55PM +0100, Gerd Stolpmann wrote:
> > > Am Sonntag, den 06.03.2011, 23:52 +0100 schrieb oliver:
> > > > Hello,
> > > > 
> > > > tried around using the simple-dtd argument
> > > > for Nethtme.parse.
> > > > 
> > > > It changes the behaviour compared to
> > > > the default behaviour, but I could not find out
> > > > how this works.
> > > > 
> > > > Someone here who can explain me this
> > > > argument and describe, how it can be used?
> > > 
> > > Maybe the HTML specification would be a good reference here:
> > > http://www.w3.org/TR/1999/REC-html401-19991224. You will see there that
> > > most HTML elements are either an inline element, a block element, or
> > > both ("flow" element). The grammar of HTML is described in terms of
> > > these classes. For instance, a P tag (paragraph) is a block element and
> > > contains block elements whereas B (bold) is an inline element and
> > > contains inline elements. From this follows that you cannot put a P
> > > inside a B: <B><P>something</P></B> is illegal.
> > > 
> > > The parser needs this information to resolve such input, i.e. do
> > > something with bad HTML. As HTML allows tag minimization (many end tags
> > > can be omitted), the parser can read this as: <B></B><P>something</P>
> > > (and the </B> in the input is ignored).
> > > 
> > > If all start and all end tags are written out, changing the
> > > simplified_dtd does not make any difference.
> > > 
> > > There is no normative text that says how to read bad HTML. Because of
> > > this, it is - to a large degree - an interpretation of HTML what you put
> > > into simplified_dtd.
> > > 
> > > > The description IMHO is not sufficient to explain
> > > > this feature.
> > > 
> > > I'd say your formal knowledge about HTML is insufficient.
> > [...]
> > 
> > If formal HTML spec is sufficient to know the behaviour of the module,
> > there would no need to have the dtd-argument, which seems, follwoinjg your
> > explanations, to change the behavior in a way that it does NOT follow
> > the formal specifications.
> 
> There is no standard regarding that (except that HTML is also SGML, and
> there are some rules for that in SGML). You could also reject bad HTML.
> But this has not become common practice (unlike for XML, for instance).
> 
> So, it depends on your HTML documents how you want to fix bad HTML.
> That's the reason why you can configure it.
[...]

But it's not mentioned how the dtd-Argument works.

Does it change the behaviour only for those tags that
are mentioned in the dtd-argument?
What about the other args? Will they stay as before (default)?

Or will using the dtd-arg change the parsing to some kind of
very relaxed fallback, and I add constraints only to the mentioned
tags?

So does the dtd-arg widen or narrow the default dtd or just replace the
default settings? What about tags that are not included in the dtd-arg?

I doubt that behaviour of your module can be found in the HTML-spec.

If it's so obvious behaviour, I wonder why nobody else could answer
the question (not here and not in irc).


> 
> > > It is
> > > impossible to explain all the basics of HTML in the scope of an mli.
> > 
> > Do you mean the basics of html spec or the many different ways,
> > bad html is written?!
> 
> This is connected to some degree - if you define a spec you also define
> ways how to violate it. For example, the spec defines for each element
> whether the start tag, the end tag, or both can be omitted. But what to
> do if this is not done correctly?

I could write my own parser, but prefer to use modules that help me.
For that the docs should explain what the arguments do.

I know that there is a DTD for HTML and that most html is not confirming to it.

But how your parser works and how that dtd-arg must be used is not clear to me.



> 
> The formal basics would here be the DTD (document type definition). If
> you know what a DTD is, you also know the problem of omitting tags. If I
> included all that in the mli, it would be 1000 lines long, and would
> contain that much information that it would be unclear what the relevant
> part is. In short - I don't want to write books in mli's. The user needs
> background information, but the mli is not the right place where to give
> it.

But the docs could explain the argument behaviour.


> 
> The specific problem with HTML is that everybody knows "something" about
> it, but most knowledge is second-hand. However, HTML is a systematic
> definition (with syntax and semantics), and everybody who knows that
> will also be able to read my mli.

Aha, that_s why nobody answered my question.

Maybe it's not that obvious.

Ciao,
   Oliver