From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id p27FEYcD000606 for ; Mon, 7 Mar 2011 16:14:34 +0100 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AuEBABGFdE3U4xEIkGdsb2JhbACEKaIrFQEBAQEJCQwHEQMirWaQMwKBJYNFdgSIBYdu X-IronPort-AV: E=Sophos;i="4.62,277,1297033200"; d="scan'208";a="92844728" Received: from moutng.kundenserver.de ([212.227.17.8]) by mail2-smtp-roc.national.inria.fr with ESMTP; 07 Mar 2011 16:14:28 +0100 Received: from office1.lan.sumadev.de (dslb-188-097-066-025.pools.arcor-ip.net [188.97.66.25]) by mrelayeu.kundenserver.de (node=mreu0) with ESMTP (Nemesis) id 0MErUc-1Pm7Uc2KSo-00Fzk4; Mon, 07 Mar 2011 16:14:20 +0100 Received: from [192.168.5.106] (dslb-188-097-066-025.pools.arcor-ip.net [188.97.66.25]) by office1.lan.sumadev.de (Postfix) with ESMTPA id 3A4D65F701; Mon, 7 Mar 2011 16:14:20 +0100 (CET) From: Gerd Stolpmann To: oliver Cc: caml-list@inria.fr In-Reply-To: <20110307144415.GA1600@siouxsie> References: <20110306225242.GA9087@siouxsie> <1299500875.30035.31.camel@thinkpad> <20110307125748.GA4977@siouxsie> <1299505237.30035.82.camel@thinkpad> <20110307144415.GA1600@siouxsie> Content-Type: text/plain; charset="UTF-8" Date: Mon, 07 Mar 2011 16:14:18 +0100 Message-ID: <1299510858.30035.95.camel@thinkpad> Mime-Version: 1.0 X-Mailer: Evolution 2.28.1 Content-Transfer-Encoding: 7bit X-Provags-ID: V02:K0:KM4yao7q/duzt2gp77i0GbZ6kP/cKTM+XQGOZroGijy +A9dHuk+iE2O2rr3LEhx/1m0KcFa41kMdbhF7b0EIAmUhQ2ppx pYn1beyy2DcePV4jscowcRobZOUOOy2gxaQfYjIAwiKivtcdvi MtjPGk71a+Tw2cHIXtuFkmRI8zeaHX80RMFVeak9/pyHyrU4lO xEUMweUoddzq6w/ou64Ag== Subject: Re: [Caml-list] ocamlnet: Netheml: simple-dtd: how does this work? Am Montag, den 07.03.2011, 15:44 +0100 schrieb oliver: > On Mon, Mar 07, 2011 at 02:40:37PM +0100, Gerd Stolpmann wrote: > > Am Montag, den 07.03.2011, 13:57 +0100 schrieb oliver: > > > On Mon, Mar 07, 2011 at 01:27:55PM +0100, Gerd Stolpmann wrote: > > > > Am Sonntag, den 06.03.2011, 23:52 +0100 schrieb oliver: > > > > > Hello, > > > > > > > > > > tried around using the simple-dtd argument > > > > > for Nethtme.parse. > > > > > > > > > > It changes the behaviour compared to > > > > > the default behaviour, but I could not find out > > > > > how this works. > > > > > > > > > > Someone here who can explain me this > > > > > argument and describe, how it can be used? > > > > > > > > Maybe the HTML specification would be a good reference here: > > > > http://www.w3.org/TR/1999/REC-html401-19991224. You will see there that > > > > most HTML elements are either an inline element, a block element, or > > > > both ("flow" element). The grammar of HTML is described in terms of > > > > these classes. For instance, a P tag (paragraph) is a block element and > > > > contains block elements whereas B (bold) is an inline element and > > > > contains inline elements. From this follows that you cannot put a P > > > > inside a B:

something

is illegal. > > > > > > > > The parser needs this information to resolve such input, i.e. do > > > > something with bad HTML. As HTML allows tag minimization (many end tags > > > > can be omitted), the parser can read this as:

something

> > > > (and the in the input is ignored). > > > > > > > > If all start and all end tags are written out, changing the > > > > simplified_dtd does not make any difference. > > > > > > > > There is no normative text that says how to read bad HTML. Because of > > > > this, it is - to a large degree - an interpretation of HTML what you put > > > > into simplified_dtd. > > > > > > > > > The description IMHO is not sufficient to explain > > > > > this feature. > > > > > > > > I'd say your formal knowledge about HTML is insufficient. > > > [...] > > > > > > If formal HTML spec is sufficient to know the behaviour of the module, > > > there would no need to have the dtd-argument, which seems, follwoinjg your > > > explanations, to change the behavior in a way that it does NOT follow > > > the formal specifications. > > > > There is no standard regarding that (except that HTML is also SGML, and > > there are some rules for that in SGML). You could also reject bad HTML. > > But this has not become common practice (unlike for XML, for instance). > > > > So, it depends on your HTML documents how you want to fix bad HTML. > > That's the reason why you can configure it. > [...] > > But it's not mentioned how the dtd-Argument works. > > Does it change the behaviour only for those tags that > are mentioned in the dtd-argument? > What about the other args? Will they stay as before (default)? I think this is pretty clear: if you set the dtd arg, you pass a completely new dtd in, overriding any default. This is how optional arguments work in Ocaml. So, if you only want to change something, take the provided html40_dtd or relaxed_html40_dtd values, and apply a change to them. > Or will using the dtd-arg change the parsing to some kind of > very relaxed fallback, and I add constraints only to the mentioned > tags? > > So does the dtd-arg widen or narrow the default dtd or just replace the > default settings? What about tags that are not included in the dtd-arg? > > I doubt that behaviour of your module can be found in the HTML-spec. No, this is Ocaml. Btw, you could have easily answered that yourself by looking at the source. I mean just for the case that you do not see the obvious. Gerd > If it's so obvious behaviour, I wonder why nobody else could answer > the question (not here and not in irc). > > > > > > > > It is > > > > impossible to explain all the basics of HTML in the scope of an mli. > > > > > > Do you mean the basics of html spec or the many different ways, > > > bad html is written?! > > > > This is connected to some degree - if you define a spec you also define > > ways how to violate it. For example, the spec defines for each element > > whether the start tag, the end tag, or both can be omitted. But what to > > do if this is not done correctly? > > I could write my own parser, but prefer to use modules that help me. > For that the docs should explain what the arguments do. > > I know that there is a DTD for HTML and that most html is not confirming to it. > > But how your parser works and how that dtd-arg must be used is not clear to me. > > > > > > > The formal basics would here be the DTD (document type definition). If > > you know what a DTD is, you also know the problem of omitting tags. If I > > included all that in the mli, it would be 1000 lines long, and would > > contain that much information that it would be unclear what the relevant > > part is. In short - I don't want to write books in mli's. The user needs > > background information, but the mli is not the right place where to give > > it. > > But the docs could explain the argument behaviour. > > > > > > The specific problem with HTML is that everybody knows "something" about > > it, but most knowledge is second-hand. However, HTML is a systematic > > definition (with syntax and semantics), and everybody who knows that > > will also be able to read my mli. > > Aha, that_s why nobody answered my question. > > Maybe it's not that obvious. > > Ciao, > Oliver > -- ------------------------------------------------------------ Gerd Stolpmann, Bad Nauheimer Str.3, 64289 Darmstadt,Germany gerd@gerd-stolpmann.de http://www.gerd-stolpmann.de Phone: +49-6151-153855 Fax: +49-6151-997714 ------------------------------------------------------------