caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Gerd Stolpmann <info@gerd-stolpmann.de>
To: oliver <oliver@first.in-berlin.de>
Cc: caml-list@inria.fr
Subject: Re: [Caml-list] ocamlnet: Netheml: simple-dtd: how does this work?
Date: Mon, 07 Mar 2011 16:14:18 +0100	[thread overview]
Message-ID: <1299510858.30035.95.camel@thinkpad> (raw)
In-Reply-To: <20110307144415.GA1600@siouxsie>

Am Montag, den 07.03.2011, 15:44 +0100 schrieb oliver:
> On Mon, Mar 07, 2011 at 02:40:37PM +0100, Gerd Stolpmann wrote:
> > Am Montag, den 07.03.2011, 13:57 +0100 schrieb oliver:
> > > On Mon, Mar 07, 2011 at 01:27:55PM +0100, Gerd Stolpmann wrote:
> > > > Am Sonntag, den 06.03.2011, 23:52 +0100 schrieb oliver:
> > > > > Hello,
> > > > > 
> > > > > tried around using the simple-dtd argument
> > > > > for Nethtme.parse.
> > > > > 
> > > > > It changes the behaviour compared to
> > > > > the default behaviour, but I could not find out
> > > > > how this works.
> > > > > 
> > > > > Someone here who can explain me this
> > > > > argument and describe, how it can be used?
> > > > 
> > > > Maybe the HTML specification would be a good reference here:
> > > > http://www.w3.org/TR/1999/REC-html401-19991224. You will see there that
> > > > most HTML elements are either an inline element, a block element, or
> > > > both ("flow" element). The grammar of HTML is described in terms of
> > > > these classes. For instance, a P tag (paragraph) is a block element and
> > > > contains block elements whereas B (bold) is an inline element and
> > > > contains inline elements. From this follows that you cannot put a P
> > > > inside a B: <B><P>something</P></B> is illegal.
> > > > 
> > > > The parser needs this information to resolve such input, i.e. do
> > > > something with bad HTML. As HTML allows tag minimization (many end tags
> > > > can be omitted), the parser can read this as: <B></B><P>something</P>
> > > > (and the </B> in the input is ignored).
> > > > 
> > > > If all start and all end tags are written out, changing the
> > > > simplified_dtd does not make any difference.
> > > > 
> > > > There is no normative text that says how to read bad HTML. Because of
> > > > this, it is - to a large degree - an interpretation of HTML what you put
> > > > into simplified_dtd.
> > > > 
> > > > > The description IMHO is not sufficient to explain
> > > > > this feature.
> > > > 
> > > > I'd say your formal knowledge about HTML is insufficient.
> > > [...]
> > > 
> > > If formal HTML spec is sufficient to know the behaviour of the module,
> > > there would no need to have the dtd-argument, which seems, follwoinjg your
> > > explanations, to change the behavior in a way that it does NOT follow
> > > the formal specifications.
> > 
> > There is no standard regarding that (except that HTML is also SGML, and
> > there are some rules for that in SGML). You could also reject bad HTML.
> > But this has not become common practice (unlike for XML, for instance).
> > 
> > So, it depends on your HTML documents how you want to fix bad HTML.
> > That's the reason why you can configure it.
> [...]
> 
> But it's not mentioned how the dtd-Argument works.
> 
> Does it change the behaviour only for those tags that
> are mentioned in the dtd-argument?
> What about the other args? Will they stay as before (default)?

I think this is pretty clear: if you set the dtd arg, you pass a
completely new dtd in, overriding any default. This is how optional
arguments work in Ocaml.

So, if you only want to change something, take the provided html40_dtd
or relaxed_html40_dtd values, and apply a change to them.

> Or will using the dtd-arg change the parsing to some kind of
> very relaxed fallback, and I add constraints only to the mentioned
> tags?
> 
> So does the dtd-arg widen or narrow the default dtd or just replace the
> default settings? What about tags that are not included in the dtd-arg?
> 
> I doubt that behaviour of your module can be found in the HTML-spec.

No, this is Ocaml.

Btw, you could have easily answered that yourself by looking at the
source. I mean just for the case that you do not see the obvious.

Gerd

> If it's so obvious behaviour, I wonder why nobody else could answer
> the question (not here and not in irc).
> 
> 
> > 
> > > > It is
> > > > impossible to explain all the basics of HTML in the scope of an mli.
> > > 
> > > Do you mean the basics of html spec or the many different ways,
> > > bad html is written?!
> > 
> > This is connected to some degree - if you define a spec you also define
> > ways how to violate it. For example, the spec defines for each element
> > whether the start tag, the end tag, or both can be omitted. But what to
> > do if this is not done correctly?
> 
> I could write my own parser, but prefer to use modules that help me.
> For that the docs should explain what the arguments do.
> 
> I know that there is a DTD for HTML and that most html is not confirming to it.
> 
> But how your parser works and how that dtd-arg must be used is not clear to me.
> 
> 
> 
> > 
> > The formal basics would here be the DTD (document type definition). If
> > you know what a DTD is, you also know the problem of omitting tags. If I
> > included all that in the mli, it would be 1000 lines long, and would
> > contain that much information that it would be unclear what the relevant
> > part is. In short - I don't want to write books in mli's. The user needs
> > background information, but the mli is not the right place where to give
> > it.
> 
> But the docs could explain the argument behaviour.
> 
> 
> > 
> > The specific problem with HTML is that everybody knows "something" about
> > it, but most knowledge is second-hand. However, HTML is a systematic
> > definition (with syntax and semantics), and everybody who knows that
> > will also be able to read my mli.
> 
> Aha, that_s why nobody answered my question.
> 
> Maybe it's not that obvious.
> 
> Ciao,
>    Oliver
> 


-- 
------------------------------------------------------------
Gerd Stolpmann, Bad Nauheimer Str.3, 64289 Darmstadt,Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------


  parent reply	other threads:[~2011-03-07 15:14 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-03-06 22:52 oliver
2011-03-07 12:27 ` Gerd Stolpmann
2011-03-07 12:57   ` oliver
2011-03-07 13:40     ` Gerd Stolpmann
2011-03-07 14:44       ` oliver
2011-03-07 14:53         ` oliver
2011-03-07 15:14         ` Gerd Stolpmann [this message]
2011-03-07 20:18           ` oliver
2011-03-07 15:40   ` Yoann Padioleau
2011-03-07 16:24     ` Gerd Stolpmann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1299510858.30035.95.camel@thinkpad \
    --to=info@gerd-stolpmann.de \
    --cc=caml-list@inria.fr \
    --cc=oliver@first.in-berlin.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).