edbrowse-dev - development list for edbrowse
 help / color / mirror / Atom feed
From: Adam Thompson <arthompson1990@gmail.com>
To: Karl Dahlke <eklhad@comcast.net>
Cc: Edbrowse-dev@lists.the-brannons.com
Subject: Re: [Edbrowse-dev] html parser and whitespace in tag names
Date: Thu, 29 Jan 2015 08:22:59 +0000	[thread overview]
Message-ID: <20150129082259.GE24669@toaster.adamthompson.me.uk> (raw)
In-Reply-To: <20150028155505.eklhad@comcast.net>

[-- Attachment #1: Type: text/plain, Size: 3146 bytes --]

On Wed, Jan 28, 2015 at 03:55:05PM -0500, Karl Dahlke wrote:
> > <a href="some_url">
> > <p>
> > link text goes here.
> > </p>
> > </a>
> 
> Yes this did not work because I was trying to be clever.
> Some tags in the text I thought should close the open anchor,
> if there was an open anchor.
> I was really thinking about this.
> 
> <P>
> <A>
> link text
> </P>
> </A>
> 
> Should the </P> close the anchor?
No, that's an error, in fact, strictly speaking, in html,
the </p> isn't necessary, it's only needed in xhtml.
> Does it really matter?

That's an interesting one. It really depends on the page,
but if we're aiming to parse html rather than xhtml,
then the </p> tag is purely optional,
so probably not for most pages (though we should warn about this kind of
incorrect nesting).
> There's so much improper html out there it makes my head spin.
> I tried to anticipate some of this and I think
> I did more harm than good.
> The latest push just comments out some code,
> in html.c from 1810 to 1828.
> #if 0
> #endif
> Code no longer being used.
> I didn't delete the code cause I don't know
> maybe it still might be used in some fashion.
> If I think it's worthless in a couple months I'll delete it
> and some other code that supports it.

Ok, thanks for doing this, that's fixed the link.

> All this makes me wonder again if I should be parsing html at all,
> or if there isn't some code out there that would do it for me,
> and turn it into a tree of nodes, and I could just work with that.
> Let somebody else worry about all this "is it nested properly" html crap.
> Trying to leverage more open source libraries.
> I was going to play with xidel but haven't got round to it.

I've never heard of xidel until now,
but I was looking into libtidy (used by the tidy html validator) which,
as part of it's functionality, exposes a tree of nodes etc.
There's a libcurl and libtidy example file with the curl distribution which is
how I found out about it. Basically it uses the tidy library to parse a webpage
then goes through printing all the nodes.
I've been toying with the idea of trying to use it
in edbrowse to replace our parsing logic.
We could then replace the printing logic from the curl example with our
rendering code (though I accept this is simplifying things a lot).
We'd probably want to put a layer between the tidy node tree and on-screen
rendering to support javascript, but we need to do this at some stage in any case.
Going down this route also means we'll get html and xhtml support in line with
the w3c recommendations, not to mention the fact that tidy is a widely used
tool for validating websites and reparing broken html,
so it can fix most issues in a relatively sane way (removing unnecessary tags
or fixing nesting etc).
It also seems to be a relatively small library and is written in c,
and appears to be fairly simple to use,
with an apparently fully featured tree of nodes,
and support for returning the errors found in the html (i.e.
the user may choose to enable this behaviour when developing webpages).

Cheers,
Adam.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

  reply	other threads:[~2015-01-29  8:26 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-01-28 20:55 Karl Dahlke
2015-01-29  8:22 ` Adam Thompson [this message]
  -- strict thread matches above, loose matches on Subject: below --
2015-02-01 20:05 Karl Dahlke
2015-02-01 20:29 ` Chris Brannon
2015-01-29 13:24 Karl Dahlke
2015-02-01 19:56 ` Chris Brannon
2015-01-28 15:02 Karl Dahlke
2015-01-28 17:33 ` Adam Thompson
2015-01-28 12:16 Karl Dahlke
2015-01-28 13:08 ` Adam Thompson
2015-01-28 11:49 Adam Thompson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150129082259.GE24669@toaster.adamthompson.me.uk \
    --to=arthompson1990@gmail.com \
    --cc=Edbrowse-dev@lists.the-brannons.com \
    --cc=eklhad@comcast.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).