[Edbrowse-dev] html parser and whitespace in tag names

edbrowse-dev - development list for edbrowse
 help / color / mirror / Atom feed

* [Edbrowse-dev]  html parser and whitespace in tag names
@ 2015-01-28 12:16 Karl Dahlke
  2015-01-28 13:08 ` Adam Thompson
  0 siblings, 1 reply; 11+ messages in thread
From: Karl Dahlke @ 2015-01-28 12:16 UTC (permalink / raw)
  To: Edbrowse-dev

Actually those tags are all recognized, even with the extra newlines.
I guess I had seen this syntax somewhere, and wrote the parser accordingly.
However, the <script> tag is different,
and a newline in that tag seems to derail things.
In your example,
http://hub.darcs.net
in the source html, delete lines 8 and 9, the script and /script tags,
and browse, and all the links appear.
I will have a look at the <script> tag parsing
and see if I can find the whitespace problem.

Karl Dahlke

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Edbrowse-dev] html parser and whitespace in tag names
  2015-01-28 12:16 [Edbrowse-dev] html parser and whitespace in tag names Karl Dahlke
@ 2015-01-28 13:08 ` Adam Thompson
  0 siblings, 0 replies; 11+ messages in thread
From: Adam Thompson @ 2015-01-28 13:08 UTC (permalink / raw)
  To: Karl Dahlke; +Cc: Edbrowse-dev

[-- Attachment #1: Type: text/plain, Size: 767 bytes --]

On Wed, Jan 28, 2015 at 07:16:54AM -0500, Karl Dahlke wrote:
> Actually those tags are all recognized, even with the extra newlines.
> I guess I had seen this syntax somewhere, and wrote the parser accordingly.
> However, the <script> tag is different,
> and a newline in that tag seems to derail things.
> In your example,
> http://hub.darcs.net
> in the source html, delete lines 8 and 9, the script and /script tags,
> and browse, and all the links appear.

Ah ok, thanks. I didn't realise the script tag was handled specially (or
was that much different to the rest of the tags on the page).
> I will have a look at the <script> tag parsing
> and see if I can find the whitespace problem.

Thanks. I wondered how this hadn't been a problem before.

Cheers,
Adam.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Edbrowse-dev] html parser and whitespace in tag names
  2015-02-01 20:05 Karl Dahlke
@ 2015-02-01 20:29 ` Chris Brannon
  0 siblings, 0 replies; 11+ messages in thread
From: Chris Brannon @ 2015-02-01 20:29 UTC (permalink / raw)
  To: edbrowse-dev

Karl Dahlke <eklhad@comcast.net> writes:

> My documentation shows last update in 2000.

Well, there was some work in their cvs repo as late as 2009, and at
least one distro is packaging a snapshot of that CVS tree.

In any case, Tyler has pointed me to
https://github.com/htacg/tidy-html5
which is an experimental fork of the tidy codebase with HTML 5 support.
It looks promising.  The "experimental" is a bit scary, and I doubt the
distros are packaging it.

-- Chris

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Edbrowse-dev]  html parser and whitespace in tag names
@ 2015-02-01 20:05 Karl Dahlke
  2015-02-01 20:29 ` Chris Brannon
  0 siblings, 1 reply; 11+ messages in thread
From: Karl Dahlke @ 2015-02-01 20:05 UTC (permalink / raw)
  To: Edbrowse-dev

> Sadly, the project appears to be dead.

I get that impression as well.
My documentation shows last update in 2000.
I sent the author an email; no reply.

I'm not averse to continuing our own html parser,
it's not really hard, but we need considerable redesign,
so that it builds a proper tree of nodes,
and handles xml and other variations.
Can these changes be done incrementally? Maybe. I don't know.
I was going to add parent to the htmlTag structure, and link each to its
open tag parent, which I'm also trying to do in javascript but same should
happen in the html tree, whether js is running or not.
And then I need to create text nodes, which I'm not doing at all today.
And so on.

Karl Dahlke

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Edbrowse-dev] html parser and whitespace in tag names
  2015-01-29 13:24 Karl Dahlke
@ 2015-02-01 19:56 ` Chris Brannon
  0 siblings, 0 replies; 11+ messages in thread
From: Chris Brannon @ 2015-02-01 19:56 UTC (permalink / raw)
  To: Edbrowse-dev

Karl Dahlke <eklhad@comcast.net> writes:

> I just installed libtidy and libtidy-devel,
> and will play with it as time permits.
> I'm 96% sure we want to get out of the html parsing business,

Sadly, the project appears to be dead.  Yes, it's probably usable in its
current state, but I'm betting that it won't handle HTML 5 correctly.
Looks like there hasn't been any work done on it since 2009.

-- Chris

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Edbrowse-dev]    html parser and whitespace in tag names
@ 2015-01-29 13:24 Karl Dahlke
  2015-02-01 19:56 ` Chris Brannon
  0 siblings, 1 reply; 11+ messages in thread
From: Karl Dahlke @ 2015-01-29 13:24 UTC (permalink / raw)
  To: Edbrowse-dev

Adam, thanks for the info.
I just installed libtidy and libtidy-devel,
and will play with it as time permits.
I'm 96% sure we want to get out of the html parsing business,
as we did with the js engine, url fetches / protocols, etc.

Karl Dahlke

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Edbrowse-dev] html parser and whitespace in tag names
  2015-01-28 20:55 Karl Dahlke
@ 2015-01-29  8:22 ` Adam Thompson
  0 siblings, 0 replies; 11+ messages in thread
From: Adam Thompson @ 2015-01-29  8:22 UTC (permalink / raw)
  To: Karl Dahlke; +Cc: Edbrowse-dev

[-- Attachment #1: Type: text/plain, Size: 3146 bytes --]

On Wed, Jan 28, 2015 at 03:55:05PM -0500, Karl Dahlke wrote:
> > <a href="some_url">
> > <p>
> > link text goes here.
> > </p>
> > </a>
> 
> Yes this did not work because I was trying to be clever.
> Some tags in the text I thought should close the open anchor,
> if there was an open anchor.
> I was really thinking about this.
> 
> <P>
> <A>
> link text
> </P>
> </A>
> 
> Should the </P> close the anchor?
No, that's an error, in fact, strictly speaking, in html,
the </p> isn't necessary, it's only needed in xhtml.
> Does it really matter?

That's an interesting one. It really depends on the page,
but if we're aiming to parse html rather than xhtml,
then the </p> tag is purely optional,
so probably not for most pages (though we should warn about this kind of
incorrect nesting).
> There's so much improper html out there it makes my head spin.
> I tried to anticipate some of this and I think
> I did more harm than good.
> The latest push just comments out some code,
> in html.c from 1810 to 1828.
> #if 0
> #endif
> Code no longer being used.
> I didn't delete the code cause I don't know
> maybe it still might be used in some fashion.
> If I think it's worthless in a couple months I'll delete it
> and some other code that supports it.

Ok, thanks for doing this, that's fixed the link.

> All this makes me wonder again if I should be parsing html at all,
> or if there isn't some code out there that would do it for me,
> and turn it into a tree of nodes, and I could just work with that.
> Let somebody else worry about all this "is it nested properly" html crap.
> Trying to leverage more open source libraries.
> I was going to play with xidel but haven't got round to it.

I've never heard of xidel until now,
but I was looking into libtidy (used by the tidy html validator) which,
as part of it's functionality, exposes a tree of nodes etc.
There's a libcurl and libtidy example file with the curl distribution which is
how I found out about it. Basically it uses the tidy library to parse a webpage
then goes through printing all the nodes.
I've been toying with the idea of trying to use it
in edbrowse to replace our parsing logic.
We could then replace the printing logic from the curl example with our
rendering code (though I accept this is simplifying things a lot).
We'd probably want to put a layer between the tidy node tree and on-screen
rendering to support javascript, but we need to do this at some stage in any case.
Going down this route also means we'll get html and xhtml support in line with
the w3c recommendations, not to mention the fact that tidy is a widely used
tool for validating websites and reparing broken html,
so it can fix most issues in a relatively sane way (removing unnecessary tags
or fixing nesting etc).
It also seems to be a relatively small library and is written in c,
and appears to be fairly simple to use,
with an apparently fully featured tree of nodes,
and support for returning the errors found in the html (i.e.
the user may choose to enable this behaviour when developing webpages).

Cheers,
Adam.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Edbrowse-dev]   html parser and whitespace in tag names
@ 2015-01-28 20:55 Karl Dahlke
  2015-01-29  8:22 ` Adam Thompson
  0 siblings, 1 reply; 11+ messages in thread
From: Karl Dahlke @ 2015-01-28 20:55 UTC (permalink / raw)
  To: Edbrowse-dev

> <a href="some_url">
> <p>
> link text goes here.
> </p>
> </a>

Yes this did not work because I was trying to be clever.
Some tags in the text I thought should close the open anchor,
if there was an open anchor.
I was really thinking about this.

<P>
<A>
link text
</P>
</A>

Should the </P> close the anchor?
Does it really matter?

There's so much improper html out there it makes my head spin.
I tried to anticipate some of this and I think
I did more harm than good.
The latest push just comments out some code,
in html.c from 1810 to 1828.
#if 0
#endif
Code no longer being used.
I didn't delete the code cause I don't know
maybe it still might be used in some fashion.
If I think it's worthless in a couple months I'll delete it
and some other code that supports it.

All this makes me wonder again if I should be parsing html at all,
or if there isn't some code out there that would do it for me,
and turn it into a tree of nodes, and I could just work with that.
Let somebody else worry about all this "is it nested properly" html crap.
Trying to leverage more open source libraries.
I was going to play with xidel but haven't got round to it.

Karl Dahlke

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Edbrowse-dev] html parser and whitespace in tag names
  2015-01-28 15:02 Karl Dahlke
@ 2015-01-28 17:33 ` Adam Thompson
  0 siblings, 0 replies; 11+ messages in thread
From: Adam Thompson @ 2015-01-28 17:33 UTC (permalink / raw)
  To: Karl Dahlke; +Cc: Edbrowse-dev

[-- Attachment #1: Type: text/plain, Size: 579 bytes --]

On Wed, Jan 28, 2015 at 10:02:21AM -0500, Karl Dahlke wrote:
> That push should fix it.

Thanks, yes it does.

However, I noticed another thing whilst using the site. One of the links has a <p> tag in it i.e:
<a href="some_url">
<p>
link text goes here.
</p>
</a>

Edbrowse fails to render this as a link.
I'm not sure what the best format for this would be,
but I've seen this construct a few times on other sites as well.

Any chance these kind of constructs could be rendered (I've also seen other
things inside <a> tags which edbrowse doesn't seem to render).

Cheers,
Adam.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Edbrowse-dev]  html parser and whitespace in tag names
@ 2015-01-28 15:02 Karl Dahlke
  2015-01-28 17:33 ` Adam Thompson
  0 siblings, 1 reply; 11+ messages in thread
From: Karl Dahlke @ 2015-01-28 15:02 UTC (permalink / raw)
  To: Edbrowse-dev

That push should fix it.

Karl Dahlke

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Edbrowse-dev] html parser and whitespace in tag names
@ 2015-01-28 11:49 Adam Thompson
  0 siblings, 0 replies; 11+ messages in thread
From: Adam Thompson @ 2015-01-28 11:49 UTC (permalink / raw)
  To: Edbrowse-dev

[-- Attachment #1: Type: text/plain, Size: 601 bytes --]

Hi,

Whilst trying to use:
http://hub.darcs.net
I noticed that the navigation links weren't showing up. Loading the site in links shows the navigation links (and other parts of the site as far as I remember).
Checking the source I noticed they have a lot of the following kinds of constructs:
<div
    >
Important content
</div
>

I was going to report this as a bug against darcshub,
however on googling a bit apparently this is legal html syntax.
The problem seems to be that edbrowse is unable to handle this kind of markup.
How difficult would it be to make the parser handle this?

Cheers,
Adam.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-02-01 20:32 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-28 12:16 [Edbrowse-dev] html parser and whitespace in tag names Karl Dahlke
2015-01-28 13:08 ` Adam Thompson
  -- strict thread matches above, loose matches on Subject: below --
2015-02-01 20:05 Karl Dahlke
2015-02-01 20:29 ` Chris Brannon
2015-01-29 13:24 Karl Dahlke
2015-02-01 19:56 ` Chris Brannon
2015-01-28 20:55 Karl Dahlke
2015-01-29  8:22 ` Adam Thompson
2015-01-28 15:02 Karl Dahlke
2015-01-28 17:33 ` Adam Thompson
2015-01-28 11:49 Adam Thompson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).