edbrowse-dev - development list for edbrowse
 help / color / mirror / Atom feed
* [Edbrowse-dev]  html parser and whitespace in tag names
@ 2015-02-01 20:05 Karl Dahlke
  2015-02-01 20:29 ` Chris Brannon
  0 siblings, 1 reply; 11+ messages in thread
From: Karl Dahlke @ 2015-02-01 20:05 UTC (permalink / raw)
  To: Edbrowse-dev

> Sadly, the project appears to be dead.

I get that impression as well.
My documentation shows last update in 2000.
I sent the author an email; no reply.

I'm not averse to continuing our own html parser,
it's not really hard, but we need considerable redesign,
so that it builds a proper tree of nodes,
and handles xml and other variations.
Can these changes be done incrementally? Maybe. I don't know.
I was going to add parent to the htmlTag structure, and link each to its
open tag parent, which I'm also trying to do in javascript but same should
happen in the html tree, whether js is running or not.
And then I need to create text nodes, which I'm not doing at all today.
And so on.

Karl Dahlke

^ permalink raw reply	[flat|nested] 11+ messages in thread
* [Edbrowse-dev]    html parser and whitespace in tag names
@ 2015-01-29 13:24 Karl Dahlke
  2015-02-01 19:56 ` Chris Brannon
  0 siblings, 1 reply; 11+ messages in thread
From: Karl Dahlke @ 2015-01-29 13:24 UTC (permalink / raw)
  To: Edbrowse-dev

Adam, thanks for the info.
I just installed libtidy and libtidy-devel,
and will play with it as time permits.
I'm 96% sure we want to get out of the html parsing business,
as we did with the js engine, url fetches / protocols, etc.

Karl Dahlke

^ permalink raw reply	[flat|nested] 11+ messages in thread
* [Edbrowse-dev]   html parser and whitespace in tag names
@ 2015-01-28 20:55 Karl Dahlke
  2015-01-29  8:22 ` Adam Thompson
  0 siblings, 1 reply; 11+ messages in thread
From: Karl Dahlke @ 2015-01-28 20:55 UTC (permalink / raw)
  To: Edbrowse-dev

> <a href="some_url">
> <p>
> link text goes here.
> </p>
> </a>

Yes this did not work because I was trying to be clever.
Some tags in the text I thought should close the open anchor,
if there was an open anchor.
I was really thinking about this.

<P>
<A>
link text
</P>
</A>

Should the </P> close the anchor?
Does it really matter?

There's so much improper html out there it makes my head spin.
I tried to anticipate some of this and I think
I did more harm than good.
The latest push just comments out some code,
in html.c from 1810 to 1828.
#if 0
#endif
Code no longer being used.
I didn't delete the code cause I don't know
maybe it still might be used in some fashion.
If I think it's worthless in a couple months I'll delete it
and some other code that supports it.

All this makes me wonder again if I should be parsing html at all,
or if there isn't some code out there that would do it for me,
and turn it into a tree of nodes, and I could just work with that.
Let somebody else worry about all this "is it nested properly" html crap.
Trying to leverage more open source libraries.
I was going to play with xidel but haven't got round to it.

Karl Dahlke

^ permalink raw reply	[flat|nested] 11+ messages in thread
* [Edbrowse-dev]  html parser and whitespace in tag names
@ 2015-01-28 15:02 Karl Dahlke
  2015-01-28 17:33 ` Adam Thompson
  0 siblings, 1 reply; 11+ messages in thread
From: Karl Dahlke @ 2015-01-28 15:02 UTC (permalink / raw)
  To: Edbrowse-dev

That push should fix it.

Karl Dahlke

^ permalink raw reply	[flat|nested] 11+ messages in thread
* [Edbrowse-dev]  html parser and whitespace in tag names
@ 2015-01-28 12:16 Karl Dahlke
  2015-01-28 13:08 ` Adam Thompson
  0 siblings, 1 reply; 11+ messages in thread
From: Karl Dahlke @ 2015-01-28 12:16 UTC (permalink / raw)
  To: Edbrowse-dev

Actually those tags are all recognized, even with the extra newlines.
I guess I had seen this syntax somewhere, and wrote the parser accordingly.
However, the <script> tag is different,
and a newline in that tag seems to derail things.
In your example,
http://hub.darcs.net
in the source html, delete lines 8 and 9, the script and /script tags,
and browse, and all the links appear.
I will have a look at the <script> tag parsing
and see if I can find the whitespace problem.

Karl Dahlke

^ permalink raw reply	[flat|nested] 11+ messages in thread
* [Edbrowse-dev] html parser and whitespace in tag names
@ 2015-01-28 11:49 Adam Thompson
  0 siblings, 0 replies; 11+ messages in thread
From: Adam Thompson @ 2015-01-28 11:49 UTC (permalink / raw)
  To: Edbrowse-dev

[-- Attachment #1: Type: text/plain, Size: 601 bytes --]

Hi,

Whilst trying to use:
http://hub.darcs.net
I noticed that the navigation links weren't showing up. Loading the site in links shows the navigation links (and other parts of the site as far as I remember).
Checking the source I noticed they have a lot of the following kinds of constructs:
<div
    >
Important content
</div
>

I was going to report this as a bug against darcshub,
however on googling a bit apparently this is legal html syntax.
The problem seems to be that edbrowse is unable to handle this kind of markup.
How difficult would it be to make the parser handle this?

Cheers,
Adam.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-02-01 20:32 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-01 20:05 [Edbrowse-dev] html parser and whitespace in tag names Karl Dahlke
2015-02-01 20:29 ` Chris Brannon
  -- strict thread matches above, loose matches on Subject: below --
2015-01-29 13:24 Karl Dahlke
2015-02-01 19:56 ` Chris Brannon
2015-01-28 20:55 Karl Dahlke
2015-01-29  8:22 ` Adam Thompson
2015-01-28 15:02 Karl Dahlke
2015-01-28 17:33 ` Adam Thompson
2015-01-28 12:16 Karl Dahlke
2015-01-28 13:08 ` Adam Thompson
2015-01-28 11:49 Adam Thompson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).