On Wed, Jan 28, 2015 at 03:55:05PM -0500, Karl Dahlke wrote: > > > >

> > link text goes here. > >

> >
> > Yes this did not work because I was trying to be clever. > Some tags in the text I thought should close the open anchor, > if there was an open anchor. > I was really thinking about this. > >

> > link text >

> > > Should the

close the anchor? No, that's an error, in fact, strictly speaking, in html, the

isn't necessary, it's only needed in xhtml. > Does it really matter? That's an interesting one. It really depends on the page, but if we're aiming to parse html rather than xhtml, then the

tag is purely optional, so probably not for most pages (though we should warn about this kind of incorrect nesting). > There's so much improper html out there it makes my head spin. > I tried to anticipate some of this and I think > I did more harm than good. > The latest push just comments out some code, > in html.c from 1810 to 1828. > #if 0 > #endif > Code no longer being used. > I didn't delete the code cause I don't know > maybe it still might be used in some fashion. > If I think it's worthless in a couple months I'll delete it > and some other code that supports it. Ok, thanks for doing this, that's fixed the link. > All this makes me wonder again if I should be parsing html at all, > or if there isn't some code out there that would do it for me, > and turn it into a tree of nodes, and I could just work with that. > Let somebody else worry about all this "is it nested properly" html crap. > Trying to leverage more open source libraries. > I was going to play with xidel but haven't got round to it. I've never heard of xidel until now, but I was looking into libtidy (used by the tidy html validator) which, as part of it's functionality, exposes a tree of nodes etc. There's a libcurl and libtidy example file with the curl distribution which is how I found out about it. Basically it uses the tidy library to parse a webpage then goes through printing all the nodes. I've been toying with the idea of trying to use it in edbrowse to replace our parsing logic. We could then replace the printing logic from the curl example with our rendering code (though I accept this is simplifying things a lot). We'd probably want to put a layer between the tidy node tree and on-screen rendering to support javascript, but we need to do this at some stage in any case. Going down this route also means we'll get html and xhtml support in line with the w3c recommendations, not to mention the fact that tidy is a widely used tool for validating websites and reparing broken html, so it can fix most issues in a relatively sane way (removing unnecessary tags or fixing nesting etc). It also seems to be a relatively small library and is written in c, and appears to be fairly simple to use, with an apparently fully featured tree of nodes, and support for returning the errors found in the html (i.e. the user may choose to enable this behaviour when developing webpages). Cheers, Adam.