On Thu, Dec 22, 2016 at 09:13:32PM +0100, Geoff McLane wrote:
> > In an ideal world,
> 
> LOL! Well we all know that does not exist!

Yep that's certainly true.

> Tidy does leave the form open, waiting, as it
> should, for a close form, but then it hits
> a tr open table element, and reports -
> 
> line 5 column 1 - Warning: missing close form
> before tr
> 
> It is at this point that it *must* close the
> form... and carries on parsing the table
> row.. etc...
> 
> And that is why tidy emits an error when it
> does eventually find a close form...
> 
> I too have had the thought - does this not
> tell tidy that the earlier implicit form
> close it added was not right - but what can
> it do about it at that stage?
> 
> > postmuck with the tree
> 
> Yes, I hear you! That is *not* fun, and as you
> point out in fixing one page, you can break so
> many others...

Agreed.  The only way I can think of around this would be for tidy to keep
track of any missing close tags and then "fix" its tree once it finds the
closing tag.  This'd be messy though and fairly difficult to do well, but would
allow the forced output mode to produce complete forms etc.  That being said I'm
not sure how many pages that'd break... probably many.

> > Using libtidy
> 
> You know, for a long time I have wondered why
> you do not write your own html parser!

We had one for quite a while but it got harder to maintain as new elements
were supported and then html5 happened.

> Not that I particularly want you to abandon
> libtidy... your participation has helped solve
> some libtidy problems... and so do hope you
> continue...
> 
> But like any std html browser, IE, firefox, chrome,
> who-ever, you are not really interested in how
> well a document is formed... browsers can just skip
> over many problems...

True, but tidy can repare most of them which is very useful.  It's also
A full validating html parser which, although causing some problems with invalid
pages, gives us  support for a lot of html which'd otherwise take quite a bit of
work and maintenance.

> If necessary, maybe levering code from text-based
> web browsers, like Lynx, but in my experimentation
> with some of these, they too can get very hairy...

Yes, and adding support for dynamic page elements only makes things worse in
that regard.  In addition, just skipping over problems means one then needs to
work around them somehow.  This may take the form of ignoring them, but most of
the time, particularly with js, some sort of special casing would be required.
This is why reparing things (see my above comment) is so useful I think.

> It is just that once you have the html text in a
> buffer, it basically consists of looking for
> `<` and the `>`, with not too many exceptions...
> 
> I have done this, with reasonable success, in several
> perl scripts I have written... as I am sure you
> probably have... like I remember in your first perl
> version...
> 
> But I understand, this is a long, LONG way around...
> quite an amount of new work initially...
> 
> But libtidy is always going to give you problems
> when it runs into invalid html, and its efforts
> to make it valid...

No more problems imho than we'd experience in getting a valid node tree from
this kind of thing.  This, actually, isn't as bad as I've seen since the form is
actually closed.  I wonder if, in our case, we could detect from the tidy output
that there is actually a closing tag somewhere and then attempt to
post-process as Karl suggested (may be print a warning and then have a command
or option to disable this for pages where it breaks)?

Any thoughts?

Cheers,
Adam.