From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf0-x242.google.com (mail-lf0-x242.google.com [IPv6:2a00:1450:4010:c07::242]) by hurricane.the-brannons.com (Postfix) with ESMTPS id E0D7378C68 for ; Sun, 25 Dec 2016 04:54:13 -0800 (PST) Received: by mail-lf0-x242.google.com with SMTP id t196so6824762lff.3 for ; Sun, 25 Dec 2016 04:54:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=4jUYpPM7v4/DMm52eaFVYad6Mh26MQr5VMlbQPr0Qj8=; b=GITSB8j24d9+5B181jyE3GgDwpvqgXcR/2xRBWWZhCjnxDpliQ9TFSgTbSgkacfq8e uSgxYm3lh8JYgo77ylEjqvLhwZACgtOcdvAWjKJa9VOu+kdcJa+uQZQVh70PllaGQSWK GW8JFvEEFfGXTyWxYPH+wmVW3bd6Sv7Ox0IGWE5400Xc9nHn77gcDX+DvMiz3XM8Oj12 4dpfQGCDpZZ4h0yGnJut/q6zEimABZIuIkkA7IPCVDcubfGueDgbFpy+a8XELYogdssx ZArquBonW969KfzEpuUC+b+JIrPFOnGQnKwtoDIJRZeSkxcex2Im4erAW5Opd/KqTvFq uXSQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=4jUYpPM7v4/DMm52eaFVYad6Mh26MQr5VMlbQPr0Qj8=; b=TTRALXetWvXNkp/kQZLfY4ND0C8n1jZX0icO3MiBjiZWycbiaYOHlTvxN3ppf2klvE Mz3nrq17W47ctEzbZwiPtitErjqT9ocUwxvdCathJcIna/r/mP9YVgOi97wO8UXGGF4I pZGazN+48gTdChGModa/uj222ihuXAvVDNH0vsPnAaWT4n+lq9jTvloJrMmOeqXzybqC pCdHpX+vpoI+TGEEJqdBUQ/aqcKOdrGhqE1TfgaUJHTlY0H8Et7vd0jfY7DZJdHCL4e+ KKwEe7FcG5B8uaHFt7ttdg1+hzIhkZHxVIOnjGTfczePIAR8R6HdTV60xgWdXh4mB+6D I0+A== X-Gm-Message-State: AIkVDXLwIQ2bC5riwFaTXbErSRIUx2d8802LJEwn8Xg4jg/YLJn7OuuMeYH3w7T4AVYH6A== X-Received: by 10.46.69.6 with SMTP id s6mr8787675lja.42.1482670472223; Sun, 25 Dec 2016 04:54:32 -0800 (PST) Received: from odin (odin.sdf-eu.org. [178.63.35.194]) by smtp.gmail.com with ESMTPSA id 66sm210166lfy.42.2016.12.25.04.54.30 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 25 Dec 2016 04:54:31 -0800 (PST) Date: Sun, 25 Dec 2016 12:53:59 +0000 From: Adam Thompson To: Geoff McLane Cc: Karl Dahlke , edbrowse-dev@lists.the-brannons.com Message-ID: <20161225125359.GA16190@odin> References: <20161120141458.eklhad@comcast.net> <4449751f-0582-add8-0cb0-74ff1d69c97f@geoffair.info> <20161122133544.eklhad@comcast.net> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="SLDf9lqlvOQaIe6s" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Subject: Re: [Edbrowse-dev] X-BeenThere: edbrowse-dev@lists.the-brannons.com X-Mailman-Version: 2.1.23 Precedence: list List-Id: Edbrowse Development List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 25 Dec 2016 12:54:14 -0000 --SLDf9lqlvOQaIe6s Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Dec 22, 2016 at 09:13:32PM +0100, Geoff McLane wrote: > > In an ideal world, >=20 > LOL! Well we all know that does not exist! Yep that's certainly true. > Tidy does leave the form open, waiting, as it > should, for a close form, but then it hits > a tr open table element, and reports - >=20 > line 5 column 1 - Warning: missing close form > before tr >=20 > It is at this point that it *must* close the > form... and carries on parsing the table > row.. etc... >=20 > And that is why tidy emits an error when it > does eventually find a close form... >=20 > I too have had the thought - does this not > tell tidy that the earlier implicit form > close it added was not right - but what can > it do about it at that stage? >=20 > > postmuck with the tree >=20 > Yes, I hear you! That is *not* fun, and as you > point out in fixing one page, you can break so > many others... Agreed. The only way I can think of around this would be for tidy to keep track of any missing close tags and then "fix" its tree once it finds the closing tag. This'd be messy though and fairly difficult to do well, but w= ould allow the forced output mode to produce complete forms etc. That being sai= d I'm not sure how many pages that'd break... probably many. > > Using libtidy >=20 > You know, for a long time I have wondered why > you do not write your own html parser! We had one for quite a while but it got harder to maintain as new elements were supported and then html5 happened. > Not that I particularly want you to abandon > libtidy... your participation has helped solve > some libtidy problems... and so do hope you > continue... >=20 > But like any std html browser, IE, firefox, chrome, > who-ever, you are not really interested in how > well a document is formed... browsers can just skip > over many problems... True, but tidy can repare most of them which is very useful. It's also A full validating html parser which, although causing some problems with in= valid pages, gives us support for a lot of html which'd otherwise take quite a b= it of work and maintenance. > If necessary, maybe levering code from text-based > web browsers, like Lynx, but in my experimentation > with some of these, they too can get very hairy... Yes, and adding support for dynamic page elements only makes things worse in that regard. In addition, just skipping over problems means one then needs= to work around them somehow. This may take the form of ignoring them, but mos= t of the time, particularly with js, some sort of special casing would be requir= ed. This is why reparing things (see my above comment) is so useful I think. > It is just that once you have the html text in a > buffer, it basically consists of looking for > `<` and the `>`, with not too many exceptions... >=20 > I have done this, with reasonable success, in several > perl scripts I have written... as I am sure you > probably have... like I remember in your first perl > version... >=20 > But I understand, this is a long, LONG way around... > quite an amount of new work initially... >=20 > But libtidy is always going to give you problems > when it runs into invalid html, and its efforts > to make it valid... No more problems imho than we'd experience in getting a valid node tree from this kind of thing. This, actually, isn't as bad as I've seen since the fo= rm is actually closed. I wonder if, in our case, we could detect from the tidy o= utput that there is actually a closing tag somewhere and then attempt to post-process as Karl suggested (may be print a warning and then have a comm= and or option to disable this for pages where it breaks)? Any thoughts? Cheers, Adam. --SLDf9lqlvOQaIe6s Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAEBAgAGBQJYX8FnAAoJED6sZNk+oYF/UkYP/1TQ3QCBYQWrCv/fDPgBtg7Z vcFcvH81O0Yqlu4J2Q4V9Gq/nkQnIPg98AKjZW1ElKEBls9Z5gG7BOKEnlAhPEes Tk7nj+1w5N9dZ+zoATx5kl0ehHIj3ovTp1np8WX26YpYCc7vqzNxOToXJSbDdy7l qm9YoA34G9l7W4pA19higiLyNnmKXPy0GB+7GqkPzWhayeYOXPbx3eLBGfiqDxqh 2wOBvsnml3aGhTcyZYGaeoGJf2lW1ihsWZrBNxDgQrHpdI8IgOVwHtVcJ09epOox 2wTkl7dMbt3rHbPROqA1NlWtE75M+buh/u5iHMm0U+esMEP3Lcaz3tq4b7u15YgW eLRAfszRYVpiCr8ZNTlhjsKHQXjgvLq0MdimxLQtaKiGptiLHmUzXFl1gEpOGiWk 1xQyUdhuYEpkLIL3+XCHin9eW+tbybUqSkU4Ju7PcsoOJcKRjwj1+UG7iXKoAb8l yMkdYRXcA5cUUYVCVxK1neJjbuQ9KGgcFIT/pwHyrxXmxbgyzSxQMXPUDXH13LVn vrIVs/TFcUhvwn3fSxWkopAEeFT6ufxeFXK5IO8WuP14WOvaLik6t6Mnmu+J17kw 8LQkcYGASiwoixhEz/KsRDdNMzfN+406Znu8fe52WoCoQtgQsR8uzi/kdGmA+512 gRW6J69ZJgDhlP5tPiIe =nq1F -----END PGP SIGNATURE----- --SLDf9lqlvOQaIe6s--