From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-x235.google.com (mail-wi0-x235.google.com [IPv6:2a00:1450:400c:c05::235]) by hurricane.the-brannons.com (Postfix) with ESMTPS id CF5A478FAB for ; Thu, 29 Jan 2015 00:26:02 -0800 (PST) Received: by mail-wi0-f181.google.com with SMTP id fb4so21064447wid.2 for ; Thu, 29 Jan 2015 00:23:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=5m0DG+ewlmfUsB6xS5Br6UmGtZZNQs0mlC3g6DEi5ks=; b=pKG0yluCutVxTESkHPt+Bmh4LAxSsDPWiuKZ6rJQ5rhR5l2N06FhV6KdS1daDy9TZQ +qMao3yY6Xb3615ex+da4eXBC7hJzUmBUoSqh9KZ63/iRGn+T1lqmBvv4ed3HHy/w9p+ siZCnMrZoRdqaUZnajE68pi1GpEGkd6yBRrqQ2AZPA32NWzMC3LTrXasXokCYoJbLZQV Hzd9emmVAaWkPsRJVpKOck6naqbAUtB2uhFTcYszachcMOYaFKUrEbXV01EE204GyKKF NPAztTNak4doE9gGbwpViGyxqsg002IrwuWRkytZC2dTA7wHiERMu31no/9GOX7LjZ33 JXpA== X-Received: by 10.180.103.40 with SMTP id ft8mr1887090wib.68.1422519782898; Thu, 29 Jan 2015 00:23:02 -0800 (PST) Received: from toaster.adamthompson.me.uk (toaster.adamthompson.me.uk. [2001:8b0:1142:9042::2]) by mx.google.com with ESMTPSA id dp8sm1389376wib.20.2015.01.29.00.23.01 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 29 Jan 2015 00:23:01 -0800 (PST) Date: Thu, 29 Jan 2015 08:22:59 +0000 From: Adam Thompson To: Karl Dahlke Message-ID: <20150129082259.GE24669@toaster.adamthompson.me.uk> References: <20150028155505.eklhad@comcast.net> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="8w3uRX/HFJGApMzv" Content-Disposition: inline In-Reply-To: <20150028155505.eklhad@comcast.net> User-Agent: Mutt/1.5.23 (2014-03-12) Cc: Edbrowse-dev@lists.the-brannons.com Subject: Re: [Edbrowse-dev] html parser and whitespace in tag names X-BeenThere: edbrowse-dev@lists.the-brannons.com X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Edbrowse Development List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 29 Jan 2015 08:26:03 -0000 --8w3uRX/HFJGApMzv Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Jan 28, 2015 at 03:55:05PM -0500, Karl Dahlke wrote: > > > >

> > link text goes here. > >

> >
>=20 > Yes this did not work because I was trying to be clever. > Some tags in the text I thought should close the open anchor, > if there was an open anchor. > I was really thinking about this. >=20 >

> > link text >

> >=20 > Should the

close the anchor? No, that's an error, in fact, strictly speaking, in html, the

isn't necessary, it's only needed in xhtml. > Does it really matter? That's an interesting one. It really depends on the page, but if we're aiming to parse html rather than xhtml, then the

tag is purely optional, so probably not for most pages (though we should warn about this kind of incorrect nesting). > There's so much improper html out there it makes my head spin. > I tried to anticipate some of this and I think > I did more harm than good. > The latest push just comments out some code, > in html.c from 1810 to 1828. > #if 0 > #endif > Code no longer being used. > I didn't delete the code cause I don't know > maybe it still might be used in some fashion. > If I think it's worthless in a couple months I'll delete it > and some other code that supports it. Ok, thanks for doing this, that's fixed the link. > All this makes me wonder again if I should be parsing html at all, > or if there isn't some code out there that would do it for me, > and turn it into a tree of nodes, and I could just work with that. > Let somebody else worry about all this "is it nested properly" html crap. > Trying to leverage more open source libraries. > I was going to play with xidel but haven't got round to it. I've never heard of xidel until now, but I was looking into libtidy (used by the tidy html validator) which, as part of it's functionality, exposes a tree of nodes etc. There's a libcurl and libtidy example file with the curl distribution which= is how I found out about it. Basically it uses the tidy library to parse a web= page then goes through printing all the nodes. I've been toying with the idea of trying to use it in edbrowse to replace our parsing logic. We could then replace the printing logic from the curl example with our rendering code (though I accept this is simplifying things a lot). We'd probably want to put a layer between the tidy node tree and on-screen rendering to support javascript, but we need to do this at some stage in an= y case. Going down this route also means we'll get html and xhtml support in line w= ith the w3c recommendations, not to mention the fact that tidy is a widely used tool for validating websites and reparing broken html, so it can fix most issues in a relatively sane way (removing unnecessary ta= gs or fixing nesting etc). It also seems to be a relatively small library and is written in c, and appears to be fairly simple to use, with an apparently fully featured tree of nodes, and support for returning the errors found in the html (i.e. the user may choose to enable this behaviour when developing webpages). Cheers, Adam. --8w3uRX/HFJGApMzv Content-Type: application/pgp-signature; name="signature.asc" Content-Description: Digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBAgAGBQJUye3jAAoJELZ22lNQBzHOnwIIAMISuiz39MIqRd9H9mq9XKdm Ll9Ze0CgKBi5dg6AXiH+dFuTWwg7Y6B0VBfJhVeYioyy6phMTxGFHxO/GiTmmEbD oZAnx2t6UFR3adw7U7Z1voWsX1DSH3PDEh31NDpK05phq1hWyDlW9Ekwid6MvQ1t r6KOMueYeIktSkO7OE03ktUTHvlG40bLgo3AWWGps9ykZdPhIAFrXsznCGi3yb0E 3JUgpFR8G+Z6oCQsWd4oZc7keLTjo7umvs7KuYwN522hg8oSdlMjaFH7XVMNlVnB XP/M1UjUqfXHzzBoUqEI5FJa8xMTolxksc+y9feenWdKL5YehiK8lzNjGlMcQos= =vKnA -----END PGP SIGNATURE----- --8w3uRX/HFJGApMzv--