From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <arthompson1990@gmail.com>
Received: from mail-wi0-x235.google.com (mail-wi0-x235.google.com
 [IPv6:2a00:1450:400c:c05::235])
 by hurricane.the-brannons.com (Postfix) with ESMTPS id CF5A478FAB
 for <Edbrowse-dev@lists.the-brannons.com>;
 Thu, 29 Jan 2015 00:26:02 -0800 (PST)
Received: by mail-wi0-f181.google.com with SMTP id fb4so21064447wid.2
 for <Edbrowse-dev@lists.the-brannons.com>;
 Thu, 29 Jan 2015 00:23:03 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=date:from:to:cc:subject:message-id:references:mime-version
 :content-type:content-disposition:in-reply-to:user-agent;
 bh=5m0DG+ewlmfUsB6xS5Br6UmGtZZNQs0mlC3g6DEi5ks=;
 b=pKG0yluCutVxTESkHPt+Bmh4LAxSsDPWiuKZ6rJQ5rhR5l2N06FhV6KdS1daDy9TZQ
 +qMao3yY6Xb3615ex+da4eXBC7hJzUmBUoSqh9KZ63/iRGn+T1lqmBvv4ed3HHy/w9p+
 siZCnMrZoRdqaUZnajE68pi1GpEGkd6yBRrqQ2AZPA32NWzMC3LTrXasXokCYoJbLZQV
 Hzd9emmVAaWkPsRJVpKOck6naqbAUtB2uhFTcYszachcMOYaFKUrEbXV01EE204GyKKF
 NPAztTNak4doE9gGbwpViGyxqsg002IrwuWRkytZC2dTA7wHiERMu31no/9GOX7LjZ33
 JXpA==
X-Received: by 10.180.103.40 with SMTP id ft8mr1887090wib.68.1422519782898;
 Thu, 29 Jan 2015 00:23:02 -0800 (PST)
Received: from toaster.adamthompson.me.uk (toaster.adamthompson.me.uk.
 [2001:8b0:1142:9042::2])
 by mx.google.com with ESMTPSA id dp8sm1389376wib.20.2015.01.29.00.23.01
 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Thu, 29 Jan 2015 00:23:01 -0800 (PST)
Date: Thu, 29 Jan 2015 08:22:59 +0000
From: Adam Thompson <arthompson1990@gmail.com>
To: Karl Dahlke <eklhad@comcast.net>
Message-ID: <20150129082259.GE24669@toaster.adamthompson.me.uk>
References: <20150028155505.eklhad@comcast.net>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
 protocol="application/pgp-signature"; boundary="8w3uRX/HFJGApMzv"
Content-Disposition: inline
In-Reply-To: <20150028155505.eklhad@comcast.net>
User-Agent: Mutt/1.5.23 (2014-03-12)
Cc: Edbrowse-dev@lists.the-brannons.com
Subject: Re: [Edbrowse-dev] html parser and whitespace in tag names
X-BeenThere: edbrowse-dev@lists.the-brannons.com
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Edbrowse Development List <edbrowse-dev.lists.the-brannons.com>
List-Unsubscribe: <http://lists.the-brannons.com/mailman/options/edbrowse-dev>, 
 <mailto:edbrowse-dev-request@lists.the-brannons.com?subject=unsubscribe>
List-Archive: <http://lists.the-brannons.com/mailman/private/edbrowse-dev/>
List-Post: <mailto:edbrowse-dev@lists.the-brannons.com>
List-Help: <mailto:edbrowse-dev-request@lists.the-brannons.com?subject=help>
List-Subscribe: <http://lists.the-brannons.com/mailman/listinfo/edbrowse-dev>, 
 <mailto:edbrowse-dev-request@lists.the-brannons.com?subject=subscribe>
X-List-Received-Date: Thu, 29 Jan 2015 08:26:03 -0000



--8w3uRX/HFJGApMzv
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Jan 28, 2015 at 03:55:05PM -0500, Karl Dahlke wrote:
> > <a href=3D"some_url">
> > <p>
> > link text goes here.
> > </p>
> > </a>
>=20
> Yes this did not work because I was trying to be clever.
> Some tags in the text I thought should close the open anchor,
> if there was an open anchor.
> I was really thinking about this.
>=20
> <P>
> <A>
> link text
> </P>
> </A>
>=20
> Should the </P> close the anchor?
No, that's an error, in fact, strictly speaking, in html,
the </p> isn't necessary, it's only needed in xhtml.
> Does it really matter?

That's an interesting one. It really depends on the page,
but if we're aiming to parse html rather than xhtml,
then the </p> tag is purely optional,
so probably not for most pages (though we should warn about this kind of
incorrect nesting).
> There's so much improper html out there it makes my head spin.
> I tried to anticipate some of this and I think
> I did more harm than good.
> The latest push just comments out some code,
> in html.c from 1810 to 1828.
> #if 0
> #endif
> Code no longer being used.
> I didn't delete the code cause I don't know
> maybe it still might be used in some fashion.
> If I think it's worthless in a couple months I'll delete it
> and some other code that supports it.

Ok, thanks for doing this, that's fixed the link.

> All this makes me wonder again if I should be parsing html at all,
> or if there isn't some code out there that would do it for me,
> and turn it into a tree of nodes, and I could just work with that.
> Let somebody else worry about all this "is it nested properly" html crap.
> Trying to leverage more open source libraries.
> I was going to play with xidel but haven't got round to it.

I've never heard of xidel until now,
but I was looking into libtidy (used by the tidy html validator) which,
as part of it's functionality, exposes a tree of nodes etc.
There's a libcurl and libtidy example file with the curl distribution which=
 is
how I found out about it. Basically it uses the tidy library to parse a web=
page
then goes through printing all the nodes.
I've been toying with the idea of trying to use it
in edbrowse to replace our parsing logic.
We could then replace the printing logic from the curl example with our
rendering code (though I accept this is simplifying things a lot).
We'd probably want to put a layer between the tidy node tree and on-screen
rendering to support javascript, but we need to do this at some stage in an=
y case.
Going down this route also means we'll get html and xhtml support in line w=
ith
the w3c recommendations, not to mention the fact that tidy is a widely used
tool for validating websites and reparing broken html,
so it can fix most issues in a relatively sane way (removing unnecessary ta=
gs
or fixing nesting etc).
It also seems to be a relatively small library and is written in c,
and appears to be fairly simple to use,
with an apparently fully featured tree of nodes,
and support for returning the errors found in the html (i.e.
the user may choose to enable this behaviour when developing webpages).

Cheers,
Adam.

--8w3uRX/HFJGApMzv
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUye3jAAoJELZ22lNQBzHOnwIIAMISuiz39MIqRd9H9mq9XKdm
Ll9Ze0CgKBi5dg6AXiH+dFuTWwg7Y6B0VBfJhVeYioyy6phMTxGFHxO/GiTmmEbD
oZAnx2t6UFR3adw7U7Z1voWsX1DSH3PDEh31NDpK05phq1hWyDlW9Ekwid6MvQ1t
r6KOMueYeIktSkO7OE03ktUTHvlG40bLgo3AWWGps9ykZdPhIAFrXsznCGi3yb0E
3JUgpFR8G+Z6oCQsWd4oZc7keLTjo7umvs7KuYwN522hg8oSdlMjaFH7XVMNlVnB
XP/M1UjUqfXHzzBoUqEI5FJa8xMTolxksc+y9feenWdKL5YehiK8lzNjGlMcQos=
=vKnA
-----END PGP SIGNATURE-----

--8w3uRX/HFJGApMzv--