From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.1 required=5.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,UNPARSEABLE_RELAY autolearn=ham autolearn_force=no
	version=3.4.4
Received: (qmail 27149 invoked from network); 13 Oct 2022 00:09:29 -0000
Received: from hurricane.the-brannons.com (2602:ff06:725:1:20::25)
  by inbox.vuxu.org with ESMTPUTF8; 13 Oct 2022 00:09:29 -0000
Received: from localhost.localdomain (localhost.localdomain [127.0.0.1])
	by hurricane.the-brannons.com (OpenSMTPD) with ESMTP id 964b24a0
	for <ml@inbox.vuxu.org>;
	Wed, 12 Oct 2022 17:09:27 -0700 (PDT)
Received: from nautica.notk.org (nautica.notk.org [91.121.71.147])
	by hurricane.the-brannons.com (OpenSMTPD) with ESMTPS id eeb646a0 (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO)
	for <edbrowse-dev@edbrowse.org>;
	Wed, 12 Oct 2022 17:09:19 -0700 (PDT)
Received: by nautica.notk.org (Postfix, from userid 108)
	id 36539C01A; Thu, 13 Oct 2022 02:09:18 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=codewreck.org; s=2;
	t=1665619758; bh=oc9nOPVO08P8JMm/Opnj8+Mwb3OWytW47Cad7dcPQbs=;
	h=Date:From:To:Cc:Subject:References:In-Reply-To:From;
	b=enJIF/7cmgyfjdphFfcvf5wdWamLL1+B0WyEGnuZz09fBHG2aYmBaJYrqSmgYPtfx
	 uZDHdldtjtvl8oNhVsF+0XUg4ukmJtaW+M6dTQ5GeTMKKh5Lhrs8DWNIvMeTLKofWm
	 patAh+0+cyYqJrolgvsgJHqS0uuo8MnGJo5C5loLFs46LM298JHd51xyLqK6vOJ8bF
	 9SMOSrPCC4Q7iNM3U/YZti46k8wnwKloTFVyGA8NnR/J9nMzl2KQp6QT8DMUSmHc/+
	 stMTY+t8KzG7JmUFmWHF3Ysq/rd6ny1Eest+3Bqbv5WVjUthrmrJwvawLAS1VlebIx
	 51BHjBxjsqIlg==
Received: from odin.codewreck.org (localhost [127.0.0.1])
	by nautica.notk.org (Postfix) with ESMTPS id 154FAC009;
	Thu, 13 Oct 2022 02:09:16 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=codewreck.org; s=2;
	t=1665619757; bh=oc9nOPVO08P8JMm/Opnj8+Mwb3OWytW47Cad7dcPQbs=;
	h=Date:From:To:Cc:Subject:References:In-Reply-To:From;
	b=A3eDYkBMYna2pSPFbB/93MdyWnYPfd+o6HNUAPB9vx7cx3Ck3sHKy5dHw4xRyEBNF
	 M7E9mJz26/1N4DvvqrATa7yB3e424pUfG9nxb43ihDbEyvHwCxkBfxF5rICU4FgAJQ
	 7piI5iE9i68mCvSWhYVV90BYQJVyS5/Lj6n6s/UwAKCJf3zi2HBWvShHVDwunuRA0k
	 ia9iorLzYKPldcTQLCObGGwkbtCAHvdKBdq/dD/fwGgeA5WeZFZe3ChFHrr2FregM+
	 H4n7OWDc9AoPm1mSvsET+9NXjOl+gCb6ABWpF1kEDArRhuRuy07eh4ThzKV9fU1Ddm
	 VHlpk9UuzHE5g==
Received: from localhost (odin.codewreck.org [local])
	by odin.codewreck.org (OpenSMTPD) with ESMTPA id 5ee08fa9;
	Thu, 13 Oct 2022 00:09:13 +0000 (UTC)
Date: Thu, 13 Oct 2022 09:08:57 +0900
From: Dominique Martinet <asmadeus@codewreck.org>
To: Karl Dahlke <eklhad@comcast.net>
Cc: edbrowse-dev@edbrowse.org
Subject: Re: I don't know shit about xml
Message-ID: <Y0dXGYdDnxu7bVLU@codewreck.org>
References: <20220912185105.eklhad@comcast.net>
X-BeenThere: edbrowse-dev@edbrowse.org
List-Id: Edbrowse Development List <edbrowse-dev.edbrowse.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <20220912185105.eklhad@comcast.net>

Karl Dahlke wrote on Wed, Oct 12, 2022 at 06:51:05PM -0400:
> And that's part of my problem.

No worry, thanks for looking into it.
I've replied to points individually below but I agre with your
assessement.

> xml is more like json

Yes, xml is just a way of writing a tree down.
As far as I understand, HTML was built on top of XML but people built
"incorrect" websites (for example not closing <p> tags or whatever) and
some browsers said it's ok then people asked why it's not working with
other browsers and that became a new standard..
But I might be embellishing this.

> * xml should be syntactically correct.

Yes, I think it's ok to just return an error and no parsed tree for xml
if we see an error.

> * Bad html should be tolerated in xml (<p><p></p></p>)
> * Should not convert <p> to P upper case

Yes, definitely to both of these.

> * The {cdata{ section we should only pull that out for xml.

I think so, it doesn't look like the html parser in firefox does
anything with it, and we've been ignoring it in html all the time, so
let's keep ignoring it in html.

Looking a bit more I found some more exceptions for xml e.g. <!--
comment --> shows up as "#comment {}" in dumptree on firefox, but that
might be a detail.


> So for start I might need another global variable, not fond of those but you
> know, or maybe a parameter to htmlScanner(), bool isXML, to say which way we
> are scanning, then rules as above based on isXML.

Yes, xml and html are different enough to warrant some separation there.
Since we do not need to interpret xml at all (except cdata that we do
not need in html), it might actually be better to fork off to a
different function altogether, instead of a global variable?
So depending on DomParser argument (or mime type) we'd either run
htmlScanner or xmlScanner ?

I'm not sure which is easier to do, my line of thinking is that if more
differences pop up the code might end up simpler.

> This is an overview but let me know if I have made it to first base, or if I
> am off in left field.

This sounds good, let's try this way.

I'm not sure how many sites actually manipulate xml in practice (appart
for my work site...), so thank you for spending time on this!
--
Dominique