From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-0.1 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 27149 invoked from network); 13 Oct 2022 00:09:29 -0000 Received: from hurricane.the-brannons.com (2602:ff06:725:1:20::25) by inbox.vuxu.org with ESMTPUTF8; 13 Oct 2022 00:09:29 -0000 Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by hurricane.the-brannons.com (OpenSMTPD) with ESMTP id 964b24a0 for ; Wed, 12 Oct 2022 17:09:27 -0700 (PDT) Received: from nautica.notk.org (nautica.notk.org [91.121.71.147]) by hurricane.the-brannons.com (OpenSMTPD) with ESMTPS id eeb646a0 (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO) for ; Wed, 12 Oct 2022 17:09:19 -0700 (PDT) Received: by nautica.notk.org (Postfix, from userid 108) id 36539C01A; Thu, 13 Oct 2022 02:09:18 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=codewreck.org; s=2; t=1665619758; bh=oc9nOPVO08P8JMm/Opnj8+Mwb3OWytW47Cad7dcPQbs=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=enJIF/7cmgyfjdphFfcvf5wdWamLL1+B0WyEGnuZz09fBHG2aYmBaJYrqSmgYPtfx uZDHdldtjtvl8oNhVsF+0XUg4ukmJtaW+M6dTQ5GeTMKKh5Lhrs8DWNIvMeTLKofWm patAh+0+cyYqJrolgvsgJHqS0uuo8MnGJo5C5loLFs46LM298JHd51xyLqK6vOJ8bF 9SMOSrPCC4Q7iNM3U/YZti46k8wnwKloTFVyGA8NnR/J9nMzl2KQp6QT8DMUSmHc/+ stMTY+t8KzG7JmUFmWHF3Ysq/rd6ny1Eest+3Bqbv5WVjUthrmrJwvawLAS1VlebIx 51BHjBxjsqIlg== Received: from odin.codewreck.org (localhost [127.0.0.1]) by nautica.notk.org (Postfix) with ESMTPS id 154FAC009; Thu, 13 Oct 2022 02:09:16 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=codewreck.org; s=2; t=1665619757; bh=oc9nOPVO08P8JMm/Opnj8+Mwb3OWytW47Cad7dcPQbs=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=A3eDYkBMYna2pSPFbB/93MdyWnYPfd+o6HNUAPB9vx7cx3Ck3sHKy5dHw4xRyEBNF M7E9mJz26/1N4DvvqrATa7yB3e424pUfG9nxb43ihDbEyvHwCxkBfxF5rICU4FgAJQ 7piI5iE9i68mCvSWhYVV90BYQJVyS5/Lj6n6s/UwAKCJf3zi2HBWvShHVDwunuRA0k ia9iorLzYKPldcTQLCObGGwkbtCAHvdKBdq/dD/fwGgeA5WeZFZe3ChFHrr2FregM+ H4n7OWDc9AoPm1mSvsET+9NXjOl+gCb6ABWpF1kEDArRhuRuy07eh4ThzKV9fU1Ddm VHlpk9UuzHE5g== Received: from localhost (odin.codewreck.org [local]) by odin.codewreck.org (OpenSMTPD) with ESMTPA id 5ee08fa9; Thu, 13 Oct 2022 00:09:13 +0000 (UTC) Date: Thu, 13 Oct 2022 09:08:57 +0900 From: Dominique Martinet To: Karl Dahlke Cc: edbrowse-dev@edbrowse.org Subject: Re: I don't know shit about xml Message-ID: References: <20220912185105.eklhad@comcast.net> X-BeenThere: edbrowse-dev@edbrowse.org List-Id: Edbrowse Development List MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20220912185105.eklhad@comcast.net> Karl Dahlke wrote on Wed, Oct 12, 2022 at 06:51:05PM -0400: > And that's part of my problem. No worry, thanks for looking into it. I've replied to points individually below but I agre with your assessement. > xml is more like json Yes, xml is just a way of writing a tree down. As far as I understand, HTML was built on top of XML but people built "incorrect" websites (for example not closing

tags or whatever) and some browsers said it's ok then people asked why it's not working with other browsers and that became a new standard.. But I might be embellishing this. > * xml should be syntactically correct. Yes, I think it's ok to just return an error and no parsed tree for xml if we see an error. > * Bad html should be tolerated in xml (

) > * Should not convert

to P upper case Yes, definitely to both of these. > * The {cdata{ section we should only pull that out for xml. I think so, it doesn't look like the html parser in firefox does anything with it, and we've been ignoring it in html all the time, so let's keep ignoring it in html. Looking a bit more I found some more exceptions for xml e.g. shows up as "#comment {}" in dumptree on firefox, but that might be a detail. > So for start I might need another global variable, not fond of those but you > know, or maybe a parameter to htmlScanner(), bool isXML, to say which way we > are scanning, then rules as above based on isXML. Yes, xml and html are different enough to warrant some separation there. Since we do not need to interpret xml at all (except cdata that we do not need in html), it might actually be better to fork off to a different function altogether, instead of a global variable? So depending on DomParser argument (or mime type) we'd either run htmlScanner or xmlScanner ? I'm not sure which is easier to do, my line of thinking is that if more differences pop up the code might end up simpler. > This is an overview but let me know if I have made it to first base, or if I > am off in left field. This sounds good, let's try this way. I'm not sure how many sites actually manipulate xml in practice (appart for my work site...), so thank you for spending time on this! -- Dominique