Re: I don't know shit about xml - Dominique Martinet

edbrowse-dev - development list for edbrowse
 help / color / mirror / Atom feed

From: Dominique Martinet <asmadeus@codewreck.org>
To: Karl Dahlke <eklhad@comcast.net>
Cc: edbrowse-dev@edbrowse.org
Subject: Re: I don't know shit about xml
Date: Thu, 13 Oct 2022 09:08:57 +0900	[thread overview]
Message-ID: <Y0dXGYdDnxu7bVLU@codewreck.org> (raw)
In-Reply-To: <20220912185105.eklhad@comcast.net>

Karl Dahlke wrote on Wed, Oct 12, 2022 at 06:51:05PM -0400:
> And that's part of my problem.

No worry, thanks for looking into it.
I've replied to points individually below but I agre with your
assessement.

> xml is more like json

Yes, xml is just a way of writing a tree down.
As far as I understand, HTML was built on top of XML but people built
"incorrect" websites (for example not closing <p> tags or whatever) and
some browsers said it's ok then people asked why it's not working with
other browsers and that became a new standard..
But I might be embellishing this.

> * xml should be syntactically correct.

Yes, I think it's ok to just return an error and no parsed tree for xml
if we see an error.

> * Bad html should be tolerated in xml (<p><p></p></p>)
> * Should not convert <p> to P upper case

Yes, definitely to both of these.

> * The {cdata{ section we should only pull that out for xml.

I think so, it doesn't look like the html parser in firefox does
anything with it, and we've been ignoring it in html all the time, so
let's keep ignoring it in html.

Looking a bit more I found some more exceptions for xml e.g.  shows up as "#comment {}" in dumptree on firefox, but that
might be a detail.

> So for start I might need another global variable, not fond of those but you
> know, or maybe a parameter to htmlScanner(), bool isXML, to say which way we
> are scanning, then rules as above based on isXML.

Yes, xml and html are different enough to warrant some separation there.
Since we do not need to interpret xml at all (except cdata that we do
not need in html), it might actually be better to fork off to a
different function altogether, instead of a global variable?
So depending on DomParser argument (or mime type) we'd either run
htmlScanner or xmlScanner ?

I'm not sure which is easier to do, my line of thinking is that if more
differences pop up the code might end up simpler.

> This is an overview but let me know if I have made it to first base, or if I
> am off in left field.

This sounds good, let's try this way.

I'm not sure how many sites actually manipulate xml in practice (appart
for my work site...), so thank you for spending time on this!
--
Dominique

next prev parent reply	other threads:[~2022-10-13  0:09 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-12 22:51 Karl Dahlke
2022-10-13  0:08 ` Dominique Martinet [this message]
2022-10-13  0:32   ` Karl Dahlke
2022-10-19  8:14     ` Adam Thompson
2022-10-19  9:13       ` Karl Dahlke

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y0dXGYdDnxu7bVLU@codewreck.org \
    --to=asmadeus@codewreck.org \
    --cc=edbrowse-dev@edbrowse.org \
    --cc=eklhad@comcast.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).