I don't know shit about xml

edbrowse-dev - development list for edbrowse
 help / color / mirror / Atom feed

From: Karl Dahlke <eklhad@comcast.net>
To: edbrowse-dev@edbrowse.org
Subject: I don't know shit about xml
Date: Wed, 12 Oct 2022 18:51:05 -0400	[thread overview]
Message-ID: <20220912185105.eklhad@comcast.net> (raw)

And that's part of my problem.

I read the wikipedia article on it. 
That increased my knowledge about 450%. 
I'm sure you know more, so please correct any of what follows, you 
know, before I write code that does the wrong thing.

* I use to think xml was more than html, an extension of html, but now 
I think it is less than html. 
It is more like json. A way to linearly encode a tree of objects. 
Then people put meaning on top of it as they wish.

* xml should be syntactically correct. 
This is more like javascript. 
We should not see in the wild the kind of garbage that we must deal 
with in html. 
<foo  a<b   > 
<foo bar="Hello no closing quote> 
<foo bar=at&t> 
<foo><bar></foo></bar> 
I'm reading that that stuff shouldn't happen, so if we only had xml in 
the wild my scanner would be easier to write, but of course it's mostly 
html, where all sorts of errors are permitted cause people wrote it by 
hand in the 90s and made mistakes or were just lazy etc.

* Conversely, some errors in html are semantic not syntax, and should 
be tolerated in xml. 
<p><p></p></p> 
That's wrong in html, paragraphs inside paragraphs, the second p closes 
the first, I do it, tidy does it, and so on, 
but p has no meaning in xml, the p entities, whatever they are, might 
nest just fine, 
so <p><p></p></p> is a fine construct that should create a 
corresponding tree with p as child of p.

* Converting <p> to P upper case because p is a standard html tag, we 
shouldn't do that in xml, leave p in lower case.

* The {cdata{ section we should only pull that out for xml. 
I guess we could embed it in an html document and verify it is left 
alone.

So for start I might need another global variable, not fond of those 
but you know, 
or maybe a parameter to htmlScanner(), bool isXML, to say which way we 
are scanning, 
then rules as above based on isXML. 
Because if we read in xml and I "fix" it, like turning <p><p></p></p> 
into <P></P>, the "document" would receive different from the way it 
was sent, silently, 
well unless somebody turned dbtags on, which a normal person would 
never do, so silently, 
and edbrowse would give the wrong answer and nobody knows why.

This is an overview but let me know if I have made it to first base, or 
if I am off in left field.

Karl Dahlke

next             reply	other threads:[~2022-10-12 22:51 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-12 22:51 Karl Dahlke [this message]
2022-10-13  0:08 ` Dominique Martinet
2022-10-13  0:32   ` Karl Dahlke
2022-10-19  8:14     ` Adam Thompson
2022-10-19  9:13       ` Karl Dahlke

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220912185105.eklhad@comcast.net \
    --to=eklhad@comcast.net \
    --cc=edbrowse-dev@edbrowse.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).