From: Karl Dahlke <eklhad@comcast.net>
To: edbrowse-dev@edbrowse.org
Subject: I don't know shit about xml
Date: Wed, 12 Oct 2022 18:51:05 -0400 [thread overview]
Message-ID: <20220912185105.eklhad@comcast.net> (raw)
And that's part of my problem.
I read the wikipedia article on it.
That increased my knowledge about 450%.
I'm sure you know more, so please correct any of what follows, you
know, before I write code that does the wrong thing.
* I use to think xml was more than html, an extension of html, but now
I think it is less than html.
It is more like json. A way to linearly encode a tree of objects.
Then people put meaning on top of it as they wish.
* xml should be syntactically correct.
This is more like javascript.
We should not see in the wild the kind of garbage that we must deal
with in html.
<foo a<b >
<foo bar="Hello no closing quote>
<foo bar=at&t>
<foo><bar></foo></bar>
I'm reading that that stuff shouldn't happen, so if we only had xml in
the wild my scanner would be easier to write, but of course it's mostly
html, where all sorts of errors are permitted cause people wrote it by
hand in the 90s and made mistakes or were just lazy etc.
* Conversely, some errors in html are semantic not syntax, and should
be tolerated in xml.
<p><p></p></p>
That's wrong in html, paragraphs inside paragraphs, the second p closes
the first, I do it, tidy does it, and so on,
but p has no meaning in xml, the p entities, whatever they are, might
nest just fine,
so <p><p></p></p> is a fine construct that should create a
corresponding tree with p as child of p.
* Converting <p> to P upper case because p is a standard html tag, we
shouldn't do that in xml, leave p in lower case.
* The {cdata{ section we should only pull that out for xml.
I guess we could embed it in an html document and verify it is left
alone.
So for start I might need another global variable, not fond of those
but you know,
or maybe a parameter to htmlScanner(), bool isXML, to say which way we
are scanning,
then rules as above based on isXML.
Because if we read in xml and I "fix" it, like turning <p><p></p></p>
into <P></P>, the "document" would receive different from the way it
was sent, silently,
well unless somebody turned dbtags on, which a normal person would
never do, so silently,
and edbrowse would give the wrong answer and nobody knows why.
This is an overview but let me know if I have made it to first base, or
if I am off in left field.
Karl Dahlke
next reply other threads:[~2022-10-12 22:51 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-12 22:51 Karl Dahlke [this message]
2022-10-13 0:08 ` Dominique Martinet
2022-10-13 0:32 ` Karl Dahlke
2022-10-19 8:14 ` Adam Thompson
2022-10-19 9:13 ` Karl Dahlke
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220912185105.eklhad@comcast.net \
--to=eklhad@comcast.net \
--cc=edbrowse-dev@edbrowse.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).