edbrowse-dev - development list for edbrowse
 help / color / mirror / Atom feed
* I don't know shit about xml
@ 2022-10-12 22:51 Karl Dahlke
  2022-10-13  0:08 ` Dominique Martinet
  0 siblings, 1 reply; 5+ messages in thread
From: Karl Dahlke @ 2022-10-12 22:51 UTC (permalink / raw)
  To: edbrowse-dev

And that's part of my problem.

I read the wikipedia article on it. 
That increased my knowledge about 450%. 
I'm sure you know more, so please correct any of what follows, you 
know, before I write code that does the wrong thing.

* I use to think xml was more than html, an extension of html, but now 
I think it is less than html. 
It is more like json. A way to linearly encode a tree of objects. 
Then people put meaning on top of it as they wish.

* xml should be syntactically correct. 
This is more like javascript. 
We should not see in the wild the kind of garbage that we must deal 
with in html. 
<foo  a<b   > 
<foo bar="Hello no closing quote> 
<foo bar=at&t> 
<foo><bar></foo></bar> 
I'm reading that that stuff shouldn't happen, so if we only had xml in 
the wild my scanner would be easier to write, but of course it's mostly 
html, where all sorts of errors are permitted cause people wrote it by 
hand in the 90s and made mistakes or were just lazy etc.

* Conversely, some errors in html are semantic not syntax, and should 
be tolerated in xml. 
<p><p></p></p> 
That's wrong in html, paragraphs inside paragraphs, the second p closes 
the first, I do it, tidy does it, and so on, 
but p has no meaning in xml, the p entities, whatever they are, might 
nest just fine, 
so <p><p></p></p> is a fine construct that should create a 
corresponding tree with p as child of p.

* Converting <p> to P upper case because p is a standard html tag, we 
shouldn't do that in xml, leave p in lower case.

* The {cdata{ section we should only pull that out for xml. 
I guess we could embed it in an html document and verify it is left 
alone.

So for start I might need another global variable, not fond of those 
but you know, 
or maybe a parameter to htmlScanner(), bool isXML, to say which way we 
are scanning, 
then rules as above based on isXML. 
Because if we read in xml and I "fix" it, like turning <p><p></p></p> 
into <P></P>, the "document" would receive different from the way it 
was sent, silently, 
well unless somebody turned dbtags on, which a normal person would 
never do, so silently, 
and edbrowse would give the wrong answer and nobody knows why.

This is an overview but let me know if I have made it to first base, or 
if I am off in left field.

Karl Dahlke


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-10-19  9:13 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-12 22:51 I don't know shit about xml Karl Dahlke
2022-10-13  0:08 ` Dominique Martinet
2022-10-13  0:32   ` Karl Dahlke
2022-10-19  8:14     ` Adam Thompson
2022-10-19  9:13       ` Karl Dahlke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).