I don't know shit about xml

edbrowse-dev - development list for edbrowse
 help / color / mirror / Atom feed

* I don't know shit about xml
@ 2022-10-12 22:51 Karl Dahlke
  2022-10-13  0:08 ` Dominique Martinet
  0 siblings, 1 reply; 5+ messages in thread
From: Karl Dahlke @ 2022-10-12 22:51 UTC (permalink / raw)
  To: edbrowse-dev

And that's part of my problem.

I read the wikipedia article on it. 
That increased my knowledge about 450%. 
I'm sure you know more, so please correct any of what follows, you 
know, before I write code that does the wrong thing.

* I use to think xml was more than html, an extension of html, but now 
I think it is less than html. 
It is more like json. A way to linearly encode a tree of objects. 
Then people put meaning on top of it as they wish.

* xml should be syntactically correct. 
This is more like javascript. 
We should not see in the wild the kind of garbage that we must deal 
with in html. 
<foo  a<b   > 
<foo bar="Hello no closing quote> 
<foo bar=at&t> 
<foo><bar></foo></bar> 
I'm reading that that stuff shouldn't happen, so if we only had xml in 
the wild my scanner would be easier to write, but of course it's mostly 
html, where all sorts of errors are permitted cause people wrote it by 
hand in the 90s and made mistakes or were just lazy etc.

* Conversely, some errors in html are semantic not syntax, and should 
be tolerated in xml. 
<p><p></p></p> 
That's wrong in html, paragraphs inside paragraphs, the second p closes 
the first, I do it, tidy does it, and so on, 
but p has no meaning in xml, the p entities, whatever they are, might 
nest just fine, 
so <p><p></p></p> is a fine construct that should create a 
corresponding tree with p as child of p.

* Converting <p> to P upper case because p is a standard html tag, we 
shouldn't do that in xml, leave p in lower case.

* The {cdata{ section we should only pull that out for xml. 
I guess we could embed it in an html document and verify it is left 
alone.

So for start I might need another global variable, not fond of those 
but you know, 
or maybe a parameter to htmlScanner(), bool isXML, to say which way we 
are scanning, 
then rules as above based on isXML. 
Because if we read in xml and I "fix" it, like turning <p><p></p></p> 
into <P></P>, the "document" would receive different from the way it 
was sent, silently, 
well unless somebody turned dbtags on, which a normal person would 
never do, so silently, 
and edbrowse would give the wrong answer and nobody knows why.

This is an overview but let me know if I have made it to first base, or 
if I am off in left field.

Karl Dahlke

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: I don't know shit about xml
  2022-10-12 22:51 I don't know shit about xml Karl Dahlke
@ 2022-10-13  0:08 ` Dominique Martinet
  2022-10-13  0:32   ` Karl Dahlke
  0 siblings, 1 reply; 5+ messages in thread
From: Dominique Martinet @ 2022-10-13  0:08 UTC (permalink / raw)
  To: Karl Dahlke; +Cc: edbrowse-dev

Karl Dahlke wrote on Wed, Oct 12, 2022 at 06:51:05PM -0400:
> And that's part of my problem.

No worry, thanks for looking into it.
I've replied to points individually below but I agre with your
assessement.

> xml is more like json

Yes, xml is just a way of writing a tree down.
As far as I understand, HTML was built on top of XML but people built
"incorrect" websites (for example not closing <p> tags or whatever) and
some browsers said it's ok then people asked why it's not working with
other browsers and that became a new standard..
But I might be embellishing this.

> * xml should be syntactically correct.

Yes, I think it's ok to just return an error and no parsed tree for xml
if we see an error.

> * Bad html should be tolerated in xml (<p><p></p></p>)
> * Should not convert <p> to P upper case

Yes, definitely to both of these.

> * The {cdata{ section we should only pull that out for xml.

I think so, it doesn't look like the html parser in firefox does
anything with it, and we've been ignoring it in html all the time, so
let's keep ignoring it in html.

Looking a bit more I found some more exceptions for xml e.g.  shows up as "#comment {}" in dumptree on firefox, but that
might be a detail.

> So for start I might need another global variable, not fond of those but you
> know, or maybe a parameter to htmlScanner(), bool isXML, to say which way we
> are scanning, then rules as above based on isXML.

Yes, xml and html are different enough to warrant some separation there.
Since we do not need to interpret xml at all (except cdata that we do
not need in html), it might actually be better to fork off to a
different function altogether, instead of a global variable?
So depending on DomParser argument (or mime type) we'd either run
htmlScanner or xmlScanner ?

I'm not sure which is easier to do, my line of thinking is that if more
differences pop up the code might end up simpler.

> This is an overview but let me know if I have made it to first base, or if I
> am off in left field.

This sounds good, let's try this way.

I'm not sure how many sites actually manipulate xml in practice (appart
for my work site...), so thank you for spending time on this!
--
Dominique

^ permalink raw reply	[flat|nested] 5+ messages in thread

* I don't know shit about xml
  2022-10-13  0:08 ` Dominique Martinet
@ 2022-10-13  0:32   ` Karl Dahlke
  2022-10-19  8:14     ` Adam Thompson
  0 siblings, 1 reply; 5+ messages in thread
From: Karl Dahlke @ 2022-10-13  0:32 UTC (permalink / raw)
  To: edbrowse-dev

The scanners have huge overlap, and I expect only minor differences, so 
should keep it as one function. 
All the tag cracking and attribute cracking and &element; cracking and 
building the tree it's all the same. 
I suspect html came first and xml was a direct generalization, by 
throwing away the semantics. For sure one was very quickly on the heels 
of the other.

Karl Dahlke

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: I don't know shit about xml
  2022-10-13  0:32   ` Karl Dahlke
@ 2022-10-19  8:14     ` Adam Thompson
  2022-10-19  9:13       ` Karl Dahlke
  0 siblings, 1 reply; 5+ messages in thread
From: Adam Thompson @ 2022-10-19  8:14 UTC (permalink / raw)
  To: Karl Dahlke; +Cc: edbrowse-dev

On Wed, Oct 12, 2022 at 08:32:37PM -0400, Karl Dahlke wrote:
> The scanners have huge overlap, and I expect only minor differences, so
> should keep it as one function. All the tag cracking and attribute cracking
> and &element; cracking and building the tree it's all the same. I suspect
> html came first and xml was a direct generalization, by throwing away the
> semantics. For sure one was very quickly on the heels of the other.

I appreciate I'm a little late to this discussion but I think (and some
quick research seems to confirm this) that they're both subsets of SGML. To
be more specific, XML is readable by a generic SGML parser whilst some SGML
(i.e. some HTML constructs) will generate errors in XML parsers. In
addition, as previously noted, XML has no inherent semantics whereas HTML
most definitely does.

To add some more confusion, an attempt was made to apply XML strictness to
HTML called as XHTML. This was, as far as I remember, the thing for a while
until HTML5 came along which (I think) went back to the pure SGML basis of
HTML.

Also, as previously noted, there's all the non-standard (and probably
incorrect in SGML though I've not bothered to read the generic standard)
garbage which people wrote (and continue to write) and browsers somehow turn
into something sane.

As such, I expect there to be quite a bit of overlap and the current
direction seems to make sense. In fact, there are other parsers which have
XML and HTML modes (and not just those used in browsers).

Cheers,
Adam.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* I don't know shit about xml
  2022-10-19  8:14     ` Adam Thompson
@ 2022-10-19  9:13       ` Karl Dahlke
  0 siblings, 0 replies; 5+ messages in thread
From: Karl Dahlke @ 2022-10-19  9:13 UTC (permalink / raw)
  To: edbrowse-dev

Others have also pointed me to sgml, just, you know, if we want to 
understand the evolution of things.

> garbage which people wrote (and continue to write) and browsers 
somehow turn 
> into something sane.

Yes tidy did a lot of this for us, I didn't realize how much until I 
wrote my own html scanner. Ugh. 
I'm still making tweaks now and then. 
And yet, my scanner isn't much bigger than the interface code that 
connected to the tidy library, so there ya go.

> the current direction seems to make sense.

Yes I think so. Thank you. 
xml as received through xhr is now parsed as xml, and that may make a 
difference to some websites. 
We also do some of the cdata parsing and representation, which tidy 
would not be able to do for us. 
So this is the right path.

Karl Dahlke

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-10-19  9:13 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-12 22:51 I don't know shit about xml Karl Dahlke
2022-10-13  0:08 ` Dominique Martinet
2022-10-13  0:32   ` Karl Dahlke
2022-10-19  8:14     ` Adam Thompson
2022-10-19  9:13       ` Karl Dahlke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).