edbrowse-dev - development list for edbrowse
 help / color / mirror / Atom feed
* [Edbrowse-dev] tidy debug tree, and a js script
@ 2015-08-29 10:04 Karl Dahlke
  2015-08-29 13:25 ` Adam Thompson
  0 siblings, 1 reply; 12+ messages in thread
From: Karl Dahlke @ 2015-08-29 10:04 UTC (permalink / raw)
  To: Edbrowse-dev

Debug prints are in, and seem to work.
Thanks Kevin.

Here is my test page, that I was worried about.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head><title>jf test </title></head>
<body>
hello world
<script type=text/javascript>
document.writeln("This is <A href=http://edbrowse.org>our website</A>");
</script>
</body></html>

I browse with db6, there  is lots of js debugging,
I'll leave that out, here are the relevant lines.

line 7 column 67: '<' + '/' + letter not allowed here
Node(0): Text
Text: hello world 
Node(0): script
type = text/javascript
Node(1): Text
Text: document.writeln("This is &lt;A href=http://edbrowse.org&gt;our
website&lt;\/A&gt;");
#  end of tidy debug output, next stuff is ours
execute jf at 6
< side effects
w{This is <A href=http://edbrowse.org>our website</A>
`~@}
< ok
execution complete
docwrite 62 bytes
<<
This is <A href=http://edbrowse.org>our website</A>
>>
anchorSwap 4
anchors unframed
whitespace combined

Right off the bat I'm concerned becausee tidy shows an error
where there is no error.
It is trying to interpret the </a> tag in the string, in the script,
and it shouldn't be doing that at all.
Next I look at the text node under the script,
the text that is to be passed to the js engine, and it has been html escaped.
<a> is now &lt;a&gt;
Why?
That would totally screw things up.
Is it escaped and interpreted for the benefit of printing, for us,
or is it done by cleanup?
If the latter then we can't use tidy5 unless this is fixed.
This is a show stopper.
They can't be mucking with the contents of a js script at all.
In fact they shouldn't muck with the contents of any script.

Karl Dahlke

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Edbrowse-dev] tidy debug tree, and a js script
  2015-08-29 10:04 [Edbrowse-dev] tidy debug tree, and a js script Karl Dahlke
@ 2015-08-29 13:25 ` Adam Thompson
  2015-08-29 14:36   ` Karl Dahlke
  0 siblings, 1 reply; 12+ messages in thread
From: Adam Thompson @ 2015-08-29 13:25 UTC (permalink / raw)
  To: Karl Dahlke; +Cc: Edbrowse-dev

[-- Attachment #1: Type: text/plain, Size: 2660 bytes --]

On Sat, Aug 29, 2015 at 06:04:04AM -0400, Karl Dahlke wrote:
> Debug prints are in, and seem to work.
> Thanks Kevin.

Well done all for the work, appologies for being somewhat inactive recently but
I've been busy with the day job and simply haven't had any time to do anything
computing related outside of the office.

> Here is my test page, that I was worried about.
> 
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
> <html>
> <head><title>jf test </title></head>
> <body>
> hello world
> <script type=text/javascript>
> document.writeln("This is <A href=http://edbrowse.org>our website</A>");
> </script>
> </body></html>
> 
> I browse with db6, there  is lots of js debugging,
> I'll leave that out, here are the relevant lines.
> 
> line 7 column 67: '<' + '/' + letter not allowed here
> Node(0): Text
> Text: hello world 
> Node(0): script
> type = text/javascript
> Node(1): Text
> Text: document.writeln("This is &lt;A href=http://edbrowse.org&gt;our
> website&lt;\/A&gt;");
> #  end of tidy debug output, next stuff is ours
> execute jf at 6
> < side effects
> w{This is <A href=http://edbrowse.org>our website</A>
> `~@}
> < ok
> execution complete
> docwrite 62 bytes
> <<
> This is <A href=http://edbrowse.org>our website</A>
> >>
> anchorSwap 4
> anchors unframed
> whitespace combined
> 
> Right off the bat I'm concerned becausee tidy shows an error
> where there is no error.
> It is trying to interpret the </a> tag in the string, in the script,
> and it shouldn't be doing that at all.

Actually, yes it should. This is one of the corner cases with html;
everything within a script tag is not parsed except the sequence </ which ends the tag.
At least according to what I've read.
If you really don't want your data parsed at all, use a cdata section,
then access that with js (it's parsed into the DOM so this works).
> Next I look at the text node under the script,
> the text that is to be passed to the js engine, and it has been html escaped.
> <a> is now &lt;a&gt;
> Why?
> That would totally screw things up.
> Is it escaped and interpreted for the benefit of printing, for us,
> or is it done by cleanup?
> If the latter then we can't use tidy5 unless this is fixed.
> This is a show stopper.
> They can't be mucking with the contents of a js script at all.
> In fact they shouldn't muck with the contents of any script.

Actually, see above for how script tags should behave. I think it's actually our current parser which is slightly broken.
Not sure about the escaping part though.

Cheers,
Adam.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Edbrowse-dev]  tidy debug tree, and a js script
  2015-08-29 13:25 ` Adam Thompson
@ 2015-08-29 14:36   ` Karl Dahlke
  2015-08-29 14:58     ` Adam Thompson
  0 siblings, 1 reply; 12+ messages in thread
From: Karl Dahlke @ 2015-08-29 14:36 UTC (permalink / raw)
  To: Edbrowse-dev

> If we want to print tidy warnings for local files fair enough,
> but lets make it a user's choice,

We can come up with a better convention,
similar to nojs and novs in the config file,
I'm fine with that, we're just playing so to speak.
But I'm not going to worry too much abount conventions
until we know tidy5 is usable, which I'm not sure it is.

You're talking about what the specs say, I'm talking about
what's on the internet that we have to honor.
I'm sure I could find document.write with tags in it if I look around.
But ok, if you didn't like my last sample script then try this one.
I know this is to spec.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head><title>jf test </title></head>
<body>
hello world
<script type=text/javascript>
if(3 < 4) alert("alls well");
</script>
</body></html>

Turn off js, we don't need it here, set db6, browse,
and see that < is turned into &lt;.
That would be a syntax error.
Tidy is mucking withth the script.

Karl Dahlke

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Edbrowse-dev] tidy debug tree, and a js script
  2015-08-29 14:36   ` Karl Dahlke
@ 2015-08-29 14:58     ` Adam Thompson
  2015-08-29 16:05       ` Karl Dahlke
  0 siblings, 1 reply; 12+ messages in thread
From: Adam Thompson @ 2015-08-29 14:58 UTC (permalink / raw)
  To: Karl Dahlke; +Cc: Edbrowse-dev

[-- Attachment #1: Type: text/plain, Size: 1694 bytes --]

On Sat, Aug 29, 2015 at 10:36:09AM -0400, Karl Dahlke wrote:
> > If we want to print tidy warnings for local files fair enough,
> > but lets make it a user's choice,
> 
> We can come up with a better convention,
> similar to nojs and novs in the config file,
> I'm fine with that, we're just playing so to speak.
> But I'm not going to worry too much abount conventions
> until we know tidy5 is usable, which I'm not sure it is.
> 
> You're talking about what the specs say, I'm talking about
> what's on the internet that we have to honor.
> I'm sure I could find document.write with tags in it if I look around.

No, this just doesn't work in anyone's conforming implementation.
Apparently it's a difference between html and xhtml or something,
I'm not entirely sure.

> But ok, if you didn't like my last sample script then try this one.
> I know this is to spec.
> 
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
> <html>
> <head><title>jf test </title></head>
> <body>
> hello world
> <script type=text/javascript>
> if(3 < 4) alert("alls well");
> </script>
> </body></html>
> 
> Turn off js, we don't need it here, set db6, browse,
> and see that < is turned into &lt;.
> That would be a syntax error.
> Tidy is mucking withth the script.

Haven't ran the edbrowse test yet, but downloaded,compiled and installed tidy5
and ran your script through their executable.
The script came out perfectly fine (and yes I know tidy consumed the html
because it added the generator field). Not sure what we're doing differently.
I wonder if it's just a printing thing or something.

Cheers,
Adam.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Edbrowse-dev]   tidy debug tree, and a js script
  2015-08-29 14:58     ` Adam Thompson
@ 2015-08-29 16:05       ` Karl Dahlke
  2015-08-30  1:15         ` Kevin Carhart
  2015-08-30  8:26         ` Adam Thompson
  0 siblings, 2 replies; 12+ messages in thread
From: Karl Dahlke @ 2015-08-29 16:05 UTC (permalink / raw)
  To: Edbrowse-dev

> If we want to print tidy warnings for local files fair enough,
> but lets make it a user's choice,

Actually I'm kinda stupid here.
I built these mashinations for local files or remote files,
and exceptions through some mechanism, filename or config options etc,
but really all I need to do is print the tidy errors at debugLevel 3 or above.
None of our users are going to want to see them,
except a rare few who write their own html,
and they can get along with db3 to check their sourcefiles.
So that's what I should do, but still not what I'm worried about.

> I wonder if it's just a printing thing or something.

Oh I hope it is, but I fear it's not.
(This would be a great time for me to be wrong.)
We aren't passing the contents of script.text to the js engine yet,
but some day we will, when tidy replaces all my software,
and on that day we will be passing

if(3 &lt; 4)

instead of

if(3 < 4)

At least that's how it appears.

Karl Dahlke

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Edbrowse-dev] tidy debug tree, and a js script
  2015-08-29 16:05       ` Karl Dahlke
@ 2015-08-30  1:15         ` Kevin Carhart
  2015-08-30  8:26         ` Adam Thompson
  1 sibling, 0 replies; 12+ messages in thread
From: Kevin Carhart @ 2015-08-30  1:15 UTC (permalink / raw)
  To: Edbrowse-dev



Hi Adam, and I'm glad that the recent round from yesterday til 
now is going well!  Exciting stuff!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Edbrowse-dev] tidy debug tree, and a js script
  2015-08-29 16:05       ` Karl Dahlke
  2015-08-30  1:15         ` Kevin Carhart
@ 2015-08-30  8:26         ` Adam Thompson
  2015-08-30  9:30           ` Adam Thompson
  1 sibling, 1 reply; 12+ messages in thread
From: Adam Thompson @ 2015-08-30  8:26 UTC (permalink / raw)
  To: Karl Dahlke; +Cc: Edbrowse-dev

[-- Attachment #1: Type: text/plain, Size: 1388 bytes --]

On Sat, Aug 29, 2015 at 12:05:50PM -0400, Karl Dahlke wrote:
> > If we want to print tidy warnings for local files fair enough,
> > but lets make it a user's choice,
> 
> Actually I'm kinda stupid here.
> I built these mashinations for local files or remote files,
> and exceptions through some mechanism, filename or config options etc,
> but really all I need to do is print the tidy errors at debugLevel 3 or above.
> None of our users are going to want to see them,
> except a rare few who write their own html,
> and they can get along with db3 to check their sourcefiles.
> So that's what I should do, but still not what I'm worried about.

I agree that's the correct approach.

> > I wonder if it's just a printing thing or something.
> 
> Oh I hope it is, but I fear it's not.
> (This would be a great time for me to be wrong.)
> We aren't passing the contents of script.text to the js engine yet,
> but some day we will, when tidy replaces all my software,
> and on that day we will be passing
> 
> if(3 &lt; 4)
> 
> instead of
> 
> if(3 < 4)
> 
> At least that's how it appears.

As I said, this doesn't tally with what I'm seeing from the actual tidy tool
(the tidy 5 version) which uses the same library as we do,
so I suspect we're doing something incorrectly somewhere.
I'll have a look at that code to see what they're doing differently.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Edbrowse-dev] tidy debug tree, and a js script
  2015-08-30  8:26         ` Adam Thompson
@ 2015-08-30  9:30           ` Adam Thompson
  2015-08-30  9:49             ` Karl Dahlke
  0 siblings, 1 reply; 12+ messages in thread
From: Adam Thompson @ 2015-08-30  9:30 UTC (permalink / raw)
  To: Karl Dahlke; +Cc: Edbrowse-dev

[-- Attachment #1: Type: text/plain, Size: 1699 bytes --]

On Sun, Aug 30, 2015 at 09:26:51AM +0100, Adam Thompson wrote:
> On Sat, Aug 29, 2015 at 12:05:50PM -0400, Karl Dahlke wrote:
> > > If we want to print tidy warnings for local files fair enough,
> > > but lets make it a user's choice,
> > 
> > Actually I'm kinda stupid here.
> > I built these mashinations for local files or remote files,
> > and exceptions through some mechanism, filename or config options etc,
> > but really all I need to do is print the tidy errors at debugLevel 3 or above.
> > None of our users are going to want to see them,
> > except a rare few who write their own html,
> > and they can get along with db3 to check their sourcefiles.
> > So that's what I should do, but still not what I'm worried about.
> 
> I agree that's the correct approach.

Thanks for making this change.

> > > I wonder if it's just a printing thing or something.
> > 
> > Oh I hope it is, but I fear it's not.
> > (This would be a great time for me to be wrong.)
> > We aren't passing the contents of script.text to the js engine yet,
> > but some day we will, when tidy replaces all my software,
> > and on that day we will be passing
> > 
> > if(3 &lt; 4)
> > 
> > instead of
> > 
> > if(3 < 4)
> > 
> > At least that's how it appears.
> 
> As I said, this doesn't tally with what I'm seeing from the actual tidy tool
> (the tidy 5 version) which uses the same library as we do,
> so I suspect we're doing something incorrectly somewhere.
> I'll have a look at that code to see what they're doing differently.

It turns out what we wanted was tidyNodeGetValue which doesn't do escaping
unlike tidyNodeGetText. I've made, tested and pushed this change.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Edbrowse-dev]    tidy debug tree, and a js script
  2015-08-30  9:30           ` Adam Thompson
@ 2015-08-30  9:49             ` Karl Dahlke
  2015-08-30 10:02               ` Adam Thompson
  0 siblings, 1 reply; 12+ messages in thread
From: Karl Dahlke @ 2015-08-30  9:49 UTC (permalink / raw)
  To: Edbrowse-dev

> It turns out what we wanted was tidyNodeGetValue which doesn't do escaping

Well will you look at that!
Like I say, it's a great time to be wrong.
So I guess we're back in the saddle again.
Do keep an eye on us over here, and keep us pointed in the right direction.
Thanks.

Karl Dahlke

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Edbrowse-dev] tidy debug tree, and a js script
  2015-08-30  9:49             ` Karl Dahlke
@ 2015-08-30 10:02               ` Adam Thompson
  2015-08-30 10:31                 ` Karl Dahlke
  0 siblings, 1 reply; 12+ messages in thread
From: Adam Thompson @ 2015-08-30 10:02 UTC (permalink / raw)
  To: Karl Dahlke; +Cc: Edbrowse-dev

[-- Attachment #1: Type: text/plain, Size: 670 bytes --]

On Sun, Aug 30, 2015 at 05:49:48AM -0400, Karl Dahlke wrote:
> > It turns out what we wanted was tidyNodeGetValue which doesn't do escaping
> 
> Well will you look at that!
> Like I say, it's a great time to be wrong.
> So I guess we're back in the saddle again.
> Do keep an eye on us over here, and keep us pointed in the right direction.
> Thanks.

Will do. Any chance you could have a go at converting some of the parsing logic
today (I've got another day off work tomorrow as it's a public holiday over
here so will be able to review)? I'd do it myself but would need to spend some
time getting my head around our current parser in order to unwind it.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Edbrowse-dev]     tidy debug tree, and a js script
  2015-08-30 10:02               ` Adam Thompson
@ 2015-08-30 10:31                 ` Karl Dahlke
  2015-08-30 11:16                   ` Adam Thompson
  0 siblings, 1 reply; 12+ messages in thread
From: Karl Dahlke @ 2015-08-30 10:31 UTC (permalink / raw)
  To: Edbrowse-dev

[-- Attachment #1: Type: text/plain, Size: 2168 bytes --]

> Any chance you could have a go at converting some of the parsing logic

Wow - I'm good but not that good.
It's a pretty big project.
I don't want to move too slowly, having done almost nothing on edbrowse
in the past 6 months, but I don't want to run recklessly fast either.
Need to pass designs by you guys before coding etc.
And there are still some big questions to answer, like is tidy5
the right path, or perhaps libhubbub, which could be part
of a larger browser effort, larger than just parsing html.
netsurf-browser.org

I'm running another sanity check on tidy.
This generates an error because & is not escaped,
and yes it probably should be.
<A href="www.x.com/path?a=b&ccopy=d>
It even converts &copy into the copyright symbol,
now part of the url.

So ok, maybe I did a bad test because I'm not following spec
but the internet doesn't follow spec either, not all the time.
Look at the raw html from www.sciam.com
It contains these two lines, on the same home page.

<li><a href="https://www.scientificamerican.com/store/subscribe/scientific-american-all-access/?WT.mc_id=SA_Webstore_SCA_AllAccess_SubCenter&amp;responseKey=W3S03RD00" target="_blank">Subscribe to All Access <span class="red">&raquo;</span></a></li>
<li><a href="https://w1.buysub.com/servlet/OrdersGateway?cds_mag_code=SCA&cds_page_id=185258&cds_response_key=I5S03R00B" target="_blank">Subscribe to Print <span class="red">&raquo;</span></a></li>

The first one has & escaped, the second one does not.
So ok just wanted to make sure tidy is handling these two cases properly,
and it is.
Happily, my parser also handles these cases properly.
I must have run into this at some point.

I'll continue testing. Assuming I uncover no serious problems,
I think the next step is to enhance our edbrowse node, with enough attributes
to faithfully copy the information from a tidy node.
We have some of the attributs but not enough.
A blatent omission is a text string,
because we never represented text nodes before.
We'll need this, and child pointers,
and a list of attribute value pairs, and other things.
I'll post more on this later.

Karl Dahlke

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Edbrowse-dev] tidy debug tree, and a js script
  2015-08-30 10:31                 ` Karl Dahlke
@ 2015-08-30 11:16                   ` Adam Thompson
  0 siblings, 0 replies; 12+ messages in thread
From: Adam Thompson @ 2015-08-30 11:16 UTC (permalink / raw)
  To: Karl Dahlke; +Cc: Edbrowse-dev

[-- Attachment #1: Type: text/plain, Size: 3408 bytes --]

On Sun, Aug 30, 2015 at 06:31:11AM -0400, Karl Dahlke wrote:
> > Any chance you could have a go at converting some of the parsing logic
> 
> Wow - I'm good but not that good.
> It's a pretty big project.
> I don't want to move too slowly, having done almost nothing on edbrowse
> in the past 6 months, but I don't want to run recklessly fast either.
> Need to pass designs by you guys before coding etc.
> And there are still some big questions to answer, like is tidy5
> the right path, or perhaps libhubbub, which could be part
> of a larger browser effort, larger than just parsing html.
> netsurf-browser.org

I looked into libhubbub before going down the tidy5 route,
but it seems to have been sucked into netsurf fully now so I gave up on it.

> I'm running another sanity check on tidy.
> This generates an error because & is not escaped,
> and yes it probably should be.
> <A href="www.x.com/path?a=b&ccopy=d>
I suspect the error here is that we're missing the closing quote in this example actually.
> It even converts &copy into the copyright symbol,
> now part of the url.

Not sure about that, but this isn't a correct attribute anyways and I've never
seen this on the internet.

> So ok, maybe I did a bad test because I'm not following spec
> but the internet doesn't follow spec either, not all the time.
> Look at the raw html from www.sciam.com
> It contains these two lines, on the same home page.
> 
> <li><a href="https://www.scientificamerican.com/store/subscribe/scientific-american-all-access/?WT.mc_id=SA_Webstore_SCA_AllAccess_SubCenter&amp;responseKey=W3S03RD00" target="_blank">Subscribe to All Access <span class="red">&raquo;</span></a></li>
> <li><a href="https://w1.buysub.com/servlet/OrdersGateway?cds_mag_code=SCA&cds_page_id=185258&cds_response_key=I5S03R00B" target="_blank">Subscribe to Print <span class="red">&raquo;</span></a></li>
> 
> The first one has & escaped, the second one does not.
> So ok just wanted to make sure tidy is handling these two cases properly,
> and it is.
> Happily, my parser also handles these cases properly.
> I must have run into this at some point.

Agreed about html not following spec;
that's why I wanted to get away from a home grown parser.

> I'll continue testing. Assuming I uncover no serious problems,
> I think the next step is to enhance our edbrowse node, with enough attributes
> to faithfully copy the information from a tidy node.
> We have some of the attributs but not enough.
> A blatent omission is a text string,
> because we never represented text nodes before.
> We'll need this, and child pointers,
> and a list of attribute value pairs, and other things.
> I'll post more on this later.

Yeah, I think we'll need to revise our tag nodes signifficantly eventually,
but as a stop gap measure I was thinking to simply run through the tidy tree
and copy what we can into our existing structure.
It's far from perfect but we need to rewrite the DOM anyway at some stage.
That's going to be a really large project requiring a large set of changes to
how our tag system works. Thus I don't see the point in enhancing our existing
tags too much, though if it naturally heads towards a functional DOM
representation that's good with me.
At least this way we get more html support and probably get rid of some parsing issues into the bargain.

Cheers,
Adam.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2015-08-30 11:14 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-29 10:04 [Edbrowse-dev] tidy debug tree, and a js script Karl Dahlke
2015-08-29 13:25 ` Adam Thompson
2015-08-29 14:36   ` Karl Dahlke
2015-08-29 14:58     ` Adam Thompson
2015-08-29 16:05       ` Karl Dahlke
2015-08-30  1:15         ` Kevin Carhart
2015-08-30  8:26         ` Adam Thompson
2015-08-30  9:30           ` Adam Thompson
2015-08-30  9:49             ` Karl Dahlke
2015-08-30 10:02               ` Adam Thompson
2015-08-30 10:31                 ` Karl Dahlke
2015-08-30 11:16                   ` Adam Thompson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).