edbrowse-dev - development list for edbrowse
 help / color / mirror / Atom feed
* [Edbrowse-dev] script tags in scripts
@ 2015-09-11  0:17 Tyler Spivey
  2015-09-11  1:10 ` Karl Dahlke
  0 siblings, 1 reply; 8+ messages in thread
From: Tyler Spivey @ 2015-09-11  0:17 UTC (permalink / raw)
  To: edbrowse-dev

If we do something like:
<script>document.write("<script></s");document.write("cript>")</script>
<p>paragraph</p>

Turn off js and browse, the paragraph will be ignored.

For Another example, on fanfiction.net, all the stories disappear.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Edbrowse-dev]  script tags in scripts
  2015-09-11  0:17 [Edbrowse-dev] script tags in scripts Tyler Spivey
@ 2015-09-11  1:10 ` Karl Dahlke
  2015-09-11  5:28   ` Kevin Carhart
  0 siblings, 1 reply; 8+ messages in thread
From: Karl Dahlke @ 2015-09-11  1:10 UTC (permalink / raw)
  To: edbrowse-dev

I'm fairly certain, and fairly concerned, that this is a tidy bug
that we can't get around.
Source as follows.

<body>
<script>document.write("<script></s");document.write("cript>")</script>
<p>paragraph</p>
</body>

db6
js
b

undoCompare no undo map
line 1 column 1: missing <!DOCTYPE> declaration
line 2 column 34: '<' + '/' + letter not allowed here
line 2 column 69: '<' + '/' + letter not allowed here
line 3 column 14: '<' + '/' + letter not allowed here
line 4 column 5: '<' + '/' + letter not allowed here
line 2 column 1: missing </script>
line 2 column 1: missing </script>
line 1 column 1: inserting missing 'title' element
Node(0): Root {
Node(1): DOCTYPE {
@PUBLIC = (null)
}
Node(1): html {
Node(2): head {
Node(3): meta {
@name = generator
@content = HTML Tidy for HTML5 for Linux/x86 version 5.1.2
}
Node(3): title {
}
}
Node(2): body {
Node(3): script {
Node(4): Text {
Text: document.write("<script><\/s");document.write("cript>")<\/script>
<p>paragraph<\/p>
<\/body>

}
}
}
}
}
||

So you see all the text is subsumed under the script tag.
And slashes are escaped.
Tidy doesn't grasp the </script> terminater.
Thoughts?

Karl Dahlke

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Edbrowse-dev] script tags in scripts
  2015-09-11  1:10 ` Karl Dahlke
@ 2015-09-11  5:28   ` Kevin Carhart
  2015-09-11  7:39     ` Adam Thompson
  2015-09-11 16:37     ` Chris Brannon
  0 siblings, 2 replies; 8+ messages in thread
From: Kevin Carhart @ 2015-09-11  5:28 UTC (permalink / raw)
  To: Karl Dahlke; +Cc: edbrowse-dev



Interesting.. Karl, does your certainty mean that you are saying
that the distinction between the two tags is fundamentally
unknowable for a parser?

I guess one good sign is that there appears to be a lot of
past literature on this issue, on Tidy listservs.  Including
one from 2006 called "Tidy barfs on split <SCRIPT> tags".
Unless it's an impossible problem, maybe these past threads
will contain something we can use.  I will read some of this
correspondence.

This reminds me of other gnarly situations with literals.
For instance, when there are regular expression criteria in
javascript strings that contain just solely a close brace or close
parenthesis, if I come along and want to make
assumptions about pairs of braces, the unmatched literal gets me
out of sync.

Kevin


On Thu, 10 Sep 2015, Karl Dahlke wrote:

> I'm fairly certain, and fairly concerned, that this is a tidy bug
> that we can't get around.
> Source as follows.
>
> <body>
> <script>document.write("<script></s");document.write("cript>")</script>
> <p>paragraph</p>
> </body>
>
> db6
> js
> b
>
> undoCompare no undo map
> line 1 column 1: missing <!DOCTYPE> declaration
> line 2 column 34: '<' + '/' + letter not allowed here
> line 2 column 69: '<' + '/' + letter not allowed here
> line 3 column 14: '<' + '/' + letter not allowed here
> line 4 column 5: '<' + '/' + letter not allowed here
> line 2 column 1: missing </script>
> line 2 column 1: missing </script>
> line 1 column 1: inserting missing 'title' element
> Node(0): Root {
> Node(1): DOCTYPE {
> @PUBLIC = (null)
> }
> Node(1): html {
> Node(2): head {
> Node(3): meta {
> @name = generator
> @content = HTML Tidy for HTML5 for Linux/x86 version 5.1.2
> }
> Node(3): title {
> }
> }
> Node(2): body {
> Node(3): script {
> Node(4): Text {
> Text: document.write("<script><\/s");document.write("cript>")<\/script>
> <p>paragraph<\/p>
> <\/body>
>
> }
> }
> }
> }
> }
> ||
>
> So you see all the text is subsumed under the script tag.
> And slashes are escaped.
> Tidy doesn't grasp the </script> terminater.
> Thoughts?
>
> Karl Dahlke
> _______________________________________________
> Edbrowse-dev mailing list
> Edbrowse-dev@lists.the-brannons.com
> http://lists.the-brannons.com/mailman/listinfo/edbrowse-dev
>

--------
Kevin Carhart * 415 225 5306 * The Ten Ninety Nihilists

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Edbrowse-dev] script tags in scripts
  2015-09-11  5:28   ` Kevin Carhart
@ 2015-09-11  7:39     ` Adam Thompson
  2015-09-11 10:17       ` Karl Dahlke
  2015-09-11 16:37     ` Chris Brannon
  1 sibling, 1 reply; 8+ messages in thread
From: Adam Thompson @ 2015-09-11  7:39 UTC (permalink / raw)
  To: Kevin Carhart; +Cc: Karl Dahlke, edbrowse-dev

[-- Attachment #1: Type: text/plain, Size: 1951 bytes --]

On Thu, Sep 10, 2015 at 10:28:03PM -0700, Kevin Carhart wrote:
> 
> Interesting.. Karl, does your certainty mean that you are saying
> that the distinction between the two tags is fundamentally
> unknowable for a parser?

It's certainly difficult if the parser isn't also capable of parsing the
scripting language within the script tags.

> I guess one good sign is that there appears to be a lot of
> past literature on this issue, on Tidy listservs.  Including
> one from 2006 called "Tidy barfs on split <SCRIPT> tags".
> Unless it's an impossible problem, maybe these past threads
> will contain something we can use.  I will read some of this
> correspondence.

I've also ran the example through the main tidy html5 version and it also spits it out.

> This reminds me of other gnarly situations with literals.
> For instance, when there are regular expression criteria in
> javascript strings that contain just solely a close brace or close
> parenthesis, if I come along and want to make
> assumptions about pairs of braces, the unmatched literal gets me
> out of sync.

Agreed, literals in scripts can cause issues like this.
There's also the issue of json shoved in script tags etc (I've seen web apps
use this for pre-caching server responses).

I'm not sure what we can do about this,
but I'm inclined to think that whatever we do won't catch every case and that
at some stage we have to accept that and move on.
I seem to remember that the accepted "fix"
for this in html is not to split the script in </script> but rather to split it at the / thus:
document.write("<"); document.write("/script>");
But I may be wrong there.

We should probably report a bug against tidy5 in any case for this.
That's why we're using a parsing library after all.
At least this one's maintained for us so there's a reasonable chance they'll
fix these things once they work out a workable solution.

Cheers,
Adam.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Edbrowse-dev]  script tags in scripts
  2015-09-11  7:39     ` Adam Thompson
@ 2015-09-11 10:17       ` Karl Dahlke
  2015-09-11 18:02         ` Adam Thompson
  0 siblings, 1 reply; 8+ messages in thread
From: Karl Dahlke @ 2015-09-11 10:17 UTC (permalink / raw)
  To: edbrowse-dev

> I'm not sure what we can do about this,
> but I'm inclined to think that whatever we do won't catch every case and that
> at some stage we have to accept that and move on.

That was true of my parser, true of tidy5, and true of any parser,
however, as you point out regularly, we should handle most websites
that other browsers handle.
And when we don't,
entire web pages shouldn't disappear beyond the point of error.
This bug is produced by fanfiction.net and fictionpress.com,
two high volume sites that work on every other browser.
And by the way, my thanks to those users who exercise and test our bleeding edge software;
you're as brave as a Windows 10 insider.
In any case, tidy5 needs to fix this,
or we need to find a way to preprocess around it,
the latter meaning I'd have to keep at least half of my parser,
which I really wanted to throw away entirely.   :(

Karl Dahlke

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Edbrowse-dev] script tags in scripts
  2015-09-11  5:28   ` Kevin Carhart
  2015-09-11  7:39     ` Adam Thompson
@ 2015-09-11 16:37     ` Chris Brannon
  1 sibling, 0 replies; 8+ messages in thread
From: Chris Brannon @ 2015-09-11 16:37 UTC (permalink / raw)
  To: edbrowse-dev

Kevin Carhart <kevin@carhart.net> writes:

> I guess one good sign is that there appears to be a lot of
> past literature on this issue, on Tidy listservs.  Including
> one from 2006 called "Tidy barfs on split <SCRIPT> tags".

Well, tidy5 definitely fails to handle these fanfiction pages properly.
And here I'm talking about the command-line tool as well as the library.
The fanfic site is dynamic, so we don't have a page that breaks
consistently.  Pages go from broken to working to re-broken.
So I've cached a broken page on my server, along with the
result of running it through the tidy5 tool.
http://the-brannons.com/hp/
This should be nice fuel for a bug report of some kind.

-- Chris

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Edbrowse-dev] script tags in scripts
  2015-09-11 10:17       ` Karl Dahlke
@ 2015-09-11 18:02         ` Adam Thompson
  2015-09-11 18:55           ` Karl Dahlke
  0 siblings, 1 reply; 8+ messages in thread
From: Adam Thompson @ 2015-09-11 18:02 UTC (permalink / raw)
  To: Karl Dahlke; +Cc: edbrowse-dev

[-- Attachment #1: Type: text/plain, Size: 2191 bytes --]

On Fri, Sep 11, 2015 at 06:17:13AM -0400, Karl Dahlke wrote:
> > I'm not sure what we can do about this,
> > but I'm inclined to think that whatever we do won't catch every case and that
> > at some stage we have to accept that and move on.
> 
> That was true of my parser, true of tidy5, and true of any parser,
> however, as you point out regularly, we should handle most websites
> that other browsers handle.
> And when we don't,
> entire web pages shouldn't disappear beyond the point of error.
> This bug is produced by fanfiction.net and fictionpress.com,
> two high volume sites that work on every other browser.

Agreed, we need to work out what's breaking here and why it's affecting tidy5
and not, say, firefox etc. I may try the pages with some other html parsing
libs (not applicable to edbrowse unfortunately as they're in, e.g.
Python or Perl) to see what they do with the pages.
I'm just saying that I think we should continue to move forward with the design
on the basis that tidy5 will be fixed.
If it's not then we'll need to look at other alternatives but there're a lot of
elements of the new design which should stay in any case I think.

> And by the way, my thanks to those users who exercise and test our bleeding edge software;
> you're as brave as a Windows 10 insider.

I second this. We need users to test this software and I appreciate the time
and effort it takes to keep on top of the latest code,
particularly when we're adding library dependancies.

> In any case, tidy5 needs to fix this,
> or we need to find a way to preprocess around it,
> the latter meaning I'd have to keep at least half of my parser,
> which I really wanted to throw away entirely.   :(

May be, or we keep the tidy-inspired design but rewrite the parsing logic,
may be borrowing the parsing code from somewhere else and making it our own.
I know I said we should try and stay out of the html parsing business,
and I still would like to ideally
but if we really can't then we can at least keep the current design direction.
There has to be a parsing lib out there somewhere which works properly...
at least I hope there is.

Cheers,
Adam.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Edbrowse-dev]   script tags in scripts
  2015-09-11 18:02         ` Adam Thompson
@ 2015-09-11 18:55           ` Karl Dahlke
  0 siblings, 0 replies; 8+ messages in thread
From: Karl Dahlke @ 2015-09-11 18:55 UTC (permalink / raw)
  To: edbrowse-dev

> I think we should continue to move forward with the design

Absolutely, and I am.
I'm working on it as we speak.
And I don't think we'll need to change libraries or go back to a handwritten parser either.
I'm pretty comfortable hitching my wagon to tidy5.
This is the first bug in many tests, and either they'll fix it,
or I'll preprocess the text to get around it:
	replace </ in a string with <" + "/
That's kinda gross, and maybe a last resort,
but we're not going to retreat.
This new design is the direction we need to travel, for so many reasons.

Karl Dahlke

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-09-11 18:53 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-11  0:17 [Edbrowse-dev] script tags in scripts Tyler Spivey
2015-09-11  1:10 ` Karl Dahlke
2015-09-11  5:28   ` Kevin Carhart
2015-09-11  7:39     ` Adam Thompson
2015-09-11 10:17       ` Karl Dahlke
2015-09-11 18:02         ` Adam Thompson
2015-09-11 18:55           ` Karl Dahlke
2015-09-11 16:37     ` Chris Brannon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).