edbrowse-dev - development list for edbrowse
 help / color / mirror / Atom feed
* [Edbrowse-dev] bad byte in NASA's vendor.js
@ 2016-12-30  1:30 Kevin Carhart
  2016-12-30  2:11 ` Chris Brannon
  0 siblings, 1 reply; 8+ messages in thread
From: Kevin Carhart @ 2016-12-30  1:30 UTC (permalink / raw)
  To: Edbrowse-dev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1475 bytes --]


Wow!  I found something that can cause JS to crash, yet is more to do 
with our retrieval of javascript code files as large ascii files, rather 
than javascript-the-language.  It happens in nasa.gov and may be one 
reason why nasa.gov stopped working.

I was looking at all of the files successfully downloaded by curl as it 
tries to render nasa.gov, but when the run is still proceeding and has not 
yet aborted (assuming db5 is on).

One of these files is:
https://www.nasa.gov/sites/all/themes/custom/nasatwo/js/vendor.js?oinn2s

Now let's say you make a test html file that will load only vendor.js.
If you go to edbrowse and load this test file with db5, it says this:
vendor.js line 89049: SyntaxError: missing ) after condition

The bad line is like:
if (!d.isValid() || !that.get('value')) {


I played around with the bad line, and found that the syntax error is 
coming from an invisible character!

So the following bytes can make the javascript processing of vendor.js 
fail, screwing things up downstream.

64
28
29
20
7C
7C
A0

The 7C 7C is the "OR" double pipe symbols.  So if you change A0 to a 
regular space, the syntax error goes away.

64
28
29
20
7C
7C
20

How about that!  So is this something we can deal with someplace way more 
low-level than in parsing javascript code itself, such as in http.c when 
the bytes are retrieved?

Will nasa.gov come back when we address this?  Don't know yet, but this is 
definitely a necessary precondition.

Kevin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Edbrowse-dev] bad byte in NASA's vendor.js
  2016-12-30  1:30 [Edbrowse-dev] bad byte in NASA's vendor.js Kevin Carhart
@ 2016-12-30  2:11 ` Chris Brannon
  2016-12-30  3:24   ` Karl Dahlke
  0 siblings, 1 reply; 8+ messages in thread
From: Chris Brannon @ 2016-12-30  2:11 UTC (permalink / raw)
  To: Kevin Carhart; +Cc: Edbrowse-dev

Kevin Carhart <kevin@carhart.net> writes:

> Wow!  I found something that can cause JS to crash, yet is more to do
> with our retrieval of javascript code files as large ascii files,
> rather than javascript-the-language.

Simplest testcase, http://the-brannons.com/jsbytes.html.  When we have a
fix, we should add it to jsrt.

I'm not quite sure where the mixup is happening, because as far as I can
tell from db5, the correct byte sequence is going to JS.

-- Chris

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Edbrowse-dev]  bad byte in NASA's vendor.js
  2016-12-30  2:11 ` Chris Brannon
@ 2016-12-30  3:24   ` Karl Dahlke
  2016-12-30 18:52     ` Chris Brannon
  0 siblings, 1 reply; 8+ messages in thread
From: Karl Dahlke @ 2016-12-30  3:24 UTC (permalink / raw)
  To: Edbrowse-dev

[-- Attachment #1: Type: text/plain, Size: 788 bytes --]

Wel it's easy enough to transliterate every breakspace into space, and that would fix this problem, but would it break something else?

alert("hello world");

That works now, but if I replaced breakspace mindlessly the string would change.
And probably breakspace could be part of a regexp.
I don't know, but other browsers manage it,
some using the same js engine we are using, so they must do something.

I see in that nasa js lots of other utf8 symbols in strings, just not breakspace, as it happens.

Kevin this is good detective work.

any thoughts on replacing breakspace with space, or other ideas?

You notice I remove the leading byte order mark, if any, at the start of javascript, jseng-moz.cpp line 2465.
I must have run into that one before.

Karl Dahlke

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Edbrowse-dev] bad byte in NASA's vendor.js
  2016-12-30  3:24   ` Karl Dahlke
@ 2016-12-30 18:52     ` Chris Brannon
  2016-12-30 19:58       ` Karl Dahlke
  0 siblings, 1 reply; 8+ messages in thread
From: Chris Brannon @ 2016-12-30 18:52 UTC (permalink / raw)
  To: Edbrowse-dev

Karl Dahlke <eklhad@comcast.net> writes:

> Wel it's easy enough to transliterate every breakspace into space, and
> that would fix this problem, but would it break something else?

I think the heart of the matter is that JS_EvaluateScript cannot cope
with non-ASCII things, even if encoded in UTF8.
There are two js_Evaluate* functions:
JS_EvaluateScript and JS_EvaluateUCScript.
The documentation just says that JS_EvaluateUCScript is the Unicode
version, and it takes const jschar * rather than const char *, where
jschar is their 16-bit wide character type.
So maybe we need to be decoding our bytes to jschar * and using
JS_EvaluateUCScript?

-- Chris

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Edbrowse-dev]  bad byte in NASA's vendor.js
  2016-12-30 18:52     ` Chris Brannon
@ 2016-12-30 19:58       ` Karl Dahlke
  2016-12-30 20:07         ` Chris Brannon
  0 siblings, 1 reply; 8+ messages in thread
From: Karl Dahlke @ 2016-12-30 19:58 UTC (permalink / raw)
  To: Edbrowse-dev

> I think the heart of the matter is that JS_EvaluateScript cannot cope
> with non-ASCII things, even if encoded in UTF8.

Yes it can, as shown in my previous example.
utf8 chars can be in strings, or even regular expressions.
This works.

var x = "a b";
alert(x.replace(/ /, "-"));

It prints out a-b.

It doesn't seem to handle breakspace as whitespace however, and yet it should, and apparently does in other browsers.
I don't think converting everything to unicode would change anything.
It would still accept my code fragment above, and likely barf on breakspace elsewhere.

As a really imperfect solution, I'm thinking about changing every breakspace to space, unless:
it is in a line less 200 characters &&
the line does not start with // &&
it is not part of a string wholly contained in this line using a rather simple " " criterion.

I'm trying to avoid writing a js scanner, which is really not a trivial thing,
beyond what lex can handle, mostly because of those cursed regular expressions that aren't even quoted,
just appear free, and we're suppose to recognize them as such,
even though almost any character can appear between the slashes.
I know about this nightmare because I parsed and ran my own javascript in edbrowse version 2.
It's really best left to someone else!
But then again, that someone else isn't handling breakspace properly.

Karl Dahlke

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Edbrowse-dev] bad byte in NASA's vendor.js
  2016-12-30 19:58       ` Karl Dahlke
@ 2016-12-30 20:07         ` Chris Brannon
  2016-12-30 23:13           ` Karl Dahlke
  0 siblings, 1 reply; 8+ messages in thread
From: Chris Brannon @ 2016-12-30 20:07 UTC (permalink / raw)
  To: Edbrowse-dev

Karl Dahlke <eklhad@comcast.net> writes:

> It doesn't seem to handle breakspace as whitespace however, and yet it should, and apparently does in other browsers.

I don't know if you have the JavaScript shell that comes with
Spidermonkey installed, but it handles it as whitespace just fine there
too.  Test case: http://the-brannons.com/test.js.txt
So why not in JS_EvaluateScript?  I added some debugging statements on
the JS side to print the bytes that were coming through, and they did in
fact come through as they should.

-- Chris

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Edbrowse-dev]  bad byte in NASA's vendor.js
  2016-12-30 20:07         ` Chris Brannon
@ 2016-12-30 23:13           ` Karl Dahlke
  2016-12-31  5:28             ` Kevin Carhart
  0 siblings, 1 reply; 8+ messages in thread
From: Karl Dahlke @ 2016-12-30 23:13 UTC (permalink / raw)
  To: Edbrowse-dev

My latest push should fix this breakspace bug. It runs for me on jsrt, and the test cases we created.
Kevin, give this a whirl with the nasa page, and march on to the next problem.

Warning - I may have totally broken things if your machine is big endian.
I couldn't really tell if JSEvaluateUCScript always wants little endian,
or if it wants what is native to your machine.
anyone able to test this for me?

Karl Dahlke

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Edbrowse-dev] bad byte in NASA's vendor.js
  2016-12-30 23:13           ` Karl Dahlke
@ 2016-12-31  5:28             ` Kevin Carhart
  0 siblings, 0 replies; 8+ messages in thread
From: Kevin Carhart @ 2016-12-31  5:28 UTC (permalink / raw)
  To: Karl Dahlke; +Cc: Edbrowse-dev



> Kevin this is good detective work
> any thoughts on replacing breakspace with space, or other ideas?

Thank you - I am glad you both have been able to run with this today 
because I may not know or may not have known where to take it.  I get 
what you mean about the unpleasant nature of regular expressions sitting 
there in JS.  We implemented the TidyEscapeScripts flag recently for 
something on a similar theme, thanks to Geoff's assistance.

I will march on & report back.

Kevin

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-12-31  5:28 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-30  1:30 [Edbrowse-dev] bad byte in NASA's vendor.js Kevin Carhart
2016-12-30  2:11 ` Chris Brannon
2016-12-30  3:24   ` Karl Dahlke
2016-12-30 18:52     ` Chris Brannon
2016-12-30 19:58       ` Karl Dahlke
2016-12-30 20:07         ` Chris Brannon
2016-12-30 23:13           ` Karl Dahlke
2016-12-31  5:28             ` Kevin Carhart

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).