* [Edbrowse-dev] regex criteria interpreted as literals
@ 2016-01-13 6:01 Kevin Carhart
2016-01-13 6:13 ` [Edbrowse-dev] fixing my semantics a little Kevin Carhart
2016-01-13 11:34 ` [Edbrowse-dev] regex criteria interpreted as literals Chris Brannon
0 siblings, 2 replies; 6+ messages in thread
From: Kevin Carhart @ 2016-01-13 6:01 UTC (permalink / raw)
To: Edbrowse-dev
I was trying to dig into this problem where Sebastian from the commandline
list was trying to read google groups with edbrowse.
There may be a few things going on with google groups, but one of them
that I could isolate as a short example is that they make use of the
inline regular expression style as follows:
<script type="text/javascript">
ua=/</g;va=/>/g;
</script>
And the routine fails because the expression criteria is taken as a
literal, so the error is then "SyntaxError: unterminated regular
expressionliteral"
I know this is very similar to the string contents interpreted
as literals problems from months back, which is now fixed, right? Maybe
this one is harder
to deal with because it isn't delimited by quotes? It gets ambiguous to
know what /</ means.
Or should this work?
Or is it slipping my mind and we talked about the regex syntax back when
we talked about things like
document.writeln("<script language=JavaScript>document.writeln('Subject:
');<" + "/script>");
Note, I made
sure my tidy was up to date before trying this. When I say:
tidy -v
I get
HTML Tidy for Linux version 5.1.33
Any idea what can be done here?
thanks
Kevin
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Edbrowse-dev] fixing my semantics a little
2016-01-13 6:01 [Edbrowse-dev] regex criteria interpreted as literals Kevin Carhart
@ 2016-01-13 6:13 ` Kevin Carhart
2016-01-13 11:34 ` [Edbrowse-dev] regex criteria interpreted as literals Chris Brannon
1 sibling, 0 replies; 6+ messages in thread
From: Kevin Carhart @ 2016-01-13 6:13 UTC (permalink / raw)
To: Edbrowse-dev
Sorry, I think my use of "literal" is backwards but I hope you can tell
what I meant from context. It's this whole cluster of questions around an
actual token with a formal meaning, versus that thing appearing as part of
a string. Or in this case, something potentially with a formal meaning
like <, only it isn't a piece of an HTML tag, it's expression criteria
intended to be matched, delimited not by quotes but by slashes. And the
parser may not have enough information to differentiate between the
situations.
On Tue, 12 Jan 2016, Kevin Carhart wrote:
>
> I was trying to dig into this problem where Sebastian from the commandline
> list was trying to read google groups with edbrowse.
>
> There may be a few things going on with google groups, but one of them that
> I could isolate as a short example is that they make use of the inline
> regular expression style as follows:
>
> <script type="text/javascript">
> ua=/</g;va=/>/g;
> </script>
>
> And the routine fails because the expression criteria is taken as a literal,
> so the error is then "SyntaxError: unterminated regular expressionliteral"
>
> I know this is very similar to the string contents interpreted as literals
> problems from months back, which is now fixed, right? Maybe this one is
> harder to deal with because it isn't delimited by quotes? It gets ambiguous
> to know what /</ means.
> Or should this work?
> Or is it slipping my mind and we talked about the regex syntax back when we
> talked about things like
> document.writeln("<script language=JavaScript>document.writeln('Subject:
> ');<" + "/script>");
>
>
> Note, I made sure my tidy was up to date before trying this. When I say:
> tidy -v
> I get
> HTML Tidy for Linux version 5.1.33
>
> Any idea what can be done here?
> thanks
> Kevin
>
--------
Kevin Carhart * 415 225 5306 * The Ten Ninety Nihilists
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Edbrowse-dev] regex criteria interpreted as literals
2016-01-13 6:01 [Edbrowse-dev] regex criteria interpreted as literals Kevin Carhart
2016-01-13 6:13 ` [Edbrowse-dev] fixing my semantics a little Kevin Carhart
@ 2016-01-13 11:34 ` Chris Brannon
2016-01-14 2:28 ` Kevin Carhart
1 sibling, 1 reply; 6+ messages in thread
From: Chris Brannon @ 2016-01-13 11:34 UTC (permalink / raw)
To: Kevin Carhart; +Cc: Edbrowse-dev
Kevin Carhart <kevin@carhart.net> writes:
> ua=/</g;va=/>/g;
*SNIP*
> this one is harder to deal with because it isn't delimited by quotes?
> It gets ambiguous to know what /</ means.
> Or should this work?
Hi Kevin,
Nah, this is perfectly fine JavaScript, but Tidy-HTML5 doesn't like it.
Apparently it converts
va = /</g;
to
va = /<\/g;
when in fact it should probably just leave it alone. I thought we'd
gotten through all of this stuff way back in October, but maybe not.
-- Chris
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Edbrowse-dev] regex criteria interpreted as literals
2016-01-13 11:34 ` [Edbrowse-dev] regex criteria interpreted as literals Chris Brannon
@ 2016-01-14 2:28 ` Kevin Carhart
2016-01-15 21:38 ` [Edbrowse-dev] Tidy and various tags was " Chris Brannon
0 siblings, 1 reply; 6+ messages in thread
From: Kevin Carhart @ 2016-01-14 2:28 UTC (permalink / raw)
To: Chris Brannon; +Cc: Edbrowse-dev
>
> Hi Kevin,
> Nah, this is perfectly fine JavaScript, but Tidy-HTML5 doesn't like it.
> Apparently it converts
> va = /</g;
> to
> va = /<\/g;
> when in fact it should probably just leave it alone. I thought we'd
> gotten through all of this stuff way back in October, but maybe not.
Thanks Chris! Do you think I ought to file this in the requests tracker
for Tidy, also a question for Geoff if you're around?
If you could share the logic with me of how the parser will disambiguate
this, if you know, I'm interested. I can understand why it
would escape </ to <\/ if these two characters are indeed the beginning of
a closing HTML tag. So is the relevant hook for knowing whether to do
so, the fact that we're already inside <script></script>, and you don't
need to escape within that?
(More of a tidy question but not too far afield I dont think..)
thanks
Kevin
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Edbrowse-dev] Tidy and various tags was Re: regex criteria interpreted as literals
2016-01-14 2:28 ` Kevin Carhart
@ 2016-01-15 21:38 ` Chris Brannon
2016-01-15 21:46 ` [Edbrowse-dev] Tidy and various tags was regex criteria Karl Dahlke
0 siblings, 1 reply; 6+ messages in thread
From: Chris Brannon @ 2016-01-15 21:38 UTC (permalink / raw)
To: Kevin Carhart; +Cc: Edbrowse-dev
Kevin Carhart <kevin@carhart.net> writes:
> Thanks Chris! Do you think I ought to file this in the requests
> tracker for Tidy, also a question for Geoff if you're around?
Hi Kevin,
Yeah I'd say so.
> If you could share the logic with me of how the parser will
> disambiguate this, if you know, I'm interested.
I think the consensus from the edbrowse side is that if you're in
<script>, <textarea>, or <style> just leave it alone until you see the
closing tag without doing the typical escaping behavior.
-- Chris
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Edbrowse-dev] Tidy and various tags was regex criteria...
2016-01-15 21:38 ` [Edbrowse-dev] Tidy and various tags was " Chris Brannon
@ 2016-01-15 21:46 ` Karl Dahlke
0 siblings, 0 replies; 6+ messages in thread
From: Karl Dahlke @ 2016-01-15 21:46 UTC (permalink / raw)
To: Edbrowse-dev
> I think the consensus from the edbrowse side is that if you're in
> <script>, <textarea>, or <style> just leave it alone until you see the
> closing tag without doing the typical escaping behavior.
Yes, with the caveat that the text in <textarea> has to be andTranslated,
" < etc.
Other than that leav it alone.
And I suspect there are other raw tags like this.
What kind of text is inside <object> for instance,
and how should that be represented in the html tidy tree?
http://www.w3schools.com/tags/tag_object.asp
We'll have to hunt up some real world examples.
No doubt tidy will evolve as edbrowse evolves.
Karl Dahlke
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2016-01-15 21:45 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-13 6:01 [Edbrowse-dev] regex criteria interpreted as literals Kevin Carhart
2016-01-13 6:13 ` [Edbrowse-dev] fixing my semantics a little Kevin Carhart
2016-01-13 11:34 ` [Edbrowse-dev] regex criteria interpreted as literals Chris Brannon
2016-01-14 2:28 ` Kevin Carhart
2016-01-15 21:38 ` [Edbrowse-dev] Tidy and various tags was " Chris Brannon
2016-01-15 21:46 ` [Edbrowse-dev] Tidy and various tags was regex criteria Karl Dahlke
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).