edbrowse-dev - development list for edbrowse
 help / color / mirror / Atom feed
* [Edbrowse-dev] regex criteria interpreted as literals
@ 2016-01-13  6:01 Kevin Carhart
  2016-01-13  6:13 ` [Edbrowse-dev] fixing my semantics a little Kevin Carhart
  2016-01-13 11:34 ` [Edbrowse-dev] regex criteria interpreted as literals Chris Brannon
  0 siblings, 2 replies; 6+ messages in thread
From: Kevin Carhart @ 2016-01-13  6:01 UTC (permalink / raw)
  To: Edbrowse-dev



I was trying to dig into this problem where Sebastian from the commandline 
list was trying to read google groups with edbrowse.

There may be a few things going on with google groups, but one of them 
that I could isolate as a short example is that they make use of the 
inline regular expression style as follows:

<script type="text/javascript">
ua=/</g;va=/>/g;
</script>

And the routine fails because the expression criteria is taken as a 
literal, so the error is then "SyntaxError: unterminated regular 
expressionliteral"

I know this is very similar to the string contents interpreted 
as literals problems from months back, which is now fixed, right?  Maybe 
this one is harder 
to deal with because it isn't delimited by quotes?  It gets ambiguous to 
know what /</ means.
Or should this work?
Or is it slipping my mind and we talked about the regex syntax back when 
we talked about things like
document.writeln("<script language=JavaScript>document.writeln('Subject: 
');<" + "/script>");


Note, I made 
sure my tidy was up to date before trying this.  When I say:
tidy -v
I get
HTML Tidy for Linux version 5.1.33

Any idea what can be done here?
thanks
Kevin

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Edbrowse-dev] fixing my semantics a little
  2016-01-13  6:01 [Edbrowse-dev] regex criteria interpreted as literals Kevin Carhart
@ 2016-01-13  6:13 ` Kevin Carhart
  2016-01-13 11:34 ` [Edbrowse-dev] regex criteria interpreted as literals Chris Brannon
  1 sibling, 0 replies; 6+ messages in thread
From: Kevin Carhart @ 2016-01-13  6:13 UTC (permalink / raw)
  To: Edbrowse-dev



Sorry, I think my use of "literal" is backwards but I hope you can tell 
what I meant from context.  It's this whole cluster of questions around an 
actual token with a formal meaning, versus that thing appearing as part of 
a string.   Or in this case, something potentially with a formal meaning 
like <, only it isn't a piece of an HTML tag, it's expression criteria 
intended to be matched, delimited not by quotes but by slashes.  And the 
parser may not have enough information to differentiate between the 
situations.


On Tue, 12 Jan 2016, Kevin Carhart wrote:

>
> I was trying to dig into this problem where Sebastian from the commandline 
> list was trying to read google groups with edbrowse.
>
> There may be a few things going on with google groups, but one of them that 
> I could isolate as a short example is that they make use of the inline 
> regular expression style as follows:
>
> <script type="text/javascript">
> ua=/</g;va=/>/g;
> </script>
>
> And the routine fails because the expression criteria is taken as a literal, 
> so the error is then "SyntaxError: unterminated regular expressionliteral"
>
> I know this is very similar to the string contents interpreted as literals 
> problems from months back, which is now fixed, right?  Maybe this one is 
> harder to deal with because it isn't delimited by quotes?  It gets ambiguous 
> to know what /</ means.
> Or should this work?
> Or is it slipping my mind and we talked about the regex syntax back when we 
> talked about things like
> document.writeln("<script language=JavaScript>document.writeln('Subject: 
> ');<" + "/script>");
>
>
> Note, I made sure my tidy was up to date before trying this.  When I say:
> tidy -v
> I get
> HTML Tidy for Linux version 5.1.33
>
> Any idea what can be done here?
> thanks
> Kevin
>

--------
Kevin Carhart * 415 225 5306 * The Ten Ninety Nihilists

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Edbrowse-dev] regex criteria interpreted as literals
  2016-01-13  6:01 [Edbrowse-dev] regex criteria interpreted as literals Kevin Carhart
  2016-01-13  6:13 ` [Edbrowse-dev] fixing my semantics a little Kevin Carhart
@ 2016-01-13 11:34 ` Chris Brannon
  2016-01-14  2:28   ` Kevin Carhart
  1 sibling, 1 reply; 6+ messages in thread
From: Chris Brannon @ 2016-01-13 11:34 UTC (permalink / raw)
  To: Kevin Carhart; +Cc: Edbrowse-dev

Kevin Carhart <kevin@carhart.net> writes:

> ua=/</g;va=/>/g;
*SNIP*
> this one is harder to deal with because it isn't delimited by quotes?
> It gets ambiguous to know what /</ means.
> Or should this work?

Hi Kevin,
Nah, this is perfectly fine JavaScript, but Tidy-HTML5 doesn't like it.
Apparently it converts
va = /</g;
to
va = /<\/g;
when in fact it should probably just leave it alone.  I thought we'd
gotten through all of this stuff way back in October, but maybe not.

-- Chris

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Edbrowse-dev] regex criteria interpreted as literals
  2016-01-13 11:34 ` [Edbrowse-dev] regex criteria interpreted as literals Chris Brannon
@ 2016-01-14  2:28   ` Kevin Carhart
  2016-01-15 21:38     ` [Edbrowse-dev] Tidy and various tags was " Chris Brannon
  0 siblings, 1 reply; 6+ messages in thread
From: Kevin Carhart @ 2016-01-14  2:28 UTC (permalink / raw)
  To: Chris Brannon; +Cc: Edbrowse-dev

>
> Hi Kevin,
> Nah, this is perfectly fine JavaScript, but Tidy-HTML5 doesn't like it.
> Apparently it converts
> va = /</g;
> to
> va = /<\/g;
> when in fact it should probably just leave it alone.  I thought we'd
> gotten through all of this stuff way back in October, but maybe not.

Thanks Chris!  Do you think I ought to file this in the requests tracker 
for Tidy, also a question for Geoff if you're around?

If you could share the logic with me of how the parser will disambiguate 
this, if you know, I'm interested.  I can understand why it 
would escape </ to <\/ if these two characters are indeed the beginning of 
a closing HTML tag.  So is the relevant hook for knowing whether to do 
so, the fact that we're already inside <script></script>, and you don't 
need to escape within that?
(More of a tidy question but not too far afield I dont think..)


thanks
Kevin

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Edbrowse-dev] Tidy and various tags was Re: regex criteria interpreted as literals
  2016-01-14  2:28   ` Kevin Carhart
@ 2016-01-15 21:38     ` Chris Brannon
  2016-01-15 21:46       ` [Edbrowse-dev] Tidy and various tags was regex criteria Karl Dahlke
  0 siblings, 1 reply; 6+ messages in thread
From: Chris Brannon @ 2016-01-15 21:38 UTC (permalink / raw)
  To: Kevin Carhart; +Cc: Edbrowse-dev

Kevin Carhart <kevin@carhart.net> writes:

> Thanks Chris!  Do you think I ought to file this in the requests
> tracker for Tidy, also a question for Geoff if you're around?

Hi Kevin,
Yeah I'd say so.

> If you could share the logic with me of how the parser will
> disambiguate this, if you know, I'm interested.

I think the consensus from the edbrowse side is that if you're in
<script>, <textarea>, or <style> just leave it alone until you see the
closing tag without doing the typical escaping behavior.

-- Chris

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Edbrowse-dev]  Tidy and various tags was regex criteria...
  2016-01-15 21:38     ` [Edbrowse-dev] Tidy and various tags was " Chris Brannon
@ 2016-01-15 21:46       ` Karl Dahlke
  0 siblings, 0 replies; 6+ messages in thread
From: Karl Dahlke @ 2016-01-15 21:46 UTC (permalink / raw)
  To: Edbrowse-dev

> I think the consensus from the edbrowse side is that if you're in
> <script>, <textarea>, or <style> just leave it alone until you see the
> closing tag without doing the typical escaping behavior.

Yes, with the caveat that the text in <textarea> has to be andTranslated,
&quot; &lt; etc.
Other than that leav it alone.
And I suspect there are other raw tags like this.
What kind of text is inside <object> for instance,
and how should that be represented in the html tidy tree?

http://www.w3schools.com/tags/tag_object.asp

We'll have to hunt up some real world examples.

No doubt tidy will evolve as edbrowse evolves.

Karl Dahlke

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-01-15 21:45 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-13  6:01 [Edbrowse-dev] regex criteria interpreted as literals Kevin Carhart
2016-01-13  6:13 ` [Edbrowse-dev] fixing my semantics a little Kevin Carhart
2016-01-13 11:34 ` [Edbrowse-dev] regex criteria interpreted as literals Chris Brannon
2016-01-14  2:28   ` Kevin Carhart
2016-01-15 21:38     ` [Edbrowse-dev] Tidy and various tags was " Chris Brannon
2016-01-15 21:46       ` [Edbrowse-dev] Tidy and various tags was regex criteria Karl Dahlke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).