From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from pcdesk.net (mail.pcdesk.net [70.58.191.25]) by hurricane.the-brannons.com (Postfix) with ESMTPS id C17767890C for ; Thu, 10 Sep 2015 17:14:49 -0700 (PDT) To: edbrowse-dev@lists.the-brannons.com From: Tyler Spivey Message-ID: <55F21D99.7070701@pcdesk.net> Date: Thu, 10 Sep 2015 17:17:29 -0700 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.2.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Subject: [Edbrowse-dev] script tags in scripts X-BeenThere: edbrowse-dev@lists.the-brannons.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Edbrowse Development List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Sep 2015 00:14:49 -0000 If we do something like:

paragraph

Turn off js and browse, the paragraph will be ignored. For Another example, on fanfiction.net, all the stories disappear. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from resqmta-ch2-07v.sys.comcast.net (resqmta-ch2-07v.sys.comcast.net [IPv6:2001:558:fe21:29:69:252:207:39]) by hurricane.the-brannons.com (Postfix) with ESMTPS id 53BDE77D0D for ; Thu, 10 Sep 2015 18:07:45 -0700 (PDT) Received: from resomta-ch2-04v.sys.comcast.net ([69.252.207.100]) by resqmta-ch2-07v.sys.comcast.net with comcast id Fd9i1r0062AWL2D01dAPAg; Fri, 11 Sep 2015 01:10:23 +0000 Received: from eklhad ([IPv6:2601:405:4002:b0a:21e:4fff:fec2:a0f1]) by resomta-ch2-04v.sys.comcast.net with comcast id FdAP1r00F0GArqr01dAPNm; Fri, 11 Sep 2015 01:10:23 +0000 To: edbrowse-dev@lists.the-brannons.com From: Karl Dahlke Reply-to: Karl Dahlke References: <55F21D99.7070701@pcdesk.net> User-Agent: edbrowse/3.5.4.2+ Date: Thu, 10 Sep 2015 21:10:23 -0400 Message-ID: <20150810211023.eklhad@comcast.net> Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=q20140121; t=1441933823; bh=P/yLLLX1LldkAOVKP0w2LA4n9+XYLOSMstQ3CboNVxc=; h=Received:Received:To:From:Reply-to:Subject:Date:Message-ID: Mime-Version:Content-Type; b=YrDxWl8/IEFVxMAQffoSJRNYf70UTmeZd9yE0nO8PDFpSd7o8IxNRj72VEgHAsuX1 kQR0YluovvL9oCY7uooFakhmmC5Vs9nNQgpkQk8rmab2yh1bRfUmC3XJn54zMmWZ60 fopddBoriar5pYzbJBSbvtJYjaeS+ltN/aIfpvTjwAdLXBaKEwEsdAUHq3yVPp/9n7 vBrsGiy5koF/E7Z5VeWpVDqw1/hEr0H1mMJVpmowk1DtsR52ViCjU97Hvha/RpcXno StND5noBcxK6cvhGmYaF69i/B0/FWdeSElX+8T2xTfVp8hCwjsu5HByBF0hzKLsrRn ZnduBSadRyWhA== Subject: [Edbrowse-dev] script tags in scripts X-BeenThere: edbrowse-dev@lists.the-brannons.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Edbrowse Development List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Sep 2015 01:07:45 -0000 I'm fairly certain, and fairly concerned, that this is a tidy bug that we can't get around. Source as follows.

paragraph

db6 js b undoCompare no undo map line 1 column 1: missing declaration line 2 column 34: '<' + '/' + letter not allowed here line 2 column 69: '<' + '/' + letter not allowed here line 3 column 14: '<' + '/' + letter not allowed here line 4 column 5: '<' + '/' + letter not allowed here line 2 column 1: missing line 2 column 1: missing line 1 column 1: inserting missing 'title' element Node(0): Root { Node(1): DOCTYPE { @PUBLIC = (null) } Node(1): html { Node(2): head { Node(3): meta { @name = generator @content = HTML Tidy for HTML5 for Linux/x86 version 5.1.2 } Node(3): title { } } Node(2): body { Node(3): script { Node(4): Text { Text: document.write(" terminater. Thoughts? Karl Dahlke From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from out.smtp-auth.no-ip.com (smtp-auth.no-ip.com [8.23.224.61]) by hurricane.the-brannons.com (Postfix) with ESMTPS id AD1287891D for ; Thu, 10 Sep 2015 22:25:25 -0700 (PDT) X-No-IP: carhart.net@noip-smtp X-Report-Spam-To: abuse@no-ip.com Received: from carhart.net (unknown [99.52.200.227]) (Authenticated sender: carhart.net@noip-smtp) by smtp-auth.no-ip.com (Postfix) with ESMTPA id 26277400BA3; Thu, 10 Sep 2015 22:28:05 -0700 (PDT) Received: from carhart.net (localhost [127.0.0.1]) by carhart.net (8.13.8/8.13.8) with ESMTP id t8B5S4F3002697; Thu, 10 Sep 2015 22:28:04 -0700 Received: from localhost (kevin@localhost) by carhart.net (8.13.8/8.13.8/Submit) with ESMTP id t8B5S3Fh002688; Thu, 10 Sep 2015 22:28:04 -0700 Date: Thu, 10 Sep 2015 22:28:03 -0700 (PDT) From: Kevin Carhart To: Karl Dahlke cc: edbrowse-dev@lists.the-brannons.com In-Reply-To: <20150810211023.eklhad@comcast.net> Message-ID: References: <55F21D99.7070701@pcdesk.net> <20150810211023.eklhad@comcast.net> User-Agent: Alpine 2.03 (LRH 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Subject: Re: [Edbrowse-dev] script tags in scripts X-BeenThere: edbrowse-dev@lists.the-brannons.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Edbrowse Development List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Sep 2015 05:25:25 -0000 Interesting.. Karl, does your certainty mean that you are saying that the distinction between the two tags is fundamentally unknowable for a parser? I guess one good sign is that there appears to be a lot of past literature on this issue, on Tidy listservs. Including one from 2006 called "Tidy barfs on split >

paragraph

> > > db6 > js > b > > undoCompare no undo map > line 1 column 1: missing declaration > line 2 column 34: '<' + '/' + letter not allowed here > line 2 column 69: '<' + '/' + letter not allowed here > line 3 column 14: '<' + '/' + letter not allowed here > line 4 column 5: '<' + '/' + letter not allowed here > line 2 column 1: missing > line 2 column 1: missing > line 1 column 1: inserting missing 'title' element > Node(0): Root { > Node(1): DOCTYPE { > @PUBLIC = (null) > } > Node(1): html { > Node(2): head { > Node(3): meta { > @name = generator > @content = HTML Tidy for HTML5 for Linux/x86 version 5.1.2 > } > Node(3): title { > } > } > Node(2): body { > Node(3): script { > Node(4): Text { > Text: document.write(" terminater. > Thoughts? > > Karl Dahlke > _______________________________________________ > Edbrowse-dev mailing list > Edbrowse-dev@lists.the-brannons.com > http://lists.the-brannons.com/mailman/listinfo/edbrowse-dev > -------- Kevin Carhart * 415 225 5306 * The Ten Ninety Nihilists From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-x22b.google.com (mail-wi0-x22b.google.com [IPv6:2a00:1450:400c:c05::22b]) by hurricane.the-brannons.com (Postfix) with ESMTPS id D60F67891D for ; Fri, 11 Sep 2015 00:37:07 -0700 (PDT) Received: by wicge5 with SMTP id ge5so51985751wic.0 for ; Fri, 11 Sep 2015 00:39:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=75FPOPwG1SDUrML/kzEOEAiH2A8Bw/4p1PAL8PDel2Y=; b=nQtL6as4BNwMShURy+scoJtxKDPXTUUYlV3ylF0MqnucRyoRvOFtwUqjMl7G5AGE2m LkDPIZMXFHDy4Ch15s604thSuFcacwgDII4Y7KSFLIuADWn/F3ZKWLqB8NKxEp+kKUAb 1eDf58YD5xXUQ+LAroZUT0+fHo/kHU0mUN+q1gFjAsa1uwV9PEY+hjphiol7ElYB0Ssz Hqsvj+CbiswrZrpjAarzSznm0uLYCjNLafwKLan0Gfbs5ktGQQgi2JCmn8cwmNy9qyxb +RBNt8U/6NmxeoSmHBP9sOhvuwubo+sULwg3EWHXPd5T7jAZJ9VnWh+syRqKjoazrPkR tIVQ== X-Received: by 10.180.74.148 with SMTP id t20mr14579040wiv.31.1441957186305; Fri, 11 Sep 2015 00:39:46 -0700 (PDT) Received: from toaster.adamthompson.me.uk (toaster.adamthompson.me.uk. [2001:8b0:1142:9042::2]) by smtp.gmail.com with ESMTPSA id 12sm342643wjw.15.2015.09.11.00.39.44 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 11 Sep 2015 00:39:45 -0700 (PDT) Date: Fri, 11 Sep 2015 08:39:39 +0100 From: Adam Thompson To: Kevin Carhart Cc: Karl Dahlke , edbrowse-dev@lists.the-brannons.com Message-ID: <20150911073939.GA29720@toaster.adamthompson.me.uk> References: <55F21D99.7070701@pcdesk.net> <20150810211023.eklhad@comcast.net> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="7AUc2qLy4jB3hD7Z" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Subject: Re: [Edbrowse-dev] script tags in scripts X-BeenThere: edbrowse-dev@lists.the-brannons.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Edbrowse Development List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Sep 2015 07:37:08 -0000 --7AUc2qLy4jB3hD7Z Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Sep 10, 2015 at 10:28:03PM -0700, Kevin Carhart wrote: >=20 > Interesting.. Karl, does your certainty mean that you are saying > that the distinction between the two tags is fundamentally > unknowable for a parser? It's certainly difficult if the parser isn't also capable of parsing the scripting language within the script tags. > I guess one good sign is that there appears to be a lot of > past literature on this issue, on Tidy listservs. Including > one from 2006 called "Tidy barfs on split but rather to spli= t it at the / thus: document.write("<"); document.write("/script>"); But I may be wrong there. We should probably report a bug against tidy5 in any case for this. That's why we're using a parsing library after all. At least this one's maintained for us so there's a reasonable chance they'll fix these things once they work out a workable solution. Cheers, Adam. --7AUc2qLy4jB3hD7Z Content-Type: application/pgp-signature; name="signature.asc" Content-Description: Digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBAgAGBQJV8oU7AAoJELZ22lNQBzHOLgkH/RfwWcvRbRMgdchW753iOdbh tB/fT3SdQOdDgPGEz+pFGJ3P652CeziKqceCGsLR8Y9csHynIfE3Z0/Ivy7VbWAg Ijcr48lFE1BZDxNcVvODNR1BM8IsV/6Sk8H9MheSpWLOdkZ0dXdZMd8UN4sX++4Q 434G+Qg1BUygFlTs524u3FsLenxLNJ/fZFo1DCs0qg7ErCEZYWatQ8QSrC4FVTas FgpfA7OcN0CreZCGq2enfZmeRMbg0WxAGd2UUqxTxL/Sth7Bq/zXSgTqup9SX7jf IDHvurTld/g7p9CRyggNHypdQRgkVo3IcuJFfVy6l3ssskSmsWRDXZthf9HKWwg= =SaJF -----END PGP SIGNATURE----- --7AUc2qLy4jB3hD7Z-- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from resqmta-ch2-02v.sys.comcast.net (resqmta-ch2-02v.sys.comcast.net [IPv6:2001:558:fe21:29:69:252:207:34]) by hurricane.the-brannons.com (Postfix) with ESMTPS id 535E47890C for ; Fri, 11 Sep 2015 03:14:34 -0700 (PDT) Received: from resomta-ch2-14v.sys.comcast.net ([69.252.207.110]) by resqmta-ch2-02v.sys.comcast.net with comcast id FmGu1r0032PT3Qt01mHDjb; Fri, 11 Sep 2015 10:17:13 +0000 Received: from eklhad ([IPv6:2601:405:4002:b0a:21e:4fff:fec2:a0f1]) by resomta-ch2-14v.sys.comcast.net with comcast id FmHD1r00A0GArqr01mHDfu; Fri, 11 Sep 2015 10:17:13 +0000 To: edbrowse-dev@lists.the-brannons.com From: Karl Dahlke Reply-to: Karl Dahlke References: <55F21D99.7070701@pcdesk.net> <20150911073939.GA29720@toaster.adamthompson.me.uk> User-Agent: edbrowse/3.5.4.2+ Date: Fri, 11 Sep 2015 06:17:13 -0400 Message-ID: <20150811061713.eklhad@comcast.net> Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=q20140121; t=1441966633; bh=GTnrrTRFFphDbyYZrXqKIrHP9xLgLlim6P85oP+/t5o=; h=Received:Received:To:From:Reply-to:Subject:Date:Message-ID: Mime-Version:Content-Type; b=boZOq5H0F0hGIRz9nIZU4TSdWDcn3AngDA7SzQEd6zZA3F+zm7z1p2NhSuGgFZVvc TH9cbaJpLnCZASNVOQTvxvTGJmB5/3gm60/6G/df7qoizTwAaHB66MsTL9DAFXk1sX OblEZMvVBtM7hxxFb2aSNcWBZhed7uCk1NuPyHGZuhMAdZiU6QZkRnQOK3iSEwV0L2 NMjLMwD9Xv7qY/V+bCs50j7glzgttsFBbjkirlr0EARm3rQoYw0ijQIiLdKd6yml+g zoLPi3owth0dMoZtScX+CBZARhp59HiPjk6jykk1jURSM4mBvxdJgVI1tNA8+Iqi/O NLMyeSu4X6EPw== Subject: [Edbrowse-dev] script tags in scripts X-BeenThere: edbrowse-dev@lists.the-brannons.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Edbrowse Development List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Sep 2015 10:14:34 -0000 > I'm not sure what we can do about this, > but I'm inclined to think that whatever we do won't catch every case and that > at some stage we have to accept that and move on. That was true of my parser, true of tidy5, and true of any parser, however, as you point out regularly, we should handle most websites that other browsers handle. And when we don't, entire web pages shouldn't disappear beyond the point of error. This bug is produced by fanfiction.net and fictionpress.com, two high volume sites that work on every other browser. And by the way, my thanks to those users who exercise and test our bleeding edge software; you're as brave as a Windows 10 insider. In any case, tidy5 needs to fix this, or we need to find a way to preprocess around it, the latter meaning I'd have to keep at least half of my parser, which I really wanted to throw away entirely. :( Karl Dahlke From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (71-38-131-64.ptld.qwest.net [71.38.131.64]) by hurricane.the-brannons.com (Postfix) with ESMTPSA id 0F6FC7890C for ; Fri, 11 Sep 2015 09:35:00 -0700 (PDT) From: Chris Brannon To: edbrowse-dev@lists.the-brannons.com References: <55F21D99.7070701@pcdesk.net> <20150810211023.eklhad@comcast.net> Date: Fri, 11 Sep 2015 09:37:40 -0700 In-Reply-To: (Kevin Carhart's message of "Thu, 10 Sep 2015 22:28:03 -0700 (PDT)") Message-ID: <87vbbgud2j.fsf@mushroom.localdomain> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.5 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain Subject: Re: [Edbrowse-dev] script tags in scripts X-BeenThere: edbrowse-dev@lists.the-brannons.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Edbrowse Development List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Sep 2015 16:35:00 -0000 Kevin Carhart writes: > I guess one good sign is that there appears to be a lot of > past literature on this issue, on Tidy listservs. Including > one from 2006 called "Tidy barfs on split