From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from resqmta-ch2-08v.sys.comcast.net (resqmta-ch2-08v.sys.comcast.net [IPv6:2001:558:fe21:29:69:252:207:40]) by hurricane.the-brannons.com (Postfix) with ESMTPS id CA61077BBD for ; Sun, 30 Aug 2015 03:28:58 -0700 (PDT) Received: from resomta-ch2-16v.sys.comcast.net ([69.252.207.112]) by resqmta-ch2-08v.sys.comcast.net with comcast id AyX21r0022S2Q5R01yXB7o; Sun, 30 Aug 2015 10:31:11 +0000 Received: from eklhad ([IPv6:2601:405:4002:b0a:21e:4fff:fec2:a0f1]) by resomta-ch2-16v.sys.comcast.net with comcast id AyXB1r0090GArqr01yXBii; Sun, 30 Aug 2015 10:31:11 +0000 To: Edbrowse-dev@lists.the-brannons.com From: Karl Dahlke Reply-to: Karl Dahlke References: <20150729060404.eklhad@comcast.net> <20150830100239.GC17154@toaster.adamthompson.me.uk> User-Agent: edbrowse/3.5.4.2+ Date: Sun, 30 Aug 2015 06:31:11 -0400 Message-ID: <20150730063111.eklhad@comcast.net> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary=nextpart-eb-711431 Content-Transfer-Encoding: 7bit DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=q20140121; t=1440930671; bh=7DyOf6NN/nmCISSWpnbJyQnIYZ0IFcl5/hWBYYVFmh8=; h=Received:Received:To:From:Reply-to:Subject:Date:Message-ID: Mime-Version:Content-Type; b=pZ4u5AqjvLn8jwktFJ5KkcRPKGfprt3Vsb81ex/xd95M45gBha0WVGHDU3jrP185h A6G4EK0ua69kZfzMRckYO+15YOXMgv5F/uw2DOl4MRnLv8sTE4seFR7OubsAL7aunN zQ8hqmtlD/tFnGGQdEwZTPjVJWaOsJRS2wy/X4g5B6eDE5LDlopn8LSTzVwc8vv24i mt9jpG+nAvxJKiU2+RRjPOmK9LW55uDjw6FxOwpE7smojhzzqWNKQSSf+no67otfMg 3eMJFeO7auJmlmsaH3VoSxRepGaOQpyBstByxL0s7U7kd6o6PjQKPXGtrdcmQ+dvXb D4vjviRpkNfRw== Subject: [Edbrowse-dev] tidy debug tree, and a js script X-BeenThere: edbrowse-dev@lists.the-brannons.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Edbrowse Development List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 30 Aug 2015 10:28:59 -0000 This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. --nextpart-eb-711431 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable > Any chance you could have a go at converting some of the parsing logic Wow - I'm good but not that good. It's a pretty big project. I don't want to move too slowly, having done almost nothing on edbrowse in the past 6 months, but I don't want to run recklessly fast either. Need to pass designs by you guys before coding etc. And there are still some big questions to answer, like is tidy5 the right path, or perhaps libhubbub, which could be part of a larger browser effort, larger than just parsing html. netsurf-browser.org I'm running another sanity check on tidy. This generates an error because & is not escaped, and yes it probably should be. It even converts © into the copyright symbol, now part of the url. So ok, maybe I did a bad test because I'm not following spec but the internet doesn't follow spec either, not all the time. Look at the raw html from www.sciam.com It contains these two lines, on the same home page.
  • Subscribe to All Access »
  • Subscribe to Print »
  • The first one has & escaped, the second one does not. So ok just wanted to make sure tidy is handling these two cases = properly, and it is. Happily, my parser also handles these cases properly. I must have run into this at some point. I'll continue testing. Assuming I uncover no serious problems, I think the next step is to enhance our edbrowse node, with enough = attributes to faithfully copy the information from a tidy node. We have some of the attributs but not enough. A blatent omission is a text string, because we never represented text nodes before. We'll need this, and child pointers, and a list of attribute value pairs, and other things. I'll post more on this later. Karl Dahlke --nextpart-eb-711431--