edbrowse-dev - development list for edbrowse
 help / color / mirror / Atom feed
* [Edbrowse-dev] tidy and db5
@ 2015-08-28  2:55 Kevin Carhart
  2015-08-28 21:35 ` [Edbrowse-dev] tidy tree Kevin Carhart
  0 siblings, 1 reply; 7+ messages in thread
From: Kevin Carhart @ 2015-08-28  2:55 UTC (permalink / raw)
  To: Edbrowse-dev

Good progress!  OK, I could bring in the db5 block unless you have 
zipped on to this already.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Edbrowse-dev] tidy tree
  2015-08-28  2:55 [Edbrowse-dev] tidy and db5 Kevin Carhart
@ 2015-08-28 21:35 ` Kevin Carhart
  2015-08-28 21:55   ` Chris Brannon
  0 siblings, 1 reply; 7+ messages in thread
From: Kevin Carhart @ 2015-08-28 21:35 UTC (permalink / raw)
  To: Edbrowse-dev



I think now that we have the TidyDoc, there is a little bit of 
variety possible in how to lay things out so that they will be actually 
useable and conducive to sitting down with it, in the ways that we need.

Here is what I have right now.

The node name is shown.
The number in parentheses is the number of levels of nesting.
After each node is each attribute name-value pair.
What else?

Here is a snippet of amazon.com.
Node(3): div
Attribute: class = navFooterLine navFooterLinkLine navFooterPadItemLine
Node(4): ul
Node(5): li
Attribute: class = nav_first
Node(6): a
Attribute: href = 
/gp/help/customer/display.html/ref=footer_cou?ie=UTF8&nodeId=508088
Attribute: class = nav_a
Node(7): Text
Node(5): li
Node(6): a
Attribute: href = 
/gp/help/customer/display.html/ref=footer_privacy?ie=UTF8&nodeId=468496
Attribute: class = nav_a
Node(7): Text
Node(5): li
Node(6): a
Attribute: href = /interestbasedads
Attribute: class = nav_a

thanks
Kevin




On Thu, 27 Aug 2015, Kevin Carhart wrote:

> Good progress!  OK, I could bring in the db5 block unless you have
> zipped on to this already.
> _______________________________________________
> Edbrowse-dev mailing list
> Edbrowse-dev@lists.the-brannons.com
> http://lists.the-brannons.com/mailman/listinfo/edbrowse-dev
>

--------
Kevin Carhart * 415 225 5306 * The Ten Ninety Nihilists

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Edbrowse-dev] tidy tree
  2015-08-28 21:35 ` [Edbrowse-dev] tidy tree Kevin Carhart
@ 2015-08-28 21:55   ` Chris Brannon
  2015-08-28 22:38     ` Kevin Carhart
  0 siblings, 1 reply; 7+ messages in thread
From: Chris Brannon @ 2015-08-28 21:55 UTC (permalink / raw)
  To: Kevin Carhart; +Cc: Edbrowse-dev

Kevin Carhart <kevin@carhart.net> writes:

> Here is what I have right now.
>
> The node name is shown.
> The number in parentheses is the number of levels of nesting.
> After each node is each attribute name-value pair.
> What else?

That's very awesome!  Sorry, I must have missed seeing that you had started
on this.  The only thing I might add is perhaps a display of the
contents of text nodes.  Or at least the length of the underlying
string?  On the other hand, maybe that's a little too much noise?
Perhaps we could show that on higher debug levels?

-- Chris

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Edbrowse-dev] tidy tree
  2015-08-28 21:55   ` Chris Brannon
@ 2015-08-28 22:38     ` Kevin Carhart
  0 siblings, 0 replies; 7+ messages in thread
From: Kevin Carhart @ 2015-08-28 22:38 UTC (permalink / raw)
  To: Chris Brannon; +Cc: Edbrowse-dev



Hi Chris
No problem.  Echoing the text at a higher db level is a good idea.  I'll 
go add that to db6 right now if I can figure it out swiftly and then send 
it to you.  I did not write very much. 
I grabbed existing work, such as this, so thank you Suman Srinivasan.

http://sumancolumbia.blogspot.com/2006/03/parsing-html-using-tidy-and-tidylib.html

Kevin


On Fri, 28 Aug 2015, Chris Brannon wrote:

> Kevin Carhart <kevin@carhart.net> writes:
>
>> Here is what I have right now.
>>
>> The node name is shown.
>> The number in parentheses is the number of levels of nesting.
>> After each node is each attribute name-value pair.
>> What else?
>
> That's very awesome!  Sorry, I must have missed seeing that you had started
> on this.  The only thing I might add is perhaps a display of the
> contents of text nodes.  Or at least the length of the underlying
> string?  On the other hand, maybe that's a little too much noise?
> Perhaps we could show that on higher debug levels?
>
> -- Chris
>

--------
Kevin Carhart * 415 225 5306 * The Ten Ninety Nihilists

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Edbrowse-dev] tidy tree
  2015-08-29  2:25 ` Karl Dahlke
@ 2015-08-29  3:04   ` Kevin Carhart
  0 siblings, 0 replies; 7+ messages in thread
From: Kevin Carhart @ 2015-08-29  3:04 UTC (permalink / raw)
  To: Karl Dahlke; +Cc: Edbrowse-dev



> If no one objects I will probably make some small tweaks and push this,
> it's just db5 prints so shouldn't hurt anything.

Sounds great!  Yes, that one main example has been helpful.  The library 
documentation on htacg.org often doesn't have any actual notes or prose on 
the functions, (it has that auto generated feel) but there seems to be an 
adequate amount of tidy code around.  For instance, I picked up from 
someplace that the node name is this string type I know nothing about, 
ctmbstr, but someone had cast it to a char *, so that's how I knew to try 
that one.

Kevin

>
> There's no \r in your patch so I should have no trouble applyihng it.
>
> Once in, we'll want to test it on all sorts of html,
> especially embedded with javascript, to make sure the nodes are correct.
> After all, we don't know for sure that we can even use this yet,
> the conversion from html to nodes has to be almost perfect.
> Though I suppose we could report any bugs to the tidy5 team.
>
> Karl Dahlke
> _______________________________________________
> Edbrowse-dev mailing list
> Edbrowse-dev@lists.the-brannons.com
> http://lists.the-brannons.com/mailman/listinfo/edbrowse-dev
>

--------
Kevin Carhart * 415 225 5306 * The Ten Ninety Nihilists

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Edbrowse-dev]  tidy tree
  2015-08-29  1:45 Kevin Carhart
@ 2015-08-29  2:25 ` Karl Dahlke
  2015-08-29  3:04   ` Kevin Carhart
  0 siblings, 1 reply; 7+ messages in thread
From: Karl Dahlke @ 2015-08-29  2:25 UTC (permalink / raw)
  To: Edbrowse-dev

> I introduced two routines from the sample code: dumpBody and dumpNode.

Obviously the sample code has saved us all some time.
This template is far more valuable then just debugging and visualization;
we will want to traverse the tree in the same way and fold those nodes into our nodes,
and create the js objects, and put all the attributes in as object members, and so on.
So this is all good, and surprisingly concise.
If no one objects I will probably make some small tweaks and push this,
it's just db5 prints so shouldn't hurt anything.

There's no \r in your patch so I should have no trouble applyihng it.

Once in, we'll want to test it on all sorts of html,
especially embedded with javascript, to make sure the nodes are correct.
After all, we don't know for sure that we can even use this yet,
the conversion from html to nodes has to be almost perfect.
Though I suppose we could report any bugs to the tidy5 team.

Karl Dahlke

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Edbrowse-dev] tidy tree
@ 2015-08-29  1:45 Kevin Carhart
  2015-08-29  2:25 ` Karl Dahlke
  0 siblings, 1 reply; 7+ messages in thread
From: Kevin Carhart @ 2015-08-29  1:45 UTC (permalink / raw)
  To: chris, kevin, Edbrowse-dev

[-- Attachment #1: Type: text/plain, Size: 1405 bytes --]

Well, that took a while but I have the contents of text nodes now showing if you are at db6.

- I bring in the tidybuffio.h so that I can make a TidyBuffer
- I bring in a TidyBuffer because the tidyNodeGetText routine puts its output in one.
- Unlike tidyDoc, it seems as though in order to free a TidyBuffer, you must test for null.
Otherwise the program seg faults, based on a thread I was reading.  
I have a phrasing at the end of the routine for how to know whether it is safe to call tidyBufFree.
I test the .size.  I'm not sure if this is correct- does anyone know?  It wouldn't let me compare the TidyBuffer object with null.
- The mechanics of the node traversal, I brought in from the Tidy example as well as other snippets of work.  
So I introduced two routines from the sample code: dumpBody and dumpNode.  
I placed them just after encodeTags.
- Why do they hardcode those first several cases when they are switching on the node name?
I assume it is something to do with the laws of the W3C spec?
Like, are these branches that terminate, so you don't have to worry about additional levels?
Anyway, I left them alone.

Over to you!  I'm sure this will raise some fertile issues in what to do from here.

I hope there will not be \r introduced into this attachment.  If there is, the email client is ruled out as a culprit, and I'll worry about other causes.
thanks
Kevin

[-- Attachment #2: /home/kevin/public_html/c7/edbrowse/KC_patch20150828.txt --]
[-- Type: text/plain, Size: 2838 bytes --]

diff -Naur 1/edbrowse-master/src/html.c 2/edbrowse-master/src/html.c
--- 1/edbrowse-master/src/html.c	2015-08-27 14:18:35.000000000 -0700
+++ 2/edbrowse-master/src/html.c	2015-08-28 17:59:09.092626328 -0700
@@ -5,7 +5,7 @@
 
 #include "eb.h"
 #include "tidy.h"
-
+#include "tidybuffio.h"
 #define handlerPresent(obj, name) (has_property(obj, name) == EJ_PROP_FUNCTION)
 
 static TidyDoc tdoc;
@@ -1695,6 +1695,10 @@
 		showTidyMessages = false;
 	tidySetCharEncoding(tdoc, (cons_utf8 ? "utf8" : "latin1"));
 	tidyParseString(tdoc, html);
+	if (debugLevel >= 5) {
+		tidyCleanAndRepair(tdoc);
+		dumpBody(tdoc);
+	}
 
 	ns = initString(&ns_l);
 	preamble = initString(&preamble_l);
@@ -2641,6 +2645,88 @@
 	return ns;
 }				/* encodeTags */
 
+void dumpBody(TidyDoc tdoc)
+{
+/* just for debugging - we only reach this routine at db5 or above */
+	dumpNode(tidyGetBody(tdoc), 0);
+}
+
+void dumpNode(TidyNode tnod, int indent)
+{
+/* just for debugging - we only reach this routine at db5 or above */
+	TidyNode child;
+	TidyBuffer tnv = { 0 };	/* text-node value */
+	for (child = tidyGetChild(tnod); child; child = tidyGetNext(child)) {
+		ctmbstr name;
+		tidyBufClear(&tnv);
+		switch (tidyNodeGetType(child)) {
+		case TidyNode_Root:
+			name = "Root";
+			break;
+		case TidyNode_DocType:
+			name = "DOCTYPE";
+			break;
+		case TidyNode_Comment:
+			name = "Comment";
+			break;
+		case TidyNode_ProcIns:
+			name = "Processing Instruction";
+			break;
+		case TidyNode_Text:
+			name = "Text";
+			break;
+		case TidyNode_CDATA:
+			name = "CDATA";
+			break;
+		case TidyNode_Section:
+			name = "XML Section";
+			break;
+		case TidyNode_Asp:
+			name = "ASP";
+			break;
+		case TidyNode_Jste:
+			name = "JSTE";
+			break;
+		case TidyNode_Php:
+			name = "PHP";
+			break;
+		case TidyNode_XmlDecl:
+			name = "XML Declaration";
+			break;
+		case TidyNode_Start:
+		case TidyNode_End:
+		case TidyNode_StartEnd:
+		default:
+			name = tidyNodeGetName(child);
+			break;
+		}
+		assert(name != NULL);
+		printf("Node(%d): %s\n", (indent / 4), ((char *)name));
+		if (debugLevel >= 6) {
+/* the ifs could be combined with && */
+			if (strcmp(((char *)name), "Text") == 0) {
+				tidyNodeGetText(tdoc, child, &tnv);
+				printf("Text: %s", tnv.bp);
+/* no trailing newline because it appears that there already is one */
+			}
+		}
+
+/* Get the first attribute for all nodes */
+		TidyAttr tattr = tidyAttrFirst(child);
+		while (tattr != NULL) {
+/* Print the node and its attribute */
+			printf("Attribute: %s = %s\n", tidyAttrName(tattr),
+			       tidyAttrValue(tattr));
+/* Get the next attribute */
+			tattr = tidyAttrNext(tattr);
+		}
+		dumpNode(child, indent + 4);
+	}
+	if (tnv.size > 0) {
+		tidyBufFree(&tnv);
+	}
+}
+
 void preFormatCheck(int tagno, bool * pretag, bool * slash)
 {
 	const struct htmlTag *t;

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-08-29  3:02 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-28  2:55 [Edbrowse-dev] tidy and db5 Kevin Carhart
2015-08-28 21:35 ` [Edbrowse-dev] tidy tree Kevin Carhart
2015-08-28 21:55   ` Chris Brannon
2015-08-28 22:38     ` Kevin Carhart
2015-08-29  1:45 Kevin Carhart
2015-08-29  2:25 ` Karl Dahlke
2015-08-29  3:04   ` Kevin Carhart

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).