[Edbrowse-dev] tidy5

edbrowse-dev - development list for edbrowse
 help / color / mirror / Atom feed

* [Edbrowse-dev]  tidy5
@ 2015-02-03 22:15 Karl Dahlke
  2015-02-03 23:41 ` Adam Thompson
  0 siblings, 1 reply; 13+ messages in thread
From: Karl Dahlke @ 2015-02-03 22:15 UTC (permalink / raw)
  To: Edbrowse-dev

> if tidy5 builds as a library (I hope it does)
> then we build and install their code same as any other library.

Sure, that makes sense.
So we'll see who has a chunk of time first to look into this.

If said library swallows html and gives us a tree of nodes, should we:

A) convert those nodes into the struct htmlTags we have today
and use most of our existing machinery, incremental, or

b) follow that tree directly, build js nodes off of it, use those nodes,
don't use our structures any more, this more of a rewrite.

I'm not expecting an answer, because we'd probably have to look
at the code, and library, and resulting tree to answer the question.
Just something to chew on.

Karl Dahlke

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Edbrowse-dev] tidy5
  2015-02-03 22:15 [Edbrowse-dev] tidy5 Karl Dahlke
@ 2015-02-03 23:41 ` Adam Thompson
  0 siblings, 0 replies; 13+ messages in thread
From: Adam Thompson @ 2015-02-03 23:41 UTC (permalink / raw)
  To: Karl Dahlke; +Cc: Edbrowse-dev

[-- Attachment #1: Type: text/plain, Size: 2339 bytes --]

On Tue, Feb 03, 2015 at 05:15:17PM -0500, Karl Dahlke wrote:
> > if tidy5 builds as a library (I hope it does)
> > then we build and install their code same as any other library.
> 
> Sure, that makes sense.
> So we'll see who has a chunk of time first to look into this.

I may have some time this weekend to have a look.
> If said library swallows html and gives us a tree of nodes, should we:
> 
> A) convert those nodes into the struct htmlTags we have today
> and use most of our existing machinery, incremental, or
> 
> b) follow that tree directly, build js nodes off of it, use those nodes,
> don't use our structures any more, this more of a rewrite.
> 
> I'm not expecting an answer, because we'd probably have to look
> at the code, and library, and resulting tree to answer the question.
> Just something to chew on.

From what I've seen of tidy (based on the curl example code),
I'd go for something between the two.
Basically I'm thinking that our existing tag machinary needs work anyway,
so we'd want to adapt it, but then we'd follow the tidy-generated tree,
building our DOM based on that.
We'd then have js hooks into our DOM which allow js to alter it since I suspect
doing that with the tidy tree would be somewhat mor involved with the tidy
code-base than we want to get.
This also gives us greater flexibility in DOM implementation.
Once js's finished with the DOM, we'd then render it.
This logic would have to be repeated each time js alters things (with some
optimisations) to ensure we get an accurate representation of what js's done to the page.

On another DOM-related note, it seems that we probably need to move the
contents of startwindow.js out of js and into our DOM implementation since DOM
objects are supposed to be "host" objects rather than javascript objects as
we're implementing them.
This is another reason to get a fully functional c DOM implementation,
since then we can plug in tidy5 to generate the initial structure and js to do
its thing, whilst allowing the js stuff to be in c++ and the tidy5 stuff to be
in whatever we need it to be.
This isn't going to be incremental, but at some stage I think we need to just do it and
stop making things "just work" for a few days until the next thing which "just
doesn't work" pops up.

Cheers,
Adam.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Edbrowse-dev] tidy5
  2015-08-14  3:37           ` Karl Dahlke
@ 2015-08-16 18:10             ` Adam Thompson
  0 siblings, 0 replies; 13+ messages in thread
From: Adam Thompson @ 2015-08-16 18:10 UTC (permalink / raw)
  To: Karl Dahlke; +Cc: Edbrowse-dev

[-- Attachment #1: Type: text/plain, Size: 1985 bytes --]

On Thu, Aug 13, 2015 at 11:37:54PM -0400, Karl Dahlke wrote:
> > In terms of an architecture I'm thinking of aiming to have the DOM as an
> > abstraction which can be used by both the rendering code and the js. Thus:
> > html is parsed into a node tree which is converted to our DOM objects
> > These objects are exposed to js via wrapper objects in the js world such that
> > any changes js makes are automatically passed through to the DOM
> > The renderer renders the DOM automatically on page load,
> > with support for re-rendering on a user command (with some sort of
> > notifications for js induced changes)
> > Form fields are altered in the DOM, which may or may not trigger a re-rendering
> 
> Yes this can cause a rerender, example onchange or onselect code,
> as exercised by the regression tests in jsrt.

Yeah, that was my thinking.

> > Any re-rendering would be partial, i.e.
> > only the changed segments of the DOM are re-rendered
> 
> This sounds like a diff between the old dom and the new,
> but it's easier to just rerender and then diff the old buffer against the new,
> and then report the lines that have changed, which is how edbrowse works today.
> Realize that a small change in dom could change the buffer
> on down the page, even into dom elements that have not changed.
> So I think you always want to just call render() and then
> diff the two buffers.
> Maybe even a diff library we can use, if not /bin/diff itself.

Yeah, I guess I'm just concerned about the js intensive pages which are
becoming much more common taking a long time to re-render,
but I can se that always doing a full re-render is the easiest and probably most
robust approach.
I'd quite like to have some smart approach to avoid making copies of unchanged
buffer lines whilst rendering, but I'm not too sure how that'd work.
> These are minor points; and you are definitely on track.
> This is where we need to be.

Thanks.

Regards,
Adam.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Edbrowse-dev]  tidy5
  2015-08-16  5:54                 ` Kevin Carhart
@ 2015-08-16 10:38                   ` Karl Dahlke
  0 siblings, 0 replies; 13+ messages in thread
From: Karl Dahlke @ 2015-08-16 10:38 UTC (permalink / raw)
  To: Edbrowse-dev

> And would some or most of the old case blocks be preserved, such as:
> the old case TAGACT_TABLE might resemble a new case TidyTag_TABLE

Yes I'm sure we would need to do that,
but I would save all that for step 2, step 1 is just calling tidy
and holding the resulting tree in-window,
until the window is freed.

> We're still building the new string 'ns'.. hmmm...

Let ns build as it does today in step 1, but by step 2
a routine render(), perhaps in render.c, will build it by traversing our dom tree.
So we will need to catch and retain text nodes, which aren't even part of our world today.
We have some tag nodes, but no text nodes.

> there is a name collision  ... mkdir

I found the same collision when I tried to recompile an old math program
I wrote 15 years ago.
The call use to be mkdir(file), now, in most libraries,
mkdir(file, mode), yet sometimes mkdir(file) works anyways,
sometimes not.
I'll check into this and most likely change to the second form,
which will most likely fix the problem.
Notice mkdir has the second form in main.c.

> thanks.. this is fun..

It is, but a bit concerning in that I don't know if tidy5
will be maintained long term, but if not, worst case,
we can take it over which is better than writing our own html parser
from scratch as we were doing.
Hurray for open source.

Karl Dahlke

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Edbrowse-dev] tidy5
  2015-08-14 20:17               ` Chris Brannon
@ 2015-08-16  5:54                 ` Kevin Carhart
  2015-08-16 10:38                   ` Karl Dahlke
  0 siblings, 1 reply; 13+ messages in thread
From: Kevin Carhart @ 2015-08-16  5:54 UTC (permalink / raw)
  To: Chris Brannon; +Cc: Edbrowse-dev

Chris said:
> Essentially true.  The details are a bit more complicated, but this is
> the idea.  We call tidy to parse the html, and we get back a structure
> from tidy called a document.  It contains our tree of nodes, and we
> can iterate over it.  The problem is, this is a usable parse tree for
> the html, but it isn't a true DOM.  We can remove nodes and attributes
> from the tree, but we can't add them.  That causes problems for JS that
> needs to add new nodes.  So we're going to have to take that parse tree
> we get back from Tidy5, build our own DOM out of it, and eventually
> render it.

So the switch statement over (action) goes away, and the painstaking 
character-by-character tag recognition goes away, but maybe in return we need 
a switch statement with handling for every value, maybe grouped together 
in cases, that they list as "Known HTML element types" in the tidyenum.h 
file?

And would some or most of the old case blocks be preserved, such as:
the old case TAGACT_TABLE might resemble a new case TidyTag_TABLE
the old case TAGACT_TR might resemble a new case TidyTag_TR
...

Like you get some work done by the library, but also want a crack at these 
node types differentiated by what they are.  Is that correct?  We're still 
building the new string 'ns'.. hmmm... is more standardization possible, 
or do you still have to do a variety of things in order to add to ns 
properly?

---
Here is a second note on what Karl said (paraphrasing), as a first step, 
how about bringing libtidy into html.c, run their parse method and just 
bring the output around as part of the ebWindow struct, for further 
examination without breaking what now exists.

My note on this.  I went to eb.h to see what would happen if I included 
tidy.h and added a TidyDoc to the ebWindow struct.  Interestingly, because 
of includes from includes, there is a name collision when I try to 
compile.. I think... over mkdir in plugin.c and mkdir in 
/usr/include/sys/stat.h.  Uh, maybe it's a client thing though. 
Disregard if it doesn't sound salient..

thanks.. this is fun.. I hope tidy will work
Kevin

--------
Kevin Carhart * 415 225 5306 * The Ten Ninety Nihilists

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Edbrowse-dev] tidy5
  2015-08-14  3:45             ` Karl Dahlke
@ 2015-08-14 20:17               ` Chris Brannon
  2015-08-16  5:54                 ` Kevin Carhart
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Brannon @ 2015-08-14 20:17 UTC (permalink / raw)
  To: Edbrowse-dev

Karl Dahlke <eklhad@comcast.net> writes:

> Forgive me I haven't looked at the code at all,
> but I would guess there's a tidy5 encodeTags() that takes the
> html text and makes the tree.

Essentially true.  The details are a bit more complicated, but this is
the idea.  We call tidy to parse the html, and we get back a structure
from tidy called a document.  It contains our tree of nodes, and we
can iterate over it.  The problem is, this is a usable parse tree for
the html, but it isn't a true DOM.  We can remove nodes and attributes
from the tree, but we can't add them.  That causes problems for JS that
needs to add new nodes.  So we're going to have to take that parse tree
we get back from Tidy5, build our own DOM out of it, and eventually
render it.

-- Chris

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Edbrowse-dev]  tidy5
  2015-08-14  0:54           ` Kevin Carhart
@ 2015-08-14  3:45             ` Karl Dahlke
  2015-08-14 20:17               ` Chris Brannon
  0 siblings, 1 reply; 13+ messages in thread
From: Karl Dahlke @ 2015-08-14  3:45 UTC (permalink / raw)
  To: Edbrowse-dev

> Am I on the right track in thinking, well tidy has a central "switch-case"
> section over various tag types, and we have a central "switch-case" in

Forgive me I haven't looked at the code at all,
but I would guess there's a tidy5 encodeTags() that takes the
html text and makes the tree.
We would just call that instead of our encodeTags(),
thus slicing out all that home grown html parsing code that I wrote,
I don't want to be in that business any more.
We would then follow up with software to traverse their node tree
and build our node tree.
The new tree will have more nodes than ours does today,
a node for every tag, not just some tags,
a note for each block of text, a node for each html comment.
So a lot more nodes, but perhaps somewhat backward compatible
with what we have today, at least for the first pass,
at least to get us going.
Then we improve and improve and improve.

Karl Dahlke

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Edbrowse-dev]  tidy5
  2015-08-13 20:07         ` Adam Thompson
  2015-08-14  0:54           ` Kevin Carhart
@ 2015-08-14  3:37           ` Karl Dahlke
  2015-08-16 18:10             ` Adam Thompson
  1 sibling, 1 reply; 13+ messages in thread
From: Karl Dahlke @ 2015-08-14  3:37 UTC (permalink / raw)
  To: Edbrowse-dev

> In terms of an architecture I'm thinking of aiming to have the DOM as an
> abstraction which can be used by both the rendering code and the js. Thus:
> html is parsed into a node tree which is converted to our DOM objects
> These objects are exposed to js via wrapper objects in the js world such that
> any changes js makes are automatically passed through to the DOM
> The renderer renders the DOM automatically on page load,
> with support for re-rendering on a user command (with some sort of
> notifications for js induced changes)
> Form fields are altered in the DOM, which may or may not trigger a re-rendering

Yes this can cause a rerender, example onchange or onselect code,
as exercised by the regression tests in jsrt.

> Any re-rendering would be partial, i.e.
> only the changed segments of the DOM are re-rendered

This sounds like a diff between the old dom and the new,
but it's easier to just rerender and then diff the old buffer against the new,
and then report the lines that have changed, which is how edbrowse works today.
Realize that a small change in dom could change the buffer
on down the page, even into dom elements that have not changed.
So I think you always want to just call render() and then
diff the two buffers.
Maybe even a diff library we can use, if not /bin/diff itself.

These are minor points; and you are definitely on track.
This is where we need to be.

Karl Dahlke

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Edbrowse-dev] tidy5
  2015-08-13 20:07         ` Adam Thompson
@ 2015-08-14  0:54           ` Kevin Carhart
  2015-08-14  3:45             ` Karl Dahlke
  2015-08-14  3:37           ` Karl Dahlke
  1 sibling, 1 reply; 13+ messages in thread
From: Kevin Carhart @ 2015-08-14  0:54 UTC (permalink / raw)
  To: Edbrowse-dev

> I know I said I'd look into a new js engine,
> but I really think we need to get the html and DOM stuff sorted before that.
>
> In terms of an architecture I'm thinking of aiming to have the DOM as an
> abstraction which can be used by both the rendering code and the js. Thus:

Thanks Adam!  Exciting.

OK, so far I compiled the tidy code and ran their sample program with 
libtidy calls.  The possibility of interoperability is very cool.
Am I on the right track in thinking, well tidy has a central "switch-case" 
section over various tag types, and we have a central "switch-case" in 
encodeTags, so this would be the place where you bring in tidy calls?

For methodology of how to proceed, I am happy with any & all methods.  I 
don't write C professionally but I know some parts of the edbrowse source 
pretty well at this point.  At least I'm now on a first-name basis with 
encodeTags.  (javaParseExecute and I used to babysit each others' 
children.)

> As for Edbrowse being used in cyber security,
> this isn't a good idea since most systems which analyse web pages for threats
> use highly advanced techniques to scan for malware which don't involve
> executing the javascript directly, and any such execution would probably
> require analysis on the js engine level to detect suspicious behaviours.

Ahhhh, I see.  That makes sense.

Kevin

--------
Kevin Carhart * 415 225 5306 * The Ten Ninety Nihilists

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Edbrowse-dev] tidy5
  2015-08-13  4:36       ` [Edbrowse-dev] tidy5 Kevin Carhart
@ 2015-08-13 20:07         ` Adam Thompson
  2015-08-14  0:54           ` Kevin Carhart
  2015-08-14  3:37           ` Karl Dahlke
  0 siblings, 2 replies; 13+ messages in thread
From: Adam Thompson @ 2015-08-13 20:07 UTC (permalink / raw)
  To: Kevin Carhart; +Cc: Edbrowse-dev

[-- Attachment #1: Type: text/plain, Size: 2222 bytes --]

On Wed, Aug 12, 2015 at 09:36:51PM -0700, Kevin Carhart wrote:
> 
> 
> Hi all
> 
> This sounds great-  thanks for the suggestion.  I hope the software works
> for our purposes.
> 
> >forward the mail I have.  Or we can collaborate on this together, if
> 
> Yes, whatever works, thanks Chris!  Please let me know your findings so far.

I'm also happy to help if there's something I can do.
I know I said I'd look into a new js engine,
but I really think we need to get the html and DOM stuff sorted before that.

In terms of an architecture I'm thinking of aiming to have the DOM as an
abstraction which can be used by both the rendering code and the js. Thus:
html is parsed into a node tree which is converted to our DOM objects
These objects are exposed to js via wrapper objects in the js world such that
any changes js makes are automatically passed through to the DOM
The renderer renders the DOM automatically on page load,
with support for re-rendering on a user command (with some sort of
notifications for js induced changes)
Form fields are altered in the DOM, which may or may not trigger a re-rendering
Any re-rendering would be partial, i.e.
only the changed segments of the DOM are re-rendered

This is going to be a *lot* of work and I don't expect it to all be done at
once, but that's certainly where I think we should be headed. Any thoughts?

As for Edbrowse being used in cyber security,
this isn't a good idea since most systems which analyse web pages for threats
use highly advanced techniques to scan for malware which don't involve
executing the javascript directly, and any such execution would probably
require analysis on the js engine level to detect suspicious behaviours.
None of these tasks would be possible with Edbrowse,
and altering it to make such things possible would mean we weren't writing a
web browser any more.
That's before we get into the security of the browser itself,
which probably could do with some careful analysis at some stage anyway,
particularly as we plan on making this a larger project.

However, I can see a definite place for Edbrowse for page automation etc once
we are more standards compliant.

Cheers,
Adam.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Edbrowse-dev] tidy5
  2015-08-13  1:08     ` Chris Brannon
@ 2015-08-13  4:36       ` Kevin Carhart
  2015-08-13 20:07         ` Adam Thompson
  0 siblings, 1 reply; 13+ messages in thread
From: Kevin Carhart @ 2015-08-13  4:36 UTC (permalink / raw)
  To: Chris Brannon; +Cc: Edbrowse-dev




Hi all

This sounds great-  thanks for the suggestion.  I hope the software works 
for our purposes.

> forward the mail I have.  Or we can collaborate on this together, if

Yes, whatever works, thanks Chris!  Please let me know your findings so 
far.

Kevin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Edbrowse-dev] tidy5
  2015-02-02 19:58 Karl Dahlke
@ 2015-02-03 21:18 ` Adam Thompson
  0 siblings, 0 replies; 13+ messages in thread
From: Adam Thompson @ 2015-02-03 21:18 UTC (permalink / raw)
  To: Karl Dahlke; +Cc: Edbrowse-dev

[-- Attachment #1: Type: text/plain, Size: 1494 bytes --]

On Mon, Feb 02, 2015 at 02:58:35PM -0500, Karl Dahlke wrote:
> I'm trying to get my head around this, and one problem is I don't
> know hardly anything about git.
> If we wanted to use and follow the tidy5 package, how would we do it?
> Could we, or should we,
> git clone the package under src, so there is then an src/tidy
> directory, that would build via make,
> that we could fold into our product and build upon?
> Could we git pull from them to keep up to date with them,
> and continue to do our work on top of it?
> Or is it impossible to put one git structure beneath another?

It's probably possible, but a *really* bad idea imo for a whole number of reasons.
> If that doesn't work then what is the mechanics of following and incorporating
> another project in ours?

Well.... at the risk of stating the obvious,
if tidy5 builds as a library (I hope it does)
then we build and install their code same as any other library.
We document the requirement for whatever version we need and go from there.

No need for nesting git repos or anything like that.
If we *really need* to fork (why?) then we copy the changes between their code
and ours and take on all the pain associated with this,
but lets not unless anyone has a good reason why (like my assumption about the
librified nature of the code is incorrect).
Even then, I'd much rather take on the work of making a libtidy5 and
contributing it to the tidy5 project and then proceeding with the library
integration.

Cheers,
Adam.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Edbrowse-dev] tidy5
@ 2015-02-02 19:58 Karl Dahlke
  2015-02-03 21:18 ` Adam Thompson
  0 siblings, 1 reply; 13+ messages in thread
From: Karl Dahlke @ 2015-02-02 19:58 UTC (permalink / raw)
  To: Edbrowse-dev

I'm trying to get my head around this, and one problem is I don't
know hardly anything about git.
If we wanted to use and follow the tidy5 package, how would we do it?
Could we, or should we,
git clone the package under src, so there is then an src/tidy
directory, that would build via make,
that we could fold into our product and build upon?
Could we git pull from them to keep up to date with them,
and continue to do our work on top of it?
Or is it impossible to put one git structure beneath another?
If that doesn't work then what is the mechanics of following and incorporating
another project in ours?

Karl Dahlke

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-08-16 18:06 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-03 22:15 [Edbrowse-dev] tidy5 Karl Dahlke
2015-02-03 23:41 ` Adam Thompson
  -- strict thread matches above, loose matches on Subject: below --
2015-08-10  8:56 [Edbrowse-dev] startwindow / class NodeList Kevin Carhart
2015-08-12 19:55 ` Kevin Carhart
2015-08-12 20:56   ` Karl Dahlke
2015-08-13  1:08     ` Chris Brannon
2015-08-13  4:36       ` [Edbrowse-dev] tidy5 Kevin Carhart
2015-08-13 20:07         ` Adam Thompson
2015-08-14  0:54           ` Kevin Carhart
2015-08-14  3:45             ` Karl Dahlke
2015-08-14 20:17               ` Chris Brannon
2015-08-16  5:54                 ` Kevin Carhart
2015-08-16 10:38                   ` Karl Dahlke
2015-08-14  3:37           ` Karl Dahlke
2015-08-16 18:10             ` Adam Thompson
2015-02-02 19:58 Karl Dahlke
2015-02-03 21:18 ` Adam Thompson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).