[Edbrowse-dev] tag list

edbrowse-dev - development list for edbrowse
 help / color / mirror / Atom feed

* [Edbrowse-dev]   tag list
@ 2014-03-02 14:15 Karl Dahlke
  2014-03-02 19:28 ` Adam Thompson
  0 siblings, 1 reply; 6+ messages in thread
From: Karl Dahlke @ 2014-03-02 14:15 UTC (permalink / raw)
  To: Edbrowse-dev

> At some stage I really need to familiarise myself with the html code.

Yes, but please ask if unsure. A question can sometimes replace
days of reading code, especially my code, which isn't well commented.

> but why are we storing a list of pointers?

Precisely so the structures don't move.
Each tag can point, by a pointer, to its parent or children
and those pointers will remain valid,
even if c++ vector does a realloc on the list of pointers,
which it will do as new tags are created.
Chris and I went through this - I even started writing
vector<struct htmlTag> code, but then I could see the structures
were moving, and the parent and child links became invalid.

> We also need to store a list of children in each tag, i.e. in the code:

Yes, and javascript has set for us a partial standard;
should we follow it?
A form contains input elements, right?
Well js standard says form contains a member named elements, which is an array
of all the input tags, in order.
Now suppose one of those inputs is a dropdown list, a select.
Dom standard says that input tag contains a member named options.
It is an array of objects, each object an option in the select list.
So children seem to be held in an array that is owned by the parent.

> Example of <body> <div>
>
> The body would have a list of two pointers to the two div tags,

js already has an array of div tags.
It is called divs, I think.
I know it has n array of link tags called links, an array of image tags called images, and so on.
What I don't know is whether this, in the standard, is a global array of all images on the page,
or a local array of images in the current structure,
like elements in a form or options in a select.
We would want the latter.
More research is needed.
domlink() in jsdom.cpp is suppose to do all of this.
And it looks like it treats elements and options as a local list,
in the current structure, but images and links and heads and metas
and anchors as a global list under document.
I don't know if this is right.
If this is the standard perhaps we can do both,
document/images[] for all image tags on the page,
and local/images[] for the array of images that are inside the current paragraph
or whatever.

Another aspect of this js standard is it is type specific.
Here is the list of elements in the form, here is the list
of images, here is the list of anchors, etc.
Maybe that's ok, but maybe we also need an array of all tags in order
within each construct.
IDK

> No need to do this rewrite at the moment,

Absolutely agree.
I think we all agree here.
Let's get 3.5.1 stable and working with distributed libraries.
We're just talking, and thinking, and planning for the future,
and I think it is helpful.

Karl Dahlke

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Edbrowse-dev] tag list
  2014-03-02 14:15 [Edbrowse-dev] tag list Karl Dahlke
@ 2014-03-02 19:28 ` Adam Thompson
  0 siblings, 0 replies; 6+ messages in thread
From: Adam Thompson @ 2014-03-02 19:28 UTC (permalink / raw)
  To: Karl Dahlke; +Cc: Edbrowse-dev

[-- Attachment #1: Type: text/plain, Size: 5368 bytes --]

On Sun, Mar 02, 2014 at 09:15:44AM -0500, Karl Dahlke wrote:
> > At some stage I really need to familiarise myself with the html code.
> 
> Yes, but please ask if unsure. A question can sometimes replace
> days of reading code, especially my code, which isn't well commented.

Tbh it's far from the worst I've seen,
though some more comments on the global and file-scoped variables would be useful.

> > but why are we storing a list of pointers?
> 
> Precisely so the structures don't move.
> Each tag can point, by a pointer, to its parent or children
> and those pointers will remain valid,
> even if c++ vector does a realloc on the list of pointers,
> which it will do as new tags are created.
> Chris and I went through this - I even started writing
> vector<struct htmlTag> code, but then I could see the structures
> were moving, and the parent and child links became invalid.

From a purely optimisation perspective, I wonder if storing vector<htmlTag> and
creating the tree from indices would be better as then the tags would be stored contiguously rather
than all over the heap, thus reducing heap fragmentation.
Of course this increases the possibility of failures due to insufficient
contiguous memory being available. At the end of the day it's not a big problem.

> > We also need to store a list of children in each tag, i.e. in the code:
> 
> Yes, and javascript has set for us a partial standard;
> should we follow it?
[snip]
I'd rather not, I'd just have a list of children in each htmlTag struct,
then write images, divs etc as wrappers that access this list.
Otherwise we get into sub-classing htmlTag which is something I'd like to avoid if possible.

> > Example of <body> <div>
> >
> > The body would have a list of two pointers to the two div tags,
> 
> js already has an array of div tags.
> It is called divs, I think.
> I know it has n array of link tags called links, an array of image tags called images, and so on.
> What I don't know is whether this, in the standard, is a global array of all images on the page,
> or a local array of images in the current structure,
> like elements in a form or options in a select.
> We would want the latter.
> More research is needed.

I'm not sure, however for rendering we need the generic approach above.
The other arrays can be generated from this as necessary.
With some care this shouldn't be too bad performance-wise I think.

> domlink() in jsdom.cpp is suppose to do all of this.
> And it looks like it treats elements and options as a local list,
> in the current structure, but images and links and heads and metas
> and anchors as a global list under document.
> I don't know if this is right.
> If this is the standard perhaps we can do both,
> document/images[] for all image tags on the page,
> and local/images[] for the array of images that are inside the current paragraph
> or whatever.

I'm not sure, I need to read the DOM spec at some stage to work out exactly what's needed.
Remember that there's a core DOM and then the html extensions to this.
I think we need to get our core DOM working,
then look at the html extensions (images[] etc) on top of this.
At the moment, we've got a partial html DOM but not really the core underneath
it so appendChild (core DOM I think) and friends are awkward to implement.
If we fix the core DOM, then the html side of it,
this will hopefully make rendering better,
add support for appendChild and tag creation by JS,
as well as probably removing special-case code.
> Another aspect of this js standard is it is type specific.
> Here is the list of elements in the form, here is the list
> of images, here is the list of anchors, etc.
> Maybe that's ok, but maybe we also need an array of all tags in order
> within each construct.

We definitely need an array of all children.
Remember this isn't so much a js standard,
more js provides an implementation of an interface to the DOM defined by the W3C.
I think if we look at it this way (i.e.
implement our DOM then the js interface) that's probably better.
It also means that when Mozilla totally change SpiderMonkey again (which they
say they may do at any time) we have a working DOM which just needs new wrapping.

To take this to its logical extreme,
I'd like to separate the html parsing and DOM creation from js,
providing an api to this DOM which is capable enough to support all the stuff JS needs.
We could then also write the rendering code using this DOM api as well.
I wonder if there's an html parsing library we can use for some of this.

> > No need to do this rewrite at the moment,
> 
> Absolutely agree.
> I think we all agree here.
> Let's get 3.5.1 stable and working with distributed libraries.
> We're just talking, and thinking, and planning for the future,
> and I think it is helpful.

Yeah. I think future planning's a good idea,
particularly when discussing the kind of changes above.
One discussion we should probably have at some stage in this area is what
systems, language standards etc we want to support.

Personally, I'd kind of like to keep most of edbrowse in C,
with interfaces to whatever we need in whatever language (c++ for SpiderMonkey js for
example), however I know people seem to want to use c++ for various things.

Cheers,
Adam.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Edbrowse-dev] tag list
  2014-03-01 19:24 Karl Dahlke
@ 2014-03-02 13:47 ` Adam Thompson
  0 siblings, 0 replies; 6+ messages in thread
From: Adam Thompson @ 2014-03-02 13:47 UTC (permalink / raw)
  To: Karl Dahlke; +Cc: Edbrowse-dev

[-- Attachment #1: Type: text/plain, Size: 2619 bytes --]

On Sat, Mar 01, 2014 at 02:24:32PM -0500, Karl Dahlke wrote:
> Yeah, this is something I got confused about too,
> until Chris set me straight.
> Duh - I wrote it - and then I got confused about it.
> I can be dumb as a box of rocks sometimes.

I remembered something about this after I replied but couldn't remember the 
details.
At some stage I really need to familiarise myself with the html code.

> The linked list or array or vector or whatever holds pointers
> to struct htmlTag, not the struct itself.
> So structs can go ahead and point to each other as parents and children,
> because the structs don't move.
> The growing vector simply reallocates the list of pointers to those structures.

Stupid question, and appologies if this's already been asked,
but why are we storing a list of pointers?
> 
> I already do this, don't I?
> t->controller is the form that owns the input tag,
> and for an option t->controller is the select that owns the option.
> Just rename controller parent and you're halfway there.

We also need to store a list of children in each tag, i.e. in the code:
<body>
<div>
<p>whatever</p>
<p>Some more text</p>
</div>
<div>
<p>Footer text</p>
</div>
</body>

The body would have a list of two pointers to the two div tags,
the first div tag would need to hold a list of two pointers to the two p tags
under it, whilst the second div tag only has one pointer to the p tag under it.
As you say though, each tag only needs a single parent link,
which simplifies things.
> 
> So with this in mind 
> 
> static list < struct htmlTag *>htmlStack;
> 
> becomes
> 
> static vector < struct htmlTag *>htmlStack;
> 
> Then sure it's all normal after that, and I'd just love to
> set cw->tags to htmlStack, but cw->tags
> is one of those things that is in C, not C++.
> In fact it's in eb.h, thus in every C file,
> so we'd have to use void * or some such, or convert the whole project to C++.

Or according to [1] set it to:
cw->tags = &htmlStack.front();
> But that's the idea, and we can certainly move forward there.

No need to do this rewrite at the moment,
and I think we need to get the js stuff sorted before we start contemplating
any possible benefits (I'm still not entirely convinced honestly) of doing this.

> Then there is no trouble adding new tags as we need to,
> as js creates new thingees for us.

Yeah, as long as we ensure we append the correct child list and set the parent
pointer correctly.

Cheers,
Adam.
[1] http://stackoverflow.com/questions/6485496/how-to-get-stdvector-pointer-to-the-raw-data

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Edbrowse-dev]  tag list
@ 2014-03-01 19:24 Karl Dahlke
  2014-03-02 13:47 ` Adam Thompson
  0 siblings, 1 reply; 6+ messages in thread
From: Karl Dahlke @ 2014-03-01 19:24 UTC (permalink / raw)
  To: Edbrowse-dev

Yeah, this is something I got confused about too,
until Chris set me straight.
Duh - I wrote it - and then I got confused about it.
I can be dumb as a box of rocks sometimes.
The linked list or array or vector or whatever holds pointers
to struct htmlTag, not the struct itself.
So structs can go ahead and point to each other as parents and children,
because the structs don't move.
The growing vector simply reallocates the list of pointers to those structures.

I already do this, don't I?
t->controller is the form that owns the input tag,
and for an option t->controller is the select that owns the option.
Just rename controller parent and you're halfway there.

So with this in mind 

static list < struct htmlTag *>htmlStack;

becomes

static vector < struct htmlTag *>htmlStack;

Then sure it's all normal after that, and I'd just love to
set cw->tags to htmlStack, but cw->tags
is one of those things that is in C, not C++.
In fact it's in eb.h, thus in every C file,
so we'd have to use void * or some such, or convert the whole project to C++.
But that's the idea, and we can certainly move forward there.
Then there is no trouble adding new tags as we need to,
as js creates new thingees for us.

Karl Dahlke

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Edbrowse-dev] tag list
  2014-03-01 14:00 Karl Dahlke
@ 2014-03-01 19:01 ` Adam Thompson
  0 siblings, 0 replies; 6+ messages in thread
From: Adam Thompson @ 2014-03-01 19:01 UTC (permalink / raw)
  To: Karl Dahlke; +Cc: Edbrowse-dev

[-- Attachment #1: Type: text/plain, Size: 2433 bytes --]

On Sat, Mar 01, 2014 at 09:00:07AM -0500, Karl Dahlke wrote:
> Well the tags are built in a growing linked list as the page is parsed,
> but at the end of parse I plop them all into an array,
> so it is easy to grab tag 243 (example), just index the array,
> because the text in your buffer actually has encoded
> 
> tag 243 {go to this link}
> 
> You don't see tag code 243 but it's there,
> and it is accessed when you go to that link.
> 
> If we want tags to continue to grow after page parse,
> new tags because of javascript creating html structures etc,
> then we want something with the dynamic power of a linked list or a tree
> but also an easy way to index like an array.
> Not sure if C++ list has this much power, or vector,
> but in C there wasn't anything, which is why I made the compromise I did.
> Anyways we might be able to forget the array and stay with the c++ list,
> and use its power to move forward.
> Create tags whenever javascript tells us to,
> and they can have parent links inside them to define the tree structure
> of forms and tables and so  on.

Yeah, I know vectors will do this for us,
with the allocation policy being implementation dependant,
but usually not too bad I think. In C I'd implement this manually using realloc
but as we're using c++ for the js stuff we can probably put the tags in a vector.
The index into the tag vector for creating the tree structure sounds kind of
nice and prevents a lot of pointers being passed around.

What this means in terms of changes I think is that we stop using the list
class (is this ok to do) and change tagarray from an array to a vector<htmlTag>
(not sure of capitalisation off the top of my head),
then use tagArray.push_back to add tags.
The parent links in the tags would be size_t variables,
and each tag would also need to maintain a vector of indices to its children
(of type vector<size_t> I think). This is a little messy,
but I'm not sure of a better implementation right now as efficient tree
traversal requires an easy way to work out what a tag's children are.
This implementation would allow that whilst keeping the efficient indexing
required for normal browsing.

The reason I'm using size_t for the parent "links"
is that they'll be indices into the tag vector which means that,
if in a realloc the whole lot gets moved,
pointers don't suddenly become invalid etc.

Cheers,
Adam.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Edbrowse-dev] tag list
@ 2014-03-01 14:00 Karl Dahlke
  2014-03-01 19:01 ` Adam Thompson
  0 siblings, 1 reply; 6+ messages in thread
From: Karl Dahlke @ 2014-03-01 14:00 UTC (permalink / raw)
  To: Edbrowse-dev

Well the tags are built in a growing linked list as the page is parsed,
but at the end of parse I plop them all into an array,
so it is easy to grab tag 243 (example), just index the array,
because the text in your buffer actually has encoded

tag 243 {go to this link}

You don't see tag code 243 but it's there,
and it is accessed when you go to that link.

If we want tags to continue to grow after page parse,
new tags because of javascript creating html structures etc,
then we want something with the dynamic power of a linked list or a tree
but also an easy way to index like an array.
Not sure if C++ list has this much power, or vector,
but in C there wasn't anything, which is why I made the compromise I did.
Anyways we might be able to forget the array and stay with the c++ list,
and use its power to move forward.
Create tags whenever javascript tells us to,
and they can have parent links inside them to define the tree structure
of forms and tables and so  on.

Karl Dahlke

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-03-02 19:29 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-02 14:15 [Edbrowse-dev] tag list Karl Dahlke
2014-03-02 19:28 ` Adam Thompson
  -- strict thread matches above, loose matches on Subject: below --
2014-03-01 19:24 Karl Dahlke
2014-03-02 13:47 ` Adam Thompson
2014-03-01 14:00 Karl Dahlke
2014-03-01 19:01 ` Adam Thompson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).