Gnus development mailing list
 help / color / mirror / Atom feed
* Built-in HTML parsing and rendering library
@ 2010-09-05 22:58 Lars Magne Ingebrigtsen
  2010-09-06  4:27 ` Daniel Pittman
  2010-09-06  7:32 ` Steinar Bang
  0 siblings, 2 replies; 12+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-05 22:58 UTC (permalink / raw)
  To: ding

For parsing xml, there's libxml2, but are there any C/C++ libraries for
parsing HTML in use out there that could be compiled into Emacs, by any
chance?  Then Gnus wouldn't have to rely on the external w3m library...

I mean, something that parses real-world HTML as well as w3m does, and
generates a parse tree based on that.  I guess if it returned a
convenient parse tree back to elisp, the HTML could be rendered fast
enough from elisp.

I've tried googling, but I don't really know what to google for...

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Built-in HTML parsing and rendering library
  2010-09-05 22:58 Built-in HTML parsing and rendering library Lars Magne Ingebrigtsen
@ 2010-09-06  4:27 ` Daniel Pittman
  2010-09-06  7:53   ` Steinar Bang
  2010-09-06 11:33   ` Lars Magne Ingebrigtsen
  2010-09-06  7:32 ` Steinar Bang
  1 sibling, 2 replies; 12+ messages in thread
From: Daniel Pittman @ 2010-09-06  4:27 UTC (permalink / raw)
  To: ding

Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> For parsing xml, there's libxml2, but are there any C/C++ libraries for
> parsing HTML in use out there that could be compiled into Emacs, by any
> chance?  Then Gnus wouldn't have to rely on the external w3m library...
>
> I mean, something that parses real-world HTML as well as w3m does, and
> generates a parse tree based on that.  I guess if it returned a
> convenient parse tree back to elisp, the HTML could be rendered fast
> enough from elisp.

http://www.xmlsoft.org/html/libxml-HTMLparser.html

> I've tried googling, but I don't really know what to google for...

Me either, but I happened across that reference and boggled at it not that
long ago, so it was semi-fresh in my mind.

        Daniel
-- 
✣ Daniel Pittman            ✉ daniel@rimspace.net            ☎ +61 401 155 707
               ♽ made with 100 percent post-consumer electrons




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Built-in HTML parsing and rendering library
  2010-09-05 22:58 Built-in HTML parsing and rendering library Lars Magne Ingebrigtsen
  2010-09-06  4:27 ` Daniel Pittman
@ 2010-09-06  7:32 ` Steinar Bang
  2010-09-06  8:29   ` David Engster
  1 sibling, 1 reply; 12+ messages in thread
From: Steinar Bang @ 2010-09-06  7:32 UTC (permalink / raw)
  To: ding

>>>>> Lars Magne Ingebrigtsen <larsi@gnus.org>:

> I mean, something that parses real-world HTML as well as w3m does, and
> generates a parse tree based on that.

Just HTML?  Or the combination of HTML+CSS+JavaScript that is often used
today to create a web page?

I guess if we're talking about emails in HTML, we're talking about
relatively simple HTML with a little CSS thrown in?

What does w3m do?  Consider the CSS and use it, before delivering
something for emacs/gnus to use?




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Built-in HTML parsing and rendering library
  2010-09-06  4:27 ` Daniel Pittman
@ 2010-09-06  7:53   ` Steinar Bang
  2010-09-06 11:33   ` Lars Magne Ingebrigtsen
  1 sibling, 0 replies; 12+ messages in thread
From: Steinar Bang @ 2010-09-06  7:53 UTC (permalink / raw)
  To: ding

>>>>> Daniel Pittman <daniel@rimspace.net>:

> Lars Magne Ingebrigtsen <larsi@gnus.org> writes:
>> For parsing xml, there's libxml2, but are there any C/C++ libraries for
>> parsing HTML in use out there that could be compiled into Emacs, by any
>> chance?  Then Gnus wouldn't have to rely on the external w3m library...
>> 
>> I mean, something that parses real-world HTML as well as w3m does, and
>> generates a parse tree based on that.  I guess if it returned a
>> convenient parse tree back to elisp, the HTML could be rendered fast
>> enough from elisp.

> http://www.xmlsoft.org/html/libxml-HTMLparser.html

Another one is "tidylib"
	http://tidy.sourceforge.net/
(the parsing of HTML Tidy repackaged as a library).

There is also the HTML parser of the w3c libwww, but that is awfully
outdated.
	http://www.w3.org/Library/src/HTML.html
(said to support HTML 4 in what's checked into the CVS, though)






^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Built-in HTML parsing and rendering library
  2010-09-06  7:32 ` Steinar Bang
@ 2010-09-06  8:29   ` David Engster
  0 siblings, 0 replies; 12+ messages in thread
From: David Engster @ 2010-09-06  8:29 UTC (permalink / raw)
  To: ding

Steinar Bang writes:
> What does w3m do?  Consider the CSS and use it, before delivering
> something for emacs/gnus to use?

w3m ignores CSS.

-David



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Built-in HTML parsing and rendering library
  2010-09-06  4:27 ` Daniel Pittman
  2010-09-06  7:53   ` Steinar Bang
@ 2010-09-06 11:33   ` Lars Magne Ingebrigtsen
  2010-09-06 12:20     ` Ted Zlatanov
  1 sibling, 1 reply; 12+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-06 11:33 UTC (permalink / raw)
  To: ding

Daniel Pittman <daniel@rimspace.net> writes:

> http://www.xmlsoft.org/html/libxml-HTMLparser.html

Oh, libxml2 has a built-in "real world" HTML parser?  Interesting.

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Built-in HTML parsing and rendering library
  2010-09-06 11:33   ` Lars Magne Ingebrigtsen
@ 2010-09-06 12:20     ` Ted Zlatanov
  2010-09-06 12:28       ` Lars Magne Ingebrigtsen
  0 siblings, 1 reply; 12+ messages in thread
From: Ted Zlatanov @ 2010-09-06 12:20 UTC (permalink / raw)
  To: ding

On Mon, 06 Sep 2010 13:33:48 +0200 Lars Magne Ingebrigtsen <larsi@gnus.org> wrote: 

LMI> Daniel Pittman <daniel@rimspace.net> writes:
>> http://www.xmlsoft.org/html/libxml-HTMLparser.html

LMI> Oh, libxml2 has a built-in "real world" HTML parser?  Interesting.

It's what Gnome uses so it's pretty good.  Because of the Gnome link it
would probably be the easiest one to bring into the Emacs core.

Ted




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Built-in HTML parsing and rendering library
  2010-09-06 12:20     ` Ted Zlatanov
@ 2010-09-06 12:28       ` Lars Magne Ingebrigtsen
  2010-09-06 12:40         ` Julien Danjou
                           ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-06 12:28 UTC (permalink / raw)
  To: ding

Ted Zlatanov <tzz@lifelogs.com> writes:

> It's what Gnome uses so it's pretty good.  Because of the Gnome link it
> would probably be the easiest one to bring into the Emacs core.

Yes.  I had a quick peek at the interface, and it seemed to return a
nice DOM that could probably be exported to Emacs pretty easily as an
elisp list tree like (:html (:head ...) (:body ...)) etc.

And since libxml2 is already installed on 99% of Linux machines, linking
Emacs to it should be no big deal.

So the question is: If we have the parse tree in Emacs Lisp, would we be
able to render it quickly enough for it to make sense to use?  I haven't
really thought about it much, but it strikes me that rendering heavily
nested tables and the like might be a time-consuming task in a language
that's as slow as Emacs Lisp.  But it might be fine; I'm not sure at all.

Is there a component of libxml2 (or some other handy library) that does
HTML rendering, too?  :-)

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Built-in HTML parsing and rendering library
  2010-09-06 12:28       ` Lars Magne Ingebrigtsen
@ 2010-09-06 12:40         ` Julien Danjou
  2010-09-06 13:09         ` Ted Zlatanov
  2010-09-06 18:26         ` Sivaram Neelakantan
  2 siblings, 0 replies; 12+ messages in thread
From: Julien Danjou @ 2010-09-06 12:40 UTC (permalink / raw)
  To: ding

[-- Attachment #1: Type: text/plain, Size: 286 bytes --]

On Mon, Sep 06 2010, Lars Magne Ingebrigtsen wrote: 

> Is there a component of libxml2 (or some other handy library) 
> that does HTML rendering, too?  :-)

Webkit[1] ? *g*

[1]  http://webkit.org/

-- 
Julien Danjou
// ᐰ <julien@danjou.info>   http://julien.danjou.info

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Built-in HTML parsing and rendering library
  2010-09-06 12:28       ` Lars Magne Ingebrigtsen
  2010-09-06 12:40         ` Julien Danjou
@ 2010-09-06 13:09         ` Ted Zlatanov
  2010-09-06 18:26         ` Sivaram Neelakantan
  2 siblings, 0 replies; 12+ messages in thread
From: Ted Zlatanov @ 2010-09-06 13:09 UTC (permalink / raw)
  To: ding

On Mon, 06 Sep 2010 14:28:10 +0200 Lars Magne Ingebrigtsen <larsi@gnus.org> wrote: 

LMI> Ted Zlatanov <tzz@lifelogs.com> writes:
>> It's what Gnome uses so it's pretty good.  Because of the Gnome link it
>> would probably be the easiest one to bring into the Emacs core.

LMI> Yes.  I had a quick peek at the interface, and it seemed to return a
LMI> nice DOM that could probably be exported to Emacs pretty easily as an
LMI> elisp list tree like (:html (:head ...) (:body ...)) etc.

HTML and XML are SGML which is a crappy Lisp, so yeah :)  Parsing them
with libxml2 would improve many corners of Emacs.

LMI> And since libxml2 is already installed on 99% of Linux machines, linking
LMI> Emacs to it should be no big deal.

Yes.  The patch would be small.  I don't know if the Emacs maintainers
will have objections but it's kind of weird no one has proposed it yet.

LMI> So the question is: If we have the parse tree in Emacs Lisp, would we be
LMI> able to render it quickly enough for it to make sense to use?  I haven't
LMI> really thought about it much, but it strikes me that rendering heavily
LMI> nested tables and the like might be a time-consuming task in a language
LMI> that's as slow as Emacs Lisp.  But it might be fine; I'm not sure at all.

LMI> Is there a component of libxml2 (or some other handy library) that does
LMI> HTML rendering, too?  :-)

These days Mozilla's Gecko is getting less popular.  http://webkit.org/
is really popular and it's LGPL.  I know it's been proposed for Emacs
inclusion before and I think it's just been general laziness not to
include it.

IMO this is a really deep hole than is measured in man-years of work.
HTML parsing is easy; rendering it is a nightmare compounded by years of
legacy crap.  So I am pessimistic this is a good use of your time.  If
the Emacs project took interest in this, there would be many more
hackers and users available and it could happen.  Or it could all
devolve into endless arguments about keyboard stickers and DVCS supremacy.

Ted




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Built-in HTML parsing and rendering library
  2010-09-06 12:28       ` Lars Magne Ingebrigtsen
  2010-09-06 12:40         ` Julien Danjou
  2010-09-06 13:09         ` Ted Zlatanov
@ 2010-09-06 18:26         ` Sivaram Neelakantan
  2010-09-06 19:58           ` Lars Magne Ingebrigtsen
  2 siblings, 1 reply; 12+ messages in thread
From: Sivaram Neelakantan @ 2010-09-06 18:26 UTC (permalink / raw)
  To: ding

On Mon, Sep 06 2010,Lars Magne Ingebrigtsen Lars Magne Ingebrigtsen wrote:


[snipped 9 lines]

> And since libxml2 is already installed on 99% of Linux machines, linking
> Emacs to it should be no big deal.
>

[snipped 8 lines]

erm...about the blighters like me who use Emacs on Win32.

Brother, could you spare a thought? ;)

 sivaram
 -- 




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Built-in HTML parsing and rendering library
  2010-09-06 18:26         ` Sivaram Neelakantan
@ 2010-09-06 19:58           ` Lars Magne Ingebrigtsen
  0 siblings, 0 replies; 12+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-06 19:58 UTC (permalink / raw)
  To: ding

Sivaram Neelakantan <nsivaram.net@gmail.com> writes:

> erm...about the blighters like me who use Emacs on Win32.

I'd be surprised if libxml2 doesn't exist on Windows...

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2010-09-06 19:58 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-05 22:58 Built-in HTML parsing and rendering library Lars Magne Ingebrigtsen
2010-09-06  4:27 ` Daniel Pittman
2010-09-06  7:53   ` Steinar Bang
2010-09-06 11:33   ` Lars Magne Ingebrigtsen
2010-09-06 12:20     ` Ted Zlatanov
2010-09-06 12:28       ` Lars Magne Ingebrigtsen
2010-09-06 12:40         ` Julien Danjou
2010-09-06 13:09         ` Ted Zlatanov
2010-09-06 18:26         ` Sivaram Neelakantan
2010-09-06 19:58           ` Lars Magne Ingebrigtsen
2010-09-06  7:32 ` Steinar Bang
2010-09-06  8:29   ` David Engster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).