[Caml-list] [ANN] Lambda Soup - HTML scraping and rewriting with CSS selectors

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

* [Caml-list] [ANN] Lambda Soup - HTML scraping and rewriting with CSS selectors
@ 2015-11-16 21:01 Anton Bachin
  2015-11-17  9:31 ` François Bobot
  0 siblings, 1 reply; 11+ messages in thread
From: Anton Bachin @ 2015-11-16 21:01 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 1586 bytes --]

Greetings,

I would like to announce the release of Lambda Soup, a library for
manipulating HTML documents with CSS selector support. In brief, it
allows expressions such as

    (* Print all links. *)

    read_file “index.html" |> parse
    $$ "a[href]"
    |> iter (fun a -> a |> R.attribute "href" |> print_endline)

and

    (* Add ids to all <h2> tags. *)

    read_channel stdin |> parse
    $$ "h2"
    |> iter (fun h2 -> h2 |> set_attribute "id" (R.leaf_text h2))
    |> write_channel stdout

The library is based on a set of lazy node traversals (to parents,
children, siblings, etc.). The CSS syntax maps onto these. Types are
used to distinguish HTML node classes (such as text, element, and
document) and reduce the need for error-checking.

The library can be found here:

    https://github.com/aantron/lambda-soup <https://github.com/aantron/lambda-soup>

and the associated documentation is at

    http://aantron.github.io/lambda-soup <http://aantron.github.io/lambda-soup>

OCaml, as an impure functional language with terse syntax, seems very
well-suited to this kind of work. I currently have Lambda Soup
postprocessing its own ocamldoc documentation, and I found this
postprocessor more pleasant to write and maintain than the equivalent
program using Python's Beautiful Soup would have been.

There is some discussion of implementing a new lax HTML(5) parser. This
may be the next thing I will do. Any comments on this, and on Lambda
Soup, are welcome.

Lambda Soup is in OPAM as package "lambdasoup".

Best,
Anton

[-- Attachment #2: Type: text/html, Size: 4730 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] [ANN] Lambda Soup - HTML scraping and rewriting with CSS selectors
  2015-11-16 21:01 [Caml-list] [ANN] Lambda Soup - HTML scraping and rewriting with CSS selectors Anton Bachin
@ 2015-11-17  9:31 ` François Bobot
  2015-11-22  7:58   ` Anton Bachin
  0 siblings, 1 reply; 11+ messages in thread
From: François Bobot @ 2015-11-17  9:31 UTC (permalink / raw)
  To: caml-list

On 16/11/2015 22:01, Anton Bachin wrote:
> I would like to announce the release of Lambda Soup, a library for
> manipulating HTML documents with CSS selector support.

Nice! It is a very nice way to select nodes.

> The library is based on a set of lazy node traversals (to parents,
> children, siblings, etc.). The CSS syntax maps onto these. Types are
> used to distinguish HTML node classes (such as text, element, and
> document) and reduce the need for error-checking.

Does the types are as powerful as the one in tyxml? Is it possible to have this kind of selectors on 
top of tyxml?

regards,

-- 
François


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] [ANN] Lambda Soup - HTML scraping and rewriting with CSS selectors
  2015-11-17  9:31 ` François Bobot
@ 2015-11-22  7:58   ` Anton Bachin
  2015-11-23 10:44     ` François Bobot
  0 siblings, 1 reply; 11+ messages in thread
From: Anton Bachin @ 2015-11-22  7:58 UTC (permalink / raw)
  To: François Bobot; +Cc: caml-list

Hello François,

First, I have to apologize. It seems my mail client thought all of Caml-list is
spam. I have been disappointed that the list is so quiet. Turns out I should be
disappointed at my mail client instead.

> Does the types are as powerful as the one in tyxml? Is it possible to have
> this kind of selectors on top of tyxml?

The types are not as powerful as the ones in tyxml, in the sense of being
precise. Lambda Soup only distinguishes between "definitely elements",
"definitely documents", and "anything, including an element, or document, or any
other node". It doesn't know about element types or any constraints on their
composition.

It does, however, seem to be possible to have the selectors on top of tyxml. I
am not very familiar with tyxml, but I do see types Xml.elt and Xml.econtent,
and value content. It may be possible to either use those directly as the
internal DOM representation of Lambda Soup, and build traversals over them, and
it is almost certainly possible to convert them to Lambda Soup's current
internal representation, the same way Lambda Soup currently converts Nethtml’s
parse trees.

If there is interest in this, perhaps with some study Lambda Soup can be
modified to use tyxml – perhaps functorized.

Regards,
Anton

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] [ANN] Lambda Soup - HTML scraping and rewriting with CSS selectors
  2015-11-22  7:58   ` Anton Bachin
@ 2015-11-23 10:44     ` François Bobot
  2015-11-23 16:26       ` Anton Bachin
  0 siblings, 1 reply; 11+ messages in thread
From: François Bobot @ 2015-11-23 10:44 UTC (permalink / raw)
  To: Anton Bachin; +Cc: caml-list

On 22/11/2015 08:58, Anton Bachin wrote:
>> Does the types are as powerful as the one in tyxml? Is it possible to have
>> this kind of selectors on top of tyxml?
>
> The types are not as powerful as the ones in tyxml, in the sense of being
> precise. Lambda Soup only distinguishes between "definitely elements",
> "definitely documents", and "anything, including an element, or document, or any
> other node". It doesn't know about element types or any constraints on their
> composition.
>
> It does, however, seem to be possible to have the selectors on top of tyxml. I
> am not very familiar with tyxml, but I do see types Xml.elt and Xml.econtent,
> and value content. It may be possible to either use those directly as the
> internal DOM representation of Lambda Soup, and build traversals over them, and
> it is almost certainly possible to convert them to Lambda Soup's current
> internal representation, the same way Lambda Soup currently converts Nethtml’s
> parse trees.

Nice to know.

> If there is interest in this, perhaps with some study Lambda Soup can be
> modified to use tyxml – perhaps functorized.

It is just for not creating another xml type and simpler composition of ocaml libraries.

The functor solution used by cohttp (for allowing lwt or async) for example is a nice one, and 
flambda should eliminate most of the cost of it (on a side note, does @inline work for functor 
application?)

However in this case since the differences is about typing and not just the implementation, I'm not 
sure you can define a generic functor that could be applied with instance (NetHttp, tyxml, yours) 
that restricts differently the function applications.

-- 
François




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] [ANN] Lambda Soup - HTML scraping and rewriting with CSS selectors
  2015-11-23 10:44     ` François Bobot
@ 2015-11-23 16:26       ` Anton Bachin
  2015-11-23 17:16         ` Drup
  2015-11-24  8:35         ` François Bobot
  0 siblings, 2 replies; 11+ messages in thread
From: Anton Bachin @ 2015-11-23 16:26 UTC (permalink / raw)
  To: François Bobot; +Cc: caml-list

> It is just for not creating another xml type and simpler composition of ocaml libraries.

That’s a worthwhile goal. I certainly intend to look into it. I want
there to be some “canonical” set of libraries that work together nicely.

> The functor solution used by cohttp (for allowing lwt or async) for example is a nice one, and flambda should eliminate most of the cost of it (on a side note, does @inline work for functor application?)

Since you just made me aware of the existence of @inline (thank you),
I’m going to have to hope someone else answers the side note
question :) I would have to gain some experience with it first in
normal usage.

> However in this case since the differences is about typing and not just the implementation, I'm not sure you can define a generic functor that could be applied with instance (NetHttp, tyxml, yours) that restricts differently the function applications.

Let me make sure we are focusing on the same issue. I am supposing that
you would like to construct a tree using tyxml’s “richly-typed”
functions, then be able to walk around it using Lambda Soup, and make
edits in the tree using the mutation interface (or some improved
variant of it) or tyxml. However, when making the edits, you would want
the “rich” tyxml types preserved, to benefit from the static constraints
they impose on your code. Is this right?

I think that is the only trouble. Lamdba Soup doesn’t actually expose
any XML type, and I think it can ported or functorized to work with just
about any low-level XML representation.

Regards,
Anton

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] [ANN] Lambda Soup - HTML scraping and rewriting with CSS selectors
  2015-11-23 16:26       ` Anton Bachin
@ 2015-11-23 17:16         ` Drup
  2015-11-23 17:35           ` Anton Bachin
  2015-11-24  8:35         ` François Bobot
  1 sibling, 1 reply; 11+ messages in thread
From: Drup @ 2015-11-23 17:16 UTC (permalink / raw)
  To: Anton Bachin, François Bobot; +Cc: caml-list

There seems to be a slight misunderstanding about how tyxml is 
constructed, so let me clarify things a bit.

- Tyxml doesn't have a canonical xml datatype, it's functorized over a 
generic Xml signature (implemented in [Xml_sigs.T]). As far as tyxml is 
concerned, xml nodes are a fully abstract type and can only be 
constructed. Multiple modules implements this signature in the ocsigen 
stack (two in js_of_ocaml's Tyxml_js, tree in eliom) that presents 
different characteristics. In particular some of them are really 
abstracts (React signals ...) and I doubt you could construct selectors 
over them in a meaningful way (but I would be happy to be proven wrong).
- Another signature, [Xml_sigs.ITERABLE], implement global iteration 
over xml trees. It is not necessary for an XML implementation used by 
tyxml to respect it and, in particular, it is not implemented for 
js_of_ocaml's Tyxml_js. As pointed out previously, it doesn't make sense 
for all implementations, but we could implement it for some of them.
- There is no signature for mutation (at the moment). This may be an 
interesting improvement.
- The [Xml] module implements a "bare" XML datatype that is not really 
used by ocsigen, but can be used to build simple xml trees in a typeful 
manner (and then print them). It also answers ITERABLE.

Now, in order to type lambda_soup using tyxml's types: It's going to be 
a bit of work. You can perfectly reuse all tyxml's type, but you need 
typeful combinators instead of strings, otherwise you have no way to 
know what your selection is going to return. You may be able to cheat 
your way through by creating a fake xml module and instantiate tyxml's 
functors on it to create all the combinators (that would be fun :p)

In any case, you will pay typesafety by a significant increase in 
verbosity and awkwardness. I'm not sure it's worth the effort, since a 
lot of real world html trees are not correct and that you never really 
need to select tyxml-constructed trees anyway. Simple compatibility with 
tyxml is much easier: you just have to agree with tyxml's signatures 
(which would deserve a bit of a cleanup).

[Xml_sigs.T]: 
https://github.com/ocsigen/tyxml/blob/master/lib/xml_sigs.mli#L21
[Xml_sigs.ITERABLE]: 
https://github.com/ocsigen/tyxml/blob/master/lib/xml_sigs.mli#L70
[Xml]: https://github.com/ocsigen/tyxml/blob/master/lib/xml.mli
[Tyxml_js]: 
https://github.com/ocsigen/js_of_ocaml/blob/master/lib/tyxml/tyxml_js.mli

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] [ANN] Lambda Soup - HTML scraping and rewriting with CSS selectors
  2015-11-23 17:16         ` Drup
@ 2015-11-23 17:35           ` Anton Bachin
  2015-11-23 17:41             ` Anton Bachin
  2015-11-23 18:20             ` Drup
  0 siblings, 2 replies; 11+ messages in thread
From: Anton Bachin @ 2015-11-23 17:35 UTC (permalink / raw)
  To: Drup; +Cc: François Bobot, caml-list

> There seems to be a slight misunderstanding about how tyxml is constructed, so let me clarify things a bit.

Thanks. I will still have to look at tyxml, though.

> Now, in order to type lambda_soup using tyxml's types: It's going to be a bit of work. You can perfectly reuse all tyxml's type, but you need typeful combinators instead of strings, otherwise you have no way to know what your selection is going to return. You may be able to cheat your way through by creating a fake xml module and instantiate tyxml's functors on it to create all the combinators (that would be fun :p)

Does tyxml have checked coercions? I was thinking of something like
filtering a traversal by a checked coercion. This is how Lambda Soup
currently does it for traversing elements. While traversing nodes, it
filters by a checked coercion to elements. Typed selection, as you
suggest, is another possibility, but my guess is that it would take a
quite a while to design something that is easily learnable and not very
challenging to type, if that is possible at all – as you seem to agree.

> In any case, you will pay typesafety by a significant increase in verbosity and awkwardness. I'm not sure it's worth the effort, since a lot of real world html trees are not correct and that you never really need to select tyxml-constructed trees anyway. Simple compatibility with tyxml is much easier: you just have to agree with tyxml's signatures (which would deserve a bit of a cleanup).

This is what I would be going for by default, since without resorting
to coercions, that is the best, in terms of typing, that you could hope
for when parsing. My main concern beyond that, as expressed in my
previous message, is how Lambda Soup could best interact at the type
level with trees constructed by tyxml, not what types it (or any other
library in OCaml) could assign to a tree constructed from arbitrary
input. I suppose that if people really never need to select on tyxml
trees, as you say, then Lambda Soup and tyxml are simply addressing
different usages that don’t interact very much, which is what I
suspected from the beginning. Having no experience with tyxml, however,
I would like more feedback :)

Regards,
Anton

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] [ANN] Lambda Soup - HTML scraping and rewriting with CSS selectors
  2015-11-23 17:35           ` Anton Bachin
@ 2015-11-23 17:41             ` Anton Bachin
  2015-11-23 18:20             ` Drup
  1 sibling, 0 replies; 11+ messages in thread
From: Anton Bachin @ 2015-11-23 17:41 UTC (permalink / raw)
  To: Drup; +Cc: François Bobot, caml-list

I suppose also with this kind of coercion, you could parse some input,
assert that you have a valid element of some type after selecting it,
and then use tyxml to insert or manipulate a subtree at that point. I
am really out of my depth at this point, so please correct me if this
is not a possible second kind of interaction between Lambda Soup and
tyxml’s types.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] [ANN] Lambda Soup - HTML scraping and rewriting with CSS selectors
  2015-11-23 17:35           ` Anton Bachin
  2015-11-23 17:41             ` Anton Bachin
@ 2015-11-23 18:20             ` Drup
  2015-11-23 19:02               ` Anton Bachin
  1 sibling, 1 reply; 11+ messages in thread
From: Drup @ 2015-11-23 18:20 UTC (permalink / raw)
  To: Anton Bachin; +Cc: François Bobot, caml-list

> Does tyxml have checked coercions? I was thinking of something like
> filtering a traversal by a checked coercion. This is how Lambda Soup
> currently does it for traversing elements. While traversing nodes, it
> filters by a checked coercion to elements. Typed selection, as you
> suggest, is another possibility, but my guess is that it would take a
> quite a while to design something that is easily learnable and not very
> challenging to type, if that is possible at all – as you seem to agree.
>

Kind of, but it's in Tyxml_js, not in tyxml directly. It needs knowledge 
over the type of nodes which is not available in Xml_sigs.T
It's also nominal, so you can't say `coerce ~to:"a"`


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] [ANN] Lambda Soup - HTML scraping and rewriting with CSS selectors
  2015-11-23 18:20             ` Drup
@ 2015-11-23 19:02               ` Anton Bachin
  0 siblings, 0 replies; 11+ messages in thread
From: Anton Bachin @ 2015-11-23 19:02 UTC (permalink / raw)
  To: Drup; +Cc: François Bobot, caml-list

> It's also nominal, so you can't say `coerce ~to:"a”`

That’s fine, I’m not proposing something like that, in part because I
don’t think it is even possible to have a type coercion with such usage
in OCaml. Maybe it is with some GADT abuse, if “a” was a constructor, but
it’s still not what I meant. I was just thinking of a set of functions

    coerce_blah : any_node -> blah_node option

for some more or less verbose actual types any_node and blah_node. For
comparison, see Dom_html.CoerceTo from js_of_ocaml [1]. then you can do

    root $$ select “div > whatever"
    |> filter_map coerce_whatever
    |> iter (fun typed_whatever_node -> ...)

Anyway, I will study tyxml in some detail. I am working on something
else right now, so it will probably be in a few weeks.

Thanks,
Anton

[1]: http://ocsigen.org/js_of_ocaml/2.6/api/Dom_html.CoerceTo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] [ANN] Lambda Soup - HTML scraping and rewriting with CSS selectors
  2015-11-23 16:26       ` Anton Bachin
  2015-11-23 17:16         ` Drup
@ 2015-11-24  8:35         ` François Bobot
  1 sibling, 0 replies; 11+ messages in thread
From: François Bobot @ 2015-11-24  8:35 UTC (permalink / raw)
  To: Anton Bachin; +Cc: caml-list

On 23/11/2015 17:26, Anton Bachin wrote:
>> However in this case since the differences is about typing and not just the implementation, I'm not sure you can define a generic functor that could be applied with instance (NetHttp, tyxml, yours) that restricts differently the function applications.
>
> Let me make sure we are focusing on the same issue. I am supposing that
> you would like to construct a tree using tyxml’s “richly-typed”
> functions, then be able to walk around it using Lambda Soup, and make
> edits in the tree using the mutation interface (or some improved
> variant of it) or tyxml. However, when making the edits, you would want
> the “rich” tyxml types preserved, to benefit from the static constraints
> they impose on your code. Is this right?

Yes, I forgot that you don't modify the tree. However even just for walking, you need to be able to 
return an element of the right type. But it is, indeed, easier without modification.

>
> I think that is the only trouble. Lamdba Soup doesn’t actually expose
> any XML type, and I think it can ported or functorized to work with just
> about any low-level XML representation.

It would be interesting to know how that work out.


Best,

-- 
François



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-11-24  8:35 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-16 21:01 [Caml-list] [ANN] Lambda Soup - HTML scraping and rewriting with CSS selectors Anton Bachin
2015-11-17  9:31 ` François Bobot
2015-11-22  7:58   ` Anton Bachin
2015-11-23 10:44     ` François Bobot
2015-11-23 16:26       ` Anton Bachin
2015-11-23 17:16         ` Drup
2015-11-23 17:35           ` Anton Bachin
2015-11-23 17:41             ` Anton Bachin
2015-11-23 18:20             ` Drup
2015-11-23 19:02               ` Anton Bachin
2015-11-24  8:35         ` François Bobot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).