Greetings,

I would like to announce the release of Lambda Soup, a library for

manipulating HTML documents with CSS selector support. In brief, it

allows expressions such as

(* Print all links. *)

read_file “index.html" |> parse

$$ "a[href]"

|> iter (fun a -> a |> R.attribute "href" |> print_endline)

and

(* Add ids to all <h2> tags. *)

read_channel stdin |> parse

$$ "h2"

|> iter (fun h2 -> h2 |> set_attribute "id" (R.leaf_text h2))

|> write_channel stdout

The library is based on a set of lazy node traversals (to parents,

children, siblings, etc.). The CSS syntax maps onto these. Types are

used to distinguish HTML node classes (such as text, element, and

document) and reduce the need for error-checking.

The library can be found here:

and the associated documentation is at

OCaml, as an impure functional language with terse syntax, seems very

well-suited to this kind of work. I currently have Lambda Soup

postprocessing its own ocamldoc documentation, and I found this

postprocessor more pleasant to write and maintain than the equivalent

program using Python's Beautiful Soup would have been.

There is some discussion of implementing a new lax HTML(5) parser. This

may be the next thing I will do. Any comments on this, and on Lambda

Soup, are welcome.

Lambda Soup is in OPAM as package "lambdasoup".

Best,

Anton