I did an evaluation of HTML parsers back in February. Most of the options are XML parsers, and a lot of them are very old. Other than Nethtml, I came up with two alternatives to consider:

http://erratique.ch/software/xmlm
https://github.com/facebook/pfff/tree/master/lang_html

I didn't end up spending much time on either. It quickly became clear that Nethtml was what I needed. It handles content that isn't strictly valid, which was important to me, and has good performance.

Cheers,
Andy


On Mon, Aug 11, 2014 at 4:57 PM, Jacques du Preez <jacquesdpz@gmail.com> wrote:
Thanks. I eventually discovered ocamlnet, but I'm hoping there's maybe more than 1 option?

==============================
Jacques du Preez

Web: OpenLandscape.net
Twitter: @jacquesdp


On Sun, Aug 10, 2014 at 10:42 PM, Christophe Troestler <Christophe.Troestler@umons.ac.be> wrote:
Hi,

On Sun, 10 Aug 2014 19:38:39 +0200, Jacques du Preez wrote:
>
> I've been searching for an OCaml library to parse HTML, and then be able to
> query and manipulate it similar to jQuery.
>
> The JSoup Java library, http://jsoup.org, allows me to do this. Is there
> something like this for OCaml?

Nethtml in ocamlnet partly does what you need (you can easily write
recursive functions to extract the desired data from the HTML tree).

Best,
C.