From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by sympa.inria.fr (Postfix) with ESMTPS id D33D47F75C for ; Sun, 24 Aug 2014 23:54:46 +0200 (CEST) Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of andrew.herron@gmail.com) identity=pra; client-ip=209.85.220.177; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="andrew.herron@gmail.com"; x-sender="andrew.herron@gmail.com"; x-conformance=sidf_compatible Received-SPF: Pass (mail3-smtp-sop.national.inria.fr: domain of andrew.herron@gmail.com designates 209.85.220.177 as permitted sender) identity=mailfrom; client-ip=209.85.220.177; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="andrew.herron@gmail.com"; x-sender="andrew.herron@gmail.com"; x-conformance=sidf_compatible; x-record-type="v=spf1" Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of postmaster@mail-vc0-f177.google.com) identity=helo; client-ip=209.85.220.177; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="andrew.herron@gmail.com"; x-sender="postmaster@mail-vc0-f177.google.com"; x-conformance=sidf_compatible X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AlUGAF9e+lPRVdyxm2dsb2JhbABYg2BXBIJ4gRyrCQaDHIEKmHEBCYdNgQMIFhABAQEBAQYLCwkUKYIKAYF4AQEBAwESER0BOAEDAQsBBQUEBw0qAgIhARIBBQEcBhMiiAsBAwkIDZoMg16DN2uLK4UCiUMWDScNhSIRAQUOhW6HIx2CDIJDDzISgUEFhQSPclmEaoIQgViMf4RDGCmBbIM1Ky+CTwEBAQ X-IPAS-Result: AlUGAF9e+lPRVdyxm2dsb2JhbABYg2BXBIJ4gRyrCQaDHIEKmHEBCYdNgQMIFhABAQEBAQYLCwkUKYIKAYF4AQEBAwESER0BOAEDAQsBBQUEBw0qAgIhARIBBQEcBhMiiAsBAwkIDZoMg16DN2uLK4UCiUMWDScNhSIRAQUOhW6HIx2CDIJDDzISgUEFhQSPclmEaoIQgViMf4RDGCmBbIM1Ky+CTwEBAQ X-IronPort-AV: E=Sophos;i="5.04,393,1406584800"; d="scan'208";a="76438880" Received: from mail-vc0-f177.google.com ([209.85.220.177]) by mail3-smtp-sop.national.inria.fr with ESMTP/TLS/RC4-SHA; 24 Aug 2014 23:54:45 +0200 Received: by mail-vc0-f177.google.com with SMTP id hy4so14017389vcb.8 for ; Sun, 24 Aug 2014 14:54:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=1SlUSjhQwmZJ6GlOd3UidZnRs2c2id3ETWdrHNiRERg=; b=Dk5yuw+dr0x6CZbj5gRNCSTiTaadRPCRqpD0eYJhkCu3nIQ0aJ1+B4EZcBMCXmDolM QAwRu/a1Kfjs4hz1dihOrKT9e7ODWdqTl8wFctmFBEi+xbgAngH4svKxEQJWsn2asiV8 hOROHLleelD6bMaL7zUsP7ClcHnQSCJb53f9Wxo0IakyChNqCEioM+MDPVacWBAdHSEg 5wuQKGeZFui4Z5K12cSzkVvf/KwDWXzgFpfCwE33c8vMy1BkcYU+FiH7RalRB4ZgHk/Z fozCO2Gk8lg0/kOm3Zuwzx6IpDD/0M4sdarr57kaWBpreZAdSxM4I7LR2WCGCd3+hAQL gDMQ== MIME-Version: 1.0 X-Received: by 10.52.69.172 with SMTP id f12mr7643270vdu.9.1408917284450; Sun, 24 Aug 2014 14:54:44 -0700 (PDT) Sender: andrew.herron@gmail.com Received: by 10.52.97.194 with HTTP; Sun, 24 Aug 2014 14:54:44 -0700 (PDT) In-Reply-To: References: <20140810.224256.1353397051109538039.Christophe.Troestler@umons.ac.be> Date: Mon, 25 Aug 2014 07:54:44 +1000 X-Google-Sender-Auth: JZmaABYJRsBVjlwszMc5uFrTVPM Message-ID: From: Andrew Herron To: Jacques du Preez Cc: OCaml Mailing List Content-Type: multipart/alternative; boundary=20cf307cffd641fe2f0501671ea5 Subject: Re: [Caml-list] OCaml HTML parsing & manipulation --20cf307cffd641fe2f0501671ea5 Content-Type: text/plain; charset=UTF-8 I did an evaluation of HTML parsers back in February. Most of the options are XML parsers, and a lot of them are very old. Other than Nethtml, I came up with two alternatives to consider: http://erratique.ch/software/xmlm https://github.com/facebook/pfff/tree/master/lang_html I didn't end up spending much time on either. It quickly became clear that Nethtml was what I needed. It handles content that isn't strictly valid, which was important to me, and has good performance. Cheers, Andy On Mon, Aug 11, 2014 at 4:57 PM, Jacques du Preez wrote: > Thanks. I eventually discovered ocamlnet, but I'm hoping there's maybe > more than 1 option? > > ============================== > Jacques du Preez > > Web: OpenLandscape.net > Twitter: @jacquesdp > > > On Sun, Aug 10, 2014 at 10:42 PM, Christophe Troestler < > Christophe.Troestler@umons.ac.be> wrote: > >> Hi, >> >> On Sun, 10 Aug 2014 19:38:39 +0200, Jacques du Preez wrote: >> > >> > I've been searching for an OCaml library to parse HTML, and then be >> able to >> > query and manipulate it similar to jQuery. >> > >> > The JSoup Java library, http://jsoup.org, allows me to do this. Is >> there >> > something like this for OCaml? >> >> Nethtml in ocamlnet partly does what you need (you can easily write >> recursive functions to extract the desired data from the HTML tree). >> >> Best, >> C. >> > > --20cf307cffd641fe2f0501671ea5 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I did an evaluation of HTML parsers back in February. Most= of the options are XML parsers, and a lot of them are very old. Other than= Nethtml, I came up with two alternatives to consider:


I didn't end up spending much time on either.= It quickly became clear that Nethtml was what I needed. It handles content= that isn't strictly valid, which was important to me, and has good per= formance.

Cheers,
Andy
=

On Mon, Aug 11, 2014 at 4:57 PM, Jacques= du Preez <jacquesdpz@gmail.com> wrote:
Thanks. I eventually discov= ered ocamlnet, but I'm hoping there's maybe more than 1 option?

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Jacques du Preez<= br>
Web: OpenLandscape.net
Twitter: @jacquesdp


On Sun, Aug 10, 20= 14 at 10:42 PM, Christophe Troestler <Christophe.Troestler@= umons.ac.be> wrote:
Hi,

On Sun, 10 Aug 2014 19:38:39 +0200, Jacques du Preez wrote:
>
> I've been searching for an OCaml library to parse HTML, and then b= e able to
> query and manipulate it similar to jQuery.
>
> The JSoup Java library, http://jsoup.org, allows me to do this. Is there
> something like this for OCaml?

Nethtml in ocamlnet partly does what you need (you can easily write recursive functions to extract the desired data from the HTML tree).

Best,
C.


--20cf307cffd641fe2f0501671ea5--