From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by sympa.inria.fr (Postfix) with ESMTPS id C450B7EEFC for ; Mon, 16 Nov 2015 22:01:19 +0100 (CET) IronPort-PHdr: 9a23:h+kmPxch1BZ9V4ysCXI86IkIlGMj4u6mDksu8pMizoh2WeGdxc6+ZB7h7PlgxGXEQZ/co6odzbGG7ua9BidZvc7JmUtBWaIPfidNsd8RkQ0kDZzNImzAB9muURYHGt9fXkRu5XCxPBsdMs//Y1rPvi/6tmZKSV3BPAZ4bt74BpTVx5zukbviptuOPE4R2GL1SIgxBSv1hD2ZjtMRj4pmJ/R54TryiVwMRd5rw3h1L0mYhRf265T41pdi9yNNp6BprJYYAu2pN5g/GLdRCTBjN2Eu+OXqswPCRE2B/CgySGITxzNOHw/DpDvzVZfwtGOuv+xh3y+QNMb2TLYcSD2i6KAtQxjt3nRUfwUl+X3a35QjxJlQpwis8kRy Authentication-Results: mail2-smtp-roc.national.inria.fr; spf=None smtp.pra=antonbachin@yahoo.com; spf=Pass smtp.mailfrom=antonbachin@yahoo.com; spf=None smtp.helo=postmaster@nm25-vm5.bullet.mail.ne1.yahoo.com Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of antonbachin@yahoo.com) identity=pra; client-ip=98.138.91.247; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="antonbachin@yahoo.com"; x-sender="antonbachin@yahoo.com"; x-conformance=sidf_compatible Received-SPF: Pass (mail2-smtp-roc.national.inria.fr: domain of antonbachin@yahoo.com designates 98.138.91.247 as permitted sender) identity=mailfrom; client-ip=98.138.91.247; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="antonbachin@yahoo.com"; x-sender="antonbachin@yahoo.com"; x-conformance=sidf_compatible; x-record-type="v=spf1" Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of postmaster@nm25-vm5.bullet.mail.ne1.yahoo.com) identity=helo; client-ip=98.138.91.247; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="antonbachin@yahoo.com"; x-sender="postmaster@nm25-vm5.bullet.mail.ne1.yahoo.com"; x-conformance=sidf_compatible X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A0DCAADmQkpWm/dbimJdgi6BYG8Qrn+FMYItgQOEVoFvOYFlHYVzgUk5EwEBAQEBAQEBEAEBAQEBBgsLCSEugi2CMQQZATkVRwJHiEMBAxINmmWPcIFqgnmGHAEjJwOEYQsBAQEBAQEBFQYMAgGGRYIQh3Yfgk4vgRUFjhGIN4UdhGqDIBSCEIZ5IIUqjV4REQGCRCOBfFOFCgEBAQ X-IPAS-Result: A0DCAADmQkpWm/dbimJdgi6BYG8Qrn+FMYItgQOEVoFvOYFlHYVzgUk5EwEBAQEBAQEBEAEBAQEBBgsLCSEugi2CMQQZATkVRwJHiEMBAxINmmWPcIFqgnmGHAEjJwOEYQsBAQEBAQEBFQYMAgGGRYIQh3Yfgk4vgRUFjhGIN4UdhGqDIBSCEIZ5IIUqjV4REQGCRCOBfFOFCgEBAQ X-IronPort-AV: E=Sophos;i="5.20,304,1444687200"; d="scan'208,217";a="187777821" Received: from nm25-vm5.bullet.mail.ne1.yahoo.com ([98.138.91.247]) by mail2-smtp-roc.national.inria.fr with ESMTP/TLS/RC4-SHA; 16 Nov 2015 22:01:18 +0100 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1447707676; bh=0YOfGTiDIWMvZtqiDYTnR+oQCddO1V4PDOabJ0+VPB0=; h=From:Subject:Date:To:From:Subject; b=IzcH7vkSVGe4W/QIkjPPyQXuzYUx/8vfd5cnPipxuPk0QsImx1WeAJ37RqYVrJOkBBMohVrQv8aotjFU73NyTP/xId2twbOErYMiLmd3lxb5EZ3wTRxXXVEYweteGQ93qXfXWY7O+lm+QtmGehQQN8pZ1uAC2kevtn8rghnwo91ZgXaeTsdJzRhim4BwdZgAS+685UeyI5wNTFSxgY/zRFNKu8jtJ0s3GlxvJO/6JhyPNYeOW4L0Nqr5FvJXBKT46QW7NfPtHl8d3kfCS1X5b6dmgWyUuL5TpqF8P4ryEQWxheOod8/gYix+FFSLJ0IF3bn2uAGDCDo2RcH8xz4Ttg== Received: from [98.138.226.180] by nm25.bullet.mail.ne1.yahoo.com with NNFMP; 16 Nov 2015 21:01:16 -0000 Received: from [98.138.226.124] by tm15.bullet.mail.ne1.yahoo.com with NNFMP; 16 Nov 2015 21:01:16 -0000 Received: from [127.0.0.1] by smtp203.mail.ne1.yahoo.com with NNFMP; 16 Nov 2015 21:01:16 -0000 X-Yahoo-Newman-Id: 419166.2928.bm@smtp203.mail.ne1.yahoo.com X-Yahoo-Newman-Property: ymail-3 X-YMail-OSG: fwBVIh4VM1lCSWahLv_r7k2hdMPQTXXoz4Rq9EKy4gWyi0f epU4_pAG2HahVeMK6GrGvhvdzM4nD85boDi2qvwo56RvMmhzfTxw14aVyB5Z m2IoNsLdTzBIzXQoobC4rNRQfxWC8JIeQZagjCgBay.4sG9ifuCqC1qkBThw 0C2gfb8iz1w7W7E6lbg.Mikn2z2hQlimV8_OoN03WvZefcNR1jSwTsimgGdN if2YpAUZx7JsxhCbqizSYJRG_spyfV7ym5lwZIwSSohVBAfMt1XzNs9Z.sA6 Jx0w94vaDXLX3Dfx7ll7i.bzwyC_6i1G0TeaZUgupS.CY1m9GYBDisLtVHtA 7WNmQv1P8r6GK7nydwXgjehZZDVrRY5HTXthHUMt1tNUBQYEQDEGyWsau3jl 50A8soMA1at_Z2Adf1xNPwNLZcbwxHPz66b7Ab0mCG5_GStpk94L.qGur.rN AHi9FyX5U2axrmbp9O5n2531SUHBoDbQNFlqFNwDRPGn3XTfsRD8gVHPtv7O j5tDgVd0qRJo.NBzCkDDJAozBXHs7AeoujgG_yisfsal_u77SNtR1ZHERdq4 XVA3r2XXb2w-- X-Yahoo-SMTP: ddtAESaswBBsjSthz_dVP91gr2NDfymF From: Anton Bachin Content-Type: multipart/alternative; boundary="Apple-Mail=_177EF787-7C88-4C90-AAD5-C2F8C1DE8224" X-Mao-Original-Outgoing-Id: 469400475.546047-71472a233a278402433d3a82e03ee718 Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2070.6\)) Message-Id: <4824377F-4045-4D47-9BAB-E06B0C939988@yahoo.com> Date: Mon, 16 Nov 2015 15:01:15 -0600 To: caml-list@inria.fr X-Mailer: Apple Mail (2.2070.6) Subject: [Caml-list] [ANN] Lambda Soup - HTML scraping and rewriting with CSS selectors --Apple-Mail=_177EF787-7C88-4C90-AAD5-C2F8C1DE8224 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Greetings, I would like to announce the release of Lambda Soup, a library for manipulating HTML documents with CSS selector support. In brief, it allows expressions such as (* Print all links. *) read_file =E2=80=9Cindex.html" |> parse $$ "a[href]" |> iter (fun a -> a |> R.attribute "href" |> print_endline) and (* Add ids to all

tags. *) read_channel stdin |> parse $$ "h2" |> iter (fun h2 -> h2 |> set_attribute "id" (R.leaf_text h2)) |> write_channel stdout The library is based on a set of lazy node traversals (to parents, children, siblings, etc.). The CSS syntax maps onto these. Types are used to distinguish HTML node classes (such as text, element, and document) and reduce the need for error-checking. The library can be found here: https://github.com/aantron/lambda-soup and the associated documentation is at http://aantron.github.io/lambda-soup OCaml, as an impure functional language with terse syntax, seems very well-suited to this kind of work. I currently have Lambda Soup postprocessing its own ocamldoc documentation, and I found this postprocessor more pleasant to write and maintain than the equivalent program using Python's Beautiful Soup would have been. There is some discussion of implementing a new lax HTML(5) parser. This may be the next thing I will do. Any comments on this, and on Lambda Soup, are welcome. Lambda Soup is in OPAM as package "lambdasoup". Best, Anton= --Apple-Mail=_177EF787-7C88-4C90-AAD5-C2F8C1DE8224 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8
Greetings,

I would like to announce the release of Lambda S= oup, a library for
manipulating HTML documents with CSS selector support. In brief, it
allows expressio= ns such as

&nb= sp;   (* Print all links. *)

    read_file =E2=80=9Cindex.html" |&g= t; parse
 =   $$ "a[href]"
    |> iter (fun a -> a |> R.attribute "href" |>= ; print_endline)

and

  =   (* Add ids to all <h2> tags. *)

<= font face=3D"Menlo" class=3D"">    read_channel stdin |> parse=
    = $$ "h2"
  =   |> iter (fun h2 -> h2 |> set_attribute "id" (R.leaf_text h2= ))
   = ; |> write_channel stdout

The library is based on a set of lazy node traversals (to pa= rents,
children= , siblings, etc.). The CSS syntax maps onto these. Types are
used to distinguish HTML node= classes (such as text, element, and
document) and reduce the need for error-checking.

The library can b= e found here:
<= br class=3D"">

and the associated documentation is at=


OCaml, as an impure functional language with terse syntax, seems ver= y
well-suited t= o this kind of work. I currently have Lambda Soup
postprocessing its own ocamldoc document= ation, and I found this
postprocessor more pleasant to write and maintain than the equival= ent
program usi= ng Python's Beautiful Soup would have been.

There is some discussion of implementing a ne= w lax HTML(5) parser. This
may be the next thing I will do. Any comments on this, and on L= ambda
Soup, are= welcome.

Lamb= da Soup is in OPAM as package "lambdasoup".

Best,
Anton
= --Apple-Mail=_177EF787-7C88-4C90-AAD5-C2F8C1DE8224--