From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by sympa.inria.fr (Postfix) with ESMTPS id 150667EE49 for ; Fri, 22 Feb 2013 09:43:03 +0100 (CET) Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of info@gerd-stolpmann.de) identity=pra; client-ip=212.227.17.9; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="info@gerd-stolpmann.de"; x-sender="info@gerd-stolpmann.de"; x-conformance=sidf_compatible Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of info@gerd-stolpmann.de) identity=mailfrom; client-ip=212.227.17.9; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="info@gerd-stolpmann.de"; x-sender="info@gerd-stolpmann.de"; x-conformance=sidf_compatible Received-SPF: Pass (mail2-smtp-roc.national.inria.fr: domain of postmaster@moutng.kundenserver.de designates 212.227.17.9 as permitted sender) identity=helo; client-ip=212.227.17.9; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="info@gerd-stolpmann.de"; x-sender="postmaster@moutng.kundenserver.de"; x-conformance=sidf_compatible; x-record-type="v=spf1" X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AgIBAM8tJ1HU4xEJjWdsb2JhbABFFq8LkXiBCRYOAQEBAQkJFBIFJIImbgtAXQkSBhMJCQKHbAMTCLV5A4lYFYh1hV4mB4NAA417iVySTg X-IPAS-Result: AgIBAM8tJ1HU4xEJjWdsb2JhbABFFq8LkXiBCRYOAQEBAQkJFBIFJIImbgtAXQkSBhMJCQKHbAMTCLV5A4lYFYh1hV4mB4NAA417iVySTg X-IronPort-AV: E=Sophos;i="4.84,714,1355094000"; d="scan'208";a="4019874" Received: from moutng.kundenserver.de ([212.227.17.9]) by mail2-smtp-roc.national.inria.fr with ESMTP/TLS/RC4-SHA; 22 Feb 2013 09:43:02 +0100 Received: from office1.lan.sumadev.de (dslb-188-097-015-126.pools.arcor-ip.net [188.97.15.126]) by mrelayeu.kundenserver.de (node=mrbap2) with ESMTP (Nemesis) id 0LdL6H-1UZ2AO2I5p-00iGgQ; Fri, 22 Feb 2013 09:43:01 +0100 Received: from samsung (ip-5-146-55-186.unitymediagroup.de [5.146.55.186]) by office1.lan.sumadev.de (Postfix) with ESMTPSA id 17A70C00D0; Fri, 22 Feb 2013 09:43:01 +0100 (CET) Date: Fri, 22 Feb 2013 09:43:00 +0100 From: Gerd Stolpmann To: =?iso-8859-1?b?Sm9z6Q==?= Romildo Malaquias Cc: caml-list@inria.fr In-Reply-To: <20130123205229.GA2673@jrm.no-ip.org> (from j.romildo@gmail.com on Wed Jan 23 21:52:29 2013) X-Mailer: Balsa 2.4.11 Message-Id: <1361522580.4875.1@samsung> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15; DelSp=Yes; Format=Flowed Content-Disposition: inline Content-Transfer-Encoding: quoted-printable X-Provags-ID: V02:K0:SwfYbl2Lb2evmiZOxju9uEnw0QdNWYakAsvLPWVokek uu1veucT19IlFRuHKPU6DP3rhbQxFxetXKvyb17zWoxN4gC18e P94YzWr926ROTlzNX86hQEpPmv68Qc+570zbeF4fhPPUj6FMo4 P+bRvJ+2+LSQ5BdM9PQM7wiz+9q5Rrv/2Md4enHWn34YmTNNU0 8JmmW5uclesuhhRVcTtFljXDAj2h0eBv4jmRBfe4LJVAlrDFl8 2Y6EaSprzhKd6XwqbfTI3/eS2TX7kwWTQn0Lnm5BxXTkjKqCSZ 7tMMyB4q4FlM4udDO8o9r23YTrIoOxkV2swqji3bJ/FkSff01t wzS7ncRQS4ce23mLsJCQ= Subject: AW: [Caml-list] Extracting information from HTML documents Well, not really identical, but there is at least a robust HTML parser=20= =20 in OCamlnet: http://projects.camlcity.org/projects/dl/ocamlnet-3.6.3/doc/html-main/Netht= ml.html Homepage: http://projects.camlcity.org/projects/ocamlnet.html This parser was once used for Mylife's profile extractor (grabbing data=20= =20 from profile pages of social networks), and is proven to handle=20=20 absolutely bad HTML well. XML should also be no problem. Gerd Am 23.01.2013 21:52:29 schrieb(en) Jos=E9 Romildo Malaquias: > Hello. >=20 > tagsoup[1][2] is a Haskell library for parsing and extracting > information from (possibly malformed) HTML/XML documents. >=20 > tagsoup provides a basic data type for a list of unstructured tags, a > parser to convert HTML into this tag type, and useful functions and > combinators for finding and extracting information. >=20 > Is there a similar library for OCaml? >=20 > I want to write an application which will need to extract some > information from HTML documents from the web. tagsoup helps a lot in=20= =20 > the > Haskell version of my program. Which OCaml libraries can help me with > that when porting the application to OCaml? >=20 > [1] http://community.haskell.org/~ndm/tagsoup/ > [2] http://hackage.haskell.org/package/tagsoup >=20 >=20 > Romildo >=20 > -- > Caml-list mailing list. Subscription management and archives: > https://sympa.inria.fr/sympa/arc/caml-list > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs >=20 >=20 --=20 ------------------------------------------------------------ Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de Creator of GODI and camlcity.org. Contact details: http://www.camlcity.org/contact.html Company homepage: http://www.gerd-stolpmann.de ------------------------------------------------------------=