From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by sympa.inria.fr (Postfix) with ESMTPS id EF1A37EE49 for ; Sat, 23 Feb 2013 13:40:29 +0100 (CET) Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of monnier.florent@gmail.com) identity=pra; client-ip=74.125.82.51; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="monnier.florent@gmail.com"; x-sender="monnier.florent@gmail.com"; x-conformance=sidf_compatible Received-SPF: Pass (mail3-smtp-sop.national.inria.fr: domain of monnier.florent@gmail.com designates 74.125.82.51 as permitted sender) identity=mailfrom; client-ip=74.125.82.51; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="monnier.florent@gmail.com"; x-sender="monnier.florent@gmail.com"; x-conformance=sidf_compatible; x-record-type="v=spf1" Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of postmaster@mail-wg0-f51.google.com) identity=helo; client-ip=74.125.82.51; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="monnier.florent@gmail.com"; x-sender="postmaster@mail-wg0-f51.google.com"; x-conformance=sidf_compatible X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AusBALu3KFFKfVIzjWdsb2JhbABFFsE7gQAIFg4BAQEBCQkLCRIGI4IgAQVAARsdAQMMBgULAzgiAREBBQEcBhMIDIdsAQMPDKETjDKCe4QSChknDVmIZwEFDI5PMweDQAOWPYEdjWYWKYQr X-IPAS-Result: AusBALu3KFFKfVIzjWdsb2JhbABFFsE7gQAIFg4BAQEBCQkLCRIGI4IgAQVAARsdAQMMBgULAzgiAREBBQEcBhMIDIdsAQMPDKETjDKCe4QSChknDVmIZwEFDI5PMweDQAOWPYEdjWYWKYQr X-IronPort-AV: E=Sophos;i="4.84,721,1355094000"; d="scan'208";a="3219460" Received: from mail-wg0-f51.google.com ([74.125.82.51]) by mail3-smtp-sop.national.inria.fr with ESMTP/TLS/RC4-SHA; 23 Feb 2013 13:40:29 +0100 Received: by mail-wg0-f51.google.com with SMTP id 8so1194810wgl.6 for ; Sat, 23 Feb 2013 04:40:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type:content-transfer-encoding; bh=yjdG8rY6BiC3XOQeACkMZWAF3v207LjhVEnolD5nlts=; b=BKapNSeip7b+ZmcdxOua++d5jcZSxuNzvkafz0eCw7JlKUs05p4o4XH7G3qEiL8Els TyxJRuKfeIhYXPRkeFI+TYOluoA/l674jPk7wwx7nkK7BakphdSVd0u2WgaemD6R+yVC bnDtJKB2P20Q9hKF7iRSvVitnc5hsymlQGd6S/W7Hbsn/vuJ54VT93q3xbbhNXlpbgGd LfZYIFTSjGpPXOaRnb8z9fr9UwtBJYHrFO5XIRJ82co7HiQtl3D/K2Y9OpLUtB9jkwMC r6OqkwI1hiCkcv1zslzBbWFrFc+zJwX07/CMoxegskbHSnXxWtAzl5uK/2Tjyzk9XhgT TSQQ== MIME-Version: 1.0 X-Received: by 10.194.119.200 with SMTP id kw8mr8970439wjb.31.1361623228555; Sat, 23 Feb 2013 04:40:28 -0800 (PST) Received: by 10.194.7.195 with HTTP; Sat, 23 Feb 2013 04:40:28 -0800 (PST) In-Reply-To: <1361522580.4875.1@samsung> References: <20130123205229.GA2673@jrm.no-ip.org> <1361522580.4875.1@samsung> Date: Sat, 23 Feb 2013 13:40:28 +0100 Message-ID: From: Florent Monnier To: Gerd Stolpmann Cc: =?ISO-8859-1?Q?Jos=E9_Romildo_Malaquias?= , caml-list@inria.fr Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Validation-by: monnier.florent@gmail.com Subject: Re: [Caml-list] Extracting information from HTML documents 2013/2/22, Gerd Stolpmann : > Am 23.01.2013 21:52:29 schrieb(en) Jos=E9 Romildo Malaquias: >> Hello. >> >> tagsoup[1][2] is a Haskell library for parsing and extracting >> information from (possibly malformed) HTML/XML documents. >> >> tagsoup provides a basic data type for a list of unstructured tags, a >> parser to convert HTML into this tag type, and useful functions and >> combinators for finding and extracting information. >> >> Is there a similar library for OCaml? >> >> I want to write an application which will need to extract some >> information from HTML documents from the web. tagsoup helps a lot in >> the Haskell version of my program. Which OCaml libraries can help me >> with that when porting the application to OCaml? >> >> [1] http://community.haskell.org/~ndm/tagsoup/ >> [2] http://hackage.haskell.org/package/tagsoup >> >> >> Romildo > > Well, not really identical, but there is at least a robust HTML parser > in OCamlnet: > > http://projects.camlcity.org/projects/dl/ocamlnet-3.6.3/doc/html-main/Net= html.html > > Homepage: http://projects.camlcity.org/projects/ocamlnet.html > > This parser was once used for Mylife's profile extractor (grabbing data > from profile pages of social networks), and is proven to handle > absolutely bad HTML well. XML should also be no problem. > > Gerd There's also xmlerr: http://www.linux-nantes.org/%7Efmonnier/ocaml/xmlerr/ but xmlerr is an alpha, experimental, hobbyist, not professional thing. Not all but some parts of its code are very quick-n-dirty. I've written it for my own use to read HTML web-pages, and I'm using it quite often since several years now. 99.9% of the time it does what I expect from it. It's not able to read XML files that are several Go, because it first loads the content in a string and then parses from it which was a very poor choice, but at the beginning I was only using it to load HTML web-pages. Don't expect something of the quality of Nethtml, xmlm and xml-light! I've never used Nethtml so I cannot say anything about it, but from what I can see from the interface is that the type is: type document =3D | Element of (string * (string * string) list * document list) | Data of string XmlErr's type is: type attr =3D string * string type t =3D | Tag of string * attr list (** opening tag *) | ETag of string (** closing tag *) | Data of string (** PCData *) | Comm of string (** Comments *) type html =3D t list As a result xmlerr will be able to return a plain representation of: text So it seems that Nethtml will return something corrected. Xmlerr doesn't, it only returns what it seems. Also Xmlerr parses comments because sometimes what I want to get is there. Xmlerr only returns junk for the very XML specific things like