caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Gerd Stolpmann <info@gerd-stolpmann.de>
To: "José Romildo Malaquias" <j.romildo@gmail.com>
Cc: caml-list@inria.fr
Subject: AW: [Caml-list] Extracting information from HTML documents
Date: Fri, 22 Feb 2013 09:43:00 +0100	[thread overview]
Message-ID: <1361522580.4875.1@samsung> (raw)
In-Reply-To: <20130123205229.GA2673@jrm.no-ip.org> (from j.romildo@gmail.com on Wed Jan 23 21:52:29 2013)

Well, not really identical, but there is at least a robust HTML parser  
in OCamlnet:

http://projects.camlcity.org/projects/dl/ocamlnet-3.6.3/doc/html-main/Nethtml.html

Homepage: http://projects.camlcity.org/projects/ocamlnet.html

This parser was once used for Mylife's profile extractor (grabbing data  
from profile pages of social networks), and is proven to handle  
absolutely bad HTML well. XML should also be no problem.

Gerd


Am 23.01.2013 21:52:29 schrieb(en) José Romildo Malaquias:
> Hello.
> 
> tagsoup[1][2] is a Haskell library for parsing and extracting
> information from (possibly malformed) HTML/XML documents.
> 
> tagsoup provides a basic data type for a list of unstructured tags, a
> parser to convert HTML into this tag type, and useful functions and
> combinators for finding and extracting information.
> 
> Is there a similar library for OCaml?
> 
> I want to write an application which will need to extract some
> information from HTML documents from the web. tagsoup helps a lot in  
> the
> Haskell version of my program. Which OCaml libraries can help me with
> that when porting the application to OCaml?
> 
> [1] http://community.haskell.org/~ndm/tagsoup/
> [2] http://hackage.haskell.org/package/tagsoup
> 
> 
> Romildo
> 
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
> 
> 



-- 
------------------------------------------------------------
Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
Creator of GODI and camlcity.org.
Contact details:        http://www.camlcity.org/contact.html
Company homepage:       http://www.gerd-stolpmann.de
------------------------------------------------------------

  reply	other threads:[~2013-02-22  8:43 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-23 20:52 José Romildo Malaquias
2013-02-22  8:43 ` Gerd Stolpmann [this message]
2013-02-23 12:40   ` Florent Monnier
2013-02-23 13:23     ` AW: " Gerd Stolpmann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1361522580.4875.1@samsung \
    --to=info@gerd-stolpmann.de \
    --cc=caml-list@inria.fr \
    --cc=j.romildo@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).