AW: [Caml-list] Extracting information from HTML documents

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

From: Gerd Stolpmann <info@gerd-stolpmann.de>
To: Florent Monnier <monnier.florent@gmail.com>
Cc: "José Romildo Malaquias" <j.romildo@gmail.com>, caml-list@inria.fr
Subject: AW: [Caml-list] Extracting information from HTML documents
Date: Sat, 23 Feb 2013 14:23:40 +0100	[thread overview]
Message-ID: <1361625820.2723.2@samsung> (raw)
In-Reply-To: <CAE1DttBdrpW1-gq7GVTN6eVSY73bma95Lt2VKTyYDsSXRis1zg@mail.gmail.com> (from monnier.florent@gmail.com on Sat Feb 23 13:40:28 2013)

Am 23.02.2013 13:40:28 schrieb(en) Florent Monnier:
> > Well, not really identical, but there is at least a robust HTML  
> parser
> > in OCamlnet:
> >
> >  
> http://projects.camlcity.org/projects/dl/ocamlnet-3.6.3/doc/html-main/Nethtml.html
> >
> > Homepage: http://projects.camlcity.org/projects/ocamlnet.html
> >
> > This parser was once used for Mylife's profile extractor (grabbing  
> data
> > from profile pages of social networks), and is proven to handle
> > absolutely bad HTML well. XML should also be no problem.
> >
> > Gerd
> 
> There's also xmlerr:
> http://www.linux-nantes.org/%7Efmonnier/ocaml/xmlerr/
> 
> but xmlerr is an alpha, experimental, hobbyist, not professional  
> thing.
> ...
> I've never used Nethtml so I cannot say anything about it, but
> from what I can see from the interface is that the type is:
> 
> type document =
>   | Element of (string * (string * string) list * document list)
>   | Data of string
> 
> XmlErr's type is:
> 
> type attr = string * string
> type t =
>   | Tag of string * attr list  (** opening tag *)
>   | ETag of string  (** closing tag *)
>   | Data of string  (** PCData *)
>   | Comm of string  (** Comments *)
> 
> type html = t list
> 
> As a result xmlerr will be able to return a plain representation of:
> <bold><i>text</bold></i>

Right, in quirk mode browsers understand this, although this has always  
been against the specs. Note that even this is possible in quirk mode:
<b>bold <i>bold+italics </b>only italics </i>normal text

Nethtml cannot interpret this in the obviously intended way. In  
practice, this was never a problem, though (fortunately, 99% of the  
code in the web is cleaner than this).

> So it seems that Nethtml will return something corrected.
> Xmlerr doesn't, it only returns what it seems.

Nethtml returns the logical view, i.e. it doesn't return tags but  
elements. (NB Tags are the lexical delimiters of elements.) This is  
actually what you normally want to see because HTML is specified in  
terms of elements (except you write something like an HTML editor where  
also knowing tags as such is important). Nethtml also processes omitted  
tags, e.g. for <a><b>text</a> it will implicitly close the "b" element  
when closing "a". Or even this: <p>para1 <p>para2 - here, Nethtml  
closes the first "p" when it sees the second (because it knows that "p"  
elements cannot contain other "p" elements). Note that this was always  
the tricky part of HTML parsing, and we had most problems in this area.

> Also Xmlerr parses comments because sometimes what I want to get is  
> there.

This is also possible with Nethtml, but optional. Nethml can also parse  
processing instructions, but these are rarely used even in XML files.

> Xmlerr only returns junk for the very XML specific things like <?xml
> and <! things,
> as a result it's not possible to use xmlerr to read, correct and print
> back corrected HTML when there are these kind of elements.

But anyway, an XML token reader like Xmlerr is certainly something  
useful.

Gerd


> The last release also provides a command line utility "htmlxtr".
> This "thing" doesn't require any ocaml programming, it's a basic
> command line tool.
> What htmlxtr does is to "untemplate" templated parts of a web-page
> (but in a very basic way) and print the extracted things on stdout
> (read man ./htmlxtr.1 for more informations).
> I'm interested by suggestions to improve it.
> 
> I'm using xmlerr to make quickly written scripts, for example
> Xmlerr.print_code prints an HTML content as ocaml code with Xmlerr.t
> type, so that I can just quickly copy-paste a piece of it in a
> parttern match and get something from this piece in less than one
> minute.
> When the template of a website changes, I can usually fix my script in
> less than 3 minutes.
> 
> I know that some other programming languages provide utilities and
> libraries for these kind of tasks and that some uses some tricks and
> concepts to extract things from web-pages the more easily possible,
> but I don't know them. If you do and have some time, please tell me
> about it.
> 
> Anyway even if xmlerr is very amateurish,
> I would be interested to get any kind of suggestions about how to  
> improve it.
> 
> --
> Cheers
> Florent
> 
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
> 



-- 
------------------------------------------------------------
Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
Creator of GODI and camlcity.org.
Contact details:        http://www.camlcity.org/contact.html
Company homepage:       http://www.gerd-stolpmann.de
------------------------------------------------------------

     prev parent reply	other threads:[~2013-02-23 13:23 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-23 20:52 José Romildo Malaquias
2013-02-22  8:43 ` AW: " Gerd Stolpmann
2013-02-23 12:40   ` Florent Monnier
2013-02-23 13:23     ` Gerd Stolpmann [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1361625820.2723.2@samsung \
    --to=info@gerd-stolpmann.de \
    --cc=caml-list@inria.fr \
    --cc=j.romildo@gmail.com \
    --cc=monnier.florent@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).