From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <info@gerd-stolpmann.de>
X-Original-To: caml-list@sympa.inria.fr
Delivered-To: caml-list@sympa.inria.fr
Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83])
	by sympa.inria.fr (Postfix) with ESMTPS id C44BD7EE49
	for <caml-list@sympa.inria.fr>; Sat, 23 Feb 2013 14:23:41 +0100 (CET)
Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender
  authenticity information available from domain of
  info@gerd-stolpmann.de) identity=pra;
  client-ip=212.227.126.187;
  receiver=mail2-smtp-roc.national.inria.fr;
  envelope-from="info@gerd-stolpmann.de";
  x-sender="info@gerd-stolpmann.de";
  x-conformance=sidf_compatible
Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender
  authenticity information available from domain of
  info@gerd-stolpmann.de) identity=mailfrom;
  client-ip=212.227.126.187;
  receiver=mail2-smtp-roc.national.inria.fr;
  envelope-from="info@gerd-stolpmann.de";
  x-sender="info@gerd-stolpmann.de";
  x-conformance=sidf_compatible
Received-SPF: Pass (mail2-smtp-roc.national.inria.fr: domain of
  postmaster@moutng.kundenserver.de designates 212.227.126.187
  as permitted sender) identity=helo;
  client-ip=212.227.126.187;
  receiver=mail2-smtp-roc.national.inria.fr;
  envelope-from="info@gerd-stolpmann.de";
  x-sender="postmaster@moutng.kundenserver.de";
  x-conformance=sidf_compatible; x-record-type="v=spf1"
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AgMBAH/BKFHU4367k2dsb2JhbABFFq9AkXuBCBYOAQEBAQkJCwkUAySCHwEBBAE6NAsFO10JEgYTCQmHbwMJCgi1KAOJUBWIdYVeJgeDQAONfIleklA
X-IPAS-Result: AgMBAH/BKFHU4367k2dsb2JhbABFFq9AkXuBCBYOAQEBAQkJCwkUAySCHwEBBAE6NAsFO10JEgYTCQmHbwMJCgi1KAOJUBWIdYVeJgeDQAONfIleklA
X-IronPort-AV: E=Sophos;i="4.84,721,1355094000"; 
   d="scan'208";a="4191268"
Received: from moutng.kundenserver.de ([212.227.126.187])
  by mail2-smtp-roc.national.inria.fr with ESMTP/TLS/RC4-SHA; 23 Feb 2013 14:23:41 +0100
Received: from office1.lan.sumadev.de (dslb-094-219-210-035.pools.arcor-ip.net [94.219.210.35])
	by mrelayeu.kundenserver.de (node=mrbap3) with ESMTP (Nemesis)
	id 0MPoeo-1UECsr3zsY-004xG1; Sat, 23 Feb 2013 14:23:40 +0100
Received: from samsung (ip-5-146-55-186.unitymediagroup.de [5.146.55.186])
	by office1.lan.sumadev.de (Postfix) with ESMTPSA id 9652BC00CF;
	Sat, 23 Feb 2013 14:23:39 +0100 (CET)
Date: Sat, 23 Feb 2013 14:23:40 +0100
From: Gerd Stolpmann <info@gerd-stolpmann.de>
To: Florent Monnier <monnier.florent@gmail.com>
Cc: =?iso-8859-1?b?Sm9z6Q==?= Romildo Malaquias <j.romildo@gmail.com>,
	caml-list@inria.fr
In-Reply-To: <CAE1DttBdrpW1-gq7GVTN6eVSY73bma95Lt2VKTyYDsSXRis1zg@mail.gmail.com>
	(from monnier.florent@gmail.com on Sat Feb 23 13:40:28 2013)
X-Mailer: Balsa 2.4.11
Message-Id: <1361625820.2723.2@samsung>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii; DelSp=Yes; Format=Flowed
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
X-Provags-ID: V02:K0:1Uo1D1D19h53d5FpJL2Fc/yzzAGULvt0kVyleo766Oj
 ipjqHFkwMe3byNOdRjsfT1e99mlQ9Rn+0/SVn3uNq+040TenkF
 9L6VW3OhoYP8/ldVpAaZXLN0EA5rKaBbYxMhq+p2sqvOvk7Tid
 IazFkrd3R2H7fS0AcyLWezvptYJjDslR1FAHfuX7gUKe1tqKtW
 Xe8w4y8waYohWIPEP9oqHQBvU8HEZeNnD9JbauTmd0wwW//vCP
 +2r8oAMGUybyuDw0vpXC9ddSnYgAejLJWdXznkUNKcnmAoX8ki
 wP5aUEwu9xzTEeRzdf0OV2FTB6Do8a2TvZQnueLJi9ZZQxyYt1
 AcV3XzARkEk+aV3d72b8=
Subject: AW: [Caml-list] Extracting information from HTML documents

Am 23.02.2013 13:40:28 schrieb(en) Florent Monnier:
> > Well, not really identical, but there is at least a robust HTML=20=20
> parser
> > in OCamlnet:
> >
> >=20=20
> http://projects.camlcity.org/projects/dl/ocamlnet-3.6.3/doc/html-main/Net=
html.html
> >
> > Homepage: http://projects.camlcity.org/projects/ocamlnet.html
> >
> > This parser was once used for Mylife's profile extractor (grabbing=20=
=20
> data
> > from profile pages of social networks), and is proven to handle
> > absolutely bad HTML well. XML should also be no problem.
> >
> > Gerd
>=20
> There's also xmlerr:
> http://www.linux-nantes.org/%7Efmonnier/ocaml/xmlerr/
>=20
> but xmlerr is an alpha, experimental, hobbyist, not professional=20=20
> thing.
> ...
> I've never used Nethtml so I cannot say anything about it, but
> from what I can see from the interface is that the type is:
>=20
> type document =3D
>   | Element of (string * (string * string) list * document list)
>   | Data of string
>=20
> XmlErr's type is:
>=20
> type attr =3D string * string
> type t =3D
>   | Tag of string * attr list  (** opening tag *)
>   | ETag of string  (** closing tag *)
>   | Data of string  (** PCData *)
>   | Comm of string  (** Comments *)
>=20
> type html =3D t list
>=20
> As a result xmlerr will be able to return a plain representation of:
> <bold><i>text</bold></i>

Right, in quirk mode browsers understand this, although this has always=20=
=20
been against the specs. Note that even this is possible in quirk mode:
<b>bold <i>bold+italics </b>only italics </i>normal text

Nethtml cannot interpret this in the obviously intended way. In=20=20
practice, this was never a problem, though (fortunately, 99% of the=20=20
code in the web is cleaner than this).

> So it seems that Nethtml will return something corrected.
> Xmlerr doesn't, it only returns what it seems.

Nethtml returns the logical view, i.e. it doesn't return tags but=20=20
elements. (NB Tags are the lexical delimiters of elements.) This is=20=20
actually what you normally want to see because HTML is specified in=20=20
terms of elements (except you write something like an HTML editor where=20=
=20
also knowing tags as such is important). Nethtml also processes omitted=20=
=20
tags, e.g. for <a><b>text</a> it will implicitly close the "b" element=20=
=20
when closing "a". Or even this: <p>para1 <p>para2 - here, Nethtml=20=20
closes the first "p" when it sees the second (because it knows that "p"=20=
=20
elements cannot contain other "p" elements). Note that this was always=20=
=20
the tricky part of HTML parsing, and we had most problems in this area.

> Also Xmlerr parses comments because sometimes what I want to get is=20=20
> there.

This is also possible with Nethtml, but optional. Nethml can also parse=20=
=20
processing instructions, but these are rarely used even in XML files.

> Xmlerr only returns junk for the very XML specific things like <?xml
> and <! things,
> as a result it's not possible to use xmlerr to read, correct and print
> back corrected HTML when there are these kind of elements.

But anyway, an XML token reader like Xmlerr is certainly something=20=20
useful.

Gerd


> The last release also provides a command line utility "htmlxtr".
> This "thing" doesn't require any ocaml programming, it's a basic
> command line tool.
> What htmlxtr does is to "untemplate" templated parts of a web-page
> (but in a very basic way) and print the extracted things on stdout
> (read man ./htmlxtr.1 for more informations).
> I'm interested by suggestions to improve it.
>=20
> I'm using xmlerr to make quickly written scripts, for example
> Xmlerr.print_code prints an HTML content as ocaml code with Xmlerr.t
> type, so that I can just quickly copy-paste a piece of it in a
> parttern match and get something from this piece in less than one
> minute.
> When the template of a website changes, I can usually fix my script in
> less than 3 minutes.
>=20
> I know that some other programming languages provide utilities and
> libraries for these kind of tasks and that some uses some tricks and
> concepts to extract things from web-pages the more easily possible,
> but I don't know them. If you do and have some time, please tell me
> about it.
>=20
> Anyway even if xmlerr is very amateurish,
> I would be interested to get any kind of suggestions about how to=20=20
> improve it.
>=20
> --
> Cheers
> Florent
>=20
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>=20


--=20
------------------------------------------------------------
Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
Creator of GODI and camlcity.org.
Contact details:        http://www.camlcity.org/contact.html
Company homepage:       http://www.gerd-stolpmann.de
------------------------------------------------------------=