From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by sympa.inria.fr (Postfix) with ESMTPS id C44BD7EE49 for ; Sat, 23 Feb 2013 14:23:41 +0100 (CET) Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of info@gerd-stolpmann.de) identity=pra; client-ip=212.227.126.187; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="info@gerd-stolpmann.de"; x-sender="info@gerd-stolpmann.de"; x-conformance=sidf_compatible Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of info@gerd-stolpmann.de) identity=mailfrom; client-ip=212.227.126.187; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="info@gerd-stolpmann.de"; x-sender="info@gerd-stolpmann.de"; x-conformance=sidf_compatible Received-SPF: Pass (mail2-smtp-roc.national.inria.fr: domain of postmaster@moutng.kundenserver.de designates 212.227.126.187 as permitted sender) identity=helo; client-ip=212.227.126.187; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="info@gerd-stolpmann.de"; x-sender="postmaster@moutng.kundenserver.de"; x-conformance=sidf_compatible; x-record-type="v=spf1" X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AgMBAH/BKFHU4367k2dsb2JhbABFFq9AkXuBCBYOAQEBAQkJCwkUAySCHwEBBAE6NAsFO10JEgYTCQmHbwMJCgi1KAOJUBWIdYVeJgeDQAONfIleklA X-IPAS-Result: AgMBAH/BKFHU4367k2dsb2JhbABFFq9AkXuBCBYOAQEBAQkJCwkUAySCHwEBBAE6NAsFO10JEgYTCQmHbwMJCgi1KAOJUBWIdYVeJgeDQAONfIleklA X-IronPort-AV: E=Sophos;i="4.84,721,1355094000"; d="scan'208";a="4191268" Received: from moutng.kundenserver.de ([212.227.126.187]) by mail2-smtp-roc.national.inria.fr with ESMTP/TLS/RC4-SHA; 23 Feb 2013 14:23:41 +0100 Received: from office1.lan.sumadev.de (dslb-094-219-210-035.pools.arcor-ip.net [94.219.210.35]) by mrelayeu.kundenserver.de (node=mrbap3) with ESMTP (Nemesis) id 0MPoeo-1UECsr3zsY-004xG1; Sat, 23 Feb 2013 14:23:40 +0100 Received: from samsung (ip-5-146-55-186.unitymediagroup.de [5.146.55.186]) by office1.lan.sumadev.de (Postfix) with ESMTPSA id 9652BC00CF; Sat, 23 Feb 2013 14:23:39 +0100 (CET) Date: Sat, 23 Feb 2013 14:23:40 +0100 From: Gerd Stolpmann To: Florent Monnier Cc: =?iso-8859-1?b?Sm9z6Q==?= Romildo Malaquias , caml-list@inria.fr In-Reply-To: (from monnier.florent@gmail.com on Sat Feb 23 13:40:28 2013) X-Mailer: Balsa 2.4.11 Message-Id: <1361625820.2723.2@samsung> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; DelSp=Yes; Format=Flowed Content-Disposition: inline Content-Transfer-Encoding: quoted-printable X-Provags-ID: V02:K0:1Uo1D1D19h53d5FpJL2Fc/yzzAGULvt0kVyleo766Oj ipjqHFkwMe3byNOdRjsfT1e99mlQ9Rn+0/SVn3uNq+040TenkF 9L6VW3OhoYP8/ldVpAaZXLN0EA5rKaBbYxMhq+p2sqvOvk7Tid IazFkrd3R2H7fS0AcyLWezvptYJjDslR1FAHfuX7gUKe1tqKtW Xe8w4y8waYohWIPEP9oqHQBvU8HEZeNnD9JbauTmd0wwW//vCP +2r8oAMGUybyuDw0vpXC9ddSnYgAejLJWdXznkUNKcnmAoX8ki wP5aUEwu9xzTEeRzdf0OV2FTB6Do8a2TvZQnueLJi9ZZQxyYt1 AcV3XzARkEk+aV3d72b8= Subject: AW: [Caml-list] Extracting information from HTML documents Am 23.02.2013 13:40:28 schrieb(en) Florent Monnier: > > Well, not really identical, but there is at least a robust HTML=20=20 > parser > > in OCamlnet: > > > >=20=20 > http://projects.camlcity.org/projects/dl/ocamlnet-3.6.3/doc/html-main/Net= html.html > > > > Homepage: http://projects.camlcity.org/projects/ocamlnet.html > > > > This parser was once used for Mylife's profile extractor (grabbing=20= =20 > data > > from profile pages of social networks), and is proven to handle > > absolutely bad HTML well. XML should also be no problem. > > > > Gerd >=20 > There's also xmlerr: > http://www.linux-nantes.org/%7Efmonnier/ocaml/xmlerr/ >=20 > but xmlerr is an alpha, experimental, hobbyist, not professional=20=20 > thing. > ... > I've never used Nethtml so I cannot say anything about it, but > from what I can see from the interface is that the type is: >=20 > type document =3D > | Element of (string * (string * string) list * document list) > | Data of string >=20 > XmlErr's type is: >=20 > type attr =3D string * string > type t =3D > | Tag of string * attr list (** opening tag *) > | ETag of string (** closing tag *) > | Data of string (** PCData *) > | Comm of string (** Comments *) >=20 > type html =3D t list >=20 > As a result xmlerr will be able to return a plain representation of: > text Right, in quirk mode browsers understand this, although this has always=20= =20 been against the specs. Note that even this is possible in quirk mode: bold bold+italics only italics normal text Nethtml cannot interpret this in the obviously intended way. In=20=20 practice, this was never a problem, though (fortunately, 99% of the=20=20 code in the web is cleaner than this). > So it seems that Nethtml will return something corrected. > Xmlerr doesn't, it only returns what it seems. Nethtml returns the logical view, i.e. it doesn't return tags but=20=20 elements. (NB Tags are the lexical delimiters of elements.) This is=20=20 actually what you normally want to see because HTML is specified in=20=20 terms of elements (except you write something like an HTML editor where=20= =20 also knowing tags as such is important). Nethtml also processes omitted=20= =20 tags, e.g. for text it will implicitly close the "b" element=20= =20 when closing "a". Or even this:

para1

para2 - here, Nethtml=20=20 closes the first "p" when it sees the second (because it knows that "p"=20= =20 elements cannot contain other "p" elements). Note that this was always=20= =20 the tricky part of HTML parsing, and we had most problems in this area. > Also Xmlerr parses comments because sometimes what I want to get is=20=20 > there. This is also possible with Nethtml, but optional. Nethml can also parse=20= =20 processing instructions, but these are rarely used even in XML files. > Xmlerr only returns junk for the very XML specific things like and as a result it's not possible to use xmlerr to read, correct and print > back corrected HTML when there are these kind of elements. But anyway, an XML token reader like Xmlerr is certainly something=20=20 useful. Gerd > The last release also provides a command line utility "htmlxtr". > This "thing" doesn't require any ocaml programming, it's a basic > command line tool. > What htmlxtr does is to "untemplate" templated parts of a web-page > (but in a very basic way) and print the extracted things on stdout > (read man ./htmlxtr.1 for more informations). > I'm interested by suggestions to improve it. >=20 > I'm using xmlerr to make quickly written scripts, for example > Xmlerr.print_code prints an HTML content as ocaml code with Xmlerr.t > type, so that I can just quickly copy-paste a piece of it in a > parttern match and get something from this piece in less than one > minute. > When the template of a website changes, I can usually fix my script in > less than 3 minutes. >=20 > I know that some other programming languages provide utilities and > libraries for these kind of tasks and that some uses some tricks and > concepts to extract things from web-pages the more easily possible, > but I don't know them. If you do and have some time, please tell me > about it. >=20 > Anyway even if xmlerr is very amateurish, > I would be interested to get any kind of suggestions about how to=20=20 > improve it. >=20 > -- > Cheers > Florent >=20 > -- > Caml-list mailing list. Subscription management and archives: > https://sympa.inria.fr/sympa/arc/caml-list > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs >=20 --=20 ------------------------------------------------------------ Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de Creator of GODI and camlcity.org. Contact details: http://www.camlcity.org/contact.html Company homepage: http://www.gerd-stolpmann.de ------------------------------------------------------------=