From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id SAA19492; Mon, 21 Jun 2004 18:12:56 +0200 (MET DST) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id SAA19269 for ; Mon, 21 Jun 2004 18:12:55 +0200 (MET DST) Received: from mail.npc.de (fw.npc.de [62.225.140.214]) by nez-perce.inria.fr (8.12.10/8.12.10) with ESMTP id i5LGCsEV008260 for ; Mon, 21 Jun 2004 18:12:54 +0200 Received: from styx.ffm.npc.de (lion.npc.de [192.168.129.1]) by mail.npc.de (Postfix) with ESMTP id 86C30181B; Mon, 21 Jun 2004 18:12:53 +0200 (CEST) Received: from hera.ffm.npc.de (hera.ffm.npc.de [192.168.130.8]) by styx.ffm.npc.de (Postfix STYX NPC GmbH) with ESMTP id 546D5C437; Mon, 21 Jun 2004 18:12:52 +0200 (CEST) Received: from ares.ffm.npc.de (ares.ffm.npc.de [192.168.130.32]) by hera.ffm.npc.de (Postfix HERA NPC GmbH) with ESMTP id C5683159C9; Mon, 21 Jun 2004 18:12:51 +0200 (CEST) Received: from ares.ffm.npc.de (ares.ffm.npc.de [192.168.130.32]) by ares.ffm.npc.de (Postfix ARES NPC GmbH) with ESMTP id 3C3E11855; Mon, 21 Jun 2004 18:12:51 +0200 (CEST) Subject: Re: [Caml-list] Parse crazy HTML, output XML From: Gerd Stolpmann To: Richard Jones Cc: caml-list@inria.fr In-Reply-To: <20040621160328.GA28952@redhat.com> References: <20040621160328.GA28952@redhat.com> Content-Type: text/plain Content-Transfer-Encoding: 7bit X-Mailer: Ximian Evolution 1.0.8 Date: 21 Jun 2004 18:12:50 +0200 Message-Id: <1087834371.15824.1.camel@ares> Mime-Version: 1.0 X-Miltered: at nez-perce with ID 40D70906.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Loop: caml-list@inria.fr X-Spam: no; 0.00; caml-list:01 gerd:01 stolpmann:01 2004:99 fragment:01 pxp:01 nethtml:01 ocamlnet:01 ocamlnet:01 sourceforge:01 gerd:01 stolpmann:01 viktoriastr:01 64293:01 darmstadt:01 Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk Am Mon, 2004-06-21 um 18.03 schrieb Richard Jones: > I have a bunch of HTML documents from an external source which I do > not control. They aren't valid XML, by any means. I need to read > them in, do a "best effort" to build a DOM, do various manipulations > over the DOM (such as removing