From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id SAA20053; Mon, 21 Jun 2004 18:18:16 +0200 (MET DST) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id SAA18925 for ; Mon, 21 Jun 2004 18:18:15 +0200 (MET DST) Received: from muscadet.inria.fr (muscadet.inria.fr [128.93.8.12]) by concorde.inria.fr (8.12.10/8.12.10) with ESMTP id i5LGIFSH003466 for ; Mon, 21 Jun 2004 18:18:15 +0200 Received: by muscadet.inria.fr (Postfix, from userid 11404) id 0B38B7AB0; Mon, 21 Jun 2004 18:18:15 +0200 (CEST) To: Caml list Subject: Re: [Caml-list] Parse crazy HTML, output XML Reply-To: James Leifer From: James Leifer References: Date: Mon, 21 Jun 2004 18:18:15 +0200 In-Reply-To: (Alain Frisch's message of "Mon, 21 Jun 2004 18:08:49 +0200 (MET DST)") Message-ID: User-Agent: Gnus/5.1006 (Gnus v5.10.6) Emacs/21.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Miltered: at concorde with ID 40D70A47.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Loop: caml-list@inria.fr X-Spam: no; 0.00; caml-list:01 alain:01 frisch:01 alain:01 frisch:01 pxp:01 dubious:01 ocaml:01 writes:01 external:03 wrote:03 library:03 parse:04 parse:04 output:05 Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk Alain Frisch writes: > On Mon, 21 Jun 2004, Richard Jones wrote: > >> The problem is the parsing phase. Both PXP and XmlLight will only >> parse valid XML (as far as I can see). Is there any simple pure OCaml >> library for parsing HTML and producing a DOM? If you've got really broken documents then perhaps "tidy" is your friend. Yes, I know it may be outside the scope of your request because it's an external program but apparently it can do wonders for syntactically dubious tag-soup html. I believe that it can produce pure xhtml, for example. -J ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners