From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82]) by yquem.inria.fr (Postfix) with ESMTP id 23DC0BC57 for ; Tue, 13 Apr 2010 21:15:51 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AoIBAAtgxEvAbSoIkWdsb2JhbACPYotiFQEBAQEJCwoHEQQevGKFDQQ X-IronPort-AV: E=Sophos;i="4.52,199,1270418400"; d="scan'208";a="56987560" Received: from einhorn.in-berlin.de ([192.109.42.8]) by mail1-smtp-roc.national.inria.fr with ESMTP/TLS/DHE-RSA-AES256-SHA; 13 Apr 2010 21:15:50 +0200 X-Envelope-From: oliver@first.in-berlin.de X-Envelope-To: Received: from localhost (okapi.in-berlin.de [192.109.42.117]) by einhorn.in-berlin.de (8.13.6/8.13.6/Debian-1) with ESMTP id o3DJFniU019433 for ; Tue, 13 Apr 2010 21:15:49 +0200 Received: from e178023226.adsl.alicedsl.de (e178023226.adsl.alicedsl.de [85.178.23.226]) by webmail.in-berlin.de (Horde Framework) with HTTP; Tue, 13 Apr 2010 21:15:49 +0200 Message-ID: <20100413211549.14246792pcyfgnol@webmail.in-berlin.de> Date: Tue, 13 Apr 2010 21:15:49 +0200 From: "Oliver Bandel" To: caml-list@inria.fr Subject: Parsing with two scanners(ources) as input (?) MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; DelSp="Yes"; format="flowed" Content-Disposition: inline Content-Transfer-Encoding: quoted-printable User-Agent: Internet Messaging Program (IMP) 4.3.3 X-Scanned-By: MIMEDefang_at_IN-Berlin_e.V. on 192.109.42.8 X-Spam: no; 0.00; bandel:01 in-berlin:01 parser:01 hand-code:01 parser:01 pcre:01 subnodes:01 pcre:01 ocamlyacc:01 recursive:01 infomation:98 oliver:01 oliver:01 parsing:01 parsing:01 Hello, I want to pasre HTML, and with that I mean I want to parse the =20 structure of the tags as well as the contents of the data-elements. At the moment I'm hacking a special parser for this case, but it's somehow ugly, because I need to hand-code the state machine =20 of the parser, and it somehow becomes ugly. It would be easier and more elegant, if I could combine the =20 HTML-tag-parsing together with the text-parsing on the data-elements. For HTML-parsing I use Nethtml. For Text-Scanning I use Pcre. I want to be able to select certain tags and text that will occur at =20 certain positions. For detecting the found tags I look for Nethtml's Element (name, args, subnodes) and for detecting the data-strings I look into Nethtml's Data string with Pcre. I would like to find out certain data that occurs after ceratin =20 sequences in the tree and then look for certain strings inside that =20 Data-strings. Any idea on how to create the parser? I thought about somehow wrapping the stuff and give it to ocamlyacc. Maybe menhir is better for that task? At the moment I use the Element-match just to call the recursive =20 parser on the next doclist. All my parsing is using Data-match and looks up for the contents there. This is, because the information I want to parse out of the document =20 is flat text inside that data-string. But some of that infomation could also be found via Tag-sequences. So I'm looking for a possibility to combine both kinds of attempts. How to do it? Oliver