From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <oliver@first.in-berlin.de>
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82])
	by yquem.inria.fr (Postfix) with ESMTP id 23DC0BC57
	for <caml-list@yquem.inria.fr>; Tue, 13 Apr 2010 21:15:51 +0200 (CEST)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AoIBAAtgxEvAbSoIkWdsb2JhbACPYotiFQEBAQEJCwoHEQQevGKFDQQ
X-IronPort-AV: E=Sophos;i="4.52,199,1270418400"; 
   d="scan'208";a="56987560"
Received: from einhorn.in-berlin.de ([192.109.42.8])
  by mail1-smtp-roc.national.inria.fr with ESMTP/TLS/DHE-RSA-AES256-SHA; 13 Apr 2010 21:15:50 +0200
X-Envelope-From: oliver@first.in-berlin.de
X-Envelope-To: <caml-list@inria.fr>
Received: from localhost (okapi.in-berlin.de [192.109.42.117])
	by einhorn.in-berlin.de (8.13.6/8.13.6/Debian-1) with ESMTP id o3DJFniU019433
	for <caml-list@inria.fr>; Tue, 13 Apr 2010 21:15:49 +0200
Received: from e178023226.adsl.alicedsl.de (e178023226.adsl.alicedsl.de
	[85.178.23.226]) by webmail.in-berlin.de (Horde Framework) with HTTP; Tue,
	13 Apr 2010 21:15:49 +0200
Message-ID: <20100413211549.14246792pcyfgnol@webmail.in-berlin.de>
Date: Tue, 13 Apr 2010 21:15:49 +0200
From: "Oliver Bandel" <oliver@first.in-berlin.de>
To: caml-list@inria.fr
Subject: Parsing with two scanners(ources) as input (?)
MIME-Version: 1.0
Content-Type: text/plain;
	charset=ISO-8859-1;
	DelSp="Yes";
	format="flowed"
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
User-Agent: Internet Messaging Program (IMP) 4.3.3
X-Scanned-By: MIMEDefang_at_IN-Berlin_e.V. on 192.109.42.8
X-Spam: no; 0.00; bandel:01 in-berlin:01 parser:01 hand-code:01 parser:01 pcre:01 subnodes:01 pcre:01 ocamlyacc:01 recursive:01 infomation:98 oliver:01 oliver:01 parsing:01 parsing:01 

Hello,

I want to pasre HTML, and with that I mean I want to parse the =20
structure of the tags as well as the contents of the data-elements.

At the moment I'm hacking a special parser for this case,
but it's somehow ugly, because I need to hand-code the state machine =20
of the parser, and it somehow becomes ugly.

It would be easier and more elegant, if I could combine the =20
HTML-tag-parsing together with the text-parsing on the data-elements.

For HTML-parsing I use Nethtml.
For Text-Scanning I use Pcre.

I want to be able to select certain tags and text that will occur at =20
certain positions.

For detecting the found tags I look for Nethtml's
   Element (name, args, subnodes)
and for detecting the data-strings I look into Nethtml's
   Data string
with Pcre.

I would like to find out certain data that occurs after ceratin =20
sequences in the tree and then look for certain strings inside that  =20
Data-strings.

Any idea on how to create the parser?

I thought about somehow wrapping the stuff and give it to ocamlyacc.
Maybe menhir is better for that task?

At the moment I use the Element-match just to call the recursive =20
parser on the next doclist.
All my parsing is using   Data-match and looks up for the contents there.
This is, because the information I want to parse out of the document =20
is flat text inside that data-string.
But some of that infomation could also be found via Tag-sequences.

So I'm looking for a possibility to combine both kinds of attempts.

How to do it?


   Oliver