caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: "Oliver Bandel" <oliver@first.in-berlin.de>
To: caml-list@inria.fr
Subject: Parsing with two scanners(ources) as input (?)
Date: Tue, 13 Apr 2010 21:15:49 +0200	[thread overview]
Message-ID: <20100413211549.14246792pcyfgnol@webmail.in-berlin.de> (raw)

Hello,

I want to pasre HTML, and with that I mean I want to parse the  
structure of the tags as well as the contents of the data-elements.

At the moment I'm hacking a special parser for this case,
but it's somehow ugly, because I need to hand-code the state machine  
of the parser, and it somehow becomes ugly.

It would be easier and more elegant, if I could combine the  
HTML-tag-parsing together with the text-parsing on the data-elements.

For HTML-parsing I use Nethtml.
For Text-Scanning I use Pcre.

I want to be able to select certain tags and text that will occur at  
certain positions.

For detecting the found tags I look for Nethtml's
   Element (name, args, subnodes)
and for detecting the data-strings I look into Nethtml's
   Data string
with Pcre.

I would like to find out certain data that occurs after ceratin  
sequences in the tree and then look for certain strings inside that   
Data-strings.

Any idea on how to create the parser?

I thought about somehow wrapping the stuff and give it to ocamlyacc.
Maybe menhir is better for that task?

At the moment I use the Element-match just to call the recursive  
parser on the next doclist.
All my parsing is using   Data-match and looks up for the contents there.
This is, because the information I want to parse out of the document  
is flat text inside that data-string.
But some of that infomation could also be found via Tag-sequences.

So I'm looking for a possibility to combine both kinds of attempts.

How to do it?


   Oliver


                 reply	other threads:[~2010-04-13 19:15 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100413211549.14246792pcyfgnol@webmail.in-berlin.de \
    --to=oliver@first.in-berlin.de \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).