9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
From: Bakul Shah <bakul+plan9@BitBlocks.com>
To: Fans of the OS Plan 9 from Bell Labs <9fans@cse.psu.edu>
Subject: Re: [9fans] webscript
Date: Fri,  7 Oct 2005 10:06:51 -0700	[thread overview]
Message-ID: <200510071706.j97H6pqi031134@gate.bitblocks.com> (raw)
In-Reply-To: Your message of "Fri, 07 Oct 2005 03:01:01 EDT." <ee9e417a0510070001r4a2eb88fx263c748a2bdcfb94@mail.gmail.com>

> I would like to be able to write scripts like this:
> 
> 	load "http://apc-reset/outlets.htm"
> 	find "yoshimi"
> 	nearest option, set "Immediate Reboot"
> 	submit
> 
> or like this:
> 
> 	load "http://www.fedex.com/Tracking"
> 	find form
> 	enter "792544024753"
> 	submit
> 	
> 	if (find "No information") {
> 	   select enclosing td
> 	   print
> 	} else if (find "Ship date") {
> 	   select enclosing table
> 	   select enclosing table
> 	   print
> 	} else {
> 	   print ">>> Unexpected Results\n"
> 	   print
> 	}
> 
> Does anyone know of programs/languages that let you
> script web sessions like that?  Searching around finds lots
> of mentions of web scraping but no actual programs.
> 
> I have a rough idea of the general structure of the language
> and grammar, and I think that libhtml does most of the
> heavy lifting already.

There are lots of html parsers but the interesting bit here
is that the parse tree seems to be operated on as a whole --
at least that is how I envision operators like find and
select-enclosing working.  This is useful for all sorts of
things: represent some data as a tree, stick probes in it,
walk around the tree, transform it, reuse parts of it in
other trees etc.  Then you can use it for munging any
structured document (email, source code, rcs files, excel,
xml, ...).  You'd need a parser to map a document's structure
into an s-expr and then you can do all the intresting stuff
in this awk-for-s-expr language.

Regular-tree expressions by Shivers & Bagrak may be of
some interest to you.  See
    http://www.cc.gatech.edu/fac/Olin.Shivers/papers/trx.pdf


  parent reply	other threads:[~2005-10-07 17:06 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-10-07  7:01 Russ Cox
2005-10-07 11:31 ` Eric Van Hensbergen
2005-10-07 19:09   ` lucio
2005-10-07 11:42 ` Dimitry Golubovsky
2005-10-08  2:03   ` Jack Johnson
2005-10-07 17:06 ` Bakul Shah [this message]
2005-10-07 18:11   ` Skip Tavakkolian
2005-10-07 20:11     ` erik quanstrom
2005-12-17 21:26 ` Caerwyn Jones

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200510071706.j97H6pqi031134@gate.bitblocks.com \
    --to=bakul+plan9@bitblocks.com \
    --cc=9fans@cse.psu.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).