From mboxrd@z Thu Jan  1 00:00:00 1970
Message-ID: <a1bd7c6491e4cac08c653e3379e6b1aa@cat-v.org>
To: 9fans@cse.psu.edu
Subject: Re: [9fans] More 'Sam I am'
Date: Wed,  8 Feb 2006 19:20:38 +0100
From: uriel@cat-v.org
In-Reply-To: <20060208173110.GJ1620@augusta.math.psu.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
Topicbox-Message-UUID: f8065ae6-ead0-11e9-9d60-3106f5b1d025

http://homepages.inf.ed.ac.uk/wadler/language.pdf

I think sam is a much safer bet than some hideous lib that pretends to
be capable of parsing (pseudo)HTML.

Years ago some people tried to write a web browser in python...  some
years later they gave up, all they had produced was a spec for an XML
format to store bookmarks.  Quoting boyd: "hysterical."

uriel

> On Tue, Feb 07, 2006 at 10:50:22PM -0800, Lyndon Nerenberg wrote:
>> So I thought, but something's not right.  I can't demonstrate more  
>> until I get to work in the morning.
> 
> Hmm.  I'm going to make an unpopular but pragmatic suggestion: Don't use
> sed or sam, but instead, use a language with an HTML parser available.
> There are some jobs for which regular expressions aren't the best tool;
> I personally think this is one of them.  Here's a script I posted to
> USENET years ago to extract data from a table.
> 
> #!/usr/local/bin/python
> 
> import sys
> import htmllib
> import formatter
> 
> class MyParser(htmllib.HTMLParser):
>         def __init__(self, format):
>                 htmllib.HTMLParser.__init__(self, format)
>                 self.state = 0
> 
>         def do_tr(self, data):
>                 if self.state:
>                         print htmllib.HTMLParser.save_end(self)
>                         self.state = 0
> 
>         def do_td(self, data):
>                 if self.state:
>                         print "%s, " % htmllib.HTMLParser.save_end(self),
>                 self.state = 1
>                 htmllib.HTMLParser.save_bgn(self)
> 
> parse = MyParser(formatter.NullFormatter())
> for file in sys.argv[1:]:
>         parse.feed(open(sys.argv[1],"r").read())
> parse.close()
> 
> I wonder if this even still works.....
> 
> 	- Dan C.