From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: To: 9fans@cse.psu.edu Subject: Re: [9fans] More 'Sam I am' Date: Wed, 8 Feb 2006 19:20:38 +0100 From: uriel@cat-v.org In-Reply-To: <20060208173110.GJ1620@augusta.math.psu.edu> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Topicbox-Message-UUID: f8065ae6-ead0-11e9-9d60-3106f5b1d025 http://homepages.inf.ed.ac.uk/wadler/language.pdf I think sam is a much safer bet than some hideous lib that pretends to be capable of parsing (pseudo)HTML. Years ago some people tried to write a web browser in python... some years later they gave up, all they had produced was a spec for an XML format to store bookmarks. Quoting boyd: "hysterical." uriel > On Tue, Feb 07, 2006 at 10:50:22PM -0800, Lyndon Nerenberg wrote: >> So I thought, but something's not right. I can't demonstrate more >> until I get to work in the morning. > > Hmm. I'm going to make an unpopular but pragmatic suggestion: Don't use > sed or sam, but instead, use a language with an HTML parser available. > There are some jobs for which regular expressions aren't the best tool; > I personally think this is one of them. Here's a script I posted to > USENET years ago to extract data from a table. > > #!/usr/local/bin/python > > import sys > import htmllib > import formatter > > class MyParser(htmllib.HTMLParser): > def __init__(self, format): > htmllib.HTMLParser.__init__(self, format) > self.state = 0 > > def do_tr(self, data): > if self.state: > print htmllib.HTMLParser.save_end(self) > self.state = 0 > > def do_td(self, data): > if self.state: > print "%s, " % htmllib.HTMLParser.save_end(self), > self.state = 1 > htmllib.HTMLParser.save_bgn(self) > > parse = MyParser(formatter.NullFormatter()) > for file in sys.argv[1:]: > parse.feed(open(sys.argv[1],"r").read()) > parse.close() > > I wonder if this even still works..... > > - Dan C.