9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
From: uriel@cat-v.org
To: 9fans@cse.psu.edu
Subject: Re: [9fans] More 'Sam I am'
Date: Wed,  8 Feb 2006 19:20:38 +0100	[thread overview]
Message-ID: <a1bd7c6491e4cac08c653e3379e6b1aa@cat-v.org> (raw)
In-Reply-To: <20060208173110.GJ1620@augusta.math.psu.edu>

http://homepages.inf.ed.ac.uk/wadler/language.pdf

I think sam is a much safer bet than some hideous lib that pretends to
be capable of parsing (pseudo)HTML.

Years ago some people tried to write a web browser in python...  some
years later they gave up, all they had produced was a spec for an XML
format to store bookmarks.  Quoting boyd: "hysterical."

uriel

> On Tue, Feb 07, 2006 at 10:50:22PM -0800, Lyndon Nerenberg wrote:
>> So I thought, but something's not right.  I can't demonstrate more  
>> until I get to work in the morning.
> 
> Hmm.  I'm going to make an unpopular but pragmatic suggestion: Don't use
> sed or sam, but instead, use a language with an HTML parser available.
> There are some jobs for which regular expressions aren't the best tool;
> I personally think this is one of them.  Here's a script I posted to
> USENET years ago to extract data from a table.
> 
> #!/usr/local/bin/python
> 
> import sys
> import htmllib
> import formatter
> 
> class MyParser(htmllib.HTMLParser):
>         def __init__(self, format):
>                 htmllib.HTMLParser.__init__(self, format)
>                 self.state = 0
> 
>         def do_tr(self, data):
>                 if self.state:
>                         print htmllib.HTMLParser.save_end(self)
>                         self.state = 0
> 
>         def do_td(self, data):
>                 if self.state:
>                         print "%s, " % htmllib.HTMLParser.save_end(self),
>                 self.state = 1
>                 htmllib.HTMLParser.save_bgn(self)
> 
> parse = MyParser(formatter.NullFormatter())
> for file in sys.argv[1:]:
>         parse.feed(open(sys.argv[1],"r").read())
> parse.close()
> 
> I wonder if this even still works.....
> 
> 	- Dan C.



  parent reply	other threads:[~2006-02-08 18:20 UTC|newest]

Thread overview: 72+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-02-08  4:34 Lyndon Nerenberg
2006-02-08  5:29 ` Russ Cox
2006-02-08  5:51   ` Lyndon Nerenberg
2006-02-08  6:14     ` Russ Cox
2006-02-08  6:30       ` Lyndon Nerenberg
2006-02-08  6:46         ` geoff
2006-02-08  6:50           ` Lyndon Nerenberg
2006-02-08 17:31             ` Dan Cross
2006-02-08 18:14               ` Lyndon Nerenberg
2006-02-08 21:28                 ` Dan Cross
2006-02-10  9:47                   ` Aharon Robbins
2006-02-10 10:45                     ` Steve Simon
2006-02-10 14:40                       ` Dan Cross
2006-02-10 22:53                         ` lucio
2006-02-23 22:52                       ` Victor Nazarov
2006-02-10 11:05                     ` uriel
2006-02-10 12:59                     ` Bruce Ellis
2006-02-08 18:20               ` uriel [this message]
2006-02-08 19:50                 ` Bruce Ellis
2006-02-08 21:35                 ` Dan Cross
2006-02-08 21:43                   ` Ronald G Minnich
2006-02-08 22:57                     ` Christoph Lohmann
2006-02-09  0:03                       ` Dan Cross
2006-02-09  0:17                         ` Christoph Lohmann
2006-02-09  0:26                           ` Dan Cross
2006-02-09  0:43                             ` Christoph Lohmann
2006-02-09  1:11                               ` andrey mirtchovski
2006-02-09  1:47                                 ` Christoph Lohmann
2006-02-09  1:56                                 ` Marina Brown
2006-02-09  2:35                                   ` Federico Benavento
2006-02-09  7:34                                     ` Bruce Ellis
2006-02-09 20:11                                     ` Ronald G Minnich
2006-02-09 16:06                                   ` Dave Eckhardt
2006-02-09 22:44                                     ` Marina Brown
2006-02-09 23:06                                       ` Bakul Shah
2006-02-10  1:37                                         ` Micah Stetson
2006-02-08 22:58                     ` Lyndon Nerenberg
2006-02-09 13:04                     ` LiteStar numnums
2006-02-10 12:28 Aharon Robbins
2006-02-10 12:51 ` Dave Lukes
2006-02-10 14:04   ` Wes Kussmaul
2006-02-10 16:15     ` Skip Tavakkolian
2006-02-10 17:22       ` Wes Kussmaul
2006-02-10 17:41         ` Skip Tavakkolian
2006-02-10 18:21           ` Wes Kussmaul
2006-02-10 20:32             ` Lyndon Nerenberg
2006-02-11  4:36     ` Marina Brown
2006-02-11  4:39       ` Lyndon Nerenberg
2006-02-11  5:06     ` jmk
2006-02-11  6:52       ` lucio
2006-02-10 15:17 ` uriel
2006-02-10 17:42 ` Bakul Shah
2006-02-10 13:44 quanstro
2006-02-10 13:57 ` Bruce Ellis
2006-02-10 14:09 quanstro
2006-02-10 14:15 ` Bruce Ellis
2006-02-10 15:17 ` John Stalker
2006-02-10 15:22 quanstro
2006-02-10 16:49 quanstro
2006-02-11  0:10 quanstro
2006-02-11  3:01 ` jmk
2006-02-11  3:40 quanstro
2006-02-11  4:48 quanstro
2006-02-11 11:22 ` Bruce Ellis
2006-02-24  0:55 quanstro
2006-02-24  3:46 ` yard-ape
2006-02-24  4:40 ` Lucio De Re
2006-02-25  7:43   ` Serge Gagnon
2006-04-24 18:05   ` Serge Gagnon
2006-02-24 13:36 quanstro
2006-02-24 13:49 ` Anselm R. Garbe
2006-02-24 14:24   ` Gabriel Ivanes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a1bd7c6491e4cac08c653e3379e6b1aa@cat-v.org \
    --to=uriel@cat-v.org \
    --cc=9fans@cse.psu.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).