From mboxrd@z Thu Jan  1 00:00:00 1970
Message-ID: <43EA351D.7080605@orthanc.ca>
Date: Wed,  8 Feb 2006 10:14:53 -0800
From: Lyndon Nerenberg <lyndon@orthanc.ca>
User-Agent: Thunderbird 1.5 (X11/20060201)
MIME-Version: 1.0
To: Fans of the OS Plan 9 from Bell Labs <9fans@cse.psu.edu>
Subject: Re: [9fans] More 'Sam I am'
References: <acc1656533ece8429f432cb08b097c5c@collyer.net>	<5E9D7F0F-985B-4D58-B0FD-7064CD25A8F2@orthanc.ca>
	<20060208173110.GJ1620@augusta.math.psu.edu>
In-Reply-To: <20060208173110.GJ1620@augusta.math.psu.edu>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Topicbox-Message-UUID: f7fdff68-ead0-11e9-9d60-3106f5b1d025

> Hmm.  I'm going to make an unpopular but pragmatic suggestion: Don't use
> sed or sam, but instead, use a language with an HTML parser available.
> There are some jobs for which regular expressions aren't the best tool;
> I personally think this is one of them.  Here's a script I posted to
> USENET years ago to extract data from a table.

The problem with this is the data I want is interspersed with data that 
I don't want.  And the bits I don't want are variable length 
inconsistent multi-line text that is a bitch to filter out of the 
rendered output stream.  It turns out that sam (against the raw HTML) 
was the only tool that was able to do the job.  I just wish I could wrap 
it in a shell script that I could throw at the directory containing all 
the .html files.

--lyndon