From mboxrd@z Thu Jan  1 00:00:00 1970
Date: Wed,  8 Feb 2006 16:28:50 -0500
From: Dan Cross <cross@math.psu.edu>
To: Fans of the OS Plan 9 from Bell Labs <9fans@cse.psu.edu>
Subject: Re: [9fans] More 'Sam I am'
Message-ID: <20060208212850.GK1620@augusta.math.psu.edu>
References: <acc1656533ece8429f432cb08b097c5c@collyer.net>
	<5E9D7F0F-985B-4D58-B0FD-7064CD25A8F2@orthanc.ca>
	<20060208173110.GJ1620@augusta.math.psu.edu>
	<43EA351D.7080605@orthanc.ca>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <43EA351D.7080605@orthanc.ca>
User-Agent: Mutt/1.4.1i
Topicbox-Message-UUID: f8140a6a-ead0-11e9-9d60-3106f5b1d025

On Wed, Feb 08, 2006 at 10:14:53AM -0800, Lyndon Nerenberg wrote:
> The problem with this is the data I want is interspersed with data that 
> I don't want.  And the bits I don't want are variable length 
> inconsistent multi-line text that is a bitch to filter out of the 
> rendered output stream.  It turns out that sam (against the raw HTML) 
> was the only tool that was able to do the job.  I just wish I could wrap 
> it in a shell script that I could throw at the directory containing all 
> the .html files.

I'm not talking about rendering, just parsing.  Well, ultimately,
what's important is that you get what you need out of the solution, I
guess.  Still, regular expressions alone give you part of the story,
but not the whole thing.  I submit that the power to actually parse
the tokens in the data as opposed to just matching them (even if the
regular expression language you're using is powerful enough to match
the structure of the document) is more powerful.  But hey, if sam
floats your boat, fish on that river!

	- Dan C.