From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Wed, 8 Feb 2006 16:28:50 -0500 From: Dan Cross To: Fans of the OS Plan 9 from Bell Labs <9fans@cse.psu.edu> Subject: Re: [9fans] More 'Sam I am' Message-ID: <20060208212850.GK1620@augusta.math.psu.edu> References: <5E9D7F0F-985B-4D58-B0FD-7064CD25A8F2@orthanc.ca> <20060208173110.GJ1620@augusta.math.psu.edu> <43EA351D.7080605@orthanc.ca> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <43EA351D.7080605@orthanc.ca> User-Agent: Mutt/1.4.1i Topicbox-Message-UUID: f8140a6a-ead0-11e9-9d60-3106f5b1d025 On Wed, Feb 08, 2006 at 10:14:53AM -0800, Lyndon Nerenberg wrote: > The problem with this is the data I want is interspersed with data that > I don't want. And the bits I don't want are variable length > inconsistent multi-line text that is a bitch to filter out of the > rendered output stream. It turns out that sam (against the raw HTML) > was the only tool that was able to do the job. I just wish I could wrap > it in a shell script that I could throw at the directory containing all > the .html files. I'm not talking about rendering, just parsing. Well, ultimately, what's important is that you get what you need out of the solution, I guess. Still, regular expressions alone give you part of the story, but not the whole thing. I submit that the power to actually parse the tokens in the data as opposed to just matching them (even if the regular expression language you're using is powerful enough to match the structure of the document) is more powerful. But hey, if sam floats your boat, fish on that river! - Dan C.