From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Fri, 10 Feb 2006 11:47:53 +0200 From: Aharon Robbins Subject: Re: [9fans] More 'Sam I am' In-reply-to: <20060208212850.GK1620@augusta.math.psu.edu> To: 9fans@cse.psu.edu Cc: Message-id: <200602100947.k1A9lrbK018110@skeeve.com> Content-transfer-encoding: 7BIT References: Topicbox-Message-UUID: f8fd34f6-ead0-11e9-9d60-3106f5b1d025 In article <20060208212850.GK1620@augusta.math.psu.edu> you write: >On Wed, Feb 08, 2006 at 10:14:53AM -0800, Lyndon Nerenberg wrote: >> The problem with this is the data I want is interspersed with data that >> I don't want. And the bits I don't want are variable length >> inconsistent multi-line text that is a bitch to filter out of the >> rendered output stream. It turns out that sam (against the raw HTML) >> was the only tool that was able to do the job. I just wish I could wrap >> it in a shell script that I could throw at the directory containing all >> the .html files. > >I'm not talking about rendering, just parsing. Well, ultimately, >what's important is that you get what you need out of the solution, I >guess. Still, regular expressions alone give you part of the story, >but not the whole thing. I submit that the power to actually parse >the tokens in the data as opposed to just matching them (even if the >regular expression language you're using is powerful enough to match >the structure of the document) is more powerful. But hey, if sam >floats your boat, fish on that river! > > - Dan C. Possibly of interest is the xmlgawk project: http://www.sourceforge.net/projects/xmlgawk This is an extended version of GNU Awk with an XML parser module add-on. The idea that instead of reading lines, you get XML tokens (tags, fields in the tags, and marked-up data). I am not directly involved in it, but it looks like a rather promising alternative for people who would like to process XML type data in the more traditional Unixy fashion. Arnold -- Aharon (Arnold) Robbins --- Pioneer Consulting Ltd. arnold AT skeeve DOT com P.O. Box 354 Home Phone: +972 8 979-0381 Fax: +1 206 350 8765 Nof Ayalon Cell Phone: +972 50 729-7545 D.N. Shimshon 99785 ISRAEL