From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <43EA351D.7080605@orthanc.ca> Date: Wed, 8 Feb 2006 10:14:53 -0800 From: Lyndon Nerenberg User-Agent: Thunderbird 1.5 (X11/20060201) MIME-Version: 1.0 To: Fans of the OS Plan 9 from Bell Labs <9fans@cse.psu.edu> Subject: Re: [9fans] More 'Sam I am' References: <5E9D7F0F-985B-4D58-B0FD-7064CD25A8F2@orthanc.ca> <20060208173110.GJ1620@augusta.math.psu.edu> In-Reply-To: <20060208173110.GJ1620@augusta.math.psu.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Topicbox-Message-UUID: f7fdff68-ead0-11e9-9d60-3106f5b1d025 > Hmm. I'm going to make an unpopular but pragmatic suggestion: Don't use > sed or sam, but instead, use a language with an HTML parser available. > There are some jobs for which regular expressions aren't the best tool; > I personally think this is one of them. Here's a script I posted to > USENET years ago to extract data from a table. The problem with this is the data I want is interspersed with data that I don't want. And the bits I don't want are variable length inconsistent multi-line text that is a bitch to filter out of the rendered output stream. It turns out that sam (against the raw HTML) was the only tool that was able to do the job. I just wish I could wrap it in a shell script that I could throw at the directory containing all the .html files. --lyndon