9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
From: Aharon Robbins <arnold@skeeve.com>
To: 9fans@cse.psu.edu
Subject: Re: [9fans] More 'Sam I am'
Date: Fri, 10 Feb 2006 11:47:53 +0200	[thread overview]
Message-ID: <200602100947.k1A9lrbK018110@skeeve.com> (raw)
In-Reply-To: <20060208212850.GK1620@augusta.math.psu.edu>

In article <20060208212850.GK1620@augusta.math.psu.edu> you write:
>On Wed, Feb 08, 2006 at 10:14:53AM -0800, Lyndon Nerenberg wrote:
>> The problem with this is the data I want is interspersed with data that 
>> I don't want.  And the bits I don't want are variable length 
>> inconsistent multi-line text that is a bitch to filter out of the 
>> rendered output stream.  It turns out that sam (against the raw HTML) 
>> was the only tool that was able to do the job.  I just wish I could wrap 
>> it in a shell script that I could throw at the directory containing all 
>> the .html files.
>
>I'm not talking about rendering, just parsing.  Well, ultimately,
>what's important is that you get what you need out of the solution, I
>guess.  Still, regular expressions alone give you part of the story,
>but not the whole thing.  I submit that the power to actually parse
>the tokens in the data as opposed to just matching them (even if the
>regular expression language you're using is powerful enough to match
>the structure of the document) is more powerful.  But hey, if sam
>floats your boat, fish on that river!
>
>	- Dan C.

Possibly of interest is the xmlgawk project:

        http://www.sourceforge.net/projects/xmlgawk

This is an extended version of GNU Awk with an XML parser module add-on.
The idea that instead of reading lines, you get XML tokens (tags, fields
in the tags, and marked-up data).  I am not directly involved in it, but
it looks like a rather promising alternative for people who would like
to process XML type data in the more traditional Unixy fashion.

Arnold
-- 
Aharon (Arnold) Robbins --- Pioneer Consulting Ltd.	arnold AT skeeve DOT com
P.O. Box 354		Home Phone: +972  8 979-0381	Fax: +1 206 350 8765
Nof Ayalon		Cell Phone: +972 50  729-7545
D.N. Shimshon 99785	ISRAEL


  reply	other threads:[~2006-02-10  9:47 UTC|newest]

Thread overview: 72+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-02-08  4:34 Lyndon Nerenberg
2006-02-08  5:29 ` Russ Cox
2006-02-08  5:51   ` Lyndon Nerenberg
2006-02-08  6:14     ` Russ Cox
2006-02-08  6:30       ` Lyndon Nerenberg
2006-02-08  6:46         ` geoff
2006-02-08  6:50           ` Lyndon Nerenberg
2006-02-08 17:31             ` Dan Cross
2006-02-08 18:14               ` Lyndon Nerenberg
2006-02-08 21:28                 ` Dan Cross
2006-02-10  9:47                   ` Aharon Robbins [this message]
2006-02-10 10:45                     ` Steve Simon
2006-02-10 14:40                       ` Dan Cross
2006-02-10 22:53                         ` lucio
2006-02-23 22:52                       ` Victor Nazarov
2006-02-10 11:05                     ` uriel
2006-02-10 12:59                     ` Bruce Ellis
2006-02-08 18:20               ` uriel
2006-02-08 19:50                 ` Bruce Ellis
2006-02-08 21:35                 ` Dan Cross
2006-02-08 21:43                   ` Ronald G Minnich
2006-02-08 22:57                     ` Christoph Lohmann
2006-02-09  0:03                       ` Dan Cross
2006-02-09  0:17                         ` Christoph Lohmann
2006-02-09  0:26                           ` Dan Cross
2006-02-09  0:43                             ` Christoph Lohmann
2006-02-09  1:11                               ` andrey mirtchovski
2006-02-09  1:47                                 ` Christoph Lohmann
2006-02-09  1:56                                 ` Marina Brown
2006-02-09  2:35                                   ` Federico Benavento
2006-02-09  7:34                                     ` Bruce Ellis
2006-02-09 20:11                                     ` Ronald G Minnich
2006-02-09 16:06                                   ` Dave Eckhardt
2006-02-09 22:44                                     ` Marina Brown
2006-02-09 23:06                                       ` Bakul Shah
2006-02-10  1:37                                         ` Micah Stetson
2006-02-08 22:58                     ` Lyndon Nerenberg
2006-02-09 13:04                     ` LiteStar numnums
2006-02-10 12:28 Aharon Robbins
2006-02-10 12:51 ` Dave Lukes
2006-02-10 14:04   ` Wes Kussmaul
2006-02-10 16:15     ` Skip Tavakkolian
2006-02-10 17:22       ` Wes Kussmaul
2006-02-10 17:41         ` Skip Tavakkolian
2006-02-10 18:21           ` Wes Kussmaul
2006-02-10 20:32             ` Lyndon Nerenberg
2006-02-11  4:36     ` Marina Brown
2006-02-11  4:39       ` Lyndon Nerenberg
2006-02-11  5:06     ` jmk
2006-02-11  6:52       ` lucio
2006-02-10 15:17 ` uriel
2006-02-10 17:42 ` Bakul Shah
2006-02-10 13:44 quanstro
2006-02-10 13:57 ` Bruce Ellis
2006-02-10 14:09 quanstro
2006-02-10 14:15 ` Bruce Ellis
2006-02-10 15:17 ` John Stalker
2006-02-10 15:22 quanstro
2006-02-10 16:49 quanstro
2006-02-11  0:10 quanstro
2006-02-11  3:01 ` jmk
2006-02-11  3:40 quanstro
2006-02-11  4:48 quanstro
2006-02-11 11:22 ` Bruce Ellis
2006-02-24  0:55 quanstro
2006-02-24  3:46 ` yard-ape
2006-02-24  4:40 ` Lucio De Re
2006-02-25  7:43   ` Serge Gagnon
2006-04-24 18:05   ` Serge Gagnon
2006-02-24 13:36 quanstro
2006-02-24 13:49 ` Anselm R. Garbe
2006-02-24 14:24   ` Gabriel Ivanes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200602100947.k1A9lrbK018110@skeeve.com \
    --to=arnold@skeeve.com \
    --cc=9fans@cse.psu.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).