The Unix Heritage Society mailing list
 help / color / mirror / Atom feed
From: "Nelson H. F. Beebe" <beebe@math.utah.edu>
To: tuhs@minnie.tuhs.org
Subject: Re: [TUHS] The most surprising Unix programs
Date: Thu, 19 Mar 2020 14:57:59 -0600	[thread overview]
Message-ID: <CMM.0.95.0.1584651479.beebe@gamma.math.utah.edu> (raw)

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 3398 bytes --]

Tomasz Rola writes on Thu, 19 Mar 2020 21:01:20 +0100 about awk:

>> One task I would be afraid to use awk for, is html processing. Most of
>> html sources I look at nowadays seems discouraging. Extracting
>> anything of value from the mess requires something more potent, I
>> think.

If you want to tackle raw HTML from abitrary source, then I agree with
you: most HTML on the Web is not grammar conformant, there are
numerous vendor extensions, and the HTML is hideously idiosynchratic
and irregularly formatted.

The solution that I adopted 25 years ago was to write a grammar
recognizing, but violation lenient, prettyprinter for HTML.  It has
served well and I use it many times daily for my work in the BibNet
Project and TeX User Group bibliography archives, now approaching 1.55
million entries.  The latest public release is available here:

	http://www.math.utah.edu/pub/sgml/

I notice that the last version there is 1.01; I'll get that updated in
a couple of days to the latest 1.03 [subject to delays due to major
work dislocations due to the virus].  The code should install anywhere
in the Unix family without problems: I build and validate it on more
than 300 O/Ses in our test farm.

With standardized HTML, applying awk is easy, and I have more than 450
awk programs, and 380,000 lines of code, that process publisher
metadata to produce rough BibTeX entries that numerous other tools,
and some manual editing, turn into clean data for free access on the
Web.

For some journals, I run a single command of fewer than 15 characters
to download Web pages for journal issues for which I do not yet have
data, and then a single journal-specific command with no arguments
that runs a large shell script with a long pipeline that outputs
relatively clean BibTeX that then normally takes me only a couple of
minutes to visually validate in an editor session.  The major work
there is bracing of proper nouns in titles that my software did not
already handle, thereby preventing downcasing of those words in the
many bibliography styles that do so.

I'm on journal announcement lists for many publishers, so I often have
new data released to the Web just 5 to 10 minutes after receiving
e-mail about new issues.

The above-mentioned archives are at
	
	http://www.math.utah.edu/pub/bibnet
	http://www.math.utah.edu/pub/tex/bib
	http://www.math.utah.edu/pub/tex/bib/index-table.html
	http://www.math.utah.edu/pub/tex/bib/idx
	http://www.math.utah.edu/pub/tex/bib/toc	

They are mirrored at Universität Karlsruhe, Oak Ridge National
Laboratory, Sandia National Laboratory, and elsewhere.

Like Al Aho, Doug McIlroy, and Arnold Robbins, I'm a huge fan of awk;
I believe that I was the first to port it to PDP-10 TOPS-20 and VAX
VMS in the mid-1980s, and it is one of the first mandatory tools that
I install on any new computer.

-------------------------------------------------------------------------------
- Nelson H. F. Beebe                    Tel: +1 801 581 5254                  -
- University of Utah                    FAX: +1 801 581 4148                  -
- Department of Mathematics, 110 LCB    Internet e-mail: beebe@math.utah.edu  -
- 155 S 1400 E RM 233                       beebe@acm.org  beebe@computer.org -
- Salt Lake City, UT 84112-0090, USA    URL: http://www.math.utah.edu/~beebe/ -
-------------------------------------------------------------------------------

             reply	other threads:[~2020-03-19 21:05 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-19 20:57 Nelson H. F. Beebe [this message]
2020-03-19 21:18 ` Tomasz Rola
2020-03-20  7:14 ` arnold
2020-03-20  7:49   ` Thomas Paulsen
2020-03-20  8:18     ` arnold
  -- strict thread matches above, loose matches on Subject: below --
2020-03-21  1:12 Noel Chiappa
2020-03-20 14:03 Noel Chiappa
2020-03-20 14:08 ` Richard Salz
2020-03-20 14:52   ` Larry McVoy
2020-03-20 14:58     ` Dagobert Michelsen
2020-03-20 15:05       ` Richard Salz
2020-03-20 22:09       ` Mike Markowski
2020-03-20 15:03     ` Gregg Levine
2020-03-20 15:05       ` Chet Ramey
2020-03-20 22:06     ` Dave Horsfall
2020-03-21  4:59     ` Wesley Parish
2020-03-20 21:57   ` Dave Horsfall
2020-03-22 18:05     ` Tony Finch
2020-03-20 15:07 ` Nemo
2020-03-20 19:03   ` Adam Thornton
2020-03-20 16:07 ` Grant Taylor via TUHS
2020-09-13 15:44   ` Juergen Nickelsen
2020-03-13 23:31 Doug McIlroy
2020-03-14  0:40 ` Dave Horsfall
2020-03-14 11:30 ` Harald Arnesen
2020-03-14 12:24   ` Clem Cole
2020-03-15 22:01     ` Rob Pike
2020-03-15 22:14       ` Larry McVoy
2020-03-15 23:34         ` Warner Losh
2020-03-16  2:45           ` Anthony Martin
2020-03-15 22:30       ` Clem Cole
2020-03-15 23:20       ` Dave Horsfall
2020-03-16  0:56         ` Rob Pike
2020-03-20 23:20           ` Dave Horsfall
2020-03-20 23:35             ` Toby Thain
2020-03-21  0:34             ` Rob Pike
2020-03-17 13:03 ` ca6c
2020-03-17 13:30   ` Andy Kosela
2020-03-17 14:53     ` Cág
2020-03-17 14:57       ` Larry McVoy
2020-03-17 14:59         ` Arrigo Triulzi
2020-03-17 15:40   ` Steve Nickolas
2020-03-17 22:28   ` Dave Horsfall
2020-03-18  0:17     ` Jon Steinhart
2020-03-18  3:28       ` Dave Horsfall
2020-03-18  8:40     ` arnold
2020-03-19 12:26     ` Mike Markowski
2020-03-19 21:31       ` Dave Horsfall
2020-03-20 11:48         ` paul
2020-03-20 15:40           ` Grant Taylor via TUHS
2020-03-20 16:40             ` Jon Steinhart
2020-03-20 17:23               ` Grant Taylor via TUHS
2020-03-20 18:43               ` Rich Morin
2020-03-19 20:01   ` Tomasz Rola

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CMM.0.95.0.1584651479.beebe@gamma.math.utah.edu \
    --to=beebe@math.utah.edu \
    --cc=tuhs@minnie.tuhs.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).