Tomasz Rola writes on Thu, 19 Mar 2020 21:01:20 +0100 about awk: >> One task I would be afraid to use awk for, is html processing. Most of >> html sources I look at nowadays seems discouraging. Extracting >> anything of value from the mess requires something more potent, I >> think. If you want to tackle raw HTML from abitrary source, then I agree with you: most HTML on the Web is not grammar conformant, there are numerous vendor extensions, and the HTML is hideously idiosynchratic and irregularly formatted. The solution that I adopted 25 years ago was to write a grammar recognizing, but violation lenient, prettyprinter for HTML. It has served well and I use it many times daily for my work in the BibNet Project and TeX User Group bibliography archives, now approaching 1.55 million entries. The latest public release is available here: http://www.math.utah.edu/pub/sgml/ I notice that the last version there is 1.01; I'll get that updated in a couple of days to the latest 1.03 [subject to delays due to major work dislocations due to the virus]. The code should install anywhere in the Unix family without problems: I build and validate it on more than 300 O/Ses in our test farm. With standardized HTML, applying awk is easy, and I have more than 450 awk programs, and 380,000 lines of code, that process publisher metadata to produce rough BibTeX entries that numerous other tools, and some manual editing, turn into clean data for free access on the Web. For some journals, I run a single command of fewer than 15 characters to download Web pages for journal issues for which I do not yet have data, and then a single journal-specific command with no arguments that runs a large shell script with a long pipeline that outputs relatively clean BibTeX that then normally takes me only a couple of minutes to visually validate in an editor session. The major work there is bracing of proper nouns in titles that my software did not already handle, thereby preventing downcasing of those words in the many bibliography styles that do so. I'm on journal announcement lists for many publishers, so I often have new data released to the Web just 5 to 10 minutes after receiving e-mail about new issues. The above-mentioned archives are at http://www.math.utah.edu/pub/bibnet http://www.math.utah.edu/pub/tex/bib http://www.math.utah.edu/pub/tex/bib/index-table.html http://www.math.utah.edu/pub/tex/bib/idx http://www.math.utah.edu/pub/tex/bib/toc They are mirrored at Universität Karlsruhe, Oak Ridge National Laboratory, Sandia National Laboratory, and elsewhere. Like Al Aho, Doug McIlroy, and Arnold Robbins, I'm a huge fan of awk; I believe that I was the first to port it to PDP-10 TOPS-20 and VAX VMS in the mid-1980s, and it is one of the first mandatory tools that I install on any new computer. ------------------------------------------------------------------------------- - Nelson H. F. Beebe Tel: +1 801 581 5254 - - University of Utah FAX: +1 801 581 4148 - - Department of Mathematics, 110 LCB Internet e-mail: beebe@math.utah.edu - - 155 S 1400 E RM 233 beebe@acm.org beebe@computer.org - - Salt Lake City, UT 84112-0090, USA URL: http://www.math.utah.edu/~beebe/ - -------------------------------------------------------------------------------