From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=0.3 required=5.0 tests=MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE,RDNS_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.2 Received: (qmail 31577 invoked from network); 19 Mar 2020 21:26:58 -0000 Received-SPF: pass (minnie.tuhs.org: domain of minnie.tuhs.org designates 45.79.103.53 as permitted sender) receiver=inbox.vuxu.org; client-ip=45.79.103.53 envelope-from= Received: from unknown (HELO minnie.tuhs.org) (45.79.103.53) by inbox.vuxu.org with ESMTP; 19 Mar 2020 21:26:58 -0000 Received: by minnie.tuhs.org (Postfix, from userid 112) id CFEFF9CDA9; Fri, 20 Mar 2020 07:26:52 +1000 (AEST) Received: from minnie.tuhs.org (localhost [127.0.0.1]) by minnie.tuhs.org (Postfix) with ESMTP id 197B49CD7C; Fri, 20 Mar 2020 07:26:23 +1000 (AEST) Received: by minnie.tuhs.org (Postfix, from userid 112) id 52DD09CD7C; Fri, 20 Mar 2020 07:26:21 +1000 (AEST) X-Greylist: delayed 464 seconds by postgrey-1.36 at minnie.tuhs.org; Fri, 20 Mar 2020 07:26:20 AEST Received: from mailout4.ceti.pl (mailout7.ceti.pl [62.121.128.47]) by minnie.tuhs.org (Postfix) with ESMTPS id 694F09CD73 for ; Fri, 20 Mar 2020 07:26:20 +1000 (AEST) Received: from tau1.ceti.pl (tau.ceti.pl [62.121.128.11]) by mailout4.ceti.pl (Postfix) with ESMTP id 08E6C37812D0 for ; Thu, 19 Mar 2020 22:18:33 +0100 (CET) Received: by tau1.ceti.pl (Postfix, from userid 3727) id C9507960F8E; Thu, 19 Mar 2020 22:18:33 +0100 (CET) Date: Thu, 19 Mar 2020 22:18:33 +0100 From: Tomasz Rola To: tuhs@minnie.tuhs.org Message-ID: <20200319211833.GD16996@tau1.ceti.pl> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Subject: Re: [TUHS] The most surprising Unix programs X-BeenThere: tuhs@minnie.tuhs.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: The Unix Heritage Society mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: tuhs-bounces@minnie.tuhs.org Sender: "TUHS" On Thu, Mar 19, 2020 at 02:57:59PM -0600, Nelson H. F. Beebe wrote: [...] > > If you want to tackle raw HTML from abitrary source, then I agree with > you: most HTML on the Web is not grammar conformant, there are > numerous vendor extensions, and the HTML is hideously idiosynchratic > and irregularly formatted. > > The solution that I adopted 25 years ago was to write a grammar > recognizing, but violation lenient, prettyprinter for HTML. It has > served well and I use it many times daily for my work in the BibNet > Project and TeX User Group bibliography archives, now approaching 1.55 > million entries. The latest public release is available here: > > http://www.math.utah.edu/pub/sgml/ Thank you, I will have a longer look at those archives. My plan so far was to explore html files with CL and Slime (interactive mode for CL inside Emacs), which would allow me to actually find out what I want to be looking for - well, hopefully :-). -- Regards, Tomasz Rola -- ** A C programmer asked whether computer had Buddha's nature. ** ** As the answer, master did "rm -rif" on the programmer's home ** ** directory. And then the C programmer became enlightened... ** ** ** ** Tomasz Rola mailto:tomasz_rola@bigfoot.com **