From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.tex.context/30004 Path: news.gmane.org!not-for-mail From: gnwiii@gmail.com Newsgroups: gmane.comp.tex.context Subject: Re: counting the words in a TeX document Date: Sat, 5 Aug 2006 14:52:28 -0300 Message-ID: <22af238a0608051052ycc17efaia63642c7ec44f152@mail.gmail.com> References: <6faad9f00608050945g5f829eaeka4afdee9858c7df8@mail.gmail.com> Reply-To: mailing list for ConTeXt users NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Trace: sea.gmane.org 1154800368 2356 80.91.229.2 (5 Aug 2006 17:52:48 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Sat, 5 Aug 2006 17:52:48 +0000 (UTC) Original-X-From: ntg-context-bounces@ntg.nl Sat Aug 05 19:52:47 2006 Return-path: Envelope-to: gctc-ntg-context-518@m.gmane.org Original-Received: from ronja.vet.uu.nl ([131.211.172.88] helo=ronja.ntg.nl) by ciao.gmane.org with esmtp (Exim 4.43) id 1G9QKJ-00087i-Fn for gctc-ntg-context-518@m.gmane.org; Sat, 05 Aug 2006 19:52:43 +0200 Original-Received: from localhost (localhost [127.0.0.1]) by ronja.ntg.nl (Postfix) with ESMTP id F04B01FFC7; Sat, 5 Aug 2006 19:52:42 +0200 (CEST) Original-Received: from ronja.ntg.nl ([127.0.0.1]) by localhost (smtp.ntg.nl [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 02304-02; Sat, 5 Aug 2006 19:52:36 +0200 (CEST) Original-Received: from ronja.vet.uu.nl (localhost [127.0.0.1]) by ronja.ntg.nl (Postfix) with ESMTP id 430E21FF9A; Sat, 5 Aug 2006 19:52:36 +0200 (CEST) Original-Received: from localhost (localhost [127.0.0.1]) by ronja.ntg.nl (Postfix) with ESMTP id EE7DF1FF9A for ; Sat, 5 Aug 2006 19:52:33 +0200 (CEST) Original-Received: from ronja.ntg.nl ([127.0.0.1]) by localhost (smtp.ntg.nl [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 01972-08-4 for ; Sat, 5 Aug 2006 19:52:28 +0200 (CEST) Original-Received: from ug-out-1314.google.com (ug-out-1314.google.com [66.249.92.173]) by ronja.ntg.nl (Postfix) with SMTP id A8E251FF98 for ; Sat, 5 Aug 2006 19:52:28 +0200 (CEST) Original-Received: by ug-out-1314.google.com with SMTP id h2so82988ugf for ; Sat, 05 Aug 2006 10:52:28 -0700 (PDT) Original-Received: by 10.67.93.7 with SMTP id v7mr6082055ugl; Sat, 05 Aug 2006 10:52:28 -0700 (PDT) Original-Received: by 10.67.103.2 with HTTP; Sat, 5 Aug 2006 10:52:28 -0700 (PDT) Original-To: "mailing list for ConTeXt users" In-Reply-To: <6faad9f00608050945g5f829eaeka4afdee9858c7df8@mail.gmail.com> Content-Disposition: inline X-Virus-Scanned: amavisd-new at ntg.nl X-BeenThere: ntg-context@ntg.nl X-Mailman-Version: 2.1.7 Precedence: list List-Id: mailing list for ConTeXt users List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: ntg-context-bounces@ntg.nl Errors-To: ntg-context-bounces@ntg.nl X-Virus-Scanned: amavisd-new at ntg.nl Xref: news.gmane.org gmane.comp.tex.context:30004 Archived-At: On 8/5/06, Mojca Miklavec wrote: > Hello, > > I would like to ask how difficult it would be to count the number of > words in a TeX/ConTeXt document. If it's too complex, please ignore > the rest of the message. It wasn't too complex for Michael Downes using LaTeX: \ProvidesFile{wordcount.tex}[2000/09/27 v1.5 Michael Downes] % Copyright 2000 Michael John Downes % This file has no restrictions on its use, distribution, or sale. % % If you run LaTeX on wordcount.tex it will prompt you for the name of a % document to be counted. For most people, however, it will be more % convenient to run the shell script wordcount.sh, giving the document % name as the first argument. The comments in wordcount.sh % give further information about the usage and limitations of this tool. % The fundamental idea is to mark each character and interword space % with a unique tag that will show up in TeX "showbox" output. Then % arrange to make the output routine trigger a TeX overfull vbox message % for the page box so that everything gets reported in the TeX log. % Then run grep -c (or an equivalent text search utility, e.g., perl) on % the log file to count the occurrences. % [....] > Most recipes for LaTeX say that it's best to do something like > "pdftotext" and then issue "wc" to count the words in the resulting > text file, but windows users don't have "wc" and sometimes you only > need to know the length of the abstract or so ... Many GNU utilities have been ported (GNUWin32), or can be implemented in perl/ruby which context uses anyway. > Some time ago Hans mentioned that he counts the number of appearance > of single charactres, but I don't know how difficult it would be to > extend it to count the number of words. > > The problem is not that well defined (how to handle equations, some > would probably want to exclude headers, footers, buttons, ...), but it > only needs to be an approximation and "backward compatibility" (in the > sense that counter would have to result in the same number after some > years) is not needed at all since algorithms might improve with time > and the resulting document doesn't really depend on that number, it > would only be written to the log file. > > My idea for the interface would be something like > > \startwordcount[abstract] > \startframedtext > Bla bla. > \stopframedtext > \stopwordcount > > which would write something like "abstract: 2 words" to the log file > > or > > \startstatistics[abstract][words] > \startframedtext > Bla bla. > \stopframedtext > \stopstatistics > > But this is really a low priority. I'm currently using Acrobat to copy > the text, then I paste it into Office and take a look at statistics > there when I need to obey some limitations. > > So, if there's a simple solution, I would be glad to use it, but if it > takes too much time to implement it, it's probably not worth the > effort. ConTeXt already analyzes the "scratch" files with perl or ruby, so if you can adapt MD's idea it shouldn't be a big deal to have texexec print the result. -- George N. White III Head of St. Margarets Bay, Nova Scotia