From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.tex.context/30003 Path: news.gmane.org!not-for-mail From: Aditya Mahajan Newsgroups: gmane.comp.tex.context Subject: Re: counting the words in a TeX document Date: Sat, 5 Aug 2006 13:02:34 -0400 (EDT) Message-ID: References: <6faad9f00608050945g5f829eaeka4afdee9858c7df8@mail.gmail.com> Reply-To: mailing list for ConTeXt users NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Trace: sea.gmane.org 1154797379 27726 80.91.229.2 (5 Aug 2006 17:02:59 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Sat, 5 Aug 2006 17:02:59 +0000 (UTC) Original-X-From: ntg-context-bounces@ntg.nl Sat Aug 05 19:02:52 2006 Return-path: Envelope-to: gctc-ntg-context-518@m.gmane.org Original-Received: from ronja.vet.uu.nl ([131.211.172.88] helo=ronja.ntg.nl) by ciao.gmane.org with esmtp (Exim 4.43) id 1G9PY2-0006NI-UU for gctc-ntg-context-518@m.gmane.org; Sat, 05 Aug 2006 19:02:50 +0200 Original-Received: from localhost (localhost [127.0.0.1]) by ronja.ntg.nl (Postfix) with ESMTP id 4BF3A1FF94; Sat, 5 Aug 2006 19:02:49 +0200 (CEST) Original-Received: from ronja.ntg.nl ([127.0.0.1]) by localhost (smtp.ntg.nl [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 01526-06; Sat, 5 Aug 2006 19:02:43 +0200 (CEST) Original-Received: from ronja.vet.uu.nl (localhost [127.0.0.1]) by ronja.ntg.nl (Postfix) with ESMTP id 875761FF39; Sat, 5 Aug 2006 19:02:43 +0200 (CEST) Original-Received: from localhost (localhost [127.0.0.1]) by ronja.ntg.nl (Postfix) with ESMTP id 623101FF39 for ; Sat, 5 Aug 2006 19:02:41 +0200 (CEST) Original-Received: from ronja.ntg.nl ([127.0.0.1]) by localhost (smtp.ntg.nl [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 01676-03-4 for ; Sat, 5 Aug 2006 19:02:38 +0200 (CEST) Original-Received: from skycaptain.mr.itd.umich.edu (smtp.mail.umich.edu [141.211.93.160]) by ronja.ntg.nl (Postfix) with SMTP id 5E2A01FF30 for ; Sat, 5 Aug 2006 19:02:38 +0200 (CEST) Original-Received: FROM aditya.annarb01.mi.comcast.net (c-68-40-50-205.hsd1.mi.comcast.net [68.40.50.205]) BY skycaptain.mr.itd.umich.edu ID 44D4CF2B.7E1FC.25399 ; 5 Aug 2006 13:02:36 -0400 Original-To: mailing list for ConTeXt users In-Reply-To: <6faad9f00608050945g5f829eaeka4afdee9858c7df8@mail.gmail.com> X-Virus-Scanned: amavisd-new at ntg.nl X-BeenThere: ntg-context@ntg.nl X-Mailman-Version: 2.1.7 Precedence: list List-Id: mailing list for ConTeXt users List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: ntg-context-bounces@ntg.nl Errors-To: ntg-context-bounces@ntg.nl X-Virus-Scanned: amavisd-new at ntg.nl Xref: news.gmane.org gmane.comp.tex.context:30003 Archived-At: On Sat, 5 Aug 2006, Mojca Miklavec wrote: > I would like to ask how difficult it would be to count the number of > words in a TeX/ConTeXt document. If it's too complex, please ignore > the rest of the message. > > > Most recipes for LaTeX say that it's best to do something like > "pdftotext" and then issue "wc" to count the words in the resulting > text file, but windows users don't have "wc" and sometimes you only > need to know the length of the abstract or so ... > > Some time ago Hans mentioned that he counts the number of appearance > of single charactres, but I don't know how difficult it would be to > extend it to count the number of words. > > The problem is not that well defined (how to handle equations, some > would probably want to exclude headers, footers, buttons, ...), but it > only needs to be an approximation and "backward compatibility" (in the > sense that counter would have to result in the same number after some > years) is not needed at all since algorithms might improve with time > and the resulting document doesn't really depend on that number, it > would only be written to the log file. > > My idea for the interface would be something like > > \startwordcount[abstract] > \startframedtext > Bla bla. > \stopframedtext > \stopwordcount > > which would write something like "abstract: 2 words" to the log file > > or > > \startstatistics[abstract][words] > \startframedtext > Bla bla. > \stopframedtext > \stopstatistics > > But this is really a low priority. I'm currently using Acrobat to copy > the text, then I paste it into Office and take a look at statistics > there when I need to obey some limitations. > > So, if there's a simple solution, I would be glad to use it, but if it > takes too much time to implement it, it's probably not worth the > effort. A very crude approach. There is a program called detex http://ctan.org/tex-archive/support/detex/ I have not used it, but I think that it strips off every command \something from the tex file. Then you can filter the file through wc to get a rough estimate of the number of words. One approach that will work is \startstatistics[filename][words|letters|lines] maps to \startbuffer[\jobname-statistics-filename] and \stopstatistics maps to \stopbuffer \getbuffer[\jobname-statistics-filename] \executesystemcommand{detex \jobname-statistics-filename.tmp | wc } and possibly prettify output to be more clearly visible in the log. Another approach can be write a vim script so that you can count the number of words in a visually highlighted area. Aditya