From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.io/gmane.comp.tex.context/30054
Path: news.gmane.org!not-for-mail
From: "Mojca Miklavec" <mojca.miklavec.lists@gmail.com>
Newsgroups: gmane.comp.tex.context
Subject: Re: counting the words in a TeX document
Date: Mon, 7 Aug 2006 10:24:32 +0200
Message-ID: <6faad9f00608070124h2162d8ddj163fd308ca30348a@mail.gmail.com>
References: <6faad9f00608050945g5f829eaeka4afdee9858c7df8@mail.gmail.com>
	<44D4FA91.3080808@wxs.nl>
	<6faad9f00608051731t1dc00da2v73ad192dedd4835c@mail.gmail.com>
	<Pine.WNT.4.63.0608061320170.2092@nqvgln>
Reply-To: mailing list for ConTeXt users <ntg-context@ntg.nl>
NNTP-Posting-Host: main.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Trace: sea.gmane.org 1154939135 25883 80.91.229.2 (7 Aug 2006 08:25:35 GMT)
X-Complaints-To: usenet@sea.gmane.org
NNTP-Posting-Date: Mon, 7 Aug 2006 08:25:35 +0000 (UTC)
Cc: Benjamin Gorinsek <benjamingorinsek@yahoo.co.uk>
Original-X-From: ntg-context-bounces@ntg.nl Mon Aug 07 10:25:33 2006
Return-path: <ntg-context-bounces@ntg.nl>
Envelope-to: gctc-ntg-context-518@m.gmane.org
Original-Received: from ronja.vet.uu.nl ([131.211.172.88] helo=ronja.ntg.nl)
	by ciao.gmane.org with esmtp (Exim 4.43)
	id 1GA0Ps-0005dh-34
	for gctc-ntg-context-518@m.gmane.org; Mon, 07 Aug 2006 10:24:52 +0200
Original-Received: from localhost (localhost [127.0.0.1])
	by ronja.ntg.nl (Postfix) with ESMTP id 9BE731FCDF;
	Mon,  7 Aug 2006 10:24:51 +0200 (CEST)
Original-Received: from ronja.ntg.nl ([127.0.0.1])
 by localhost (smtp.ntg.nl [127.0.0.1]) (amavisd-new, port 10024) with LMTP
 id 07423-04-13; Mon,  7 Aug 2006 10:24:42 +0200 (CEST)
Original-Received: from ronja.vet.uu.nl (localhost [127.0.0.1])
	by ronja.ntg.nl (Postfix) with ESMTP id B811E1FDDA;
	Mon,  7 Aug 2006 10:24:41 +0200 (CEST)
Original-Received: from localhost (localhost [127.0.0.1])
	by ronja.ntg.nl (Postfix) with ESMTP id 521471FDDA
	for <ntg-context@ntg.nl>; Mon,  7 Aug 2006 10:24:39 +0200 (CEST)
Original-Received: from ronja.ntg.nl ([127.0.0.1])
	by localhost (smtp.ntg.nl [127.0.0.1]) (amavisd-new,
	port 10024) with LMTP id 07423-04-12 for <ntg-context@ntg.nl>;
	Mon,  7 Aug 2006 10:24:33 +0200 (CEST)
Original-Received: from nf-out-0910.google.com (nf-out-0910.google.com [64.233.182.187])
	by ronja.ntg.nl (Postfix) with SMTP id 9CEBE1FCDF
	for <ntg-context@ntg.nl>; Mon,  7 Aug 2006 10:24:32 +0200 (CEST)
Original-Received: by nf-out-0910.google.com with SMTP id x29so919652nfb
	for <ntg-context@ntg.nl>; Mon, 07 Aug 2006 01:24:32 -0700 (PDT)
Original-Received: by 10.78.175.14 with SMTP id x14mr2316063hue;
	Mon, 07 Aug 2006 01:24:32 -0700 (PDT)
Original-Received: by 10.78.175.15 with HTTP; Mon, 7 Aug 2006 01:24:32 -0700 (PDT)
Original-To: "mailing list for ConTeXt users" <ntg-context@ntg.nl>
In-Reply-To: <Pine.WNT.4.63.0608061320170.2092@nqvgln>
Content-Disposition: inline
X-Virus-Scanned: amavisd-new at ntg.nl
X-BeenThere: ntg-context@ntg.nl
X-Mailman-Version: 2.1.7
Precedence: list
List-Id: mailing list for ConTeXt users <ntg-context.ntg.nl>
List-Unsubscribe: <http://www.ntg.nl/mailman/listinfo/ntg-context>,
	<mailto:ntg-context-request@ntg.nl?subject=unsubscribe>
List-Archive: <http://www.ntg.nl/pipermail/ntg-context>
List-Post: <mailto:ntg-context@ntg.nl>
List-Help: <mailto:ntg-context-request@ntg.nl?subject=help>
List-Subscribe: <http://www.ntg.nl/mailman/listinfo/ntg-context>,
	<mailto:ntg-context-request@ntg.nl?subject=subscribe>
Original-Sender: ntg-context-bounces@ntg.nl
Errors-To: ntg-context-bounces@ntg.nl
X-Virus-Scanned: amavisd-new at ntg.nl
Xref: news.gmane.org gmane.comp.tex.context:30054
Archived-At: <http://permalink.gmane.org/gmane.comp.tex.context/30054>

On 8/6/06, Aditya Mahajan wrote:
> On Sun, 6 Aug 2006, Mojca Miklavec wrote:
> > Base on those three answers I got a more clear idea of two (different,
> > but complementary) methods that might be sensible:
> >
> > a) ctxtools --wordcount filename[tex|pdf]
> > to do the wordcount for the whole document using pdftotext + ruby regexp
> >
> > b)
> > \usemodule[wordcount]
> >
> > whatever
> >
> > \startstatistics[name][words|letters|lines]
> > some more-or-less plain text
> > \stopstatistics
> >
> > whatever
> >
> > and according to Aditya's idea, run a (ruby) regular expression
> > (insead of detex) on it which would write the nicely formatted desired
> > number to the output/log file. (I don't know if it's possible to use
> > the first approach for the second problem, but it doesn't make sense
> > to complicate things too much.)
>
> If you have a script that counts words in a Context document, the
> second approach is straight forward. Write everything to a buffer and
> run the script on the buffer. However, such a mechansim will never be
> perfect (or close to perfect) in the sense of parsing arbitrary input.

The most dummy solution that I could think of (using slightly modified
Hans's ruby script):

\unprotect

\def\startstatistics
  {\dodoubleempty\dostartstatistics}

\def\dostartstatistics[#1][#2]#3\stopstatistics
  {\setbuffer[#1]#3\endbuffer
    \executesystemcommand{ruby wordcount.rb \jobname-#1.tmp}%
    \getbuffer[#1]}

\protect \doifnotmode{demo}{\endinput}

... but a friend who asked me for a favour actually wants to use
abbreviations and bibliography as well, so only the first method (to
create PDF first) would work. He currently keeps copy-pasting the
resulting PDF to Word and uses Word's statistics to cound the words
and/or characters for him.

But I guess that his wishes will have to wait for some more time in this case.

> ftp://tug.ctan.org/pub/tex-archive/macros/plain/contrib/misc/xii.tex
>
> But of course, you will not write anything like this in an abstract
> :-)

Nevertheless, I love the story (and esp. the document which creates it)!

All the best,
    Mojca