From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.io/gmane.comp.tex.context/30007
Path: news.gmane.org!not-for-mail
From: "Mojca Miklavec" <mojca.miklavec.lists@gmail.com>
Newsgroups: gmane.comp.tex.context
Subject: Re: counting the words in a TeX document
Date: Sun, 6 Aug 2006 02:31:31 +0200
Message-ID: <6faad9f00608051731t1dc00da2v73ad192dedd4835c@mail.gmail.com>
References: <6faad9f00608050945g5f829eaeka4afdee9858c7df8@mail.gmail.com>
	<44D4FA91.3080808@wxs.nl>
Reply-To: mailing list for ConTeXt users <ntg-context@ntg.nl>
NNTP-Posting-Host: main.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Trace: sea.gmane.org 1154824320 22315 80.91.229.2 (6 Aug 2006 00:32:00 GMT)
X-Complaints-To: usenet@sea.gmane.org
NNTP-Posting-Date: Sun, 6 Aug 2006 00:32:00 +0000 (UTC)
Cc: Benjamin Gorinsek <benjamingorinsek@yahoo.co.uk>
Original-X-From: ntg-context-bounces@ntg.nl Sun Aug 06 02:31:56 2006
Return-path: <ntg-context-bounces@ntg.nl>
Envelope-to: gctc-ntg-context-518@m.gmane.org
Original-Received: from ronja.vet.uu.nl ([131.211.172.88] helo=ronja.ntg.nl)
	by ciao.gmane.org with esmtp (Exim 4.43)
	id 1G9WYX-000679-7O
	for gctc-ntg-context-518@m.gmane.org; Sun, 06 Aug 2006 02:31:49 +0200
Original-Received: from localhost (localhost [127.0.0.1])
	by ronja.ntg.nl (Postfix) with ESMTP id CCA5C1FEF8;
	Sun,  6 Aug 2006 02:31:47 +0200 (CEST)
Original-Received: from ronja.ntg.nl ([127.0.0.1])
 by localhost (smtp.ntg.nl [127.0.0.1]) (amavisd-new, port 10024) with LMTP
 id 16635-01-2; Sun,  6 Aug 2006 02:31:37 +0200 (CEST)
Original-Received: from ronja.vet.uu.nl (localhost [127.0.0.1])
	by ronja.ntg.nl (Postfix) with ESMTP id 31A411FEB8;
	Sun,  6 Aug 2006 02:31:37 +0200 (CEST)
Original-Received: from localhost (localhost [127.0.0.1])
	by ronja.ntg.nl (Postfix) with ESMTP id 5732B1FEB7
	for <ntg-context@ntg.nl>; Sun,  6 Aug 2006 02:31:35 +0200 (CEST)
Original-Received: from ronja.ntg.nl ([127.0.0.1])
	by localhost (smtp.ntg.nl [127.0.0.1]) (amavisd-new,
	port 10024) with LMTP id 16635-01 for <ntg-context@ntg.nl>;
	Sun,  6 Aug 2006 02:31:32 +0200 (CEST)
Original-Received: from nf-out-0910.google.com (nf-out-0910.google.com [64.233.182.188])
	by ronja.ntg.nl (Postfix) with SMTP id 5ADD91FEB8
	for <ntg-context@ntg.nl>; Sun,  6 Aug 2006 02:31:32 +0200 (CEST)
Original-Received: by nf-out-0910.google.com with SMTP id x29so592047nfb
	for <ntg-context@ntg.nl>; Sat, 05 Aug 2006 17:31:32 -0700 (PDT)
Original-Received: by 10.78.127.6 with SMTP id z6mr2012444huc;
	Sat, 05 Aug 2006 17:31:31 -0700 (PDT)
Original-Received: by 10.78.175.15 with HTTP; Sat, 5 Aug 2006 17:31:31 -0700 (PDT)
Original-To: "mailing list for ConTeXt users" <ntg-context@ntg.nl>
In-Reply-To: <44D4FA91.3080808@wxs.nl>
Content-Disposition: inline
X-Virus-Scanned: amavisd-new at ntg.nl
X-BeenThere: ntg-context@ntg.nl
X-Mailman-Version: 2.1.7
Precedence: list
List-Id: mailing list for ConTeXt users <ntg-context.ntg.nl>
List-Unsubscribe: <http://www.ntg.nl/mailman/listinfo/ntg-context>,
	<mailto:ntg-context-request@ntg.nl?subject=unsubscribe>
List-Archive: <http://www.ntg.nl/pipermail/ntg-context>
List-Post: <mailto:ntg-context@ntg.nl>
List-Help: <mailto:ntg-context-request@ntg.nl?subject=help>
List-Subscribe: <http://www.ntg.nl/mailman/listinfo/ntg-context>,
	<mailto:ntg-context-request@ntg.nl?subject=subscribe>
Original-Sender: ntg-context-bounces@ntg.nl
Errors-To: ntg-context-bounces@ntg.nl
X-Virus-Scanned: amavisd-new at ntg.nl
Xref: news.gmane.org gmane.comp.tex.context:30007
Archived-At: <http://permalink.gmane.org/gmane.comp.tex.context/30007>

On 8/5/06, Hans Hagen wrote:
> Mojca Miklavec wrote:
> > Hello,
> >
> > I would like to ask how difficult it would be to count the number of
> > words in a TeX/ConTeXt document. If it's too complex, please ignore
> > the rest of the message.
> the way i do such things (and worse trickery) is using pdftotext
>
> you can of course use tex, but then ther ecan be generated words and so and it is insane to use tex (or adapt a tex style) for that; it may help to run with (nondestructive)
>
> \setupalign[nothyphenated]
>
> anyhow, here is a script (i could not locate my normal one)
>
> === wordcount.rb ===
>
> if (file = ARGV[0]) && file && FileTest.file?(file) then
>     begin
>         system("pdftotext #{ARGV[0]} wc.log")
>         data = IO.read("wc.log")
>         data.gsub!(/\d[\.\:]*\w+/o) do ' ' end  # remove suffixes
>         data.gsub!(/\d/o)           do ' ' end  # remove numbers
>         data.gsub!(/\-\s+/mo)       do ' ' end  # remove hyphenation
>         data.gsub!(/\-/mo)          do ' ' end  # split compound words
>         data.gsub!(/[\.\,\<\>\/\?\\\|\'\"\;\:\]\{\}\{\+\=\-\_\)\(\*\&\^\%\$\#\@\!\~\`]/mo) do ' ' end
>         words = data.split(/\s+/)
>         count = Hash.new
>         words.each do |w|
>             count[w] = (count[w] || 0) + 1
>         end
>     rescue
>         puts("some error #{$!}")
>     else
>         puts("words  : #{words.size}")
>         puts("unique : #{count.size}")
>     end
>     if ARGV[1] =~ /list/ then
>         puts("\n")
>         count.sort.each do |k,v|
>             puts("#{k} : #{v}")
>         end
>     end
> end
>
>
> usage: wc filename.pdf [list]
>
> it this kind of stuff is usefull, we can add it to one of the scripts that come with context

Thanks a lot! I guess that's *it*! I always forget about the most
powerful feature of ConTeXt in comparison to LaTeX - scripting can be
added to almost any place (and the user doesn't need to install any
additional executables, such as "detex" mentioned by Aditya).

Here's some of my feedback:
- pdftotext is far from being useful for pdf to text conversion
(doesn't handle any accents), but is perfectly suitable for wordcount
- \[ is missing in the last gsub (only the right bracket is deleted)
- something strange (but not critical) happens to en-dashes

But everything else looks like a perfect functionality for ctxtools --wordcount.

On 8/5/06, Aditya Mahajan wrote:
> A very crude approach. There is a program called detex
> http://ctan.org/tex-archive/support/detex/ I have not used it, but I
> think that it strips off every command \something from the tex file.
> Then you can filter the file through wc to get a rough estimate of
> the number of words. One approach that will work is
>
> \startstatistics[filename][words|letters|lines]
>
> maps to
>
> \startbuffer[\jobname-statistics-filename]
>
> and
>
> \stopstatistics maps to
>
> \stopbuffer
> \getbuffer[\jobname-statistics-filename]
> \executesystemcommand{detex \jobname-statistics-filename.tmp | wc
> <flags correspondingto words|lines|letters> }

I took a look, but it merely looks like a parser for hardcoded (La)TeX
(someone should correct me if I'm wrong).

However, the fact that abstracts for which one might need wordcount
usually don't have too much trickery involved (they're usually olmost
pure plain text), doing the same, only with a simple ruby script
instead of compiling/installing some external LaTeX-aware C program
might already lead to satisfactory results.

> It wasn't too complex for Michael Downes using LaTeX:
>
> \ProvidesFile{wordcount.tex}[2000/09/27 v1.5 Michael Downes]
> % Copyright 2000 Michael John Downes
> % This file has no restrictions on its use, distribution, or sale.
> %
> % If you run LaTeX on wordcount.tex it will prompt you for the name of a
> % document to be counted. For most people, however, it will be more

This solution is more likely to produce better results (just that it
includes slightly more work). It actually runs (La)TeX, just redefines
a few commands before, so that counting the words is then a
straightforward parsing of log files based on the number of some
boxes.


Base on those three answers I got a more clear idea of two (different,
but complementary) methods that might be sensible:

a) ctxtools --wordcount filename[tex|pdf]
to do the wordcount for the whole document using pdftotext + ruby regexp

b)
\usemodule[wordcount]

whatever

\startstatistics[name][words|letters|lines]
some more-or-less plain text
\stopstatistics

whatever

and according to Aditya's idea, run a (ruby) regular expression
(insead of detex) on it which would write the nicely formatted desired
number to the output/log file. (I don't know if it's possible to use
the first approach for the second problem, but it doesn't make sense
to complicate things too much.)

As long as the command names are carefully chosen (and extensible if
the need for more complex behaviour arises in the future), that should
be about everything and it doesn't seem so difficult to implement
after all. (But I would write to the documentation that the resulting
numbers might change slightly in the future if the algorithm for
counting the words is improved.)

Any thoughts?

Thanks a lot,
     Mojca