ntg-context - mailing list for ConTeXt users
 help / color / mirror / Atom feed
From: Hans Hagen <pragma@wxs.nl>
Subject: Re: counting the words in a TeX document
Date: Sat, 05 Aug 2006 22:07:45 +0200	[thread overview]
Message-ID: <44D4FA91.3080808@wxs.nl> (raw)
In-Reply-To: <6faad9f00608050945g5f829eaeka4afdee9858c7df8@mail.gmail.com>

Mojca Miklavec wrote:
> Hello,
>
> I would like to ask how difficult it would be to count the number of
> words in a TeX/ConTeXt document. If it's too complex, please ignore
> the rest of the message.
>   
the way i do such things (and worse trickery) is using pdftotext 

you can of course use tex, but then ther ecan be generated words and so and it is insane to use tex (or adapt a tex style) for that; it may help to run with (nondestructive) 

\setupalign[nothyphenated]

anyhow, here is a script (i could not locate my normal one) 

=== wordcount.rb ===

if (file = ARGV[0]) && file && FileTest.file?(file) then
    begin
        system("pdftotext #{ARGV[0]} wc.log")
        data = IO.read("wc.log")
        data.gsub!(/\d[\.\:]*\w+/o) do ' ' end  # remove suffixes
        data.gsub!(/\d/o)           do ' ' end  # remove numbers
        data.gsub!(/\-\s+/mo)       do ' ' end  # remove hyphenation
        data.gsub!(/\-/mo)          do ' ' end  # split compound words
        data.gsub!(/[\.\,\<\>\/\?\\\|\'\"\;\:\]\{\}\{\+\=\-\_\)\(\*\&\^\%\$\#\@\!\~\`]/mo) do ' ' end
        words = data.split(/\s+/)
        count = Hash.new
        words.each do |w|
            count[w] = (count[w] || 0) + 1
        end
    rescue
        puts("some error #{$!}")
    else
        puts("words  : #{words.size}")
        puts("unique : #{count.size}")
    end
    if ARGV[1] =~ /list/ then
        puts("\n")
        count.sort.each do |k,v|
            puts("#{k} : #{v}")
        end
    end
end


usage: wc filename.pdf [list] 

it this kind of stuff is usefull, we can add it to one of the scripts that come with context 

Hans 
-----------------------------------------------------------------
                                          Hans Hagen | PRAGMA ADE
              Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
     tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com
                                             | www.pragma-pod.nl
-----------------------------------------------------------------

  parent reply	other threads:[~2006-08-05 20:07 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-08-05 16:45 Mojca Miklavec
2006-08-05 17:02 ` Aditya Mahajan
2006-08-05 17:52 ` gnwiii
2006-08-05 20:07 ` Hans Hagen [this message]
2006-08-06  0:31   ` Mojca Miklavec
2006-08-06 15:00     ` Hans Hagen
2006-08-06 17:27     ` Aditya Mahajan
2006-08-07  8:24       ` Mojca Miklavec
2006-08-07  9:22         ` Hans Hagen
2006-08-07 18:54           ` Mojca Miklavec
2006-08-07 20:55             ` Hans Hagen
2006-08-07 21:31               ` Mojca Miklavec
2006-08-08  0:49                 ` Aditya Mahajan
2006-08-08  7:54                 ` Hans Hagen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=44D4FA91.3080808@wxs.nl \
    --to=pragma@wxs.nl \
    --cc=ntg-context@ntg.nl \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).