From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.tex.context/30005 Path: news.gmane.org!not-for-mail From: Hans Hagen Newsgroups: gmane.comp.tex.context Subject: Re: counting the words in a TeX document Date: Sat, 05 Aug 2006 22:07:45 +0200 Message-ID: <44D4FA91.3080808@wxs.nl> References: <6faad9f00608050945g5f829eaeka4afdee9858c7df8@mail.gmail.com> Reply-To: mailing list for ConTeXt users NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Trace: sea.gmane.org 1154808485 21953 80.91.229.2 (5 Aug 2006 20:08:05 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Sat, 5 Aug 2006 20:08:05 +0000 (UTC) Original-X-From: ntg-context-bounces@ntg.nl Sat Aug 05 22:08:02 2006 Return-path: Envelope-to: gctc-ntg-context-518@m.gmane.org Original-Received: from ronja.vet.uu.nl ([131.211.172.88] helo=ronja.ntg.nl) by ciao.gmane.org with esmtp (Exim 4.43) id 1G9SRF-00008t-UB for gctc-ntg-context-518@m.gmane.org; Sat, 05 Aug 2006 22:08:01 +0200 Original-Received: from localhost (localhost [127.0.0.1]) by ronja.ntg.nl (Postfix) with ESMTP id DE40B1FEBD; Sat, 5 Aug 2006 22:07:59 +0200 (CEST) Original-Received: from ronja.ntg.nl ([127.0.0.1]) by localhost (smtp.ntg.nl [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 02891-06; Sat, 5 Aug 2006 22:07:51 +0200 (CEST) Original-Received: from ronja.vet.uu.nl (localhost [127.0.0.1]) by ronja.ntg.nl (Postfix) with ESMTP id 092611FEC1; Sat, 5 Aug 2006 22:07:51 +0200 (CEST) Original-Received: from localhost (localhost [127.0.0.1]) by ronja.ntg.nl (Postfix) with ESMTP id 481841FEC1 for ; Sat, 5 Aug 2006 22:07:50 +0200 (CEST) Original-Received: from ronja.ntg.nl ([127.0.0.1]) by localhost (smtp.ntg.nl [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 03122-03 for ; Sat, 5 Aug 2006 22:07:47 +0200 (CEST) Original-Received: from mail.pragma-ade.net (dsl-083-247-100-017.solcon.nl [83.247.100.17]) by ronja.ntg.nl (Postfix) with SMTP id F41021FEBD for ; Sat, 5 Aug 2006 22:07:46 +0200 (CEST) Original-Received: from [10.100.1.102] (unverified [10.100.1.102]) by controller-1 (SurgeMail 3.5b3) with ESMTP id 31644 for ; Sat, 05 Aug 2006 22:07:46 +0200 User-Agent: Thunderbird 1.5.0.5 (Windows/20060719) Original-To: mailing list for ConTeXt users In-Reply-To: <6faad9f00608050945g5f829eaeka4afdee9858c7df8@mail.gmail.com> X-Server: High Performance Mail Server - http://surgemail.com r=-274017400 X-Authenticated-User: hagen@controller-1 X-Virus-Scanned: amavisd-new at ntg.nl X-BeenThere: ntg-context@ntg.nl X-Mailman-Version: 2.1.7 Precedence: list List-Id: mailing list for ConTeXt users List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: ntg-context-bounces@ntg.nl Errors-To: ntg-context-bounces@ntg.nl X-Virus-Scanned: amavisd-new at ntg.nl Xref: news.gmane.org gmane.comp.tex.context:30005 Archived-At: Mojca Miklavec wrote: > Hello, > > I would like to ask how difficult it would be to count the number of > words in a TeX/ConTeXt document. If it's too complex, please ignore > the rest of the message. > the way i do such things (and worse trickery) is using pdftotext you can of course use tex, but then ther ecan be generated words and so and it is insane to use tex (or adapt a tex style) for that; it may help to run with (nondestructive) \setupalign[nothyphenated] anyhow, here is a script (i could not locate my normal one) === wordcount.rb === if (file = ARGV[0]) && file && FileTest.file?(file) then begin system("pdftotext #{ARGV[0]} wc.log") data = IO.read("wc.log") data.gsub!(/\d[\.\:]*\w+/o) do ' ' end # remove suffixes data.gsub!(/\d/o) do ' ' end # remove numbers data.gsub!(/\-\s+/mo) do ' ' end # remove hyphenation data.gsub!(/\-/mo) do ' ' end # split compound words data.gsub!(/[\.\,\<\>\/\?\\\|\'\"\;\:\]\{\}\{\+\=\-\_\)\(\*\&\^\%\$\#\@\!\~\`]/mo) do ' ' end words = data.split(/\s+/) count = Hash.new words.each do |w| count[w] = (count[w] || 0) + 1 end rescue puts("some error #{$!}") else puts("words : #{words.size}") puts("unique : #{count.size}") end if ARGV[1] =~ /list/ then puts("\n") count.sort.each do |k,v| puts("#{k} : #{v}") end end end usage: wc filename.pdf [list] it this kind of stuff is usefull, we can add it to one of the scripts that come with context Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------