counting the words in a TeX document

ntg-context - mailing list for ConTeXt users
 help / color / mirror / Atom feed

* counting the words in a TeX document
@ 2006-08-05 16:45 Mojca Miklavec
  2006-08-05 17:02 ` Aditya Mahajan
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Mojca Miklavec @ 2006-08-05 16:45 UTC (permalink / raw)


Hello,

I would like to ask how difficult it would be to count the number of
words in a TeX/ConTeXt document. If it's too complex, please ignore
the rest of the message.


Most recipes for LaTeX say that it's best to do something like
"pdftotext" and then issue "wc" to count the words in the resulting
text file, but windows users don't have "wc" and sometimes you only
need to know the length of the abstract or so ...

Some time ago Hans mentioned that he counts the number of appearance
of single charactres, but I don't know how difficult it would be to
extend it to count the number of words.

The problem is not that well defined (how to handle equations, some
would probably want to exclude headers, footers, buttons, ...), but it
only needs to be an approximation and "backward compatibility" (in the
sense that counter would have to result in the same number after some
years) is not needed at all since algorithms might improve with time
and the resulting document doesn't really depend on that number, it
would only be written to the log file.

My idea for the interface would be something like

\startwordcount[abstract]
\startframedtext
Bla bla.
\stopframedtext
\stopwordcount

which would write something like "abstract: 2 words" to the log file

or

\startstatistics[abstract][words]
\startframedtext
Bla bla.
\stopframedtext
\stopstatistics

But this is really a low priority. I'm currently using Acrobat to copy
the text, then I paste it into Office and take a look at statistics
there when I need to obey some limitations.

So, if there's a simple solution, I would be glad to use it, but if it
takes too much time to implement it, it's probably not worth the
effort.

Thanks a lot,
    Mojca

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: counting the words in a TeX document
  2006-08-05 16:45 counting the words in a TeX document Mojca Miklavec
@ 2006-08-05 17:02 ` Aditya Mahajan
  2006-08-05 17:52 ` gnwiii
  2006-08-05 20:07 ` Hans Hagen
  2 siblings, 0 replies; 14+ messages in thread
From: Aditya Mahajan @ 2006-08-05 17:02 UTC (permalink / raw)


On Sat, 5 Aug 2006, Mojca Miklavec wrote:
> I would like to ask how difficult it would be to count the number of
> words in a TeX/ConTeXt document. If it's too complex, please ignore
> the rest of the message.
>
>
> Most recipes for LaTeX say that it's best to do something like
> "pdftotext" and then issue "wc" to count the words in the resulting
> text file, but windows users don't have "wc" and sometimes you only
> need to know the length of the abstract or so ...
>
> Some time ago Hans mentioned that he counts the number of appearance
> of single charactres, but I don't know how difficult it would be to
> extend it to count the number of words.
>
> The problem is not that well defined (how to handle equations, some
> would probably want to exclude headers, footers, buttons, ...), but it
> only needs to be an approximation and "backward compatibility" (in the
> sense that counter would have to result in the same number after some
> years) is not needed at all since algorithms might improve with time
> and the resulting document doesn't really depend on that number, it
> would only be written to the log file.
>
> My idea for the interface would be something like
>
> \startwordcount[abstract]
> \startframedtext
> Bla bla.
> \stopframedtext
> \stopwordcount
>
> which would write something like "abstract: 2 words" to the log file
>
> or
>
> \startstatistics[abstract][words]
> \startframedtext
> Bla bla.
> \stopframedtext
> \stopstatistics
>
> But this is really a low priority. I'm currently using Acrobat to copy
> the text, then I paste it into Office and take a look at statistics
> there when I need to obey some limitations.
>
> So, if there's a simple solution, I would be glad to use it, but if it
> takes too much time to implement it, it's probably not worth the
> effort.

A very crude approach. There is a program called detex 
http://ctan.org/tex-archive/support/detex/ I have not used it, but I 
think that it strips off every command \something from the tex file. 
Then you can filter the file through wc to get a rough estimate of 
the number of words. One approach that will work is

\startstatistics[filename][words|letters|lines]

maps to

\startbuffer[\jobname-statistics-filename]

and

\stopstatistics maps to

\stopbuffer
\getbuffer[\jobname-statistics-filename]
\executesystemcommand{detex \jobname-statistics-filename.tmp | wc 
<flags correspondingto words|lines|letters> }

and possibly prettify output to be more clearly visible in the log.

Another approach can be write a vim script so that you can count the 
number of words in a visually highlighted area.

Aditya

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: counting the words in a TeX document
  2006-08-05 16:45 counting the words in a TeX document Mojca Miklavec
  2006-08-05 17:02 ` Aditya Mahajan
@ 2006-08-05 17:52 ` gnwiii
  2006-08-05 20:07 ` Hans Hagen
  2 siblings, 0 replies; 14+ messages in thread
From: gnwiii @ 2006-08-05 17:52 UTC (permalink / raw)

On 8/5/06, Mojca Miklavec <mojca.miklavec.lists@gmail.com> wrote:
> Hello,
>
> I would like to ask how difficult it would be to count the number of
> words in a TeX/ConTeXt document. If it's too complex, please ignore
> the rest of the message.

It wasn't too complex for Michael Downes using LaTeX:

\ProvidesFile{wordcount.tex}[2000/09/27 v1.5 Michael Downes]
% Copyright 2000 Michael John Downes
% This file has no restrictions on its use, distribution, or sale.
%
% If you run LaTeX on wordcount.tex it will prompt you for the name of a
% document to be counted. For most people, however, it will be more
% convenient to run the shell script wordcount.sh, giving the document
% name as the first argument. The comments in wordcount.sh
% give further information about the usage and limitations of this tool.

% The fundamental idea is to mark each character and interword space
% with a unique tag that will show up in TeX "showbox" output. Then
% arrange to make the output routine trigger a TeX overfull vbox message
% for the page box so that everything gets reported in the TeX log.
% Then run grep -c (or an equivalent text search utility, e.g., perl) on
% the log file to count the occurrences.
% [....]

> Most recipes for LaTeX say that it's best to do something like
> "pdftotext" and then issue "wc" to count the words in the resulting
> text file, but windows users don't have "wc" and sometimes you only
> need to know the length of the abstract or so ...

Many GNU utilities have been ported (GNUWin32), or can be implemented
in perl/ruby which context uses anyway.

> Some time ago Hans mentioned that he counts the number of appearance
> of single charactres, but I don't know how difficult it would be to
> extend it to count the number of words.
>
> The problem is not that well defined (how to handle equations, some
> would probably want to exclude headers, footers, buttons, ...), but it
> only needs to be an approximation and "backward compatibility" (in the
> sense that counter would have to result in the same number after some
> years) is not needed at all since algorithms might improve with time
> and the resulting document doesn't really depend on that number, it
> would only be written to the log file.
>
> My idea for the interface would be something like
>
> \startwordcount[abstract]
> \startframedtext
> Bla bla.
> \stopframedtext
> \stopwordcount
>
> which would write something like "abstract: 2 words" to the log file
>
> or
>
> \startstatistics[abstract][words]
> \startframedtext
> Bla bla.
> \stopframedtext
> \stopstatistics
>
> But this is really a low priority. I'm currently using Acrobat to copy
> the text, then I paste it into Office and take a look at statistics
> there when I need to obey some limitations.
>
> So, if there's a simple solution, I would be glad to use it, but if it
> takes too much time to implement it, it's probably not worth the
> effort.

ConTeXt already analyzes the "scratch" files with perl or ruby, so if
you can adapt MD's idea it shouldn't be a big deal to have texexec
print the result.

-- 
George N. White III <aa056@chebucto.ns.ca>
Head of St. Margarets Bay, Nova Scotia

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: counting the words in a TeX document
  2006-08-05 16:45 counting the words in a TeX document Mojca Miklavec
  2006-08-05 17:02 ` Aditya Mahajan
  2006-08-05 17:52 ` gnwiii
@ 2006-08-05 20:07 ` Hans Hagen
  2006-08-06  0:31   ` Mojca Miklavec
  2 siblings, 1 reply; 14+ messages in thread
From: Hans Hagen @ 2006-08-05 20:07 UTC (permalink / raw)


Mojca Miklavec wrote:
> Hello,
>
> I would like to ask how difficult it would be to count the number of
> words in a TeX/ConTeXt document. If it's too complex, please ignore
> the rest of the message.
>   
the way i do such things (and worse trickery) is using pdftotext 

you can of course use tex, but then ther ecan be generated words and so and it is insane to use tex (or adapt a tex style) for that; it may help to run with (nondestructive) 

\setupalign[nothyphenated]

anyhow, here is a script (i could not locate my normal one) 

=== wordcount.rb ===

if (file = ARGV[0]) && file && FileTest.file?(file) then
    begin
        system("pdftotext #{ARGV[0]} wc.log")
        data = IO.read("wc.log")
        data.gsub!(/\d[\.\:]*\w+/o) do ' ' end  # remove suffixes
        data.gsub!(/\d/o)           do ' ' end  # remove numbers
        data.gsub!(/\-\s+/mo)       do ' ' end  # remove hyphenation
        data.gsub!(/\-/mo)          do ' ' end  # split compound words
        data.gsub!(/[\.\,\<\>\/\?\\\|\'\"\;\:\]\{\}\{\+\=\-\_\)\(\*\&\^\%\$\#\@\!\~\`]/mo) do ' ' end
        words = data.split(/\s+/)
        count = Hash.new
        words.each do |w|
            count[w] = (count[w] || 0) + 1
        end
    rescue
        puts("some error #{$!}")
    else
        puts("words  : #{words.size}")
        puts("unique : #{count.size}")
    end
    if ARGV[1] =~ /list/ then
        puts("\n")
        count.sort.each do |k,v|
            puts("#{k} : #{v}")
        end
    end
end


usage: wc filename.pdf [list] 

it this kind of stuff is usefull, we can add it to one of the scripts that come with context 

Hans 
-----------------------------------------------------------------
                                          Hans Hagen | PRAGMA ADE
              Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
     tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com
                                             | www.pragma-pod.nl
-----------------------------------------------------------------

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: counting the words in a TeX document
  2006-08-05 20:07 ` Hans Hagen
@ 2006-08-06  0:31   ` Mojca Miklavec
  2006-08-06 15:00     ` Hans Hagen
  2006-08-06 17:27     ` Aditya Mahajan
  0 siblings, 2 replies; 14+ messages in thread
From: Mojca Miklavec @ 2006-08-06  0:31 UTC (permalink / raw)
  Cc: Benjamin Gorinsek

On 8/5/06, Hans Hagen wrote:
> Mojca Miklavec wrote:
> > Hello,
> >
> > I would like to ask how difficult it would be to count the number of
> > words in a TeX/ConTeXt document. If it's too complex, please ignore
> > the rest of the message.
> the way i do such things (and worse trickery) is using pdftotext
>
> you can of course use tex, but then ther ecan be generated words and so and it is insane to use tex (or adapt a tex style) for that; it may help to run with (nondestructive)
>
> \setupalign[nothyphenated]
>
> anyhow, here is a script (i could not locate my normal one)
>
> === wordcount.rb ===
>
> if (file = ARGV[0]) && file && FileTest.file?(file) then
>     begin
>         system("pdftotext #{ARGV[0]} wc.log")
>         data = IO.read("wc.log")
>         data.gsub!(/\d[\.\:]*\w+/o) do ' ' end  # remove suffixes
>         data.gsub!(/\d/o)           do ' ' end  # remove numbers
>         data.gsub!(/\-\s+/mo)       do ' ' end  # remove hyphenation
>         data.gsub!(/\-/mo)          do ' ' end  # split compound words
>         data.gsub!(/[\.\,\<\>\/\?\\\|\'\"\;\:\]\{\}\{\+\=\-\_\)\(\*\&\^\%\$\#\@\!\~\`]/mo) do ' ' end
>         words = data.split(/\s+/)
>         count = Hash.new
>         words.each do |w|
>             count[w] = (count[w] || 0) + 1
>         end
>     rescue
>         puts("some error #{$!}")
>     else
>         puts("words  : #{words.size}")
>         puts("unique : #{count.size}")
>     end
>     if ARGV[1] =~ /list/ then
>         puts("\n")
>         count.sort.each do |k,v|
>             puts("#{k} : #{v}")
>         end
>     end
> end
>
>
> usage: wc filename.pdf [list]
>
> it this kind of stuff is usefull, we can add it to one of the scripts that come with context

Thanks a lot! I guess that's *it*! I always forget about the most
powerful feature of ConTeXt in comparison to LaTeX - scripting can be
added to almost any place (and the user doesn't need to install any
additional executables, such as "detex" mentioned by Aditya).

Here's some of my feedback:
- pdftotext is far from being useful for pdf to text conversion
(doesn't handle any accents), but is perfectly suitable for wordcount
- \[ is missing in the last gsub (only the right bracket is deleted)
- something strange (but not critical) happens to en-dashes

But everything else looks like a perfect functionality for ctxtools --wordcount.

On 8/5/06, Aditya Mahajan wrote:
> A very crude approach. There is a program called detex
> http://ctan.org/tex-archive/support/detex/ I have not used it, but I
> think that it strips off every command \something from the tex file.
> Then you can filter the file through wc to get a rough estimate of
> the number of words. One approach that will work is
>
> \startstatistics[filename][words|letters|lines]
>
> maps to
>
> \startbuffer[\jobname-statistics-filename]
>
> and
>
> \stopstatistics maps to
>
> \stopbuffer
> \getbuffer[\jobname-statistics-filename]
> \executesystemcommand{detex \jobname-statistics-filename.tmp | wc
> <flags correspondingto words|lines|letters> }

I took a look, but it merely looks like a parser for hardcoded (La)TeX
(someone should correct me if I'm wrong).

However, the fact that abstracts for which one might need wordcount
usually don't have too much trickery involved (they're usually olmost
pure plain text), doing the same, only with a simple ruby script
instead of compiling/installing some external LaTeX-aware C program
might already lead to satisfactory results.

> It wasn't too complex for Michael Downes using LaTeX:
>
> \ProvidesFile{wordcount.tex}[2000/09/27 v1.5 Michael Downes]
> % Copyright 2000 Michael John Downes
> % This file has no restrictions on its use, distribution, or sale.
> %
> % If you run LaTeX on wordcount.tex it will prompt you for the name of a
> % document to be counted. For most people, however, it will be more

This solution is more likely to produce better results (just that it
includes slightly more work). It actually runs (La)TeX, just redefines
a few commands before, so that counting the words is then a
straightforward parsing of log files based on the number of some
boxes.

Base on those three answers I got a more clear idea of two (different,
but complementary) methods that might be sensible:

a) ctxtools --wordcount filename[tex|pdf]
to do the wordcount for the whole document using pdftotext + ruby regexp

b)
\usemodule[wordcount]

whatever

\startstatistics[name][words|letters|lines]
some more-or-less plain text
\stopstatistics

whatever

and according to Aditya's idea, run a (ruby) regular expression
(insead of detex) on it which would write the nicely formatted desired
number to the output/log file. (I don't know if it's possible to use
the first approach for the second problem, but it doesn't make sense
to complicate things too much.)

As long as the command names are carefully chosen (and extensible if
the need for more complex behaviour arises in the future), that should
be about everything and it doesn't seem so difficult to implement
after all. (But I would write to the documentation that the resulting
numbers might change slightly in the future if the algorithm for
counting the words is improved.)

Any thoughts?

Thanks a lot,
     Mojca

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: counting the words in a TeX document
  2006-08-06  0:31   ` Mojca Miklavec
@ 2006-08-06 15:00     ` Hans Hagen
  2006-08-06 17:27     ` Aditya Mahajan
  1 sibling, 0 replies; 14+ messages in thread
From: Hans Hagen @ 2006-08-06 15:00 UTC (permalink / raw)


Mojca Miklavec wrote:
> t complementary) methods that might be sensible:
>
> a) ctxtools --wordcount filename[tex|pdf]
> to do the wordcount for the whole document using pdftotext + ruby regexp
>   
counting words in tex docs is not that hard:

it needs in addition:

delete all the words starting with \
delete everything (nested) between [ ]

Hans

 

-----------------------------------------------------------------
                                          Hans Hagen | PRAGMA ADE
              Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
     tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com
                                             | www.pragma-pod.nl
-----------------------------------------------------------------

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: counting the words in a TeX document
  2006-08-06  0:31   ` Mojca Miklavec
  2006-08-06 15:00     ` Hans Hagen
@ 2006-08-06 17:27     ` Aditya Mahajan
  2006-08-07  8:24       ` Mojca Miklavec
  1 sibling, 1 reply; 14+ messages in thread
From: Aditya Mahajan @ 2006-08-06 17:27 UTC (permalink / raw)
  Cc: Benjamin Gorinsek

On Sun, 6 Aug 2006, Mojca Miklavec wrote:
> Base on those three answers I got a more clear idea of two (different,
> but complementary) methods that might be sensible:
>
> a) ctxtools --wordcount filename[tex|pdf]
> to do the wordcount for the whole document using pdftotext + ruby regexp
>
> b)
> \usemodule[wordcount]
>
> whatever
>
> \startstatistics[name][words|letters|lines]
> some more-or-less plain text
> \stopstatistics
>
> whatever
>
> and according to Aditya's idea, run a (ruby) regular expression
> (insead of detex) on it which would write the nicely formatted desired
> number to the output/log file. (I don't know if it's possible to use
> the first approach for the second problem, but it doesn't make sense
> to complicate things too much.)

If you have a script that counts words in a Context document, the 
second approach is straight forward. Write everything to a buffer and 
run the script on the buffer. However, such a mechansim will never be 
perfect (or close to perfect) in the sense of parsing arbitrary input.

ftp://tug.ctan.org/pub/tex-archive/macros/plain/contrib/misc/xii.tex

But of course, you will not write anything like this in an abstract 
:-)

Aditya

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: counting the words in a TeX document
  2006-08-06 17:27     ` Aditya Mahajan
@ 2006-08-07  8:24       ` Mojca Miklavec
  2006-08-07  9:22         ` Hans Hagen
  0 siblings, 1 reply; 14+ messages in thread
From: Mojca Miklavec @ 2006-08-07  8:24 UTC (permalink / raw)
  Cc: Benjamin Gorinsek

On 8/6/06, Aditya Mahajan wrote:
> On Sun, 6 Aug 2006, Mojca Miklavec wrote:
> > Base on those three answers I got a more clear idea of two (different,
> > but complementary) methods that might be sensible:
> >
> > a) ctxtools --wordcount filename[tex|pdf]
> > to do the wordcount for the whole document using pdftotext + ruby regexp
> >
> > b)
> > \usemodule[wordcount]
> >
> > whatever
> >
> > \startstatistics[name][words|letters|lines]
> > some more-or-less plain text
> > \stopstatistics
> >
> > whatever
> >
> > and according to Aditya's idea, run a (ruby) regular expression
> > (insead of detex) on it which would write the nicely formatted desired
> > number to the output/log file. (I don't know if it's possible to use
> > the first approach for the second problem, but it doesn't make sense
> > to complicate things too much.)
>
> If you have a script that counts words in a Context document, the
> second approach is straight forward. Write everything to a buffer and
> run the script on the buffer. However, such a mechansim will never be
> perfect (or close to perfect) in the sense of parsing arbitrary input.

The most dummy solution that I could think of (using slightly modified
Hans's ruby script):

\unprotect

\def\startstatistics
  {\dodoubleempty\dostartstatistics}

\def\dostartstatistics[#1][#2]#3\stopstatistics
  {\setbuffer[#1]#3\endbuffer
    \executesystemcommand{ruby wordcount.rb \jobname-#1.tmp}%
    \getbuffer[#1]}

\protect \doifnotmode{demo}{\endinput}

... but a friend who asked me for a favour actually wants to use
abbreviations and bibliography as well, so only the first method (to
create PDF first) would work. He currently keeps copy-pasting the
resulting PDF to Word and uses Word's statistics to cound the words
and/or characters for him.

But I guess that his wishes will have to wait for some more time in this case.

> ftp://tug.ctan.org/pub/tex-archive/macros/plain/contrib/misc/xii.tex
>
> But of course, you will not write anything like this in an abstract
> :-)

Nevertheless, I love the story (and esp. the document which creates it)!

All the best,
    Mojca

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: counting the words in a TeX document
  2006-08-07  8:24       ` Mojca Miklavec
@ 2006-08-07  9:22         ` Hans Hagen
  2006-08-07 18:54           ` Mojca Miklavec
  0 siblings, 1 reply; 14+ messages in thread
From: Hans Hagen @ 2006-08-07  9:22 UTC (permalink / raw)

Mojca Miklavec wrote:
>> ftp://tug.ctan.org/pub/tex-archive/macros/plain/contrib/misc/xii.tex
>>
>>     
yeah, a famous tex master piece!
>> But of course, you will not write anything like this in an abstract
>> :-)
>>     
hm, let me provide a word counter for that one before you get the idea to ask for it -) 

\starttext

\setbox0\vbox\bgroup  % \tracingall -) 
\forgetall \nohyphens \hsize1mm
\let\bye\egroup \bgroup
\let~\catcode~`76~`A13~`F1~`j00~`P2jdefA71F~`7113jdefPALLF
PA''FwPA;;FPAZZFLaLPA//71F71iPAHHFLPAzzFenPASSFthP;A$$FevP
A@@FfPARR717273F737271P;ADDFRgniPAWW71FPATTFvePA**FstRsamP
AGGFRruoPAqq71.72.F717271PAYY7172F727171PA??Fi*LmPA&&71jfi
Fjfi71PAVVFjbigskipRPWGAUU71727374 75,76Fjpar71727375Djifx
:76jelse&U76jfiPLAKK7172F71l7271PAXX71FVLnOSeL71SLRyadR@oL
RrhC?yLRurtKFeLPFovPgaTLtReRomL;PABB71 72,73:Fjif.73.jelse
B73:jfiXF71PU71 72,73:PWs;AMM71F71diPAJJFRdriPAQQFRsreLPAI
I71Fo71dPA!!FRgiePBt'el@ lTLqdrYmu.Q.,Ke;vz vzLqpip.Q.,tz;
;Lql.IrsZ.eap,qn.i. i.eLlMaesLdRcna,;!;h htLqm.MRasZ.ilk,%
s$;z zLqs'.ansZ.Ymi,/sx ;LYegseZRyal,@i;@ TLRlogdLrDsW,@;G
LcYlaDLbJsW,SWXJW ree @rzchLhzsW,;WERcesInW qt.'oL.Rtrul;e
doTsW,Wk;Rri@stW aHAHHFndZPpqar.tridgeLinZpe.LtYer.W,:jbye
\egroup

\newcounter\NOfLines
\beginshapebox \unvcopy0 \endshapebox
\reshapebox{\doglobal\increment\NOfLines}

\getnoflines{\ht0}

lines: \the\noflines
words: \NOfLines\par

% \unvbox0

\stoptext

-----------------------------------------------------------------
                                          Hans Hagen | PRAGMA ADE
              Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
     tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com
                                             | www.pragma-pod.nl
-----------------------------------------------------------------

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: counting the words in a TeX document
  2006-08-07  9:22         ` Hans Hagen
@ 2006-08-07 18:54           ` Mojca Miklavec
  2006-08-07 20:55             ` Hans Hagen
  0 siblings, 1 reply; 14+ messages in thread
From: Mojca Miklavec @ 2006-08-07 18:54 UTC (permalink / raw)


On 8/7/06, Hans Hagen wrote:
> Mojca Miklavec wrote:
> >> ftp://tug.ctan.org/pub/tex-archive/macros/plain/contrib/misc/xii.tex
> >>
> >>
> yeah, a famous tex master piece!
> >> But of course, you will not write anything like this in an abstract
> >> :-)
> >>
> hm, let me provide a word counter for that one before you get the idea to ask for it -)

Whom did you have in mind? I would never have thought about asking
such a question. ;)

> \starttext
>
> \setbox0\vbox\bgroup  % \tracingall -)
> \forgetall \nohyphens \hsize1mm
> \let\bye\egroup \bgroup
> \let~\catcode~`76~`A13~`F1~`j00~`P2jdefA71F~`7113jdefPALLF
> PA''FwPA;;FPAZZFLaLPA//71F71iPAHHFLPAzzFenPASSFthP;A$$FevP
> A@@FfPARR717273F737271P;ADDFRgniPAWW71FPATTFvePA**FstRsamP
> AGGFRruoPAqq71.72.F717271PAYY7172F727171PA??Fi*LmPA&&71jfi
> Fjfi71PAVVFjbigskipRPWGAUU71727374 75,76Fjpar71727375Djifx
> :76jelse&U76jfiPLAKK7172F71l7271PAXX71FVLnOSeL71SLRyadR@oL
> RrhC?yLRurtKFeLPFovPgaTLtReRomL;PABB71 72,73:Fjif.73.jelse
> B73:jfiXF71PU71 72,73:PWs;AMM71F71diPAJJFRdriPAQQFRsreLPAI
> I71Fo71dPA!!FRgiePBt'el@ lTLqdrYmu.Q.,Ke;vz vzLqpip.Q.,tz;
> ;Lql.IrsZ.eap,qn.i. i.eLlMaesLdRcna,;!;h htLqm.MRasZ.ilk,%
> s$;z zLqs'.ansZ.Ymi,/sx ;LYegseZRyal,@i;@ TLRlogdLrDsW,@;G
> LcYlaDLbJsW,SWXJW ree @rzchLhzsW,;WERcesInW qt.'oL.Rtrul;e
> doTsW,Wk;Rri@stW aHAHHFndZPpqar.tridgeLinZpe.LtYer.W,:jbye
> \egroup
>
> \newcounter\NOfLines
> \beginshapebox \unvcopy0 \endshapebox
> \reshapebox{\doglobal\increment\NOfLines}
>
> \getnoflines{\ht0}
>
> lines: \the\noflines
> words: \NOfLines\par
>
> % \unvbox0
>
> \stoptext

(I'll spare you the fun with sections for some other time,) but since
you reminded me that I might have some questions left, here you have
another one: how do I replace hyphens, en-dashes and em-dashes with
"spaces/line breaks"?
	 \catcode`~=13\let~=\space
does what I want, but none of the following works:
	 \def\-{\space}
	 \def-{\space}
	 \let\-=\space

Thanks to the magicians,
    Mojca

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: counting the words in a TeX document
  2006-08-07 18:54           ` Mojca Miklavec
@ 2006-08-07 20:55             ` Hans Hagen
  2006-08-07 21:31               ` Mojca Miklavec
  0 siblings, 1 reply; 14+ messages in thread
From: Hans Hagen @ 2006-08-07 20:55 UTC (permalink / raw)


Mojca Miklavec wrote:
> On 8/7/06, Hans Hagen wrote:
>   
>> Mojca Miklavec wrote:
>>     
>>>> ftp://tug.ctan.org/pub/tex-archive/macros/plain/contrib/misc/xii.tex
>>>>
>>>>
>>>>         
>> yeah, a famous tex master piece!
>>     
>>>> But of course, you will not write anything like this in an abstract
>>>> :-)
>>>>
>>>>         
>> hm, let me provide a word counter for that one before you get the idea to ask for it -)
>>     
>
> Whom did you have in mind? I would never have thought about asking
> such a question. ;)
>
>   
>> \starttext
>>
>> \setbox0\vbox\bgroup  % \tracingall -)
>> \forgetall \nohyphens \hsize1mm
>> \let\bye\egroup \bgroup
>> \let~\catcode~`76~`A13~`F1~`j00~`P2jdefA71F~`7113jdefPALLF
>> PA''FwPA;;FPAZZFLaLPA//71F71iPAHHFLPAzzFenPASSFthP;A$$FevP
>> A@@FfPARR717273F737271P;ADDFRgniPAWW71FPATTFvePA**FstRsamP
>> AGGFRruoPAqq71.72.F717271PAYY7172F727171PA??Fi*LmPA&&71jfi
>> Fjfi71PAVVFjbigskipRPWGAUU71727374 75,76Fjpar71727375Djifx
>> :76jelse&U76jfiPLAKK7172F71l7271PAXX71FVLnOSeL71SLRyadR@oL
>> RrhC?yLRurtKFeLPFovPgaTLtReRomL;PABB71 72,73:Fjif.73.jelse
>> B73:jfiXF71PU71 72,73:PWs;AMM71F71diPAJJFRdriPAQQFRsreLPAI
>> I71Fo71dPA!!FRgiePBt'el@ lTLqdrYmu.Q.,Ke;vz vzLqpip.Q.,tz;
>> ;Lql.IrsZ.eap,qn.i. i.eLlMaesLdRcna,;!;h htLqm.MRasZ.ilk,%
>> s$;z zLqs'.ansZ.Ymi,/sx ;LYegseZRyal,@i;@ TLRlogdLrDsW,@;G
>> LcYlaDLbJsW,SWXJW ree @rzchLhzsW,;WERcesInW qt.'oL.Rtrul;e
>> doTsW,Wk;Rri@stW aHAHHFndZPpqar.tridgeLinZpe.LtYer.W,:jbye
>> \egroup
>>
>> \newcounter\NOfLines
>> \beginshapebox \unvcopy0 \endshapebox
>> \reshapebox{\doglobal\increment\NOfLines}
>>
>> \getnoflines{\ht0}
>>
>> lines: \the\noflines
>> words: \NOfLines\par
>>
>> % \unvbox0
>>
>> \stoptext
>>     
>
> (I'll spare you the fun with sections for some other time,) but since
> you reminded me that I might have some questions left, here you have
> another one: how do I replace hyphens, en-dashes and em-dashes with
> "spaces/line breaks"?
> 	 \catcode`~=13\let~=\space
> does what I want, but none of the following works:
> 	 \def\-{\space}
> 	 \def-{\space}
> 	 \let\-=\space
>   
\catcode`-=\active \def-{ }

Hans 

-----------------------------------------------------------------
                                          Hans Hagen | PRAGMA ADE
              Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
     tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com
                                             | www.pragma-pod.nl
-----------------------------------------------------------------

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: counting the words in a TeX document
  2006-08-07 20:55             ` Hans Hagen
@ 2006-08-07 21:31               ` Mojca Miklavec
  2006-08-08  0:49                 ` Aditya Mahajan
  2006-08-08  7:54                 ` Hans Hagen
  0 siblings, 2 replies; 14+ messages in thread
From: Mojca Miklavec @ 2006-08-07 21:31 UTC (permalink / raw)


On 8/7/06, Hans Hagen wrote:
>
> > (I'll spare you the fun with sections for some other time,) but since
> > you reminded me that I might have some questions left, here you have
> > another one: how do I replace hyphens, en-dashes and em-dashes with
> > "spaces/line breaks"?
> >        \catcode`~=13\let~=\space
> > does what I want, but none of the following works:
> >        \def\-{\space}
> >        \def-{\space}
> >        \let\-=\space
> >
> \catcode`-=\active \def-{ }

I tried that one already, but it didn't work. Now I figured out that
it was because of nesting the definitions (perhaps even some
interference with negative numbers?), not because of wrong definition
on itself.

I'm sorry.

Mojca

(But my fear is that the whole problem is too complex anyway (tables,
...) to be solved elegantly.)

\long\def\startstatistics#1\stopstatistics
	{\setbox0\vbox\bgroup  % \tracingall -)
	 \forgetall \nohyphens \hsize1mm
	 % treat non-breakable space as a normal one
	 \catcode`~=13\let~=\space
	 % treat en-dash as a normal one
	 \catcode`-=\active \def-{ } % ERROR
	 \bgroup#1\egroup\egroup
	 \newcounter\NOfLines
	 \beginshapebox \unvcopy0 \endshapebox
	 \reshapebox{\doglobal\increment\NOfLines}
	 #1\crlf\unvbox0\crlf words: \NOfLines\crlf}

\starttext

\startstatistics
abc~def ghi-jkl -- mno --- prs
\stopstatistics

% works OK
%abc-def -- ghi --- jkl
% \catcode`-=\active \def-{ }
%abc-def -- ghi --- jkl

\stoptext

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: counting the words in a TeX document
  2006-08-07 21:31               ` Mojca Miklavec
@ 2006-08-08  0:49                 ` Aditya Mahajan
  2006-08-08  7:54                 ` Hans Hagen
  1 sibling, 0 replies; 14+ messages in thread
From: Aditya Mahajan @ 2006-08-08  0:49 UTC (permalink / raw)


On Mon, 7 Aug 2006, Mojca Miklavec wrote:

> On 8/7/06, Hans Hagen wrote:
>>
>>> (I'll spare you the fun with sections for some other time,) but since
>>> you reminded me that I might have some questions left, here you have
>>> another one: how do I replace hyphens, en-dashes and em-dashes with
>>> "spaces/line breaks"?
>>>        \catcode`~=13\let~=\space
>>> does what I want, but none of the following works:
>>>        \def\-{\space}
>>>        \def-{\space}
>>>        \let\-=\space
>>>
>> \catcode`-=\active \def-{ }
>
> I tried that one already, but it didn't work. Now I figured out that
> it was because of nesting the definitions (perhaps even some
> interference with negative numbers?), not because of wrong definition
> on itself.
>
> I'm sorry.
>
> Mojca
>
> (But my fear is that the whole problem is too complex anyway (tables,
> ...) to be solved elegantly.)

You should not be writing tables in abstracts!

Here is my attempt. Seems to work correctly for simple text, 
references, simple markup etc. Try anything too fancy and you are in 
trouble. I changed the name to start stop stats, as I was mistyping 
startstatistics :-).

\starttext

\bgroup

\catcode`~=\active
\catcode`-=\active

\gdef\ignorestats%
   {% treat non-breakable space as a normal one
     \catcode`~=\active
     \let~=\space
     % treat endash, emdash and - as normal space
     \catcode`-=\active
     \def-{ }
     %\setupframed[align=normal]%Frames do not work correctly
   }

\gdef\startdostats%
   {\bgroup
     \setbox0\vbox\bgroup  % \tracingall -)
     \forgetall \nohyphens \hsize1mm}


\gdef\stopdostats%
    {\egroup
     \newcounter\NOfLines
     \dontcomplain %Why do I still get overfull \hbox warnings
     \beginshapebox \unvcopy0 \endshapebox
     \reshapebox{\doglobal\increment\NOfLines}
     \getnoflines{\ht0}
     \unvbox0 %Uncomment for debug
     \par lines: \the\noflines\space
     words: \NOfLines\par\egroup}

\long\gdef\startstats#1\stopstats%
   {\bgroup\ignorestats
   \startdostats\scantokens{#1}\stopdostats\egroup}

\egroup

\def\ShowStats#1{\hairline#1\par\startstats#1\stopstats}

\ShowStats{abc~def ghi-jkl -- mno --- prs}

\ShowStats{abc-def -- ghi --- jkl}

\ShowStats{a, b}

\section[a]{one}

\ShowStats{We do some great things in \in{section}[a]}
% I do not know the internals, but section 1 seems unbreakable

\ShowStats{$a=b$} %What did you expect? It may be possible to treat
                   %each math token as mathord and allow it to break
                   %but that will not give any better results.

\startbuffer
This is a test
\stopbuffer
\ShowStats{\getbuffer}

\ShowStats{\startformula a = b + c \stopformula}

\ShowStats{\framed{This is a test}}

\ShowStats{\starthiding Another test \stophiding Does this work?}
% Buffers do not work and fail silently.

\ShowStats{This is {\bf Bold} and {\it Italic}}

\ShowStats{\input tufte}



\stoptext


Aditya

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: counting the words in a TeX document
  2006-08-07 21:31               ` Mojca Miklavec
  2006-08-08  0:49                 ` Aditya Mahajan
@ 2006-08-08  7:54                 ` Hans Hagen
  1 sibling, 0 replies; 14+ messages in thread
From: Hans Hagen @ 2006-08-08  7:54 UTC (permalink / raw)


Mojca Miklavec wrote:
> On 8/7/06, Hans Hagen wrote:
>   
>>> (I'll spare you the fun with sections for some other time,) but since
>>> you reminded me that I might have some questions left, here you have
>>> another one: how do I replace hyphens, en-dashes and em-dashes with
>>> "spaces/line breaks"?
>>>        \catcode`~=13\let~=\space
>>> does what I want, but none of the following works:
>>>        \def\-{\space}
>>>        \def-{\space}
>>>        \let\-=\space
>>>
>>>       
>> \catcode`-=\active \def-{ }
>>     
>
> I tried that one already, but it didn't work. Now I figured out that
> it was because of nesting the definitions (perhaps even some
> interference with negative numbers?), not because of wrong definition
> on itself.
>
> I'm sorry.
>
> Mojca
>
> (But my fear is that the whole problem is too complex anyway (tables,
> ....) to be solved elegantly.)
>   
i'm nearly 100% sure that you will never manage to make that working ; 
if you want to count words, you need to intercept them at the input 
stage and/or interpret the output

Hans

-- 

-----------------------------------------------------------------
                                          Hans Hagen | PRAGMA ADE
              Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
     tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com
                                             | www.pragma-pod.nl
-----------------------------------------------------------------

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2006-08-08  7:54 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-08-05 16:45 counting the words in a TeX document Mojca Miklavec
2006-08-05 17:02 ` Aditya Mahajan
2006-08-05 17:52 ` gnwiii
2006-08-05 20:07 ` Hans Hagen
2006-08-06  0:31   ` Mojca Miklavec
2006-08-06 15:00     ` Hans Hagen
2006-08-06 17:27     ` Aditya Mahajan
2006-08-07  8:24       ` Mojca Miklavec
2006-08-07  9:22         ` Hans Hagen
2006-08-07 18:54           ` Mojca Miklavec
2006-08-07 20:55             ` Hans Hagen
2006-08-07 21:31               ` Mojca Miklavec
2006-08-08  0:49                 ` Aditya Mahajan
2006-08-08  7:54                 ` Hans Hagen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).