* PDF document statistics (character count incl. spaces)? @ 2015-02-01 8:31 Jörg Weger 2015-02-01 19:11 ` Aditya Mahajan 0 siblings, 1 reply; 14+ messages in thread From: Jörg Weger @ 2015-02-01 8:31 UTC (permalink / raw) To: mailing list for ConTeXt users Is there a way to report the “character count including spaces” of the resulting PDF in ConTeXt? Greetings Jörg ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: PDF document statistics (character count incl. spaces)? 2015-02-01 8:31 PDF document statistics (character count incl. spaces)? Jörg Weger @ 2015-02-01 19:11 ` Aditya Mahajan 2015-02-01 21:06 ` Jörg Weger 0 siblings, 1 reply; 14+ messages in thread From: Aditya Mahajan @ 2015-02-01 19:11 UTC (permalink / raw) To: mailing list for ConTeXt users [-- Attachment #1: Type: TEXT/PLAIN, Size: 274 bytes --] On Sun, 1 Feb 2015, Jörg Weger wrote: > Is there a way to report the “character count including spaces” of the > resulting PDF in ConTeXt? Given that these counts are never accurate, how about pdftotext filename followed by wc filename Aditya [-- Attachment #2: Type: text/plain, Size: 485 bytes --] ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: PDF document statistics (character count incl. spaces)? 2015-02-01 19:11 ` Aditya Mahajan @ 2015-02-01 21:06 ` Jörg Weger 2015-02-01 21:12 ` Wolfgang Schuster ` (2 more replies) 0 siblings, 3 replies; 14+ messages in thread From: Jörg Weger @ 2015-02-01 21:06 UTC (permalink / raw) To: mailing list for ConTeXt users, Aditya Mahajan Is the character count “wc --char <textfile>” returns with or without blank spaces? (Which is important for me.) “man wc” doesn’t talk about that. I had hoped there was a better way than to edit the result of “pdftotext” in my text editor or in libreoffice writer (deleting unnecessary carriage returns and spaces by searching for regular expressions) which are able to do the count I need. In fact I had hoped that ConTeXt was able to count the characters and spaces it renders to PDF (is that theoretically possible?) … Greetings Jörg On 01.02.2015 20:11, Aditya Mahajan wrote: > On Sun, 1 Feb 2015, Jörg Weger wrote: > >> Is there a way to report the “character count including spaces” of the >> resulting PDF in ConTeXt? > > Given that these counts are never accurate, how about > > pdftotext filename > > followed by > > wc filename > > Aditya > > > ___________________________________________________________________________________ > If your question is of interest to others as well, please add an entry to the Wiki! > > maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context > webpage : http://www.pragma-ade.nl / http://tex.aanhet.net > archive : http://foundry.supelec.fr/projects/contextrev/ > wiki : http://contextgarden.net > ___________________________________________________________________________________ > ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: PDF document statistics (character count incl. spaces)? 2015-02-01 21:06 ` Jörg Weger @ 2015-02-01 21:12 ` Wolfgang Schuster 2015-02-01 21:32 ` Idris Samawi Hamid ادريس سماوي حامد 2015-02-02 9:20 ` Keith Schultz 2015-02-01 23:56 ` Hans Hagen 2015-02-02 20:45 ` Marcin Borkowski 2 siblings, 2 replies; 14+ messages in thread From: Wolfgang Schuster @ 2015-02-01 21:12 UTC (permalink / raw) To: mailing list for ConTeXt users > Am 01.02.2015 um 22:06 schrieb Jörg Weger <joerg73.muc@googlemail.com>: > > Is the character count “wc --char <textfile>” returns with or without blank spaces? (Which is important for me.) “man wc” doesn’t talk about that. > > I had hoped there was a better way than to edit the result of “pdftotext” in my text editor or in libreoffice writer (deleting unnecessary carriage returns and spaces by searching for regular expressions) which are able to do the count I need. In fact I had hoped that ConTeXt was able to count the characters and spaces it renders to PDF (is that theoretically possible?) … ConTeXt has an option to count the words (you find the result in <jobname>.words) in a document but words words shorter than four letters aren’t taken into account. \setupspellchecking[state=start,method=2] \starttext \input knuth \stoptext Wolfgang ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: PDF document statistics (character count incl. spaces)? 2015-02-01 21:12 ` Wolfgang Schuster @ 2015-02-01 21:32 ` Idris Samawi Hamid ادريس سماوي حامد 2015-02-01 22:11 ` Wolfgang Schuster 2015-02-02 9:20 ` Keith Schultz 1 sibling, 1 reply; 14+ messages in thread From: Idris Samawi Hamid ادريس سماوي حامد @ 2015-02-01 21:32 UTC (permalink / raw) To: mailing list for ConTeXt users On Sun, 01 Feb 2015 14:12:54 -0700, Wolfgang Schuster <schuster.wolfgang@gmail.com> wrote: > \setupspellchecking[state=start,method=2] > \starttext > \input knuth > \stoptext Slightly off-topic: Just as Wolfgang's reply came in I was setting up a new version of http://tinyspell.com/ Editor-based spell-checkers are usually not very useful (although some LaTeX-centric editors are pretty good at it.) I never knew about \setupspellchecking before now. Perhaps it could evolve into something very useful. Part of spell-checking involves getting uppercase vs lowercase right. I see that the .words output of \setupspellchecking ignores case, and treats '-' (the simple dash) as a word separator. I'd like to see this evolve into something more precise. > words shorter than four letters aren’t taken into account. I get *some* words shorter than four letters in the output, so there must be some other logic going on... Thanks for pointing out this utility, Wolfgang, and Best wishes Idris -- Idris Samawi Hamid Professor of Philosophy Colorado State University Fort Collins, CO 80523 ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: PDF document statistics (character count incl. spaces)? 2015-02-01 21:32 ` Idris Samawi Hamid ادريس سماوي حامد @ 2015-02-01 22:11 ` Wolfgang Schuster 2015-02-01 22:27 ` Idris Samawi Hamid ادريس سماوي حامد 0 siblings, 1 reply; 14+ messages in thread From: Wolfgang Schuster @ 2015-02-01 22:11 UTC (permalink / raw) To: mailing list for ConTeXt users > Am 01.02.2015 um 22:32 schrieb Idris Samawi Hamid ادريس سماوي حامد <ishamid@colostate.edu>: > >> words shorter than four letters aren’t taken into account. > > I get *some* words shorter than four letters in the output, so there must be some other logic going on… Do you have a few examples? Wolfgang ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: PDF document statistics (character count incl. spaces)? 2015-02-01 22:11 ` Wolfgang Schuster @ 2015-02-01 22:27 ` Idris Samawi Hamid ادريس سماوي حامد 0 siblings, 0 replies; 14+ messages in thread From: Idris Samawi Hamid ادريس سماوي حامد @ 2015-02-01 22:27 UTC (permalink / raw) To: mailing list for ConTeXt users On Sun, 01 Feb 2015 15:11:48 -0700, Wolfgang Schuster <schuster.wolfgang@gmail.com> wrote: > >> Am 01.02.2015 um 22:32 schrieb Idris Samawi Hamid ادريس سماوي حامد >> <ishamid@colostate.edu>: >> >>> words shorter than four letters aren’t taken into account. >> >> I get *some* words shorter than four letters in the output, so there >> must be some other logic going on… > > Do you have a few examples? A quick one: ======= \setupspellchecking[state=start,method=2] \starttext Dār is the Arabic word for home. \stoptext ======= -- Idris Samawi Hamid Professor of Philosophy Colorado State University Fort Collins, CO 80523 ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: PDF document statistics (character count incl. spaces)? 2015-02-01 21:12 ` Wolfgang Schuster 2015-02-01 21:32 ` Idris Samawi Hamid ادريس سماوي حامد @ 2015-02-02 9:20 ` Keith Schultz 2015-02-02 15:39 ` Alan BRASLAU 1 sibling, 1 reply; 14+ messages in thread From: Keith Schultz @ 2015-02-02 9:20 UTC (permalink / raw) To: mailing list for ConTeXt users Hello All, As a linguist, I can say that not counting words that are shorter is an absolute NO-GO for an accurate word count and thereby character count! See below, for a non representative proof ! > Am 01.02.2015 um 22:12 schrieb Wolfgang Schuster <schuster.wolfgang@gmail.com>: > [snip, snip] > ConTeXt has an option to count the words (you find the result in <jobname>.words) in a document > but words words shorter than four letters aren’t taken into account. word length under 4 characters : 10 word length =< 4 chars : 20 here you are missing a third of the words! That is 30% regards Keith ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: PDF document statistics (character count incl. spaces)? 2015-02-02 9:20 ` Keith Schultz @ 2015-02-02 15:39 ` Alan BRASLAU 2015-02-02 16:55 ` Hans Hagen 0 siblings, 1 reply; 14+ messages in thread From: Alan BRASLAU @ 2015-02-02 15:39 UTC (permalink / raw) To: Keith Schultz; +Cc: mailing list for ConTeXt users On Mon, 2 Feb 2015 10:20:15 +0100 Keith Schultz <keithjschultz@icloud.com> wrote: > Hello All, > > As a linguist, I can say that not counting words that are shorter is > an absolute NO-GO for an accurate word count and thereby character > count! > > See below, for a non representative proof ! > > > Am 01.02.2015 um 22:12 schrieb Wolfgang Schuster > > <schuster.wolfgang@gmail.com>: > > > [snip, snip] > > > ConTeXt has an option to count the words (you find the result in > > <jobname>.words) in a document but words words shorter than four > > letters aren’t taken into account. > word length under 4 characters : 10 > word length =< 4 chars : 20 > > here you are missing a third of the words! That is 30% > > regards > Keith See also: Zipf, G. K. (1949), "Human Behavior and the Principle of Least Effort", Cambridge, MA: Addison-Wesley. in particular, Chapter 2: On the Economy of Words. As well as: Shannon, C. E. (1951), "The redundancy of English", Cybernetics, 248-272. 54% for English, so we can afford to be sloppy (wch s wy txt compr qte ll). Alan ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: PDF document statistics (character count incl. spaces)? 2015-02-02 15:39 ` Alan BRASLAU @ 2015-02-02 16:55 ` Hans Hagen 2015-02-03 3:19 ` Alan BRASLAU 0 siblings, 1 reply; 14+ messages in thread From: Hans Hagen @ 2015-02-02 16:55 UTC (permalink / raw) To: ntg-context On 2/2/2015 4:39 PM, Alan BRASLAU wrote: >>> ConTeXt has an option to count the words (you find the result in >>> <jobname>.words) in a document but words words shorter than four >>> letters aren’t taken into account. >> word length under 4 characters : 10 >> word length =< 4 chars : 20 >> >> here you are missing a third of the words! That is 30% this feature relates to (simple) spell checking and collectign words for dedicated spell check lists and, 4 chars is nearly always avalid word which is why we discard them Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl ----------------------------------------------------------------- ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: PDF document statistics (character count incl. spaces)? 2015-02-02 16:55 ` Hans Hagen @ 2015-02-03 3:19 ` Alan BRASLAU 0 siblings, 0 replies; 14+ messages in thread From: Alan BRASLAU @ 2015-02-03 3:19 UTC (permalink / raw) To: Hans Hagen; +Cc: ntg-context On Mon, 2 Feb 2015 17:55:35 +0100 Hans Hagen <pragma@wxs.nl> wrote: > this feature relates to (simple) spell checking and collectign words > for dedicated spell check lists and, 4 chars is nearly always avalid > word which is why we discard them English is rich in "four-letter words"! Alan ;-) ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: PDF document statistics (character count incl. spaces)? 2015-02-01 21:06 ` Jörg Weger 2015-02-01 21:12 ` Wolfgang Schuster @ 2015-02-01 23:56 ` Hans Hagen 2015-02-02 22:39 ` Jörg Weger 2015-02-02 20:45 ` Marcin Borkowski 2 siblings, 1 reply; 14+ messages in thread From: Hans Hagen @ 2015-02-01 23:56 UTC (permalink / raw) To: ntg-context On 2/1/2015 10:06 PM, Jörg Weger wrote: > Is the character count “wc --char <textfile>” returns with or without > blank spaces? (Which is important for me.) “man wc” doesn’t talk about > that. > > I had hoped there was a better way than to edit the result of > “pdftotext” in my text editor or in libreoffice writer (deleting > unnecessary carriage returns and spaces by searching for regular > expressions) which are able to do the count I need. In fact I had hoped > that ConTeXt was able to count the characters and spaces it renders to > PDF (is that theoretically possible?) … it's not too hard so maybe when i'm bored or see a good reason .. Hans ----------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl ----------------------------------------------------------------- ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: PDF document statistics (character count incl. spaces)? 2015-02-01 23:56 ` Hans Hagen @ 2015-02-02 22:39 ` Jörg Weger 0 siblings, 0 replies; 14+ messages in thread From: Jörg Weger @ 2015-02-02 22:39 UTC (permalink / raw) To: mailing list for ConTeXt users, Hans Hagen So I hope you might get bored once in a while before I have to write my bachelor thesis :) Greetings Jörg On 02.02.2015 00:56, Hans Hagen wrote: > On 2/1/2015 10:06 PM, Jörg Weger wrote: >> Is the character count “wc --char <textfile>” returns with or without >> blank spaces? (Which is important for me.) “man wc” doesn’t talk about >> that. >> >> I had hoped there was a better way than to edit the result of >> “pdftotext” in my text editor or in libreoffice writer (deleting >> unnecessary carriage returns and spaces by searching for regular >> expressions) which are able to do the count I need. In fact I had hoped >> that ConTeXt was able to count the characters and spaces it renders to >> PDF (is that theoretically possible?) … > > it's not too hard so maybe when i'm bored or see a good reason .. > > Hans > > ----------------------------------------------- > Hans Hagen | PRAGMA ADE > Ridderstraat 27 | 8061 GH Hasselt | The Netherlands > tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com > | www.pragma-pod.nl > ----------------------------------------------------------------- > ___________________________________________________________________________________ > > If your question is of interest to others as well, please add an entry > to the Wiki! > > maillist : ntg-context@ntg.nl / > http://www.ntg.nl/mailman/listinfo/ntg-context > webpage : http://www.pragma-ade.nl / http://tex.aanhet.net > archive : http://foundry.supelec.fr/projects/contextrev/ > wiki : http://contextgarden.net > ___________________________________________________________________________________ ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: PDF document statistics (character count incl. spaces)? 2015-02-01 21:06 ` Jörg Weger 2015-02-01 21:12 ` Wolfgang Schuster 2015-02-01 23:56 ` Hans Hagen @ 2015-02-02 20:45 ` Marcin Borkowski 2 siblings, 0 replies; 14+ messages in thread From: Marcin Borkowski @ 2015-02-02 20:45 UTC (permalink / raw) To: mailing list for ConTeXt users On 2015-02-01, at 22:06, Jörg Weger <joerg73.muc@googlemail.com> wrote: > Is the character count “wc --char <textfile>” returns with or without > blank spaces? (Which is important for me.) “man wc” doesn’t talk about that. > > I had hoped there was a better way than to edit the result of > “pdftotext” in my text editor or in libreoffice writer (deleting > unnecessary carriage returns and spaces by searching for regular > expressions) which are able to do the count I need. In fact I had hoped > that ConTeXt was able to count the characters and spaces it renders to > PDF (is that theoretically possible?) … I am pretty sure that you can make sed filter out blank characters. So then you can just chain pdftotext, sed and wc. OTOH, here's a relevant question (and a simple answer) on SO. (It seems to count newlines, though.) JFF, I've just coded this in Emacs Lisp: --8<---------------cut here---------------start------------->8--- ;; Count non-blank characters in a buffer (defun how-many-visible-chars () "Count visible (i.e., other than spaces, tabs and newlines) characters in the buffer." (interactive) (let ((count 0)) (save-excursion (goto-char (point-min)) (while (not (eobp)) (unless (looking-at-p "[ \t\n]") (setq count (1+ count))) (forward-char))) (message "%d visible characters" count))) --8<---------------cut here---------------end--------------->8--- It's terribly unoptimized, but I ran it on a 300+ kB file on my low-end netbook and it ran in something like 2 seconds, so it's not that bad in practice. Also, it's not well-coded: it should e.g. return the number instead of displaying the message when called non-interactively, it might take active region into account etc. - but as a proof-of-concept, it works surprisingly well (i.e., fast). > Greetings Jörg Best, -- Marcin Borkowski http://octd.wmi.amu.edu.pl/en/Marcin_Borkowski Faculty of Mathematics and Computer Science Adam Mickiewicz University ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2015-02-03 3:19 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-02-01 8:31 PDF document statistics (character count incl. spaces)? Jörg Weger 2015-02-01 19:11 ` Aditya Mahajan 2015-02-01 21:06 ` Jörg Weger 2015-02-01 21:12 ` Wolfgang Schuster 2015-02-01 21:32 ` Idris Samawi Hamid ادريس سماوي حامد 2015-02-01 22:11 ` Wolfgang Schuster 2015-02-01 22:27 ` Idris Samawi Hamid ادريس سماوي حامد 2015-02-02 9:20 ` Keith Schultz 2015-02-02 15:39 ` Alan BRASLAU 2015-02-02 16:55 ` Hans Hagen 2015-02-03 3:19 ` Alan BRASLAU 2015-02-01 23:56 ` Hans Hagen 2015-02-02 22:39 ` Jörg Weger 2015-02-02 20:45 ` Marcin Borkowski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).