public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* lua filter to count words in a document
@ 2017-11-13 16:55 John MACFARLANE
  2017-12-19  1:08 ` Greg Tucker-Kellogg
  0 siblings, 1 reply; 5+ messages in thread
From: John MACFARLANE @ 2017-11-13 16:55 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

This lua filter can be used to count the words in a
document, in any format pandoc can read.  It omits
metadata words (title, abstract, authors), and of course
it ignores all of the non-content words like HTML tags,
LaTeX commands, the `#` that marks an ATX header, and
so on.

To use, save as wordcount.lua and do

    pandoc --lua-filter wordcount.lua inputfile

```lua
-- counts words in a document

words = 0

wordcount = {
  Str = function(el)
    -- we don't count a word if it's entirely punctuation:
    if el.text:match("%P") then
        words = words + 1
    end
  end,

  Code = function(el)
    _,n = el.text:gsub("%S+","")
    words = words + n
  end,

  CodeBlock = function(el)
    _,n = el.text:gsub("%S+","")
    words = words + n
  end
}

function Pandoc(el)
    -- skip metadata, just count body:
    pandoc.walk_block(pandoc.Div(el.blocks), wordcount)
    print(words .. " words in body")
    os.exit(0)
end
```


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: lua filter to count words in a document
  2017-11-13 16:55 lua filter to count words in a document John MACFARLANE
@ 2017-12-19  1:08 ` Greg Tucker-Kellogg
       [not found]   ` <d39a7269-42e4-4ca4-8f2b-946415d174af-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Greg Tucker-Kellogg @ 2017-12-19  1:08 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 4297 bytes --]

This filter appears to fail  when the metadata includes elements such as 
multiple authors with keys. For example, the following fails when the 
metadata blocks is included (but not if the metadata block is a simple list 
of authors).

The error message is 

    Error running filter wordcount.lua:
    attempt to call a nil value

And the return value is 83


---                                                                        
                                                                            
                  
title: The document title                                                  
                                                                            
                  
author:                                                                    
                                                                            
                  
- name: Author One                                                          
                                                                            
                 
  affiliation: University of Somewhere                                      
                                                                            
                 
- name: Author Two                                                          
                                                                            
                 
  affiliation: University of Nowhere                                        
                                                                            
                 
---                                                                        
                                                                            
                  
                                                                            
                                                                            
                 
                                                                            
                                                                            
                 
# This document has a few words (14)                                        
                                                                            
                 
                                                                            
                                                                            
                 
This is a test of some words                                                
                                                                            
                 




On Tuesday, November 14, 2017 at 12:54:55 AM UTC+8, John MacFarlane wrote:
>
> This lua filter can be used to count the words in a 
> document, in any format pandoc can read.  It omits 
> metadata words (title, abstract, authors), and of course 
> it ignores all of the non-content words like HTML tags, 
> LaTeX commands, the `#` that marks an ATX header, and 
> so on. 
>
> To use, save as wordcount.lua and do 
>
>     pandoc --lua-filter wordcount.lua inputfile 
>
> ```lua 
> -- counts words in a document 
>
> words = 0 
>
> wordcount = { 
>   Str = function(el) 
>     -- we don't count a word if it's entirely punctuation: 
>     if el.text:match("%P") then 
>         words = words + 1 
>     end 
>   end, 
>
>   Code = function(el) 
>     _,n = el.text:gsub("%S+","") 
>     words = words + n 
>   end, 
>
>   CodeBlock = function(el) 
>     _,n = el.text:gsub("%S+","") 
>     words = words + n 
>   end 
> } 
>
> function Pandoc(el) 
>     -- skip metadata, just count body: 
>     pandoc.walk_block(pandoc.Div(el.blocks), wordcount) 
>     print(words .. " words in body") 
>     os.exit(0) 
> end 
> ``` 
>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/d39a7269-42e4-4ca4-8f2b-946415d174af%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 6861 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: lua filter to count words in a document
       [not found]   ` <d39a7269-42e4-4ca4-8f2b-946415d174af-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2017-12-19  6:22     ` John MacFarlane
       [not found]       ` <20171219062247.GB765-9Rnp8PDaXcadBw3G0RLmbRFnWt+6NQIA@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: John MacFarlane @ 2017-12-19  6:22 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

@tarleb, any idea what is happening here?

+++ Greg Tucker-Kellogg [Dec 18 17 17:08 ]:
>   This filter appears to fail  when the metadata includes elements such
>   as multiple authors with keys. For example, the following fails when
>   the metadata blocks is included (but not if the metadata block is a
>   simple list of authors).
>   The error message is
>       Error running filter wordcount.lua:
>       attempt to call a nil value
>   And the return value is 83
>   ---
>
>
>   title: The document title
>
>
>   author:
>
>
>   - name: Author One
>
>
>     affiliation: University of Somewhere
>
>
>   - name: Author Two
>
>
>     affiliation: University of Nowhere
>
>
>   ---
>
>
>
>
>
>
>
>
>   # This document has a few words (14)
>
>
>
>
>
>   This is a test of some words
>
>
>   On Tuesday, November 14, 2017 at 12:54:55 AM UTC+8, John MacFarlane
>   wrote:
>
>     This lua filter can be used to count the words in a
>     document, in any format pandoc can read.  It omits
>     metadata words (title, abstract, authors), and of course
>     it ignores all of the non-content words like HTML tags,
>     LaTeX commands, the `#` that marks an ATX header, and
>     so on.
>     To use, save as wordcount.lua and do
>         pandoc --lua-filter wordcount.lua inputfile
>     ```lua
>     -- counts words in a document
>     words = 0
>     wordcount = {
>       Str = function(el)
>         -- we don't count a word if it's entirely punctuation:
>         if el.text:match("%P") then
>             words = words + 1
>         end
>       end,
>       Code = function(el)
>         _,n = el.text:gsub("%S+","")
>         words = words + n
>       end,
>       CodeBlock = function(el)
>         _,n = el.text:gsub("%S+","")
>         words = words + n
>       end
>     }
>     function Pandoc(el)
>         -- skip metadata, just count body:
>         pandoc.walk_block(pandoc.Div(el.blocks), wordcount)
>         print(words .. " words in body")
>         os.exit(0)
>     end
>     ```
>
>   --
>   You received this message because you are subscribed to the Google
>   Groups "pandoc-discuss" group.
>   To unsubscribe from this group and stop receiving emails from it, send
>   an email to [1]pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>   To post to this group, send email to
>   [2]pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>   To view this discussion on the web visit
>   [3]https://groups.google.com/d/msgid/pandoc-discuss/d39a7269-42e4-4ca4-
>   8f2b-946415d174af%40googlegroups.com.
>   For more options, visit [4]https://groups.google.com/d/optout.
>
>References
>
>   1. mailto:pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
>   2. mailto:pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
>   3. https://groups.google.com/d/msgid/pandoc-discuss/d39a7269-42e4-4ca4-8f2b-946415d174af-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org?utm_medium=email&utm_source=footer
>   4. https://groups.google.com/d/optout


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: lua filter to count words in a document
       [not found]       ` <20171219062247.GB765-9Rnp8PDaXcadBw3G0RLmbRFnWt+6NQIA@public.gmane.org>
@ 2017-12-19  8:26         ` Albert Krewinkel
       [not found]           ` <87wp1jt978.fsf-9EawChwDxG8hFhg+JK9F0w@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Albert Krewinkel @ 2017-12-19  8:26 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Mea culpa.  I broke MetaMaps when adding the List feature. Should be
fixed in master now.

As a quick fix, you can simply add a file `init.lua` to your pandoc data
directory (usually `~/.pandoc`) in which you re-add the function to the
pandoc module:


    pandoc = require 'pandoc'
    pandoc.mediabag = require 'pandoc.mediabag'

    pandoc.MetaMap = pandoc.MetaValue:create_constructor(
      "MetaMap",
      function (mm) return mm end
    )

That should fix it. Sorry for the inconvenience.


John MacFarlane <jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org> writes:

> @tarleb, any idea what is happening here?
>
> +++ Greg Tucker-Kellogg [Dec 18 17 17:08 ]:
>>   This filter appears to fail  when the metadata includes elements such
>>   as multiple authors with keys. For example, the following fails when
>>   the metadata blocks is included (but not if the metadata block is a
>>   simple list of authors).
>>   The error message is
>>       Error running filter wordcount.lua:
>>       attempt to call a nil value
>>   And the return value is 83
>>   ---
>>
>>
>>   title: The document title
>>
>>
>>   author:
>>
>>
>>   - name: Author One
>>
>>
>>     affiliation: University of Somewhere
>>
>>
>>   - name: Author Two
>>
>>
>>     affiliation: University of Nowhere
>>
>>
>>   ---
>>
>>
>>
>>
>>
>>
>>
>>
>>   # This document has a few words (14)
>>
>>
>>
>>
>>
>>   This is a test of some words
>>
>>
>>   On Tuesday, November 14, 2017 at 12:54:55 AM UTC+8, John MacFarlane
>>   wrote:
>>
>>     This lua filter can be used to count the words in a
>>     document, in any format pandoc can read.  It omits
>>     metadata words (title, abstract, authors), and of course
>>     it ignores all of the non-content words like HTML tags,
>>     LaTeX commands, the `#` that marks an ATX header, and
>>     so on.
>>     To use, save as wordcount.lua and do
>>         pandoc --lua-filter wordcount.lua inputfile
>>     ```lua
>>     -- counts words in a document
>>     words = 0
>>     wordcount = {
>>       Str = function(el)
>>         -- we don't count a word if it's entirely punctuation:
>>         if el.text:match("%P") then
>>             words = words + 1
>>         end
>>       end,
>>       Code = function(el)
>>         _,n = el.text:gsub("%S+","")
>>         words = words + n
>>       end,
>>       CodeBlock = function(el)
>>         _,n = el.text:gsub("%S+","")
>>         words = words + n
>>       end
>>     }
>>     function Pandoc(el)
>>         -- skip metadata, just count body:
>>         pandoc.walk_block(pandoc.Div(el.blocks), wordcount)
>>         print(words .. " words in body")
>>         os.exit(0)
>>     end
>>     ```
>>
>>   --
>>   You received this message because you are subscribed to the Google
>>   Groups "pandoc-discuss" group.
>>   To unsubscribe from this group and stop receiving emails from it, send
>>   an email to [1]pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>   To post to this group, send email to
>>   [2]pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>   To view this discussion on the web visit
>>   [3]https://groups.google.com/d/msgid/pandoc-discuss/d39a7269-42e4-4ca4-
>>   8f2b-946415d174af%40googlegroups.com.
>>   For more options, visit [4]https://groups.google.com/d/optout.
>>
>>References
>>
>>   1. mailto:pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
>>   2. mailto:pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
>>   3. https://groups.google.com/d/msgid/pandoc-discuss/d39a7269-42e4-4ca4-8f2b-946415d174af-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org?utm_medium=email&utm_source=footer
>>   4. https://groups.google.com/d/optout

-- 
Albert Krewinkel
GPG: 8eed e3e2 e8c5 6f18 81fe  e836 388d c0b2 1f63 1124


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: lua filter to count words in a document
       [not found]           ` <87wp1jt978.fsf-9EawChwDxG8hFhg+JK9F0w@public.gmane.org>
@ 2017-12-19 14:10             ` Greg Tucker-Kellogg
  0 siblings, 0 replies; 5+ messages in thread
From: Greg Tucker-Kellogg @ 2017-12-19 14:10 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 4879 bytes --]

Thanks for such a quick and helpful reply!

On Tuesday, December 19, 2017 at 4:28:08 PM UTC+8, Albert Krewinkel wrote:
>
> Mea culpa.  I broke MetaMaps when adding the List feature. Should be 
> fixed in master now. 
>
> As a quick fix, you can simply add a file `init.lua` to your pandoc data 
> directory (usually `~/.pandoc`) in which you re-add the function to the 
> pandoc module: 
>
>
>     pandoc = require 'pandoc' 
>     pandoc.mediabag = require 'pandoc.mediabag' 
>
>     pandoc.MetaMap = pandoc.MetaValue:create_constructor( 
>       "MetaMap", 
>       function (mm) return mm end 
>     ) 
>
> That should fix it. Sorry for the inconvenience. 
>
>
> John MacFarlane <j...-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org <javascript:>> writes: 
>
> > @tarleb, any idea what is happening here? 
> > 
> > +++ Greg Tucker-Kellogg [Dec 18 17 17:08 ]: 
> >>   This filter appears to fail  when the metadata includes elements such 
> >>   as multiple authors with keys. For example, the following fails when 
> >>   the metadata blocks is included (but not if the metadata block is a 
> >>   simple list of authors). 
> >>   The error message is 
> >>       Error running filter wordcount.lua: 
> >>       attempt to call a nil value 
> >>   And the return value is 83 
> >>   --- 
> >> 
> >> 
> >>   title: The document title 
> >> 
> >> 
> >>   author: 
> >> 
> >> 
> >>   - name: Author One 
> >> 
> >> 
> >>     affiliation: University of Somewhere 
> >> 
> >> 
> >>   - name: Author Two 
> >> 
> >> 
> >>     affiliation: University of Nowhere 
> >> 
> >> 
> >>   --- 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >>   # This document has a few words (14) 
> >> 
> >> 
> >> 
> >> 
> >> 
> >>   This is a test of some words 
> >> 
> >> 
> >>   On Tuesday, November 14, 2017 at 12:54:55 AM UTC+8, John MacFarlane 
> >>   wrote: 
> >> 
> >>     This lua filter can be used to count the words in a 
> >>     document, in any format pandoc can read.  It omits 
> >>     metadata words (title, abstract, authors), and of course 
> >>     it ignores all of the non-content words like HTML tags, 
> >>     LaTeX commands, the `#` that marks an ATX header, and 
> >>     so on. 
> >>     To use, save as wordcount.lua and do 
> >>         pandoc --lua-filter wordcount.lua inputfile 
> >>     ```lua 
> >>     -- counts words in a document 
> >>     words = 0 
> >>     wordcount = { 
> >>       Str = function(el) 
> >>         -- we don't count a word if it's entirely punctuation: 
> >>         if el.text:match("%P") then 
> >>             words = words + 1 
> >>         end 
> >>       end, 
> >>       Code = function(el) 
> >>         _,n = el.text:gsub("%S+","") 
> >>         words = words + n 
> >>       end, 
> >>       CodeBlock = function(el) 
> >>         _,n = el.text:gsub("%S+","") 
> >>         words = words + n 
> >>       end 
> >>     } 
> >>     function Pandoc(el) 
> >>         -- skip metadata, just count body: 
> >>         pandoc.walk_block(pandoc.Div(el.blocks), wordcount) 
> >>         print(words .. " words in body") 
> >>         os.exit(0) 
> >>     end 
> >>     ``` 
> >> 
> >>   -- 
> >>   You received this message because you are subscribed to the Google 
> >>   Groups "pandoc-discuss" group. 
> >>   To unsubscribe from this group and stop receiving emails from it, 
> send 
> >>   an email to [1]pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <javascript:>. 
> >>   To post to this group, send email to 
> >>   [2]pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <javascript:>. 
> >>   To view this discussion on the web visit 
> >>   [3]
> https://groups.google.com/d/msgid/pandoc-discuss/d39a7269-42e4-4ca4- 
> >>   8f2b-946415d174af%40googlegroups.com. 
> >>   For more options, visit [4]https://groups.google.com/d/optout. 
> >> 
> >>References 
> >> 
> >>   1. mailto:pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <javascript:> 
> >>   2. mailto:pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <javascript:> 
> >>   3. 
> https://groups.google.com/d/msgid/pandoc-discuss/d39a7269-42e4-4ca4-8f2b-946415d174af-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org?utm_medium=email&utm_source=footer 
> >>   4. https://groups.google.com/d/optout 
>
> -- 
> Albert Krewinkel 
> GPG: 8eed e3e2 e8c5 6f18 81fe  e836 388d c0b2 1f63 1124 
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/831b7367-060f-4acd-88b5-9a5ad38700d7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 9436 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-12-19 14:10 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-13 16:55 lua filter to count words in a document John MACFARLANE
2017-12-19  1:08 ` Greg Tucker-Kellogg
     [not found]   ` <d39a7269-42e4-4ca4-8f2b-946415d174af-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-12-19  6:22     ` John MacFarlane
     [not found]       ` <20171219062247.GB765-9Rnp8PDaXcadBw3G0RLmbRFnWt+6NQIA@public.gmane.org>
2017-12-19  8:26         ` Albert Krewinkel
     [not found]           ` <87wp1jt978.fsf-9EawChwDxG8hFhg+JK9F0w@public.gmane.org>
2017-12-19 14:10             ` Greg Tucker-Kellogg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).