Extending lua wordcount filter to count specific parts of text

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

* Extending lua wordcount filter to count specific parts of text
@ 2020-08-17 19:42 h gv
       [not found] ` <49b04b07-285b-47f5-8b6b-b123db559b07o-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 2+ messages in thread
From: h gv @ 2020-08-17 19:42 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 2973 bytes --]

I'd like to extend the lua wordcount filter to tell me a bit more about 
specific parts of my text, specifically how many words are in the footnotes 
and how many words are in "original quotations," which I mark off with the 
<qu></qu> tag in my markdown (and which I then strip later via another 
filter for certain versions). I got the footnote part to work but can't 
figure out the RawInline html bit. Any guidance would be appreciated.

Here's my filter followed by a simple markdown doc and the results

```
-- counts words in a document 

words = 0 
notewords = 0
quotewords = 0
notenoquotewords = 0
noquotewords = 0

wordcount = { 
  
    Note = function(el)
        pandoc.walk_inline(el, {
            Str = function(el) 
                if el.text:match("%P") then 
                    notewords = notewords + 1
                end 
            end })
    end,

    RawInline = function(el)
        if el.text == '<qu>' then
            pandoc.walk_inline(el, {
                Str = function(el)
                    if el.text:match("%P") then 
                        quotewords = quotewords + 1
                    end 
            end })
        end
    end,

    Str = function(el) 
        -- we don't count a word if it's entirely punctuation: 
        if el.text:match("%P") then 
            words = words + 1 
        end 
    end, 

    Code = function(el) 
        _,n = el.text:gsub("%S+","") 
        words = words + n 
    end, 

    CodeBlock = function(el) 
        _,n = el.text:gsub("%S+","") 
        words = words + n 
    end 
} 

function Pandoc(el) 
    -- skip metadata, just count body: 
    pandoc.walk_block(pandoc.Div(el.blocks), wordcount) 
    mainwords = words - notewords
    notenoquotewords = notewords - quotewords
    noquotewords = words - quotewords
    print(words .. " total words")
    print(mainwords .. " words in main text") 
    print(notewords .. " words in notes")
    print(noquotewords .. " total words minus original quotes")
    print(quotewords .. " words in original quotes")
    print (notenoquotewords .. " words in notes minus original quotes")
    os.exit(0) 
end
```

test.md mwe markdown file
```
Suspendisse malesuada venenatis mauris. Curabitur ornare mollis velit. Sed 
vitae metus.
"Morbi posuere mi id odio."[^1]

[^1]: Citation. <qu>("Original quotation here.")</qu>
```
`pandoc --lua-filter wordcount.lua test.md`

> 20 total words
> 16 words in main text
> 4 words in notes
> 20 total words minus original quotes
> 0 words in original quotes
> 4 words in notes minus original quotes

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/49b04b07-285b-47f5-8b6b-b123db559b07o%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 4115 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Extending lua wordcount filter to count specific parts of text
       [not found] ` <49b04b07-285b-47f5-8b6b-b123db559b07o-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2020-08-19 21:02   ` hgvhgvhgv
  0 siblings, 0 replies; 2+ messages in thread
From: hgvhgvhgv @ 2020-08-19 21:02 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 4078 bytes --]

Now I understand why what I wrote doesn't work (the RawInline objects is 
limited to just whats in the <> and isn't a traversable object like Note). 
But based on reading through the listserv, I'm not sure if I can do what I 
want with two arbitrary tags as RawInline objects. It would be easier if my 
<qu> tags were <span class="qu"> (though less ideal from a readability 
standpoint). I can convert these to spans in this way (
https://groups.google.com/g/pandoc-discuss/c/yQjvOhIQ40A/m/RclMzdtiCAAJ). 
But then it seems like I have to go back to markdown to get the lua filter 
to recognize these as Span objects? Or is there some way to do it all in 
one pass? Maybe a completely different approach is necessary (somehow 
putting what's between the two RawInline tags into a table or list?). Sorry 
for my obtuse first attempts.

On Monday, August 17, 2020 at 3:42:54 PM UTC-4 h...-97jfqw80gc5Wk0Htik3J/w@public.gmane.org wrote:

> I'd like to extend the lua wordcount filter to tell me a bit more about 
> specific parts of my text, specifically how many words are in the footnotes 
> and how many words are in "original quotations," which I mark off with the 
> <qu></qu> tag in my markdown (and which I then strip later via another 
> filter for certain versions). I got the footnote part to work but can't 
> figure out the RawInline html bit. Any guidance would be appreciated.
>
> Here's my filter followed by a simple markdown doc and the results
>
> ```
> -- counts words in a document 
>
> words = 0 
> notewords = 0
> quotewords = 0
> notenoquotewords = 0
> noquotewords = 0
>
> wordcount = { 
>   
>     Note = function(el)
>         pandoc.walk_inline(el, {
>             Str = function(el) 
>                 if el.text:match("%P") then 
>                     notewords = notewords + 1
>                 end 
>             end })
>     end,
>
>     RawInline = function(el)
>         if el.text == '<qu>' then
>             pandoc.walk_inline(el, {
>                 Str = function(el)
>                     if el.text:match("%P") then 
>                         quotewords = quotewords + 1
>                     end 
>             end })
>         end
>     end,
>
>     Str = function(el) 
>         -- we don't count a word if it's entirely punctuation: 
>         if el.text:match("%P") then 
>             words = words + 1 
>         end 
>     end, 
>
>     Code = function(el) 
>         _,n = el.text:gsub("%S+","") 
>         words = words + n 
>     end, 
>
>     CodeBlock = function(el) 
>         _,n = el.text:gsub("%S+","") 
>         words = words + n 
>     end 
> } 
>
> function Pandoc(el) 
>     -- skip metadata, just count body: 
>     pandoc.walk_block(pandoc.Div(el.blocks), wordcount) 
>     mainwords = words - notewords
>     notenoquotewords = notewords - quotewords
>     noquotewords = words - quotewords
>     print(words .. " total words")
>     print(mainwords .. " words in main text") 
>     print(notewords .. " words in notes")
>     print(noquotewords .. " total words minus original quotes")
>     print(quotewords .. " words in original quotes")
>     print (notenoquotewords .. " words in notes minus original quotes")
>     os.exit(0) 
> end
> ```
>
> test.md mwe markdown file
> ```
> Suspendisse malesuada venenatis mauris. Curabitur ornare mollis velit. Sed 
> vitae metus.
> "Morbi posuere mi id odio."[^1]
>
> [^1]: Citation. <qu>("Original quotation here.")</qu>
> ```
> `pandoc --lua-filter wordcount.lua test.md`
>
> > 20 total words
> > 16 words in main text
> > 4 words in notes
> > 20 total words minus original quotes
> > 0 words in original quotes
> > 4 words in notes minus original quotes
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/e516c89b-05fc-4607-9237-98d2d01577c0n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 5360 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2020-08-19 21:02 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-17 19:42 Extending lua wordcount filter to count specific parts of text h gv
     [not found] ` <49b04b07-285b-47f5-8b6b-b123db559b07o-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-08-19 21:02   ` hgvhgvhgv

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).