Re: Extending lua wordcount filter to count specific parts of text

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

From: hgvhgvhgv <jbauchner-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: pandoc-discuss <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
Subject: Re: Extending lua wordcount filter to count specific parts of text
Date: Wed, 19 Aug 2020 14:02:10 -0700 (PDT)	[thread overview]
Message-ID: <e516c89b-05fc-4607-9237-98d2d01577c0n@googlegroups.com> (raw)
In-Reply-To: <49b04b07-285b-47f5-8b6b-b123db559b07o-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 4078 bytes --]

Now I understand why what I wrote doesn't work (the RawInline objects is 
limited to just whats in the <> and isn't a traversable object like Note). 
But based on reading through the listserv, I'm not sure if I can do what I 
want with two arbitrary tags as RawInline objects. It would be easier if my 
<qu> tags were <span class="qu"> (though less ideal from a readability 
standpoint). I can convert these to spans in this way (
https://groups.google.com/g/pandoc-discuss/c/yQjvOhIQ40A/m/RclMzdtiCAAJ). 
But then it seems like I have to go back to markdown to get the lua filter 
to recognize these as Span objects? Or is there some way to do it all in 
one pass? Maybe a completely different approach is necessary (somehow 
putting what's between the two RawInline tags into a table or list?). Sorry 
for my obtuse first attempts.

On Monday, August 17, 2020 at 3:42:54 PM UTC-4 h...-97jfqw80gc5Wk0Htik3J/w@public.gmane.org wrote:

> I'd like to extend the lua wordcount filter to tell me a bit more about 
> specific parts of my text, specifically how many words are in the footnotes 
> and how many words are in "original quotations," which I mark off with the 
> <qu></qu> tag in my markdown (and which I then strip later via another 
> filter for certain versions). I got the footnote part to work but can't 
> figure out the RawInline html bit. Any guidance would be appreciated.
>
> Here's my filter followed by a simple markdown doc and the results
>
> ```
> -- counts words in a document 
>
> words = 0 
> notewords = 0
> quotewords = 0
> notenoquotewords = 0
> noquotewords = 0
>
> wordcount = { 
>   
>     Note = function(el)
>         pandoc.walk_inline(el, {
>             Str = function(el) 
>                 if el.text:match("%P") then 
>                     notewords = notewords + 1
>                 end 
>             end })
>     end,
>
>     RawInline = function(el)
>         if el.text == '<qu>' then
>             pandoc.walk_inline(el, {
>                 Str = function(el)
>                     if el.text:match("%P") then 
>                         quotewords = quotewords + 1
>                     end 
>             end })
>         end
>     end,
>
>     Str = function(el) 
>         -- we don't count a word if it's entirely punctuation: 
>         if el.text:match("%P") then 
>             words = words + 1 
>         end 
>     end, 
>
>     Code = function(el) 
>         _,n = el.text:gsub("%S+","") 
>         words = words + n 
>     end, 
>
>     CodeBlock = function(el) 
>         _,n = el.text:gsub("%S+","") 
>         words = words + n 
>     end 
> } 
>
> function Pandoc(el) 
>     -- skip metadata, just count body: 
>     pandoc.walk_block(pandoc.Div(el.blocks), wordcount) 
>     mainwords = words - notewords
>     notenoquotewords = notewords - quotewords
>     noquotewords = words - quotewords
>     print(words .. " total words")
>     print(mainwords .. " words in main text") 
>     print(notewords .. " words in notes")
>     print(noquotewords .. " total words minus original quotes")
>     print(quotewords .. " words in original quotes")
>     print (notenoquotewords .. " words in notes minus original quotes")
>     os.exit(0) 
> end
> ```
>
> test.md mwe markdown file
> ```
> Suspendisse malesuada venenatis mauris. Curabitur ornare mollis velit. Sed 
> vitae metus.
> "Morbi posuere mi id odio."[^1]
>
> [^1]: Citation. <qu>("Original quotation here.")</qu>
> ```
> `pandoc --lua-filter wordcount.lua test.md`
>
> > 20 total words
> > 16 words in main text
> > 4 words in notes
> > 20 total words minus original quotes
> > 0 words in original quotes
> > 4 words in notes minus original quotes
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/e516c89b-05fc-4607-9237-98d2d01577c0n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 5360 bytes --]

     prev parent reply	other threads:[~2020-08-19 21:02 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-17 19:42 h gv
     [not found] ` <49b04b07-285b-47f5-8b6b-b123db559b07o-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-08-19 21:02   ` hgvhgvhgv [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e516c89b-05fc-4607-9237-98d2d01577c0n@googlegroups.com \
    --to=jbauchner-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
    --cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).