public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
From: Bastien DUMONT <bastien.dumont-VwIFZPTo/vqsTnJN9+BGXg@public.gmane.org>
To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
Subject: Re: Glossary Filter for MD2Tex
Date: Wed, 19 Oct 2022 21:28:41 +0000	[thread overview]
Message-ID: <Y1BsCdqttFxOi/pa@localhost> (raw)
In-Reply-To: <B93B3CA7-A461-4056-929D-592B578B184F-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 11846 bytes --]

I think that the attached script could be a good starting point.

Le Wednesday 19 October 2022 à 04:50:25PM, Bernardo C.D.A. Vasconcelos a écrit :
> I have found this little script that takes me nearly there:
> 
> local vars = {}
> 
> function Meta(meta)
>     for k, v in pairs(meta) do
>         vars["%" .. k .. "%"] = v
>     end
> end
> 
> function Str(elem)
>     if vars[elem.text] then
>         return vars[elem.text]
>     else
>         return elem
>     end
> end
> 
> return {
>     { Meta = Meta },
>     { Str  = Str  }
> }
> 
> 
> Instead, we would use: meta.glossary.entries. The crux for me is looping
> through the list of entries, adding all the values of the to_match field
> (a.k.a. known forms) (of each entry) to vars as a key with the content of some
> other field (e.g. glslink) as value. E.g. vars[ .. entry.to_match.each .. ] =
> entry.glslink.
> 
> On 18 Oct 2022, at 19:06, Bastien DUMONT wrote:
> 
>     Yes, it could! You would have access to the corresponding metadata object
>     in the AST.
> 
>     Le Tuesday 18 October 2022 à 06:43:48PM, Bernardo C.D.A. Vasconcelos a
>     écrit :
> 
>         The data is mostly in database format and could be output in the best
>         format
>         for the task, but I wanted to make it friendly for other people to use
>         as well.
>         Could a YAML metadata block be a solution?
> 
>         glossary:
>         glossary_lang: grc
>         entries:
>         - headword: ἀγαθός
>         text: "□ *pt.* bom; □ *en.* good; and so on and so forth"
>         match:
>         - γαθέ
>         - γαθοί
>         - κἀγάθ
>         - κἀγαθά
>         - κἀγαθάς
>         - κἀγαθή
>         - κἀγαθήν
>         - κἀγαθαί
>         - κἀγαθοί
>         - κἀγαθος
>         - headword: ἀγαπᾶν
>         transliteration: agapan
>         text: "□ *pt.* estar satisfeito, gostar; □ *en.* be satisfied, like;"
>         match:
>         - ἀγάπα
>         - ἀγάπαις
>         - ἀγάπη
>         - ἀγάπην
>         - ἀγάπης
>         - ἀγάπῃ
>         - ἀγαπᾶ
>         - ἀγαπᾶν
>         - ἀγαπᾶς
> 
>         On 18 Oct 2022, at 14:34, Bastien DUMONT wrote:
> 
>         No, citeproc receives a data structure produced by pandoc. Pandoc is
>         responsible for the parsing. I think that your script would not be so
>         hard
>         to rewrite in Lua, the main problem is to know if you can achieve your
>         goals this way. If your main concern is portability, then writing a Lua
>         filter with no dependancies certainly is a good solution provided that
>         you
>         feed it with a Lua data structure (or embed the code responsible for
>         JSON
>         parsing in your script).
> 
>         Le Tuesday 18 October 2022 à 02:16:16PM, Bernardo C.D.A. Vasconcelos a
>         écrit :
> 
>         Thank you for the suggestions, Bastien. There is technically no need
>         for
>         regex, as all the forms are spelled out to avoid the need to create ad
>         hoc
>         regex rules for each term. Now that I think about it, the principle is
>         the
>         same as Citeproc's: a tagged inline element will be matched against a
>         lookup
>         table and replaced. I will look at the citeproc code to see if it leads
>         anywhere or if it could be reused in anyway.
> 
>         On 18 Oct 2022, at 13:34, Bastien DUMONT wrote:
> 
>         Yes, but it is limited to this utf8 library. For instance, if
>         perform a
>         regexp search like `string.match('ἀγαθός', '[γδ]')`, it try to
>         match one
>         of the four bytes inside the square brackets against the string
>         'ἀγαθός', so it will return the first byte of γ, not γ. To
>         circumvent
>         this limitation, you would be forced to test γ and δ separately.
>         Nevertheless, if you always perform comparisons between whole
>         strings as
>         you currently do in your script, this should not be a problem.
> 
>         As for your concern with dependancies, you most probably would have
>         to
>         rely on a JSON library such as lunajson. However, if your JSON
>         files are
>         not supposed to change, you could also convert them to a Lua file
>         using
>         a JSON library and a serialization library, so as to be able to
>         import
>         the resulting Lua data structure directly in your filter.
> 
>         Le Tuesday 18 October 2022 à 12:36:03PM, Bernardo C.D.A.
>         Vasconcelos a
>         écrit :
> 
>         As for translating the filter note that Lua can't really
>         handle
>         UTF-8.
>         There is some rudimentary support for converting codepoint
>         number ↔
>         UTF-8
>         byte sequences and for iterating through a string of bytes
>         representing
>         UTF-8 encoded characters but no concept of chars as opposed
>         to
>         bytes.
>         This
>         may become a show stopper if you need to manipulate strings
>         containing
>         UTF-8 text.
> 
>         Thanks, @BPJ, for the explanation. Apparently, Lua 5.3 onwards
>         includes
>         UTF-8 support. Have you seen it? E.g. [1]https://
>         q-syshelp.qsc.com/Content/Control_Scripting/
>         Lua_5.3_Reference_Manual/Standard_Libraries/
>         4_-_Basic_UTF-8_Support.htm
> 
>         For Ancient Greek you want grc as the language tag.
> 
>         Indeed it is (and that is generally what I use), but ἀγαθός is
>         just
>         Polytonic Greek, which is not the same as Ancient Greek.
> 
>         --
>         You received this message because you are subscribed to the
>         Google
>         Groups "pandoc-discuss" group.
>         To unsubscribe from this group and stop receiving emails from
>         it,
>         send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>         To view this discussion on the web visit [2]https://
>         groups.google.com/d/msgid/pandoc-discuss/
>         3307993F-F813-405F-BFEC-F17FAF27BEA5%40gmail.com.
> 
>         --
>         You received this message because you are subscribed to the Google
>         Groups "pandoc-discuss" group.
>         To unsubscribe from this group and stop receiving emails from it,
>         send
>         an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>         To view this discussion on the web visit [3]https://
>         groups.google.com/d/msgid/pandoc-discuss/
>         Y07VnbuRsuqUg8US%40localhost.
> 
>         --
>         You received this message because you are subscribed to the Google
>         Groups "pandoc-discuss" group.
>         To unsubscribe from this group and stop receiving emails from it, send
>         an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>         To view this discussion on the web visit [4][1]https://
>         groups.google.com/d
>         /msgid/pandoc-discuss/7072522D-F2FE-4BAC-A575-93426852FCFB%40gmail.com.
> 
>         --
>         You received this message because you are subscribed to the Google
>         Groups
>         "pandoc-discuss" group.
>         To unsubscribe from this group and stop receiving emails from it, send
>         an
>         email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>         To view this discussion on the web visit [5][2]https://
>         groups.google.com/d/
>         msgid/pandoc-discuss/Y07ji07FFokQdOR%2B%40localhost.
> 
>         --
>         You received this message because you are subscribed to the Google
>         Groups
>         "pandoc-discuss" group.
>         To unsubscribe from this group and stop receiving emails from it, send
>         an email
>         to [6]pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>         To view this discussion on the web visit [7][3]https://
>         groups.google.com/d/msgid/
>         pandoc-discuss/D4CB4B20-A1D5-49C8-BA96-2E37BA4FB779%40gmail.com.
> 
>         References:
> 
>         [1] [4]https://q-syshelp.qsc.com/Content/Control_Scripting/
>         Lua_5.3_Reference_Manual/Standard_Libraries/4_-_Basic_UTF-8_Support.htm
>         [2] [5]https://groups.google.com/d/msgid/pandoc-discuss/
>         3307993F-F813-405F-BFEC-F17FAF27BEA5%40gmail.com
>         [3] [6]https://groups.google.com/d/msgid/pandoc-discuss/
>         Y07VnbuRsuqUg8US%40localhost
>         [4] [7]https://groups.google.com/d/msgid/pandoc-discuss/
>         7072522D-F2FE-4BAC-A575-93426852FCFB%40gmail.com
>         [5] [8]https://groups.google.com/d/msgid/pandoc-discuss/
>         Y07ji07FFokQdOR%2B%40localhost
>         [6] [9]mailto:pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
>         [7] [10]https://groups.google.com/d/msgid/pandoc-discuss/
>         D4CB4B20-A1D5-49C8-BA96-2E37BA4FB779%40gmail.com?utm_medium=email&
>         utm_source=footer
> 
>     --
>     You received this message because you are subscribed to the Google Groups
>     "pandoc-discuss" group.
>     To unsubscribe from this group and stop receiving emails from it, send an
>     email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>     To view this discussion on the web visit [11]https://groups.google.com/d/
>     msgid/pandoc-discuss/Y08jckNrIpxbW6nR%40localhost.
> 
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to [12]pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit [13]https://groups.google.com/d/msgid/
> pandoc-discuss/B93B3CA7-A461-4056-929D-592B578B184F%40gmail.com.
> 
> References:
> 
> [1] https://groups.google.com/d
> [2] https://groups.google.com/d/
> [3] https://groups.google.com/d/msgid/
> [4] https://q-syshelp.qsc.com/Content/Control_Scripting/Lua_5.3_Reference_Manual/Standard_Libraries/4_-_Basic_UTF-8_Support.htm
> [5] https://groups.google.com/d/msgid/pandoc-discuss/3307993F-F813-405F-BFEC-F17FAF27BEA5%40gmail.com
> [6] https://groups.google.com/d/msgid/pandoc-discuss/Y07VnbuRsuqUg8US%40localhost
> [7] https://groups.google.com/d/msgid/pandoc-discuss/7072522D-F2FE-4BAC-A575-93426852FCFB%40gmail.com
> [8] https://groups.google.com/d/msgid/pandoc-discuss/Y07ji07FFokQdOR%2B%40localhost
> [9] mailto:pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
> [10] https://groups.google.com/d/msgid/pandoc-discuss/D4CB4B20-A1D5-49C8-BA96-2E37BA4FB779%40gmail.com?utm_medium=email&utm_source=footer
> [11] https://groups.google.com/d/msgid/pandoc-discuss/Y08jckNrIpxbW6nR%40localhost
> [12] mailto:pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
> [13] https://groups.google.com/d/msgid/pandoc-discuss/B93B3CA7-A461-4056-929D-592B578B184F%40gmail.com?utm_medium=email&utm_source=footer

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/Y1BsCdqttFxOi/pa%40localhost.

[-- Attachment #2: tag-greek-words.lua --]
[-- Type: text/plain, Size: 2123 bytes --]

-- I suppose that you always use plain strings in the
-- "headword", "transliteration" and "match" fields,
-- so I stringify the corresponding Inlines to be able
-- to more easily insert the values in the LaTeX string.
local stringify = pandoc.utils.stringify

local open_glslink_scd_arg = pandoc.RawInline('latex', '{')
local close_glslink_scd_arg = pandoc.RawInline('latex', '}')

-- I use two tables: one to store the data relative to the headwords,
-- the other to map the forms to the corresponding entries
-- in headwords_data.
-- Since the entries in headwords_data are tables
-- and tables are always passed by reference in Lua,
-- this approach avoids a lot of redundant writings in memory.
local headwords_data = {}
local forms_to_headwords = {}

local function get_glossary_data(meta)
  for _, entry in ipairs(meta.glossary.entries) do
    local headword = stringify(entry.headword) 
    headwords_data[headword] = {
      headword = headword,
      text = entry.text,
      transliteration = stringify(entry.transliteration)
    }
    for _, form in ipairs(entry.match) do
      forms_to_headwords[stringify(form)] = headwords_data[headword]
    end
  end
end

local function tag_words(span)
  if span.attributes.lang == 'el' then
    local content = stringify(span.content)
    local word_data = forms_to_headwords[content]
    if word_data then
      local linguistic_tags =
        -- If the "transliteration" field is missing, Lua will throw an error.
        -- I suppose that this should not happen, but if it can be so,
        -- uncomment the following line (supposing that the lonely @
        -- will not cause problems):
        -- word_data.transliteration = word_data.transliteration or ''
        pandoc.RawInline('latex',
                         '\\index{' .. word_data.transliteration ..
                         '@' .. word_data.headword .. '}' ..
                         '\\glslink{' .. word_data.transliteration .. '}')
      return { linguistic_tags, open_glslink_scd_arg, span, close_glslink_scd_arg }
    end
  end
end

return {
  { Meta = get_glossary_data },
  { Span = tag_words }
}

[-- Attachment #3: test.md --]
[-- Type: text/markdown, Size: 777 bytes --]

---
glossary:
  glossary_lang: grc
  entries:
  - headword: ἀγαθός
    transliteration: agathos
    text: "□ *pt.* bom;  □ *en.* good; and so on and so forth"
    match:
    - γαθέ
    - γαθοί
    - κἀγάθ
    - κἀγαθά
    - κἀγαθάς
    - κἀγαθή
    - κἀγαθήν
    - κἀγαθαί
    - κἀγαθοί
    - κἀγαθος
  - headword: ἀγαπᾶν
    transliteration: agapan
    text: "□ *pt.* estar satisfeito, gostar;  □ *en.* be satisfied, like;"
    match:
    - ἀγάπα
    - ἀγάπαις
    - ἀγάπη
    - ἀγάπην
    - ἀγάπης
    - ἀγάπῃ
    - ἀγαπᾶ
    - ἀγαπᾶν
    - ἀγαπᾶς
---

The words [κἀγαθά]{lang=el} and [ἀγαπᾶς]{lang=el}.

  parent reply	other threads:[~2022-10-19 21:28 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-17 18:25 Bernardo C. D. A. Vasconcelos
     [not found] ` <88a14108-f2e4-40d0-a98e-5c6f84b8ff41n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2022-10-17 18:38   ` BPJ
     [not found]     ` <CADAJKhCVT-PNRsSgr5hU7Zzwaq3fN+CF3SGA5mTLrc2As+R6rw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2022-10-18 15:36       ` Bernardo C.D.A. Vasconcelos
     [not found]         ` <3307993F-F813-405F-BFEC-F17FAF27BEA5-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2022-10-18 16:34           ` Bastien DUMONT
2022-10-18 17:16             ` Bernardo C.D.A. Vasconcelos
     [not found]               ` <7072522D-F2FE-4BAC-A575-93426852FCFB-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2022-10-18 17:34                 ` Bastien DUMONT
2022-10-18 21:43                   ` Bernardo C.D.A. Vasconcelos
     [not found]                     ` <D4CB4B20-A1D5-49C8-BA96-2E37BA4FB779-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2022-10-18 22:06                       ` Bastien DUMONT
2022-10-19 19:50                         ` Bernardo C.D.A. Vasconcelos
     [not found]                           ` <B93B3CA7-A461-4056-929D-592B578B184F-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2022-10-19 21:28                             ` Bastien DUMONT [this message]
2022-10-19 22:43                               ` Bernardo C.D.A. Vasconcelos
     [not found]                                 ` <272DFB73-CD83-4A77-B2C5-CCF1AF7B6BF6-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2022-10-20  7:16                                   ` Bastien DUMONT
2022-10-18 18:42           ` BPJ

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y1BsCdqttFxOi/pa@localhost \
    --to=bastien.dumont-vwifzpto/vqstnjn9+bgxg@public.gmane.org \
    --cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).