The data is mostly in database format and could be output in the best format for the task, but I wanted to make it friendly for other people to use as well. Could a YAML metadata block be a solution?

glossary:
  glossary_lang: grc
  entries:
  - headword: ἀγαθός
    text: "□ *pt.* bom;   *en.* good; and so on and so forth"
    match:
    - γαθέ
    - γαθοί
    - κἀγάθ
    - κἀγαθά
    - κἀγαθάς
    - κἀγαθή
    - κἀγαθήν
    - κἀγαθαί
    - κἀγαθοί
    - κἀγαθος
  - headword: ἀγαπᾶν
    transliteration: agapan
    text: "□ *pt.* estar satisfeito, gostar;   *en.* be satisfied, like;"
    match:
    - ἀγάπα
    - ἀγάπαις
    - ἀγάπη
    - ἀγάπην
    - ἀγάπης
    - ἀγάπῃ
    - ἀγαπᾶ
    - ἀγαπᾶν
    - ἀγαπᾶς

On 18 Oct 2022, at 14:34, Bastien DUMONT wrote:

No, citeproc receives a data structure produced by pandoc. Pandoc is responsible for the parsing. I think that your script would not be so hard to rewrite in Lua, the main problem is to know if you can achieve your goals this way. If your main concern is portability, then writing a Lua filter with no dependancies certainly is a good solution provided that you feed it with a Lua data structure (or embed the code responsible for JSON parsing in your script).

Le Tuesday 18 October 2022 à 02:16:16PM, Bernardo C.D.A. Vasconcelos a écrit :

Thank you for the suggestions, Bastien. There is technically no need for
regex, as all the forms are spelled out to avoid the need to create ad hoc
regex rules for each term. Now that I think about it, the principle is the
same as Citeproc's: a tagged inline element will be matched against a lookup
table and replaced. I will look at the citeproc code to see if it leads
anywhere or if it could be reused in anyway.

On 18 Oct 2022, at 13:34, Bastien DUMONT wrote:

Yes, but it is limited to this utf8 library. For instance, if perform a
regexp search like `string.match('ἀγαθός', '[γδ]')`, it try to match one
of the four bytes inside the square brackets against the string
'ἀγαθός', so it will return the first byte of γ, not γ. To circumvent
this limitation, you would be forced to test γ and δ separately.
Nevertheless, if you always perform comparisons between whole strings as
you currently do in your script, this should not be a problem.

As for your concern with dependancies, you most probably would have to
rely on a JSON library such as lunajson. However, if your JSON files are
not supposed to change, you could also convert them to a Lua file using
a JSON library and a serialization library, so as to be able to import
the resulting Lua data structure directly in your filter.

Le Tuesday 18 October 2022 à 12:36:03PM, Bernardo C.D.A. Vasconcelos a
écrit :

As for translating the filter note that Lua can't really handle
UTF-8.
There is some rudimentary support for converting codepoint
number ↔
UTF-8
byte sequences and for iterating through a string of bytes
representing
UTF-8 encoded characters but no concept of chars as opposed to
bytes.
This
may become a show stopper if you need to manipulate strings
containing
UTF-8 text.

Thanks, @BPJ, for the explanation. Apparently, Lua 5.3 onwards
includes
UTF-8 support. Have you seen it? E.g. https://q-syshelp.qsc.com/Content/Control_Scripting/Lua_5.3_Reference_Manual/Standard_Libraries/4_-_Basic_UTF-8_Support.htm

For Ancient Greek you want grc as the language tag.

Indeed it is (and that is generally what I use), but ἀγαθός is just
Polytonic Greek, which is not the same as Ancient Greek.

--
You received this message because you are subscribed to the Google
Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/3307993F-F813-405F-BFEC-F17FAF27BEA5%40gmail.com.

--
You received this message because you are subscribed to the Google
Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/Y07VnbuRsuqUg8US%40localhost.

--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/7072522D-F2FE-4BAC-A575-93426852FCFB%40gmail.com.

--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/Y07ji07FFokQdOR%2B%40localhost.

--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/D4CB4B20-A1D5-49C8-BA96-2E37BA4FB779%40gmail.com.