Glossary Filter for MD2Tex - Bernardo C. D. A. Vasconcelos

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

From: "Bernardo C. D. A. Vasconcelos" <bernardovasconcelos-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: pandoc-discuss <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
Subject: Glossary Filter for MD2Tex
Date: Mon, 17 Oct 2022 11:25:13 -0700 (PDT)	[thread overview]
Message-ID: <88a14108-f2e4-40d0-a98e-5c6f84b8ff41n@googlegroups.com> (raw)


[-- Attachment #1.1: Type: text/plain, Size: 4498 bytes --]

Hello everyone,

I am curious if anyone would be willing to lend me a hand in (or give me 
directions) translating a small script from Ruby to Lua. The idea is this: 
we feed the filter a JSON string with the glossary data. The filter will 
check the JSON for each entry's `filter_match` and tag these accordingly in 
the text, pointing them to the correct glossary entry. It works as it is, 
but it has dependencies (which makes it harder to share), and it seems a 
bit slow (perhaps the logic I am applying is faulty). 

*JSON Example*

```
{
  "entries": [
    {
    "title": "ἀγαθός",
    "subtitle": "□ *pt.* bom;  □ *en.* good",
    "filter_match": ["γαθέ", "γαθοί", "κἀγάθ", "κἀγαθά", "κἀγαθάς", 
"κἀγαθή", "κἀγαθήν", "κἀγαθαί", "κἀγαθοί", "κἀγαθος", "κἀγαθούς", 
"κἀγαθοῖς", "κἀγαθοῦ", "κἀγαθόν", "κἀγαθός", "κἀγαθώ", "κἀγαθῆς", 
"κἀγαθῶν", "κἀγαθῶς", "κἀγαθῷ", "τἀγάθ", "τἀγαθά", "τἀγαθοῦ", "τἀγαθόν", 
"τἀγαθῇ", "τἀγαθῷ", "τὠγαθοῦ", "τὠγαθόν", "ἀγάθ", "ἀγάθων", "ἀγαθά", 
"ἀγαθάν", "ἀγαθάς", "ἀγαθέ", "ἀγαθή", "ἀγαθήν", "ἀγαθαί", "ἀγαθαῖν", 
"ἀγαθαῖς", "ἀγαθαῖσιν", "ἀγαθοί", "ἀγαθούς", "ἀγαθοῖν", "ἀγαθοῖο", 
"ἀγαθοῖς", "ἀγαθοῖσι", "ἀγαθοῖσιν", "ἀγαθοῦ", "ἀγαθόν", "ἀγαθός", "ἀγαθώ", 
"ἀγαθᾶν", "ἀγαθᾶς", "ἀγαθᾷ", "ἀγαθῆισι", "ἀγαθῆισιν", "ἀγαθῆς", "ἀγαθῇ", 
"ἀγαθῇσι", "ἀγαθῇσιν", "ἀγαθῶ", "ἀγαθῶι", "ἀγαθῶν", "ἀγαθῶς", "ἀγαθῷ", 
"ἁγαθή", "ἁγαθαί", "ἁγαθοί", "ἁγαθός", "ὠγαθέ", "ὦγαθ", "ὦγαθε"], 
    "transliteration": "agathos",
    },
    {
    "title": "ἀγαπᾶν",
    "subtitle": "□ *pt.* estar satisfeito, gostar;  □ *en.* be satisfied, 
like;",
    "filter_match": ["ἀγάπα", "ἀγάπαις", "ἀγάπη", "ἀγάπην", "ἀγάπης", 
"ἀγάπῃ", "ἀγαπᾶ", "ἀγαπᾶν", "ἀγαπᾶς", "ἀγαπᾷ", "ἀγαπᾷν", "ἀγαπᾷς", "ἀγαπῇ", 
"ἀγαπῶν"], 
    "transliteration": "agapan",
    }
  ]
}
```
(I am using JSON here just because it seemed to make sense. Perhaps it 
would be interesting if we were pulling this data from the definitions list 
(with extended attributes) in the same document?)

*The Ruby script*

```
#!/usr/bin/env ruby

Encoding.default_internal = Encoding::UTF_8
Encoding.default_external = Encoding::UTF_8

require 'paru/filter'
require 'json'

GLOSSARY = JSON.parse(File.read("#{__dir__}/data.json"))['items']

Paru::Filter.run do
  with 'Span' do |p|
    next unless p.attr['lang'] == 'el'

    span_content = p.inner_markdown.nil? ? '' : p.inner_markdown.chomp
    result = GLOSSARY.select { |g| g['match'].include?(span_content) } 
unless span_content.nil?

    next unless result != []

    p.inner_markdown = 
"\\index{#{result[0]['transliteration']}@#{result[0]['headword']}}\\glslink{#{result[0]['transliteration']}}{#{p.inner_markdown.chomp}}"
    log << result[0]['headword']
  end
end

log_file.puts "Paru::Filter took #{Time.now - start_time}s.\n\n"
log_file.puts "#{log.length} total entries (#{log.uniq.length} unique) were 
tagged:\n#{log.uniq.sort.join("\n")}\n\n"
```

So if my markdown input were:

```
Lorem, etc. [ἀγαθὸς]{lang=el} is a greek word.
```

The LaTeX output would be:

```
Lorem, etc.\\index{agathos@ἀγαθὸς}\\glslink{agathos}{ἀγαθὸς} is a greek 
word.
```

Please note that the glossary headword must be *agathos*, the 
transliterated form, instead of ἀγαθός, due to weird sorting issues with 
LaTeX.

Any input is appreciated.

Bernardo
https://github.com/bcdavasconcelos


-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/88a14108-f2e4-40d0-a98e-5c6f84b8ff41n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 5465 bytes --]

next             reply	other threads:[~2022-10-17 18:25 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-17 18:25 Bernardo C. D. A. Vasconcelos [this message]
     [not found] ` <88a14108-f2e4-40d0-a98e-5c6f84b8ff41n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2022-10-17 18:38   ` BPJ
     [not found]     ` <CADAJKhCVT-PNRsSgr5hU7Zzwaq3fN+CF3SGA5mTLrc2As+R6rw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2022-10-18 15:36       ` Bernardo C.D.A. Vasconcelos
     [not found]         ` <3307993F-F813-405F-BFEC-F17FAF27BEA5-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2022-10-18 16:34           ` Bastien DUMONT
2022-10-18 17:16             ` Bernardo C.D.A. Vasconcelos
     [not found]               ` <7072522D-F2FE-4BAC-A575-93426852FCFB-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2022-10-18 17:34                 ` Bastien DUMONT
2022-10-18 21:43                   ` Bernardo C.D.A. Vasconcelos
     [not found]                     ` <D4CB4B20-A1D5-49C8-BA96-2E37BA4FB779-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2022-10-18 22:06                       ` Bastien DUMONT
2022-10-19 19:50                         ` Bernardo C.D.A. Vasconcelos
     [not found]                           ` <B93B3CA7-A461-4056-929D-592B578B184F-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2022-10-19 21:28                             ` Bastien DUMONT
2022-10-19 22:43                               ` Bernardo C.D.A. Vasconcelos
     [not found]                                 ` <272DFB73-CD83-4A77-B2C5-CCF1AF7B6BF6-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2022-10-20  7:16                                   ` Bastien DUMONT
2022-10-18 18:42           ` BPJ

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=88a14108-f2e4-40d0-a98e-5c6f84b8ff41n@googlegroups.com \
    --to=bernardovasconcelos-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
    --cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).