public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
From: BPJ <bpj-J3H7GcXPSITLoDKTGw+V6w@public.gmane.org>
To: pandoc-discuss <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
Subject: Re: Glossary Filter for MD2Tex
Date: Mon, 17 Oct 2022 20:38:10 +0200	[thread overview]
Message-ID: <CADAJKhCVT-PNRsSgr5hU7Zzwaq3fN+CF3SGA5mTLrc2As+R6rw@mail.gmail.com> (raw)
In-Reply-To: <88a14108-f2e4-40d0-a98e-5c6f84b8ff41n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 5764 bytes --]

For Ancient Greek you want grc as the language tag.

As for translating the filter note that Lua can't really handle UTF-8.
There is some rudimentary support for converting codepoint number ↔ UTF-8
byte sequences and for iterating through a string of bytes representing
UTF-8 encoded characters but no concept of chars as opposed to bytes. This
may become a show stopper if you need to manipulate strings containing
UTF-8 text.

Den mån 17 okt. 2022 20:26Bernardo C. D. A. Vasconcelos <
bernardovasconcelos@gmail.com> skrev:

> Hello everyone,
>
> I am curious if anyone would be willing to lend me a hand in (or give me
> directions) translating a small script from Ruby to Lua. The idea is this:
> we feed the filter a JSON string with the glossary data. The filter will
> check the JSON for each entry's `filter_match` and tag these accordingly in
> the text, pointing them to the correct glossary entry. It works as it is,
> but it has dependencies (which makes it harder to share), and it seems a
> bit slow (perhaps the logic I am applying is faulty).
>
> *JSON Example*
>
> ```
> {
>   "entries": [
>     {
>     "title": "ἀγαθός",
>     "subtitle": "□ *pt.* bom;  □ *en.* good",
>     "filter_match": ["γαθέ", "γαθοί", "κἀγάθ", "κἀγαθά", "κἀγαθάς",
> "κἀγαθή", "κἀγαθήν", "κἀγαθαί", "κἀγαθοί", "κἀγαθος", "κἀγαθούς",
> "κἀγαθοῖς", "κἀγαθοῦ", "κἀγαθόν", "κἀγαθός", "κἀγαθώ", "κἀγαθῆς",
> "κἀγαθῶν", "κἀγαθῶς", "κἀγαθῷ", "τἀγάθ", "τἀγαθά", "τἀγαθοῦ", "τἀγαθόν",
> "τἀγαθῇ", "τἀγαθῷ", "τὠγαθοῦ", "τὠγαθόν", "ἀγάθ", "ἀγάθων", "ἀγαθά",
> "ἀγαθάν", "ἀγαθάς", "ἀγαθέ", "ἀγαθή", "ἀγαθήν", "ἀγαθαί", "ἀγαθαῖν",
> "ἀγαθαῖς", "ἀγαθαῖσιν", "ἀγαθοί", "ἀγαθούς", "ἀγαθοῖν", "ἀγαθοῖο",
> "ἀγαθοῖς", "ἀγαθοῖσι", "ἀγαθοῖσιν", "ἀγαθοῦ", "ἀγαθόν", "ἀγαθός", "ἀγαθώ",
> "ἀγαθᾶν", "ἀγαθᾶς", "ἀγαθᾷ", "ἀγαθῆισι", "ἀγαθῆισιν", "ἀγαθῆς", "ἀγαθῇ",
> "ἀγαθῇσι", "ἀγαθῇσιν", "ἀγαθῶ", "ἀγαθῶι", "ἀγαθῶν", "ἀγαθῶς", "ἀγαθῷ",
> "ἁγαθή", "ἁγαθαί", "ἁγαθοί", "ἁγαθός", "ὠγαθέ", "ὦγαθ", "ὦγαθε"],
>     "transliteration": "agathos",
>     },
>     {
>     "title": "ἀγαπᾶν",
>     "subtitle": "□ *pt.* estar satisfeito, gostar;  □ *en.* be satisfied,
> like;",
>     "filter_match": ["ἀγάπα", "ἀγάπαις", "ἀγάπη", "ἀγάπην", "ἀγάπης",
> "ἀγάπῃ", "ἀγαπᾶ", "ἀγαπᾶν", "ἀγαπᾶς", "ἀγαπᾷ", "ἀγαπᾷν", "ἀγαπᾷς", "ἀγαπῇ",
> "ἀγαπῶν"],
>     "transliteration": "agapan",
>     }
>   ]
> }
> ```
> (I am using JSON here just because it seemed to make sense. Perhaps it
> would be interesting if we were pulling this data from the definitions list
> (with extended attributes) in the same document?)
>
> *The Ruby script*
>
> ```
> #!/usr/bin/env ruby
>
> Encoding.default_internal = Encoding::UTF_8
> Encoding.default_external = Encoding::UTF_8
>
> require 'paru/filter'
> require 'json'
>
> GLOSSARY = JSON.parse(File.read("#{__dir__}/data.json"))['items']
>
> Paru::Filter.run do
>   with 'Span' do |p|
>     next unless p.attr['lang'] == 'el'
>
>     span_content = p.inner_markdown.nil? ? '' : p.inner_markdown.chomp
>     result = GLOSSARY.select { |g| g['match'].include?(span_content) }
> unless span_content.nil?
>
>     next unless result != []
>
>     p.inner_markdown =
> "\\index{#{result[0]['transliteration']}@#{result[0]['headword']}}\\glslink{#{result[0]['transliteration']}}{#{p.inner_markdown.chomp}}"
>     log << result[0]['headword']
>   end
> end
>
> log_file.puts "Paru::Filter took #{Time.now - start_time}s.\n\n"
> log_file.puts "#{log.length} total entries (#{log.uniq.length} unique)
> were tagged:\n#{log.uniq.sort.join("\n")}\n\n"
> ```
>
> So if my markdown input were:
>
> ```
> Lorem, etc. [ἀγαθὸς]{lang=el} is a greek word.
> ```
>
> The LaTeX output would be:
>
> ```
> Lorem, etc.\\index{agathos@ἀγαθὸς}\\glslink{agathos}{ἀγαθὸς} is a greek
> word.
> ```
>
> Please note that the glossary headword must be *agathos*, the
> transliterated form, instead of ἀγαθός, due to weird sorting issues with
> LaTeX.
>
> Any input is appreciated.
>
> Bernardo
> https://github.com/bcdavasconcelos
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/88a14108-f2e4-40d0-a98e-5c6f84b8ff41n%40googlegroups.com
> <https://groups.google.com/d/msgid/pandoc-discuss/88a14108-f2e4-40d0-a98e-5c6f84b8ff41n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhCVT-PNRsSgr5hU7Zzwaq3fN%2BCF3SGA5mTLrc2As%2BR6rw%40mail.gmail.com.

[-- Attachment #2: Type: text/html, Size: 8282 bytes --]

  parent reply	other threads:[~2022-10-17 18:38 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-17 18:25 Bernardo C. D. A. Vasconcelos
     [not found] ` <88a14108-f2e4-40d0-a98e-5c6f84b8ff41n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2022-10-17 18:38   ` BPJ [this message]
     [not found]     ` <CADAJKhCVT-PNRsSgr5hU7Zzwaq3fN+CF3SGA5mTLrc2As+R6rw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2022-10-18 15:36       ` Bernardo C.D.A. Vasconcelos
     [not found]         ` <3307993F-F813-405F-BFEC-F17FAF27BEA5-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2022-10-18 16:34           ` Bastien DUMONT
2022-10-18 17:16             ` Bernardo C.D.A. Vasconcelos
     [not found]               ` <7072522D-F2FE-4BAC-A575-93426852FCFB-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2022-10-18 17:34                 ` Bastien DUMONT
2022-10-18 21:43                   ` Bernardo C.D.A. Vasconcelos
     [not found]                     ` <D4CB4B20-A1D5-49C8-BA96-2E37BA4FB779-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2022-10-18 22:06                       ` Bastien DUMONT
2022-10-19 19:50                         ` Bernardo C.D.A. Vasconcelos
     [not found]                           ` <B93B3CA7-A461-4056-929D-592B578B184F-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2022-10-19 21:28                             ` Bastien DUMONT
2022-10-19 22:43                               ` Bernardo C.D.A. Vasconcelos
     [not found]                                 ` <272DFB73-CD83-4A77-B2C5-CCF1AF7B6BF6-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2022-10-20  7:16                                   ` Bastien DUMONT
2022-10-18 18:42           ` BPJ

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CADAJKhCVT-PNRsSgr5hU7Zzwaq3fN+CF3SGA5mTLrc2As+R6rw@mail.gmail.com \
    --to=bpj-j3h7gcxpsitlodktgw+v6w@public.gmane.org \
    --cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).