From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/31689 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: bapt a Newsgroups: gmane.text.pandoc Subject: Re: Lua filter to automatically tag keywords for TeX indexing Date: Fri, 4 Nov 2022 14:26:55 -0700 (PDT) Message-ID: <23ccfb2a-c458-49e1-a275-dad452f5d2e3n@googlegroups.com> References: <7f570676-2876-4e29-a8c0-9a765617f141n@googlegroups.com> Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_1170_279440201.1667597215608" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="5197"; mail-complaints-to="usenet@ciao.gmane.io" To: pandoc-discuss Original-X-From: pandoc-discuss+bncBDG3FYUYQUCBBIMHS2NQMGQEWDPNFGA-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Fri Nov 04 22:27:02 2022 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane-mx.org Original-Received: from mail-oa1-f60.google.com ([209.85.160.60]) by ciao.gmane.io with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1or4DN-00015f-AI for gtp-pandoc-discuss@m.gmane-mx.org; Fri, 04 Nov 2022 22:27:01 +0100 Original-Received: by mail-oa1-f60.google.com with SMTP id 586e51a60fabf-13bdcfbd787sf3101633fac.18 for ; Fri, 04 Nov 2022 14:27:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20210112; h=list-unsubscribe:list-subscribe:list-archive:list-help:list-post :list-id:mailing-list:precedence:reply-to:x-original-sender :mime-version:subject:references:in-reply-to:message-id:to:from:date :sender:from:to:cc:subject:date:message-id:reply-to; bh=oZMZRzCQ1G+r7A9oZgtg5OkhWf2cNnBXphBTjPJE6iU=; b=hYO/S19V5DD5+rMVdUzSb2DsQQg6wMYj4uMUkfGnTa0hqQjGKWC4gsqLXURtsLy7Ss MsajJLhZeAagELN3V6oiChaMCVc5bYTsTpAf/x0QDsVmcPx5Ce0uJBPnl8KCzynJoGvj 7M3Y5IslLWvgKATzhWsPYJduB/yPXFUDHHz2ffcC/7vH5FHe2NCCIl+LxYYnuPVlgPeR oHynuue5qg/Px+sU5iWd97yZoFd5Fu78K0lXilg2qRZcQeqTBHGkwhhLCRuyrHyzkG4v pvlhzgxQp8APyhVmsap8gqydememq3rCMoETqt/aK978ud5WKQvh8GCm6cZP98UltLme Ziag== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=list-unsubscribe:list-subscribe:list-archive:list-help:list-post :list-id:mailing-list:precedence:reply-to:x-original-sender :mime-version:subject:references:in-reply-to:message-id:to:from:date :from:to:cc:subject:date:message-id:reply-to; bh=oZMZRzCQ1G+r7A9oZgtg5OkhWf2cNnBXphBTjPJE6iU=; b=Gommp8Nye4hI0JEKzrvrQGHm6P0PVBbXevAM+EIMU060AB5NTRsAtymgt7fMFKwEu2 l3kq4Ick1gDy9cYP1zh9v2j2pa0BQD3Pjs+pYGgWGaavwqi3mQ6ZYzc/wwax2fPjnoqH w9vSqvHY2uu++b47tGeI+kzDj8JsHyjL/ygxP3LmM6AFE6gFqAP8EW20R3CzjMUccSQv pJar7mifEVyw0XRwf8L12xMEs/Eu0YtcGCAoZsBsnIV618n71Hl0lTcACkYLXvsDUGGJ 7cPip8hfrUVL8IgBdzK/LViRScnt9Dd/hNWAWbUCAxmD1YKXqFqv3lZ1jDmCFEQMTw8H jxUg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=list-unsubscribe:list-subscribe:list-archive:list-help:list-post :x-spam-checked-in-group:list-id:mailing-list:precedence:reply-to :x-original-sender:mime-version:subject:references:in-reply-to :message-id:to:from:date:x-gm-message-state:sender:from:to:cc :subject:date:message-id:reply-to; bh=oZMZRzCQ1G+r7A9oZgtg5OkhWf2cNnBXphBTjPJE6iU=; b=CBs4InA+UxuJPiHb2vtc8o7KY428EtQ+pMS/K36JeH+PcjvU76sMPKNcEQTKWZQchb XJzl0YTvQs6fFELV+dy7w/Rb7eAhfhrvZU0WoaeFFnd7uZBW1B1DZ5+yTGG6xMgoGnto 9gOKOzvZO92rKd4KP2nY+w5cDFpCnT9pXUla0K+sOzWZZTM/tYmr09Umc9VgH83jkqpi qWgMZ/6Vb16eJtVyQiN1dLGpRyDFdE6G8wttaXjcEytKZrMFqDwWDvb3Tz+qdlBX3PF/ 5kRHo/3sLirE3wgAxojRSvAs/0oTRh+9BUCmhpVRhyEC+ohwTBxuVgSlxqeBr83ZtYEU /rUg== Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Gm-Message-State: ACrzQf1hO12ddLNdRftgHlr6lUx/QhER3BIWYib74M18xqmYaSerttw0 heQPrWORaBq/aY+U5yf12Uk= X-Google-Smtp-Source: AMsMyM7yZC919fCvDgmHoatKv5EQL1A4qPXRKsalggbWD2FJ1fsOiNfxKDIdvC/C1zvuEmkgNYPEsw== X-Received: by 2002:a05:6808:1591:b0:359:e9f6:e349 with SMTP id t17-20020a056808159100b00359e9f6e349mr19729081oiw.267.1667597220107; Fri, 04 Nov 2022 14:27:00 -0700 (PDT) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 2002:aca:f1d5:0:b0:357:5ddf:b56e with SMTP id p204-20020acaf1d5000000b003575ddfb56els1873714oih.7.-pod-prod-gmail; Fri, 04 Nov 2022 14:26:57 -0700 (PDT) X-Received: by 2002:a05:6808:1891:b0:35a:f28:bb3d with SMTP id bi17-20020a056808189100b0035a0f28bb3dmr15344356oib.166.1667597216905; Fri, 04 Nov 2022 14:26:56 -0700 (PDT) In-Reply-To: X-Original-Sender: auguieba-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.io gmane.text.pandoc:31689 Archived-At: ------=_Part_1170_279440201.1667597215608 Content-Type: multipart/alternative; boundary="----=_Part_1171_1236796809.1667597215608" ------=_Part_1171_1236796809.1667597215608 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Thank you both for the helpful replies; I don't fully understand how these= =20 boolean tables are used in Lua (clearly it works, I just don't fully get=20 it), but it seems to be an important concept so I'll read up on it. Thanks, baptiste On Friday, 4 November 2022 at 08:39:29 UTC+13 fiddlosopher wrote: > > > > On Nov 2, 2022, at 6:20 PM, bapt a wrote: > >=20 > > Hi all, > >=20 > > I've started writing a technical book using Quarto markdown, which uses= =20 > pandoc with Lua filters under the hood to produce a website as well as th= e=20 > publisher's pdf format (via LaTeX).=20 > > I quite like to keep the source document as plain as possible, and I'm= =20 > wondering if I could avoid the use of [concept]{.index}, which gets turne= d=20 > into \index{concept}, and instead write a Lua filter with my custom list = of=20 > keywords, and have pandoc automatically match them as they appear in the= =20 > text.=20 > > As a proof of principle I wrote the following code (see below), which= =20 > matches specific keywords, and reformats them as small-caps. I quickly=20 > realised that trailing punctuation, such as "concept, ..." will fail to= =20 > match, so I'm using gsub to strip such punctuation before matching. It=20 > works, but I'm a bit worried: > >=20 > > - what's the overhead of such a filter, in practice? From what I=20 > understand, every single string element in the AST will be processed by= =20 > gsub then tested for a match. Are Lua filters walking down the AST fast= =20 > enough that I shouldn't worry about it? (as far as I can tell on small=20 > examples, it seems fine) > > The AST walking is very fast. See the benchmarks at the beginning of=20 > https://pandoc.org/lua-filters.html for one example. > > > - assuming this idea is reasonable, I might want to do a few similar=20 > operations, e.g. reformatting program languages (as in this example code)= ,=20 > wrapping keywords in \index{}, etc., and the exact format will often depe= nd=20 > on the output target (html vs TeX etc.). Is there a better construct for= =20 > this than successive if/else statements to look for matches? (I don't kno= w=20 > much Lua) > > In lua you can do > > string.gusb(val, [=E2=80=9C(%l*)=E2=80=9D], function (word) > if indexable[word] then > .. whatever .. > end > end) > > This will run the function on every group of letters in the matched strin= g. > Here I=E2=80=99m assuming you have a lua table indexable that maps words = to true,=20 > e.g. > > { cow: true, horse: true } > > That will be much faster than iterating through an array as you=E2=80=99r= e doing=20 > here. > > --=20 You received this message because you are subscribed to the Google Groups "= pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an e= mail to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/= pandoc-discuss/23ccfb2a-c458-49e1-a275-dad452f5d2e3n%40googlegroups.com. ------=_Part_1171_1236796809.1667597215608 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Thank you both for the helpful replies; I don't fully understand how t= hese boolean tables are used in Lua (clearly it works, I just don't fully g= et it), but it seems to be an important concept so I'll read up on it.
<= /div>

Thanks,

baptiste

On F= riday, 4 November 2022 at 08:39:29 UTC+13 fiddlosopher wrote:


> On Nov 2, 2022, at 6:20 PM, bapt a <augu...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>=20
> Hi all,
>=20
> I've started writing a technical book using Quarto markdown, w= hich uses pandoc with Lua filters under the hood to produce a website as we= ll as the publisher's pdf format (via LaTeX).=20
> I quite like to keep the source document as plain as possible, and= I'm wondering if I could avoid the use of [concept]{.index}, which get= s turned into \index{concept}, and instead write a Lua filter with my custo= m list of keywords, and have pandoc automatically match them as they appear= in the text.=20
> As a proof of principle I wrote the following code (see below), wh= ich matches specific keywords, and reformats them as small-caps. I quickly = realised that trailing punctuation, such as "concept, ..." will f= ail to match, so I'm using gsub to strip such punctuation before matchi= ng. It works, but I'm a bit worried:
>=20
> - what's the overhead of such a filter, in practice? From what= I understand, every single string element in the AST will be processed by = gsub then tested for a match. Are Lua filters walking down the AST fast eno= ugh that I shouldn't worry about it? (as far as I can tell on small exa= mples, it seems fine)

The AST walking is very fast. See the benchmarks at the beginning of <= a href=3D"https://pandoc.org/lua-filters.html" target=3D"_blank" rel=3D"nof= ollow" data-saferedirecturl=3D"https://www.google.com/url?hl=3Den-GB&q= =3Dhttps://pandoc.org/lua-filters.html&source=3Dgmail&ust=3D1667680= 548080000&usg=3DAOvVaw2KsB9QDUIBmEELEtpQSszC">https://pandoc.org/lua-fi= lters.html for one example.

> - assuming this idea is reasonable, I might want to do a few simil= ar operations, e.g. reformatting program languages (as in this example code= ), wrapping keywords in \index{}, etc., and the exact format will often dep= end on the output target (html vs TeX etc.). Is there a better construct fo= r this than successive if/else statements to look for matches? (I don't= know much Lua)

In lua you can do

string.gusb(val, [=E2=80=9C(%l*)=E2=80=9D], function (word)
if indexable[word] then
.. whatever ..
end
end)

This will run the function on every group of letters in the matched str= ing.
Here I=E2=80=99m assuming you have a lua table indexable that maps word= s to true, e.g.

{ cow: true, horse: true }

That will be much faster than iterating through an array as you=E2=80= =99re doing here.

--
You received this message because you are subscribed to the Google Groups &= quot;pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to pand= oc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To view this discussion on the web visit https://groups.google.com/d= /msgid/pandoc-discuss/23ccfb2a-c458-49e1-a275-dad452f5d2e3n%40googlegroups.= com.
------=_Part_1171_1236796809.1667597215608-- ------=_Part_1170_279440201.1667597215608--