regex captures on a Header element

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

* regex captures on a Header element
@ 2022-08-24 12:30 Randy Josleyn
       [not found] ` <03fcdfd9-2811-4622-897e-98d2303e54e1n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Randy Josleyn @ 2022-08-24 12:30 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 1486 bytes --]

Hi group,

I am writing a multilingual document and I want to convert a markdown 
header to a latex command like so:

## A header | 中文标题
->
\bisection{A header}{中文标题}

Using this example 
<https://pandoc.org/lua-filters.html#modifying-pandocs-manual.txt-for-man-pages> 
about man pages from the documentation, I have come up with something like 
the following filter:

~~~lua
local text = pandoc.text
local raw = function (content)
  return pandoc.RawInline('latex', content)
end

function Header(el)
  local pattern = "(%a+)%s+|%s+(.*)"
  headertext = table.unpack(el.content).text
  local _, _, enh, zhh = string.find(headertext, pattern)
  return raw('\\bisection{'..enh..'}{'..zhh..'}')
end
~~~

However, Lua tells me I'm trying to concatenate a nil value `zhh`. I guess 
it could be that my regex is wrong, or that I'm using string.find 
incorrectly; I copied the pattern from the Lua manual Section 20.3, 
"Captures" <https://www.lua.org/pil/20.3.html>. Can anyone give me any 
pointers?

Thank you all!

Randy

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/03fcdfd9-2811-4622-897e-98d2303e54e1n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 2037 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

[parent not found: <03fcdfd9-2811-4622-897e-98d2303e54e1n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]

* Re: regex captures on a Header element
       [not found] ` <03fcdfd9-2811-4622-897e-98d2303e54e1n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2022-08-24 16:42   ` John MacFarlane
       [not found]     ` <79D67508-3478-4C1D-9637-7084AE959EDD-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: John MacFarlane @ 2022-08-24 16:42 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

One thing to keep in mind is that Lua's string functions are not unicode-aware.
So things like %a+ are probably not going to work as expected on Chinese text.
Lua 5.3 (which is the default version we include) has some support for UTF-8,
see https://www.lua.org/manual/5.3/manual.html#6.5


> On Aug 24, 2022, at 5:30 AM, Randy Josleyn <randy.josleyn-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> 
> Hi group,
> 
> I am writing a multilingual document and I want to convert a markdown header to a latex command like so:
> 
> ## A header | 中文标题
> ->
> \bisection{A header}{中文标题}
> 
> Using this example about man pages from the documentation, I have come up with something like the following filter:
> 
> ~~~lua
> local text = pandoc.text
> local raw = function (content)
>   return pandoc.RawInline('latex', content)
> end
> 
> function Header(el)
>   local pattern = "(%a+)%s+|%s+(.*)"
>   headertext = table.unpack(el.content).text
>   local _, _, enh, zhh = string.find(headertext, pattern)
>   return raw('\\bisection{'..enh..'}{'..zhh..'}')
> end
> ~~~
> 
> However, Lua tells me I'm trying to concatenate a nil value `zhh`. I guess it could be that my regex is wrong, or that I'm using string.find incorrectly; I copied the pattern from the Lua manual Section 20.3, "Captures". Can anyone give me any pointers?
> 
> Thank you all!
> 
> Randy
> 
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/03fcdfd9-2811-4622-897e-98d2303e54e1n%40googlegroups.com.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/79D67508-3478-4C1D-9637-7084AE959EDD%40gmail.com.


^ permalink raw reply	[flat|nested] 4+ messages in thread

[parent not found: <79D67508-3478-4C1D-9637-7084AE959EDD-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]

* Re: regex captures on a Header element
       [not found]     ` <79D67508-3478-4C1D-9637-7084AE959EDD-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2022-08-25  3:29       ` Randy Josleyn
       [not found]         ` <85a8a23d-7e4e-4f1a-b54e-b2c0fb4e6182n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Randy Josleyn @ 2022-08-25  3:29 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 3391 bytes --]

Thank you for the heads-up. I just tested out `utf8.charpattern`, but I 
could only get it to match one character and not a contiguous string of 
them; I'll default to `.+` for now.

After more experimentation, I was able to get it doing what I wanted. I 
realized I should have been using `table.concat` instead of `unpack`. My 
final code is below for reference. My next task is to get pairs of 
paragraphs and put their contents in a custom latex command which typesets 
them in parallel. Thank you for your help!

~~~lua
function Header(el)
  local pattern = "(%a+)%s+|%s+(.+)"
  local content = {}
  for k, v in pairs(el.content) do
    if v.t == 'Str' then
      content[k] = v.text
    elseif v.t == 'Space' then
      content[k] = ' '
    end
  end
  local headertext = table.concat(content)
  local headers = (string.gsub(headertext, pattern, '{%1}{%2}'))
  return pandoc.RawInline('latex', '\\bisection' .. headers)
end
~~~
On Thursday, August 25, 2022 at 12:42:25 AM UTC+8 fiddlosopher wrote:

> One thing to keep in mind is that Lua's string functions are not 
> unicode-aware.
> So things like %a+ are probably not going to work as expected on Chinese 
> text.
> Lua 5.3 (which is the default version we include) has some support for 
> UTF-8,
> see https://www.lua.org/manual/5.3/manual.html#6.5
>
>
> > On Aug 24, 2022, at 5:30 AM, Randy Josleyn <randy....-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > 
> > Hi group,
> > 
> > I am writing a multilingual document and I want to convert a markdown 
> header to a latex command like so:
> > 
> > ## A header | 中文标题
> > ->
> > \bisection{A header}{中文标题}
> > 
> > Using this example about man pages from the documentation, I have come 
> up with something like the following filter:
> > 
> > ~~~lua
> > local text = pandoc.text
> > local raw = function (content)
> > return pandoc.RawInline('latex', content)
> > end
> > 
> > function Header(el)
> > local pattern = "(%a+)%s+|%s+(.*)"
> > headertext = table.unpack(el.content).text
> > local _, _, enh, zhh = string.find(headertext, pattern)
> > return raw('\\bisection{'..enh..'}{'..zhh..'}')
> > end
> > ~~~
> > 
> > However, Lua tells me I'm trying to concatenate a nil value `zhh`. I 
> guess it could be that my regex is wrong, or that I'm using string.find 
> incorrectly; I copied the pattern from the Lua manual Section 20.3, 
> "Captures". Can anyone give me any pointers?
> > 
> > Thank you all!
> > 
> > Randy
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "pandoc-discuss" group.
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/03fcdfd9-2811-4622-897e-98d2303e54e1n%40googlegroups.com
> .
>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/85a8a23d-7e4e-4f1a-b54e-b2c0fb4e6182n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 5040 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

[parent not found: <85a8a23d-7e4e-4f1a-b54e-b2c0fb4e6182n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]

* Re: regex captures on a Header element
       [not found]         ` <85a8a23d-7e4e-4f1a-b54e-b2c0fb4e6182n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2022-08-25 17:21           ` Albert Krewinkel
  0 siblings, 0 replies; 4+ messages in thread
From: Albert Krewinkel @ 2022-08-25 17:21 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 4591 bytes --]

If you have markup in your headings then you may want to iterate over `el.content` to find the separator, then you don't have to worry about Unicode. Something along the lines of

~~~lua

local sep_seen = false

local en = pandoc.Inlines{}

local zh = pandoc.Inlines{}

for i, v in ipairs(el.content) do

 if sep_seen then

 zh:insert(v)

 elseif v.text == '|' then

 sep_seen = true

 else

 en:insert(v)

end

~~~

You may also like the function `pandoc.utils.stringify` as an alternative to using table.concat.

https://pandoc.org/lua-filters#pandoc.utils.stringify

Randy Josleyn <randy.josleyn-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> hat am 25.08.2022 05:29 CEST geschrieben:

Thank you for the heads-up. I just tested out `utf8.charpattern`, but I could only get it to match one character and not a contiguous string of them; I'll default to `.+` for now.

After more experimentation, I was able to get it doing what I wanted. I realized I should have been using `table.concat` instead of `unpack`. My final code is below for reference. My next task is to get pairs of paragraphs and put their contents in a custom latex command which typesets them in parallel. Thank you for your help!

~~~lua

function Header(el)
 local pattern = "(%a+)%s+|%s+(.+)"
 local content = {}
 for k, v in pairs(el.content) do
 if v.t == 'Str' then
 content[k] = v.text
 elseif v.t == 'Space' then
 content[k] = ' '
 end
 end
 local headertext = table.concat(content)
 local headers = (string.gsub(headertext, pattern, '{%1}{%2}'))
 return pandoc.RawInline('latex', '\\bisection' .. headers)
end

~~~

On Thursday, August 25, 2022 at 12:42:25 AM UTC+8 fiddlosopher wrote:

One thing to keep in mind is that Lua's string functions are not unicode-aware. 
So things like %a+ are probably not going to work as expected on Chinese text. 
Lua 5.3 (which is the default version we include) has some support for UTF-8, 
see https://www.lua.org/manual/5.3/manual.html#6.5 

> On Aug 24, 2022, at 5:30 AM, Randy Josleyn <randy....-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: 
> 
> Hi group, 
> 
> I am writing a multilingual document and I want to convert a markdown header to a latex command like so: 
> 
> ## A header | 中文标题 
> -> 
> \bisection{A header}{中文标题} 
> 
> Using this example about man pages from the documentation, I have come up with something like the following filter: 
> 
> ~~~lua 
> local text = pandoc.text 
> local raw = function (content) 
> return pandoc.RawInline('latex', content) 
> end 
> 
> function Header(el) 
> local pattern = "(%a+)%s+|%s+(.*)" 
> headertext = table.unpack(el.content).text 
> local _, _, enh, zhh = string.find(headertext, pattern) 
> return raw('\\bisection{'..enh..'}{'..zhh..'}') 
> end 
> ~~~ 
> 
> However, Lua tells me I'm trying to concatenate a nil value `zhh`. I guess it could be that my regex is wrong, or that I'm using string.find incorrectly; I copied the pattern from the Lua manual Section 20.3, "Captures". Can anyone give me any pointers? 
> 
> Thank you all! 
> 
> Randy 
> 
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. 
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org 
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/03fcdfd9-2811-4622-897e-98d2303e54e1n%40googlegroups.com. 

 -- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <mailto:pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/85a8a23d-7e4e-4f1a-b54e-b2c0fb4e6182n%40googlegroups.com <https://groups.google.com/d/msgid/pandoc-discuss/85a8a23d-7e4e-4f1a-b54e-b2c0fb4e6182n%40googlegroups.com?utm_medium=email&utm_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/3EFC2EF5-A5CA-46A6-AEB6-80ECFD05A3B2%40zeitkraut.de.

[-- Attachment #2: Type: text/html, Size: 5783 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-08-25 17:21 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-24 12:30 regex captures on a Header element Randy Josleyn
     [not found] ` <03fcdfd9-2811-4622-897e-98d2303e54e1n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2022-08-24 16:42   ` John MacFarlane
     [not found]     ` <79D67508-3478-4C1D-9637-7084AE959EDD-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2022-08-25  3:29       ` Randy Josleyn
     [not found]         ` <85a8a23d-7e4e-4f1a-b54e-b2c0fb4e6182n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2022-08-25 17:21           ` Albert Krewinkel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).