ntg-context - mailing list for ConTeXt users
 help / color / mirror / Atom feed
* String substitution using regular expressions and backreferences
@ 2022-08-01 19:58 Thangalin via ntg-context
  2022-08-01 20:13 ` Wolfgang Schuster via ntg-context
  0 siblings, 1 reply; 5+ messages in thread
From: Thangalin via ntg-context @ 2022-08-01 19:58 UTC (permalink / raw)
  To: mailing list for ConTeXt users; +Cc: Thangalin


[-- Attachment #1.1: Type: text/plain, Size: 1429 bytes --]

Hi list,

I'm looking to perform text replacements.

\definereplacement[SubstPostmeridian][
  match={[Pp].[Mm].]},
  replace={\cap{pm}}
]

The \replaceword command doesn't handle periods well. The translate module
doesn't seem flexible enough to cover edge cases. Consider the following
example document containing both sample inputs and sample outputs:

\starttext
  {\bf Markdown Input}

  Our grandmother clock rang 11 p.m. and we fled.

  Our grandmother clock rang 11 p.m., so we fled.

  Our grandmother clock rang 11 p.m. We fled.

  \blank[big]

  {\bf \ConTeXt{} Output}

  Our grandmother clock rang 11 \cap{pm} and we fled.

  Our grandmother clock rang 11 \cap{pm}, so we fled.

  Our grandmother clock rang 11 \cap{pm}. We fled.
\stoptext

It would be most convenient to write:

% Strip periods from p.m.
\definereplacement[SubstPostmeridianLowercase][
  match={[Pp].[Mm]. ([^:upper:])},
  replace={\cap{pm} \1}
]

% Preserve terminal period for p.m. (e.e. cummings notwithstanding)
\definereplacement[SubstPostmeridianTerminal][
  match={[Pp].[Mm]. ([:upper:])},
  replace={\cap{pm}. \1}
]

% Apply a macron for lowercase 'c' (McAnulty, McGenius, etc.)
% Well, not quite a macron: https://tex.stackexchange.com/q/364024/2148
\definereplacement[SubstMac][
  match={Mc([:upper:]\w)},
  replace={M\macronbelow{c}\1}
]

The \1 may be problematic. Other sigils include $1 and #1, which may also
have issues.

Thank you!

[-- Attachment #1.2: Type: text/html, Size: 3056 bytes --]

[-- Attachment #2: Type: text/plain, Size: 496 bytes --]

___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / https://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : https://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki     : https://contextgarden.net
___________________________________________________________________________________

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: String substitution using regular expressions and backreferences
  2022-08-01 19:58 String substitution using regular expressions and backreferences Thangalin via ntg-context
@ 2022-08-01 20:13 ` Wolfgang Schuster via ntg-context
  2022-08-01 20:56   ` Thangalin via ntg-context
  2022-08-25 19:44   ` Thangalin via ntg-context
  0 siblings, 2 replies; 5+ messages in thread
From: Wolfgang Schuster via ntg-context @ 2022-08-01 20:13 UTC (permalink / raw)
  To: mailing list for ConTeXt users; +Cc: Wolfgang Schuster


[-- Attachment #1.1: Type: text/plain, Size: 295 bytes --]

Thangalin via ntg-context schrieb am 01.08.2022 um 21:58:
> Hi list,
>
> I'm looking to perform text replacements.

Please don't omit important information, on TeX SE you mentioned you 
input is XML which means a lot more can be done than your simple TeX 
based example demonstrates.

Wolfgang


[-- Attachment #1.2: Type: text/html, Size: 881 bytes --]

[-- Attachment #2: Type: text/plain, Size: 496 bytes --]

___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / https://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : https://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki     : https://contextgarden.net
___________________________________________________________________________________

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: String substitution using regular expressions and backreferences
  2022-08-01 20:13 ` Wolfgang Schuster via ntg-context
@ 2022-08-01 20:56   ` Thangalin via ntg-context
  2022-08-25 19:44   ` Thangalin via ntg-context
  1 sibling, 0 replies; 5+ messages in thread
From: Thangalin via ntg-context @ 2022-08-01 20:56 UTC (permalink / raw)
  To: mailing list for ConTeXt users; +Cc: Thangalin


[-- Attachment #1.1: Type: text/plain, Size: 325 bytes --]

Good point, Wolfgang.

The Markdown is translated to XHTML then typeset as XML using the setups
listed here:

https://github.com/DaveJarvis/keenwrite-themes/tree/main/xhtml
Having an XML string replacement solution would be great. I suppose that
would help prevent substitutions within pre and code blocks, too, wouldn't
it?

[-- Attachment #1.2: Type: text/html, Size: 818 bytes --]

[-- Attachment #2: Type: text/plain, Size: 496 bytes --]

___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / https://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : https://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki     : https://contextgarden.net
___________________________________________________________________________________

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: String substitution using regular expressions and backreferences
  2022-08-01 20:13 ` Wolfgang Schuster via ntg-context
  2022-08-01 20:56   ` Thangalin via ntg-context
@ 2022-08-25 19:44   ` Thangalin via ntg-context
  2022-08-26  7:34     ` Hans Hagen via ntg-context
  1 sibling, 1 reply; 5+ messages in thread
From: Thangalin via ntg-context @ 2022-08-25 19:44 UTC (permalink / raw)
  To: mailing list for ConTeXt users; +Cc: Thangalin


[-- Attachment #1.1: Type: text/plain, Size: 3131 bytes --]

I've attempted to apply Wolfgang's subtle suggestion of using Lua to parse
the input document using a regular expression via lpeg.replacer. The
replacement itself works fine; however, in doing so the XML document
structure is converted to text, which means that it is no longer possible
to "flush" the XML for further processing as XML. The result is that any
unresolved XML tags are written verbatim to the PDF:

https://i.stack.imgur.com/9ZFND.png

There are two other issues with this approach. First is efficiency. Second
is that the processing function would have to be called for every XML
element to capture the replacement.

My original post asked about applying regex word substitution in a ConTeXt
way, such as:

\definereplacement[SubstMac][ match={Mc([A-Z].*)}, replace={\Mac \\1} ]
\definereplacement[SubstPostmeridian][ match={[Pp]\\.[Mm]\\.},
replace={\cap{pm}} ]

That seems like the cleanest approach because it would work on top of XML
or any other source document. Nevertheless, here is what I tried, which
partially works:

\startbuffer[main]
<html>
  <p>“Mr. McAnulty, I presume?”</p>
  <p>Regular text. <em>Irregular text.</em></p>
</html>\stopbuffer
\startxmlsetups xml:xhtml
  \xmlsetsetup{\xmldocument}{*}{-}
  \xmlsetsetup{\xmldocument}{html|p|em}{xml:*}\stopxmlsetups
\startxmlsetups xml:html
  \startdocument
    \xmlflush{#1}
  \stopdocument\stopxmlsetups
% Paragraphs are followed by a paragraph break, but only if not
nested.\startxmlsetups xml:p
  \xmlfunction{#1}{p}
  \par\stopxmlsetups
\startxmlsetups xml:em
  \dontleavehmode{\em\xmlflush{#1}}\stopxmlsetups
\startluacode
function xml.functions.p( t )
  rep = { [1] = { "McAnulty", "\\Mac Anulty" } }
  x = lpeg.replacer( rep ):match( tostring( xml.text( t ) ) )

  buffers.assign( "p", context( x ) )
  context.getbuffer{ "p" }
end\stopluacode
\xmlregistersetup{xml:xhtml}
\def\Mac{%
  % Determine the sizes of 'M' and 'c'.
  \newbox\MacMBox%
  \setbox\MacMBox\hbox{M}%
  \newbox\MacCBox%
  \setbox\MacCBox\hbox{c}%
  %
  % Cheat to dynamically derive the kerning size by putting Mc in a box.
  %
  \newbox\MacKernBox%
  \setbox\MacKernBox\hbox{\inframed[offset=\zeropoint, width=fit]{Mc}}%
  \def\MacDelta{\dimexpr\wd\MacKernBox-\wd\MacMBox-\wd\MacCBox\relax}%
  \def\MacUWidth{\dimexpr\wd\MacCBox-.75\MacDelta\relax}%
  \def\MacRule{\vrule width \MacUWidth height .04em depth \zeropoint \relax}%
  \def\MacKern{\dimexpr\wd\MacKernBox-\wd\MacMBox-\wd\MacCBox\relax}%
  \def\MacHeight{\dimexpr\ht\MacMBox-\ht\MacCBox\relax}%
  %
  % Write Mc, where c has a macron, to the document.
  %
  M{%
    \dontleavehmode{\raisebox{\MacHeight}\hbox{c}}%
    \kern-1.04\MacUWidth
    \MacRule
    \kern.08\MacUWidth
  }%
}%
\xmlprocessbuffer{main}{main}{}

As shown in the screen shot, this doesn't correctly handle nested XML
elements.

Any ideas on what approach to take to perform a string replacement in
ConTeXt?

Thanks again!


[Your] input is XML which means a lot more can be done than your simple TeX
> based example demonstrates.
>
> Wolfgang
>
>

[-- Attachment #1.2: Type: text/html, Size: 9717 bytes --]

[-- Attachment #2: Type: text/plain, Size: 496 bytes --]

___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / https://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : https://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki     : https://contextgarden.net
___________________________________________________________________________________

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: String substitution using regular expressions and backreferences
  2022-08-25 19:44   ` Thangalin via ntg-context
@ 2022-08-26  7:34     ` Hans Hagen via ntg-context
  0 siblings, 0 replies; 5+ messages in thread
From: Hans Hagen via ntg-context @ 2022-08-26  7:34 UTC (permalink / raw)
  To: mailing list for ConTeXt users; +Cc: Hans Hagen

On 8/25/2022 9:44 PM, Thangalin via ntg-context wrote:
> I've attempted to apply Wolfgang's subtle suggestion of using Lua to parse
> the input document using a regular expression via lpeg.replacer. The
> replacement itself works fine; however, in doing so the XML document
> structure is converted to text, which means that it is no longer possible
> to "flush" the XML for further processing as XML. The result is that any
> unresolved XML tags are written verbatim to the PDF:
> 
> https://i.stack.imgur.com/9ZFND.png
> 
> There are two other issues with this approach. First is efficiency. Second
> is that the processing function would have to be called for every XML
> element to capture the replacement.
> 
> My original post asked about applying regex word substitution in a ConTeXt
> way, such as:
> 
> \definereplacement[SubstMac][ match={Mc([A-Z].*)}, replace={\Mac \\1} ]
> \definereplacement[SubstPostmeridian][ match={[Pp]\\.[Mm]\\.},
> replace={\cap{pm}} ]
> 
> That seems like the cleanest approach because it would work on top of XML
> or any other source document. Nevertheless, here is what I tried, which
> partially works:
> 
> \startbuffer[main]
> <html>
>    <p>“Mr. McAnulty, I presume?”</p>
>    <p>Regular text. <em>Irregular text.</em></p>
> </html>\stopbuffer
> \startxmlsetups xml:xhtml
>    \xmlsetsetup{\xmldocument}{*}{-}
>    \xmlsetsetup{\xmldocument}{html|p|em}{xml:*}\stopxmlsetups
> \startxmlsetups xml:html
>    \startdocument
>      \xmlflush{#1}
>    \stopdocument\stopxmlsetups
> % Paragraphs are followed by a paragraph break, but only if not
> nested.\startxmlsetups xml:p
>    \xmlfunction{#1}{p}
>    \par\stopxmlsetups
> \startxmlsetups xml:em
>    \dontleavehmode{\em\xmlflush{#1}}\stopxmlsetups
> \startluacode
> function xml.functions.p( t )
>    rep = { [1] = { "McAnulty", "\\Mac Anulty" } }
>    x = lpeg.replacer( rep ):match( tostring( xml.text( t ) ) )
> 
>    buffers.assign( "p", context( x ) )
>    context.getbuffer{ "p" }
> end\stopluacode
> \xmlregistersetup{xml:xhtml}
> \def\Mac{%
>    % Determine the sizes of 'M' and 'c'.
>    \newbox\MacMBox%
>    \setbox\MacMBox\hbox{M}%
>    \newbox\MacCBox%
>    \setbox\MacCBox\hbox{c}%
>    %
>    % Cheat to dynamically derive the kerning size by putting Mc in a box.
>    %
>    \newbox\MacKernBox%
>    \setbox\MacKernBox\hbox{\inframed[offset=\zeropoint, width=fit]{Mc}}%
>    \def\MacDelta{\dimexpr\wd\MacKernBox-\wd\MacMBox-\wd\MacCBox\relax}%
>    \def\MacUWidth{\dimexpr\wd\MacCBox-.75\MacDelta\relax}%
>    \def\MacRule{\vrule width \MacUWidth height .04em depth \zeropoint \relax}%
>    \def\MacKern{\dimexpr\wd\MacKernBox-\wd\MacMBox-\wd\MacCBox\relax}%
>    \def\MacHeight{\dimexpr\ht\MacMBox-\ht\MacCBox\relax}%
>    %
>    % Write Mc, where c has a macron, to the document.
>    %
>    M{%
>      \dontleavehmode{\raisebox{\MacHeight}\hbox{c}}%
>      \kern-1.04\MacUWidth
>      \MacRule
>      \kern.08\MacUWidth
>    }%
> }%
> \xmlprocessbuffer{main}{main}{}
> 
> As shown in the screen shot, this doesn't correctly handle nested XML
> elements.
> 
> Any ideas on what approach to take to perform a string replacement in
> ConTeXt?
Best stay at the xml end ...

\startbuffer[main]
<html>
   <p>“Mr. McAnulty, I presume?”</p>
   <p>Regular text. <em>Irregular text.</em></p>
</html>
\stopbuffer

\startxmlsetups xml:xhtml
   \xmlsetsetup{\xmldocument}{*}{-}
   \xmlsetsetup{\xmldocument}{html|p|em}{xml:*}
\stopxmlsetups

\startxmlsetups xml:html
     \xmlflush{#1}
\stopxmlsetups

\startxmlsetups xml:p
     \xmlfunction{#1}{p}
     \xmlcontext{#1}
     \par
\stopxmlsetups

\startxmlsetups xml:em
   \dontleavehmode{\em\xmlflush{#1}}
\stopxmlsetups

\startluacode
     local rep = lpeg.replacer { [1] = { "McAnulty", "\\Mac Anulty" } }
     function xml.functions.p(t)
         local dt = t.dt
         for i=1,#dt do
             local di = dt[i]
             if type(di) == "string" then
                 dt[i] = lpeg.match(rep,di)
             end
         end
     end
\stopluacode

\xmlregistersetup{xml:xhtml}

\startdocument
     \xmlprocessbuffer{main}{main}{}
\stopdocument

But this is more fun and probably also more reliable:

\startbuffer[main]
<html>
   <p>“Mr. McAnulty, I presume?”</p>
   <p>Regular text. <em>Irregular text.</em></p>
</html>
\stopbuffer

\startxmlsetups xml:xhtml
   \xmlsetsetup{\xmldocument}{*}{-}
   \xmlsetsetup{\xmldocument}{html|p|em}{xml:*}
\stopxmlsetups

\startxmlsetups xml:html
     \xmlflush{#1}
\stopxmlsetups

\startxmlsetups xml:p
     \xmlcontext{#1}
     \par
\stopxmlsetups

\startxmlsetups xml:em
   \dontleavehmode{\em\xmlflush{#1}}
\stopxmlsetups

\xmlregistersetup{xml:xhtml}

\usemodule[gimmicks] % in latest uploads

\chardef\MacAnulty = \getprivateglyphslot{MacAnulty}

\startsetups [box:mcanulty:\number\MacAnulty]
     \Mac Anulty
\stopsetups

\registerboxglyph category {mcanulty} unicode \MacAnulty \relax

\startluacode
     fonts.handlers.otf.addfeature {
         name    = "mcanulty",
         type    = "ligature",
         nocheck = true,
         data    = {
             [fonts.constructors.privateslots.MacAnulty] = {
                 "M", "c", "A", "n", "u", "l", "t", "y",
             },
         }
     }
\stopluacode

\definefontfeature[default][default][box=mcanulty,mcanulty=yes]

\startdocument
     \xmlprocessbuffer{main}{main}{}
\stopdocument

-----------------------------------------------------------------
                                           Hans Hagen | PRAGMA ADE
               Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
        tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl
-----------------------------------------------------------------
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / https://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : https://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki     : https://contextgarden.net
___________________________________________________________________________________

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-08-26  7:34 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-01 19:58 String substitution using regular expressions and backreferences Thangalin via ntg-context
2022-08-01 20:13 ` Wolfgang Schuster via ntg-context
2022-08-01 20:56   ` Thangalin via ntg-context
2022-08-25 19:44   ` Thangalin via ntg-context
2022-08-26  7:34     ` Hans Hagen via ntg-context

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).