From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.tex.context/115822 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Thangalin via ntg-context Newsgroups: gmane.comp.tex.context Subject: Re: String substitution using regular expressions and backreferences Date: Thu, 25 Aug 2022 12:44:12 -0700 Message-ID: References: <20329b8b-6347-2bf8-7d63-9f5ac3d01e8e@gmail.com> Reply-To: mailing list for ConTeXt users Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============6835011760149311650==" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="19233"; mail-complaints-to="usenet@ciao.gmane.io" Cc: Thangalin To: mailing list for ConTeXt users Original-X-From: ntg-context-bounces@ntg.nl Thu Aug 25 21:45:04 2022 Return-path: Envelope-to: gctc-ntg-context-518@m.gmane-mx.org Original-Received: from zapf.boekplan.nl ([5.39.185.232] helo=zapf.ntg.nl) by ciao.gmane.io with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1oRIml-0004jJ-2X for gctc-ntg-context-518@m.gmane-mx.org; Thu, 25 Aug 2022 21:45:03 +0200 Original-Received: from localhost (localhost [127.0.0.1]) by zapf.ntg.nl (Postfix) with ESMTP id 7A2FA3609BC; Thu, 25 Aug 2022 21:44:30 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at zapf.boekplan.nl Original-Received: from zapf.ntg.nl ([127.0.0.1]) by localhost (zapf.ntg.nl [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id VQ73qeDF_HSq; Thu, 25 Aug 2022 21:44:28 +0200 (CEST) Original-Received: from zapf.ntg.nl (localhost [127.0.0.1]) by zapf.ntg.nl (Postfix) with ESMTP id AF2D636096C; Thu, 25 Aug 2022 21:44:28 +0200 (CEST) Original-Received: from localhost (localhost [127.0.0.1]) by zapf.ntg.nl (Postfix) with ESMTP id 7C67436096C for ; Thu, 25 Aug 2022 21:44:27 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at zapf.boekplan.nl Original-Received: from zapf.ntg.nl ([127.0.0.1]) by localhost (zapf.ntg.nl [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 91dsMx_as3Dq for ; Thu, 25 Aug 2022 21:44:25 +0200 (CEST) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.216.43; helo=mail-pj1-f43.google.com; envelope-from=thangalin@gmail.com; receiver= Original-Received: from mail-pj1-f43.google.com (mail-pj1-f43.google.com [209.85.216.43]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)) (No client certificate requested) by zapf.ntg.nl (Postfix) with ESMTPS id EAACA36094E for ; Thu, 25 Aug 2022 21:44:24 +0200 (CEST) Original-Received: by mail-pj1-f43.google.com with SMTP id t5so2154709pjs.0 for ; Thu, 25 Aug 2022 12:44:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :from:to:cc; bh=44IUhXBVvwRMOqufzXH5MhEnzAEPnWDQ0/84cwb9m48=; b=bflLdGvBQ148zgDvDjPBfpuCRFel+lZYIcM1P8tFakkwFvDJLIthch/2H6dcLe+COt Trg2rMLfjUigKBVuxLjtztaT4YvmnL7f5TWUy6LM451gqaprzT6PZ2tYNgGMdFDFR1+r sf6Pt1N+lNbFInbwgb8kZ7BKK4AxorBANXeHDNhPzLCEPcZL8MMdi2dcFKxO3Po6nMmA hNthCo78VfjcAuD2EKQGfHs+8M05LtdSHXDuDXrkQHMOEzXdnaLkIEIDc6YnaHm56haA 8bJ2MDsL171T7YEVPboVqnzpnhv+8CcA+qSNM5A1wpLPovPaLk0T7Jgs/FCdYNCrtbts MVYg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :x-gm-message-state:from:to:cc; bh=44IUhXBVvwRMOqufzXH5MhEnzAEPnWDQ0/84cwb9m48=; b=rPUlM12dEqUjj9z+bNMS2V5R/MY1ueGgD7Xth8D4im3JuNvcU8epSxBeuQTkY9dpH+ wluRsBVUZPobAD9kZ2diqunIFPwRWS6Z57X6Lh+8QavA/aUbcjwPy07W93FT48lh+ugw F5tEZnKETsyiiVWnF95Y5w6IMWcIJk6vpN2dmG++avZIs4jqJzVHlyP+e7Wq6ULG4MIt 1mvmv/J7tGEYXrsZYZ8xgNQDN4REH9Com92uY5eEgzf+IxY8xxLJOdc6zfNTk5A3XgHE bLnLu+wozZ9N4E1io2yLKcYTNHgy+o3D2kzJx6jeHbEPTCuLVM/bx1TozIl+/0Qd1NDR CG7A== X-Gm-Message-State: ACgBeo2Km9qMMAHOfvk2jf09HxVdlnz2Wx6a4zPRFrbxdkvfW1C6wOZP PEfv6pSDbeT3AN/lKNsZ9N9MPIeQOmH6sJ9Mbg6nsPORVlo= X-Google-Smtp-Source: AA6agR5jo4whYtbx+X1NI8cUxS24Scw36CTSeFDrtzs8AEEDtxeb4YzvKoV9XelnTu+rV0re/8K0mX+9k2I4NyMuOcM= X-Received: by 2002:a17:903:32cd:b0:16f:c31:7034 with SMTP id i13-20020a17090332cd00b0016f0c317034mr503304plr.126.1661456663091; Thu, 25 Aug 2022 12:44:23 -0700 (PDT) In-Reply-To: <20329b8b-6347-2bf8-7d63-9f5ac3d01e8e@gmail.com> X-BeenThere: ntg-context@ntg.nl X-Mailman-Version: 2.1.26 Precedence: list List-Id: mailing list for ConTeXt users List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: ntg-context-bounces@ntg.nl Original-Sender: "ntg-context" Xref: news.gmane.io gmane.comp.tex.context:115822 Archived-At: --===============6835011760149311650== Content-Type: multipart/alternative; boundary="000000000000369e7c05e7160619" --000000000000369e7c05e7160619 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I've attempted to apply Wolfgang's subtle suggestion of using Lua to parse the input document using a regular expression via lpeg.replacer. The replacement itself works fine; however, in doing so the XML document structure is converted to text, which means that it is no longer possible to "flush" the XML for further processing as XML. The result is that any unresolved XML tags are written verbatim to the PDF: https://i.stack.imgur.com/9ZFND.png There are two other issues with this approach. First is efficiency. Second is that the processing function would have to be called for every XML element to capture the replacement. My original post asked about applying regex word substitution in a ConTeXt way, such as: \definereplacement[SubstMac][ match=3D{Mc([A-Z].*)}, replace=3D{\Mac \\1} ] \definereplacement[SubstPostmeridian][ match=3D{[Pp]\\.[Mm]\\.}, replace=3D{\cap{pm}} ] That seems like the cleanest approach because it would work on top of XML or any other source document. Nevertheless, here is what I tried, which partially works: \startbuffer[main]

=E2=80=9CMr. McAnulty, I presume?=E2=80=9D

Regular text. Irregular text.

\stopbuffer \startxmlsetups xml:xhtml \xmlsetsetup{\xmldocument}{*}{-} \xmlsetsetup{\xmldocument}{html|p|em}{xml:*}\stopxmlsetups \startxmlsetups xml:html \startdocument \xmlflush{#1} \stopdocument\stopxmlsetups % Paragraphs are followed by a paragraph break, but only if not nested.\startxmlsetups xml:p \xmlfunction{#1}{p} \par\stopxmlsetups \startxmlsetups xml:em \dontleavehmode{\em\xmlflush{#1}}\stopxmlsetups \startluacode function xml.functions.p( t ) rep =3D { [1] =3D { "McAnulty", "\\Mac Anulty" } } x =3D lpeg.replacer( rep ):match( tostring( xml.text( t ) ) ) buffers.assign( "p", context( x ) ) context.getbuffer{ "p" } end\stopluacode \xmlregistersetup{xml:xhtml} \def\Mac{% % Determine the sizes of 'M' and 'c'. \newbox\MacMBox% \setbox\MacMBox\hbox{M}% \newbox\MacCBox% \setbox\MacCBox\hbox{c}% % % Cheat to dynamically derive the kerning size by putting Mc in a box. % \newbox\MacKernBox% \setbox\MacKernBox\hbox{\inframed[offset=3D\zeropoint, width=3Dfit]{Mc}}% \def\MacDelta{\dimexpr\wd\MacKernBox-\wd\MacMBox-\wd\MacCBox\relax}% \def\MacUWidth{\dimexpr\wd\MacCBox-.75\MacDelta\relax}% \def\MacRule{\vrule width \MacUWidth height .04em depth \zeropoint \relax= }% \def\MacKern{\dimexpr\wd\MacKernBox-\wd\MacMBox-\wd\MacCBox\relax}% \def\MacHeight{\dimexpr\ht\MacMBox-\ht\MacCBox\relax}% % % Write Mc, where c has a macron, to the document. % M{% \dontleavehmode{\raisebox{\MacHeight}\hbox{c}}% \kern-1.04\MacUWidth \MacRule \kern.08\MacUWidth }% }% \xmlprocessbuffer{main}{main}{} As shown in the screen shot, this doesn't correctly handle nested XML elements. Any ideas on what approach to take to perform a string replacement in ConTeXt? Thanks again! [Your] input is XML which means a lot more can be done than your simple TeX > based example demonstrates. > > Wolfgang > > --000000000000369e7c05e7160619 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
I've attempted to apply Wolfgang's subtle sugge= stion of using Lua to parse the input document using a regular expression v= ia lpeg.replacer. The replacement itself works fine; however, in doing so t= he XML document structure is converted to text, which means that it is no l= onger possible to "flush" the XML for further processing as XML. = The result is that any unresolved XML tags are written verbatim to the PDF:=


There are two other issues with this app= roach. First is efficiency. Second is that the processing function would ha= ve to be called for every XML element to capture the replacement.
=

My original post asked about a= pplying regex word substitution in a ConTeXt way, such as:

\definereplacement[SubstMac][ match=3D{Mc([A-Z].*)}, replace= =3D{\Mac \\1} ]
\definereplacement[SubstPostmeridian][ match=3D{[Pp]\\= .[Mm]\\.}, replace=3D{\cap{pm}} ]

That seem= s like the cleanest approach because it would work on top of XML or any oth= er source document. Nevertheless, here is what I tried, which partially wor= ks:
\startbuffer[main]
<html>
  <p>=E2=80=9CMr. McAnulty, I presume?=E2=80=9D</p>
  <p>Regular text. <em>Irregular text.</em></p>
</html>
\stopbuffer

\startxmlsetups xml:xhtml
  \xmlsetsetup{\xmldocument}{*}{-}
  \xmlsetsetup{\xmldocument}{html|p|em}{xml:*}
\stopxmlsetups

\startxmlsetups xml:html
  \startdocument
    \xmlflush{#1}
  \stopdocument
\stopxmlsetups

% Paragraphs are followed by a paragraph=
 break, but only if not nested.
\startxmlsetups xml:p
  \xmlfunction{#1}{p}
  \par
\stopxmlsetups

\startxmlsetups xml:em
  \dontleavehmode{\em\xmlflush<=
/span>{#1}}
\stopxmlsetups

\startluacode
function xml.functions.p( t )
  rep =3D { [1] =3D { "McAnulty", "\\Mac Anulty" } }
  x =3D lpeg.replacer( rep ):match( tostring( xml.text( t ) ) )

  buffers.assign( "p", context( x ) )
  context.getbuffer{ "p" }
end
\stopluacode

\xmlregistersetup{xml:xhtml}

\def\Mac{%
  % Determine the sizes of 'M' a=
nd 'c'.
  \newbox\MacMBox%
  \setbox\MacMBox\hbox{M=
}%
  \newbox\MacCBox%
  \setbox\MacCBox\hbox{c=
}%
  %
  % Cheat to dynamically derive the kern=
ing size by putting Mc in a box.
  %
  \newbox\MacKernBox%
  \setbox\MacKernBox\hbox{\inframed[offset=3D\zeropoint, width=3Dfit]{Mc}}%
  \def\MacDelta{\dimexpr=
\wd\MacKernBox-\wd\MacMBox-\wd\MacCBox\relax}%=

  \def\MacUWidth{\dimexpr\wd\MacCBox-.75\MacDelta\relax}%
  \def\MacRule{\vrule wi=
dth \MacUWidth height .04em depth=
 \zeropoint \relax}%
  \def\MacKern{\dimexpr<=
span class=3D"gmail-hljs-keyword">\wd\MacKernBox-\wd\MacMBox-\wd\MacCBox\relax}%<=
/span>
  \def\MacHeight{\dimexpr\ht\MacMBox-\ht\MacCBox\relax}%
  %
  % Write Mc, where c has a macron, to t=
he document.
  %
  M{%
    \dontleavehmode{\raisebox{\MacHeight}\hbox{c}}%
    \kern-1.04\MacUWidth
    \MacRule
    \kern.08\MacUWidth
  }%
}%

\xmlprocessbuffer{main}{main}{}
As shown in the screen shot, this doesn't correctly handle nes= ted XML elements.

Any ide= as on what approach to take to perform a string replacement in ConTeXt?

Thanks again!

<= div class=3D"gmail_quote">
[Your] input is XML which means a lot more can be done than your simple TeX=20 based example demonstrates.

Wolfgang

--000000000000369e7c05e7160619-- --===============6835011760149311650== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fX19fX19fX19fX18KSWYgeW91ciBxdWVzdGlvbiBpcyBvZiBpbnRlcmVz dCB0byBvdGhlcnMgYXMgd2VsbCwgcGxlYXNlIGFkZCBhbiBlbnRyeSB0byB0aGUgV2lraSEKCm1h aWxsaXN0IDogbnRnLWNvbnRleHRAbnRnLm5sIC8gaHR0cHM6Ly93d3cubnRnLm5sL21haWxtYW4v bGlzdGluZm8vbnRnLWNvbnRleHQKd2VicGFnZSAgOiBodHRwczovL3d3dy5wcmFnbWEtYWRlLm5s IC8gaHR0cDovL2NvbnRleHQuYWFuaGV0Lm5ldAphcmNoaXZlICA6IGh0dHBzOi8vYml0YnVja2V0 Lm9yZy9waGcvY29udGV4dC1taXJyb3IvY29tbWl0cy8Kd2lraSAgICAgOiBodHRwczovL2NvbnRl eHRnYXJkZW4ubmV0Cl9fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fCg== --===============6835011760149311650==--