ntg-context - mailing list for ConTeXt users
 help / color / mirror / Atom feed
* XML, dealing with whitespace
@ 2022-01-15 12:04 Denis Maier via ntg-context
  2022-01-15 19:28 ` Wolfgang Schuster via ntg-context
  0 siblings, 1 reply; 3+ messages in thread
From: Denis Maier via ntg-context @ 2022-01-15 12:04 UTC (permalink / raw)
  To: ntg-context; +Cc: denis.maier


[-- Attachment #1.1: Type: text/plain, Size: 4376 bytes --]

Hi all,

I have sources that look like this:

%%%%%%%%%%%%%%%%%%%%%
<?xml version="1.0" encoding="UTF-8"?>
<article>
   <p>Bla Bla Bla</p>
   <p>
      <underline>
         <italic>Bla</italic>
      </underline>, Bla Bla.</p>
</article>
%%%%%%%%%%%%%%%%%%%%%

Typesetting this with context gives me a spurious space after the underlined Bla in italics. Complete MWE :

%%%%%%%%%%%%%%%%%%%%%
\startxmlsetups xml:test
    \xmlsetsetup{#1}{*}{-}
    \xmlsetsetup{#1}{article|p|italic|underline}{xml:*}
\stopxmlsetups

\xmlregistersetup{xml:test}

\startxmlsetups xml:article
\starttext
    \xmlflush{#1}
\stoptext
\stopxmlsetups

\startxmlsetups xml:p
    \xmlflush{#1}\par
\stopxmlsetups

\startxmlsetups xml:italic
    \emph{\xmlflush{#1}}
\stopxmlsetups

\startxmlsetups xml:underline
    \underbar{\xmlflush{#1}}
\stopxmlsetups

\startbuffer[test]
<?xml version="1.0" encoding="UTF-8"?>
<article>
   <p>Bla Bla Bla</p>
   <p>
      <underline>
         <italic>Bla</italic>
      </underline>, Bla Bla.</p>
</article>
\stopbuffer

\xmlprocessbuffer{test}{test}{}
%%%%%%%%%%%%%%%%%%%%%

How can I get rid off spurious leading and trailing whitespace. I've found \xmlstrip and \xmlstripped, but I don't really understand how they work. I've also found out about
\ignorespaces\xmlflush{#1}\removeunwantedspaces
but this has then to be added to every definition, which would be a bit tedious...
There have a been a couple of similar questions by Hans van der Meer about a decade ago, but I couldn't find the answer.

Then, \xmlstripanywhere is also mentioned in xml-mkiv.pdf, but it's not explained. I found one example in the sources (https://source.contextgarden.net/tex/context/modules/mkiv/x-html.mkiv?search=%5Cxmlstripanywhere#l50), but what does that do? Is that sort of need for \xmlstrip and friends to work?

So, what would be the best way to deal with that situation? (More details below, perhaps there's an easier solution outside of context, because the problem is actually caused by xslt...)

Best,
Denis


P.S. Background:

I convert docx files with pandoc to jats xml. Pandoc does quite a decent job, but I need to tweak a few things with xslt. The actual transformation that I need works ok, but the transformation also causes other problems.
This is the original markdown file :

%%%%%%%%%%%%%%%%%%%%%%%
Bla Bla Bla

[*Bla*]{.underline} Bla Bla.
%%%%%%%%%%%%%%%%%%%%%%%

Pandoc produces a jats xml file that looks like this (simplified, empty nodes deleted) :

%%%%%%%%%%%%%%%%%%%%%%%
<?xml version="1.0" encoding="utf-8" ?>
<article>
<body>
<p>Bla Bla Bla</p>
<p><underline><italic>Bla</italic></underline>, Bla Bla.</p>
</body>
</article>
%%%%%%%%%%%%%%%%%%%%%%%

I use this xsl for tweaking pandoc's output

%%%%%%%%%%%%%%%%%%%%%%%
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl=http://www.w3.org/1999/XSL/Transform xmlns:fo=http://www.w3.org/1999/XSL/Format>

<xsl:output
                method="xml"
                indent="yes" />

                <!-- <xsl:strip-space elements="*"/> -->

    <xsl:template match="*">
        <xsl:copy>
          <xsl:copy-of select="@*"/>
                                 <xsl:apply-templates/>
        </xsl:copy>
    </xsl:template>

<!-- <xsl:template match="node()|@*"> -->
     <!-- <xsl:copy> -->
       <!-- <xsl:apply-templates select="node()|@*"/> -->
     <!-- </xsl:copy> -->
<!-- </xsl:template> -->

</xsl:stylesheet>
%%%%%%%%%%%%%%%%%%%%%%%

This is again much simplified, I've omitted the templates that do the actual tweaking.
Anyway, both versions of the identity transformation produce this (using Saxon):

%%%%%%%%%%%%%%%%%%%%%%%
<?xml version="1.0" encoding="UTF-8"?>
<article>
   <body>
      <p>Bla Bla Bla</p>
      <p>
         <underline>
            <italic>Bla</italic>
         </underline>, Bla Bla.</p>
   </body>
</article>
%%%%%%%%%%%%%%%%%%%%%%%

I can get rid off all whitespace with indent="no", but that produces a rather unreadable file.
xsl:strip-space has had no effect.

Maybe someone knows a solution how to improve that step? Is there a way to convince an xslt-processor not to introduce the newlines after certain tags? Something like, treat paragraphs as a single unit or so.
Am I missing something?

[-- Attachment #1.2: Type: text/html, Size: 16201 bytes --]

[-- Attachment #2: Type: text/plain, Size: 493 bytes --]

___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki     : http://contextgarden.net
___________________________________________________________________________________

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-01-17  9:47 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-15 12:04 XML, dealing with whitespace Denis Maier via ntg-context
2022-01-15 19:28 ` Wolfgang Schuster via ntg-context
2022-01-17  9:47   ` Denis Maier via ntg-context

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).