ntg-context - mailing list for ConTeXt users
 help / color / mirror / Atom feed
* XML, dealing with whitespace
@ 2022-01-15 12:04 Denis Maier via ntg-context
  2022-01-15 19:28 ` Wolfgang Schuster via ntg-context
  0 siblings, 1 reply; 3+ messages in thread
From: Denis Maier via ntg-context @ 2022-01-15 12:04 UTC (permalink / raw)
  To: ntg-context; +Cc: denis.maier


[-- Attachment #1.1: Type: text/plain, Size: 4376 bytes --]

Hi all,

I have sources that look like this:

%%%%%%%%%%%%%%%%%%%%%
<?xml version="1.0" encoding="UTF-8"?>
<article>
   <p>Bla Bla Bla</p>
   <p>
      <underline>
         <italic>Bla</italic>
      </underline>, Bla Bla.</p>
</article>
%%%%%%%%%%%%%%%%%%%%%

Typesetting this with context gives me a spurious space after the underlined Bla in italics. Complete MWE :

%%%%%%%%%%%%%%%%%%%%%
\startxmlsetups xml:test
    \xmlsetsetup{#1}{*}{-}
    \xmlsetsetup{#1}{article|p|italic|underline}{xml:*}
\stopxmlsetups

\xmlregistersetup{xml:test}

\startxmlsetups xml:article
\starttext
    \xmlflush{#1}
\stoptext
\stopxmlsetups

\startxmlsetups xml:p
    \xmlflush{#1}\par
\stopxmlsetups

\startxmlsetups xml:italic
    \emph{\xmlflush{#1}}
\stopxmlsetups

\startxmlsetups xml:underline
    \underbar{\xmlflush{#1}}
\stopxmlsetups

\startbuffer[test]
<?xml version="1.0" encoding="UTF-8"?>
<article>
   <p>Bla Bla Bla</p>
   <p>
      <underline>
         <italic>Bla</italic>
      </underline>, Bla Bla.</p>
</article>
\stopbuffer

\xmlprocessbuffer{test}{test}{}
%%%%%%%%%%%%%%%%%%%%%

How can I get rid off spurious leading and trailing whitespace. I've found \xmlstrip and \xmlstripped, but I don't really understand how they work. I've also found out about
\ignorespaces\xmlflush{#1}\removeunwantedspaces
but this has then to be added to every definition, which would be a bit tedious...
There have a been a couple of similar questions by Hans van der Meer about a decade ago, but I couldn't find the answer.

Then, \xmlstripanywhere is also mentioned in xml-mkiv.pdf, but it's not explained. I found one example in the sources (https://source.contextgarden.net/tex/context/modules/mkiv/x-html.mkiv?search=%5Cxmlstripanywhere#l50), but what does that do? Is that sort of need for \xmlstrip and friends to work?

So, what would be the best way to deal with that situation? (More details below, perhaps there's an easier solution outside of context, because the problem is actually caused by xslt...)

Best,
Denis


P.S. Background:

I convert docx files with pandoc to jats xml. Pandoc does quite a decent job, but I need to tweak a few things with xslt. The actual transformation that I need works ok, but the transformation also causes other problems.
This is the original markdown file :

%%%%%%%%%%%%%%%%%%%%%%%
Bla Bla Bla

[*Bla*]{.underline} Bla Bla.
%%%%%%%%%%%%%%%%%%%%%%%

Pandoc produces a jats xml file that looks like this (simplified, empty nodes deleted) :

%%%%%%%%%%%%%%%%%%%%%%%
<?xml version="1.0" encoding="utf-8" ?>
<article>
<body>
<p>Bla Bla Bla</p>
<p><underline><italic>Bla</italic></underline>, Bla Bla.</p>
</body>
</article>
%%%%%%%%%%%%%%%%%%%%%%%

I use this xsl for tweaking pandoc's output

%%%%%%%%%%%%%%%%%%%%%%%
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl=http://www.w3.org/1999/XSL/Transform xmlns:fo=http://www.w3.org/1999/XSL/Format>

<xsl:output
                method="xml"
                indent="yes" />

                <!-- <xsl:strip-space elements="*"/> -->

    <xsl:template match="*">
        <xsl:copy>
          <xsl:copy-of select="@*"/>
                                 <xsl:apply-templates/>
        </xsl:copy>
    </xsl:template>

<!-- <xsl:template match="node()|@*"> -->
     <!-- <xsl:copy> -->
       <!-- <xsl:apply-templates select="node()|@*"/> -->
     <!-- </xsl:copy> -->
<!-- </xsl:template> -->

</xsl:stylesheet>
%%%%%%%%%%%%%%%%%%%%%%%

This is again much simplified, I've omitted the templates that do the actual tweaking.
Anyway, both versions of the identity transformation produce this (using Saxon):

%%%%%%%%%%%%%%%%%%%%%%%
<?xml version="1.0" encoding="UTF-8"?>
<article>
   <body>
      <p>Bla Bla Bla</p>
      <p>
         <underline>
            <italic>Bla</italic>
         </underline>, Bla Bla.</p>
   </body>
</article>
%%%%%%%%%%%%%%%%%%%%%%%

I can get rid off all whitespace with indent="no", but that produces a rather unreadable file.
xsl:strip-space has had no effect.

Maybe someone knows a solution how to improve that step? Is there a way to convince an xslt-processor not to introduce the newlines after certain tags? Something like, treat paragraphs as a single unit or so.
Am I missing something?

[-- Attachment #1.2: Type: text/html, Size: 16201 bytes --]

[-- Attachment #2: Type: text/plain, Size: 493 bytes --]

___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki     : http://contextgarden.net
___________________________________________________________________________________

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: XML, dealing with whitespace
  2022-01-15 12:04 XML, dealing with whitespace Denis Maier via ntg-context
@ 2022-01-15 19:28 ` Wolfgang Schuster via ntg-context
  2022-01-17  9:47   ` Denis Maier via ntg-context
  0 siblings, 1 reply; 3+ messages in thread
From: Wolfgang Schuster via ntg-context @ 2022-01-15 19:28 UTC (permalink / raw)
  To: mailing list for ConTeXt users, Denis Maier via ntg-context
  Cc: Wolfgang Schuster


[-- Attachment #1.1: Type: text/plain, Size: 1522 bytes --]

Denis Maier via ntg-context schrieb am 15.01.2022 um 13:04:
>
> Hi all,
>
> I have sources that look like this:
>
> %%%%%%%%%%%%%%%%%%%%%
>
> <?xml version="1.0" encoding="UTF-8"?>
>
> <article>
>
>    <p>Bla Bla Bla</p>
>
> <p>
>
>       <underline>
>
>          <italic>Bla</italic>
>
>       </underline>, Bla Bla.</p>
>
> </article>
>
> %%%%%%%%%%%%%%%%%%%%%
>
> Typesetting this with context gives me a spurious space after the 
> underlined Bla in italics.
>

There is no spurious space, the line break is just converted to a space 
and I see no reason why this shouldn't happen. To remove space before or 
after certain parts of text within a paragraph you can use the 
\removeunwantedspace and \ignorespaces commands.

%%%% begin example
\starttexdefinition RemovePreceding #1
     \removeunwantedspaces
     #1
\stoptexdefinition

\starttexdefinition RemoveFollowing #1
     #1
     \ignorespaces
\stoptexdefinition

\starttext

Bla \RemovePreceding{Bla} Bla

Bla \RemoveFollowing{Bla} Bla

\stoptext
%%%% end example

When only  following spaces are a problem a better alternative to 
\ignorespace is \autoinsertnextspace which checks the following token 
which ensures there is space when the next character is punctuation.

%%%% begin example
\starttexdefinition Italic #1
     \emphasized{#1}
     \autoinsertnextspace
\stoptexdefinition

\starttexdefinition Underbar #1
     \underbar{#1}
\stoptexdefinition

\starttext

Bla Bla Bla

\Underbar{\Italic{Bla} , Bla Bla.}

\stoptext
%%%% end example

Wolfgang


[-- Attachment #1.2: Type: text/html, Size: 3941 bytes --]

[-- Attachment #2: Type: text/plain, Size: 493 bytes --]

___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki     : http://contextgarden.net
___________________________________________________________________________________

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: XML, dealing with whitespace
  2022-01-15 19:28 ` Wolfgang Schuster via ntg-context
@ 2022-01-17  9:47   ` Denis Maier via ntg-context
  0 siblings, 0 replies; 3+ messages in thread
From: Denis Maier via ntg-context @ 2022-01-17  9:47 UTC (permalink / raw)
  To: wolfgang.schuster.lists, ntg-context, ntg-context; +Cc: denis.maier


[-- Attachment #1.1: Type: text/plain, Size: 2854 bytes --]

Hi Wolfgang,

Von: Wolfgang Schuster <wolfgang.schuster.lists@gmail.com>
Gesendet: Samstag, 15. Januar 2022 20:28
An: mailing list for ConTeXt users <ntg-context@ntg.nl>; Denis Maier via ntg-context <ntg-context@ntg.nl>
Cc: Maier, Denis Christian (UB) <denis.maier@unibe.ch>
Betreff: Re: [NTG-context] XML, dealing with whitespace

Denis Maier via ntg-context schrieb am 15.01.2022 um 13:04:

Hi all,

I have sources that look like this:

%%%%%%%%%%%%%%%%%%%%%
<?xml version="1.0" encoding="UTF-8"?>
<article>
   <p>Bla Bla Bla</p>
   <p>
      <underline>
         <italic>Bla</italic>
      </underline>, Bla Bla.</p>
</article>
%%%%%%%%%%%%%%%%%%%%%

Typesetting this with context gives me a spurious space after the underlined Bla in italics.

There is no spurious space, the line break is just converted to a space and I see no reason why this shouldn't happen. To remove space before or after certain parts of text within a paragraph you can use the \removeunwantedspace and \ignorespaces commands.

Yes, it's absolutely true. From tex's point of view, the space is not spurious. It's absolutely adaquate to treat the newline as a space here.

As I've outlined in my original post the problem occurs because xslt adds these indentations here. FWIW, I finally found this solution, which seems has been added to xslt 3.0 (after being available as a saxon extension: there's a new attribute <suppress-indentation> on xsl:output that can be used to control this:
<This is a new property in XSLT 3.0 (it was previously available in Saxon as an extension). The value is a whitespace-separated list of element names, and it typically identifies "inline" elements that should not cause indentation; in XHTML, for example, these would be b, i, span, and the like.>
(https://www.saxonica.com/documentation9.5/xsl-elements/output.html)

So, the solution to my problem is this:

<xsl:output
    method="xml"
    indent="yes"
    suppress-indentation="italic underline"
    />

Denis



%%%% begin example
\starttexdefinition RemovePreceding #1
    \removeunwantedspaces
    #1
\stoptexdefinition

\starttexdefinition RemoveFollowing #1
    #1
    \ignorespaces
\stoptexdefinition

\starttext

Bla \RemovePreceding{Bla} Bla

Bla \RemoveFollowing{Bla} Bla

\stoptext
%%%% end example

When only  following spaces are a problem a better alternative to \ignorespace is \autoinsertnextspace which checks the following token which ensures there is space when the next character is punctuation.

%%%% begin example
\starttexdefinition Italic #1
    \emphasized{#1}
    \autoinsertnextspace
\stoptexdefinition

\starttexdefinition Underbar #1
    \underbar{#1}
\stoptexdefinition

\starttext

Bla Bla Bla

\Underbar{\Italic{Bla} , Bla Bla.}

\stoptext
%%%% end example

Wolfgang

[-- Attachment #1.2: Type: text/html, Size: 8273 bytes --]

[-- Attachment #2: Type: text/plain, Size: 493 bytes --]

___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki     : http://contextgarden.net
___________________________________________________________________________________

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-01-17  9:47 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-15 12:04 XML, dealing with whitespace Denis Maier via ntg-context
2022-01-15 19:28 ` Wolfgang Schuster via ntg-context
2022-01-17  9:47   ` Denis Maier via ntg-context

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).