public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* Reference IDs in XML output
@ 2021-03-12 20:22 Albert Krewinkel
       [not found] ` <877dmc9l7b.fsf-9EawChwDxG8hFhg+JK9F0w@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Albert Krewinkel @ 2021-03-12 20:22 UTC (permalink / raw)
  To: pandoc-discuss

There is a small problem which I noticed lately: citation keys are used
as part of the id of the respective reference item; e.g., if a citation
has `@misc{foo, ...}` then the bibliography entry has id="ref-foo". This
can be a problem when generating XML output, as the citation keys may
contain characters which are not allowed in XML names. E.g., BibTeX
allows slashes as part of the identifier, but those are illegal in an
`id` attribute, leading to the generation of invalid XML documents. As
far as I can see, this affects JATS, TEI, HTML4, and EPUB2. The HTML5
standard is less restrictive, so EPUB3 is unaffected.

I'd like to fix the problem, but am not sure where and how.

- Where: in each affected writer, or in citeproc?
- How: by removing the offending characters, or by using a different
  scheme to generate reference identifiers? Numbering, hashing, …?
  Do we check for duplicates, or can we assume that identifiers with
  prefix "ref-" are reserved for pandoc?

The more I think about this, the more questions I have and by now I'm
overthinking it. Any help to get me back to the ground is appreciated.


--
Albert Krewinkel
GPG: 8eed e3e2 e8c5 6f18 81fe  e836 388d c0b2 1f63 1124

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/877dmc9l7b.fsf%40zeitkraut.de.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Reference IDs in XML output
       [not found] ` <877dmc9l7b.fsf-9EawChwDxG8hFhg+JK9F0w@public.gmane.org>
@ 2021-03-12 21:03   ` TRS-80
  2021-03-13  9:41   ` BPJ
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: TRS-80 @ 2021-03-12 21:03 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

On 2021-03-12 15:22, Albert Krewinkel wrote:
> The more I think about this, the more questions I have and by now I'm
> overthinking it. Any help to get me back to the ground is appreciated.

Apologies for not having any specifics to add which might be helpful.

For whatever it's worth, I just wanted to empathize with your general
position, as I have found myself there many times.  Or, as the kids
say nowadays "I know that feel, bro."  :D

Happy Friday!

Cheers,
TRS-80


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Reference IDs in XML output
       [not found] ` <877dmc9l7b.fsf-9EawChwDxG8hFhg+JK9F0w@public.gmane.org>
  2021-03-12 21:03   ` TRS-80
@ 2021-03-13  9:41   ` BPJ
  2021-03-13 16:12   ` jcr
  2021-03-14 21:18   ` John MacFarlane
  3 siblings, 0 replies; 5+ messages in thread
From: BPJ @ 2021-03-13  9:41 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1: Type: text/plain, Size: 3453 bytes --]

I recently ran into an analogous problem in a Perl program when I wanted to
use strings containing non-ASCII letters/characters as capture group names
in regular expressions (so that I could pull the name of the group that
matched and look it up in an associative array). I ended up replacing all
characters other than ASCII alphanumerics and all initial digits with their
Unicode codepoint number in hex between underscores, using this
substitution:

``````perl
my $name = $key =~ s{^([0-9])|([^A-Za-z0-9])}{sprintf '_%x_', ord $+}egr;
``````

and then this substitution to turn the name back into a key:

``````perl
my $key = $name =~ s{_([[:xdigit:]]+)_}{chr hex $1}egr;
``````

so underscores themselves become `_5f_`, `$` becomes `_24_`. Some (partly
silly) examples with roundtripping:

foo/bar
foo_2f_bar
foo/bar
24/7
_32_4_2f_7
24/7
€3.14
_20ac_3_2e_14
€3.14
šæŋ
_161__e6__14b_
šæŋ

It's ugly as hell but the conversion is simple and straightforward, the
results are predictable, reasonably compact and somewhat human readable --
if all else fails you can look up the code point numbers in a suitable
utility by hand even.


-- 
Better --help|less than helpless

Den fre 12 mars 2021 21:22Albert Krewinkel <albert+pandoc-9EawChwDxG8hFhg+JK9F0w@public.gmane.org>
skrev:

> There is a small problem which I noticed lately: citation keys are used
> as part of the id of the respective reference item; e.g., if a citation
> has `@misc{foo, ...}` then the bibliography entry has id="ref-foo". This
> can be a problem when generating XML output, as the citation keys may
> contain characters which are not allowed in XML names. E.g., BibTeX
> allows slashes as part of the identifier, but those are illegal in an
> `id` attribute, leading to the generation of invalid XML documents. As
> far as I can see, this affects JATS, TEI, HTML4, and EPUB2. The HTML5
> standard is less restrictive, so EPUB3 is unaffected.
>
> I'd like to fix the problem, but am not sure where and how.
>
> - Where: in each affected writer, or in citeproc?
> - How: by removing the offending characters, or by using a different
>   scheme to generate reference identifiers? Numbering, hashing, …?
>   Do we check for duplicates, or can we assume that identifiers with
>   prefix "ref-" are reserved for pandoc?
>
> The more I think about this, the more questions I have and by now I'm
> overthinking it. Any help to get me back to the ground is appreciated.
>
>
> --
> Albert Krewinkel
> GPG: 8eed e3e2 e8c5 6f18 81fe  e836 388d c0b2 1f63 1124
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/877dmc9l7b.fsf%40zeitkraut.de
> .
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhD2KfBW9nzO7uJNt7%2BNd%3DJP0WCH%3Dwr7cVAaYxLw5i32%2Bg%40mail.gmail.com.

[-- Attachment #2: Type: text/html, Size: 5023 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Reference IDs in XML output
       [not found] ` <877dmc9l7b.fsf-9EawChwDxG8hFhg+JK9F0w@public.gmane.org>
  2021-03-12 21:03   ` TRS-80
  2021-03-13  9:41   ` BPJ
@ 2021-03-13 16:12   ` jcr
  2021-03-14 21:18   ` John MacFarlane
  3 siblings, 0 replies; 5+ messages in thread
From: jcr @ 2021-03-13 16:12 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 2002 bytes --]

I haven't given much thought to this, but why not reuse the algorithm that 
pandoc uses to generate IDs for headers? Whether or not you prefix ref-, I 
think it would be best to check for duplicate IDs for the sake of having 
robust output.

On Friday, March 12, 2021 at 9:22:07 PM UTC+1 Albert Krewinkel wrote:

> There is a small problem which I noticed lately: citation keys are used
> as part of the id of the respective reference item; e.g., if a citation
> has `@misc{foo, ...}` then the bibliography entry has id="ref-foo". This
> can be a problem when generating XML output, as the citation keys may
> contain characters which are not allowed in XML names. E.g., BibTeX
> allows slashes as part of the identifier, but those are illegal in an
> `id` attribute, leading to the generation of invalid XML documents. As
> far as I can see, this affects JATS, TEI, HTML4, and EPUB2. The HTML5
> standard is less restrictive, so EPUB3 is unaffected.
>
> I'd like to fix the problem, but am not sure where and how.
>
> - Where: in each affected writer, or in citeproc?
> - How: by removing the offending characters, or by using a different
> scheme to generate reference identifiers? Numbering, hashing, …?
> Do we check for duplicates, or can we assume that identifiers with
> prefix "ref-" are reserved for pandoc?
>
> The more I think about this, the more questions I have and by now I'm
> overthinking it. Any help to get me back to the ground is appreciated.
>
>
> --
> Albert Krewinkel
> GPG: 8eed e3e2 e8c5 6f18 81fe e836 388d c0b2 1f63 1124
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/c77442be-5e75-4238-a8f5-56611e7e9404n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 2577 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Reference IDs in XML output
       [not found] ` <877dmc9l7b.fsf-9EawChwDxG8hFhg+JK9F0w@public.gmane.org>
                     ` (2 preceding siblings ...)
  2021-03-13 16:12   ` jcr
@ 2021-03-14 21:18   ` John MacFarlane
  3 siblings, 0 replies; 5+ messages in thread
From: John MacFarlane @ 2021-03-14 21:18 UTC (permalink / raw)
  To: Albert Krewinkel, pandoc-discuss


If I recall, we have a similar issue in the LateX writer,
since labels in LaTeX are limited in what they can contain.
We use this function to map an identifier to a label:

toLabel :: PandocMonad m => Text -> LW m Text
toLabel z = go `fmap` stringToLaTeX URLString z
 where
   go = T.concatMap $ \x -> case x of
     _ | (isLetter x || isDigit x) && isAscii x -> T.singleton x
       | x `elemText` "_-+=:;." -> T.singleton x
       | otherwise -> T.pack $ "ux" <> printf "%x" (ord x)

Maybe something like this could work?


Albert Krewinkel <albert+pandoc-9EawChwDxG8hFhg+JK9F0w@public.gmane.org> writes:

> There is a small problem which I noticed lately: citation keys are used
> as part of the id of the respective reference item; e.g., if a citation
> has `@misc{foo, ...}` then the bibliography entry has id="ref-foo". This
> can be a problem when generating XML output, as the citation keys may
> contain characters which are not allowed in XML names. E.g., BibTeX
> allows slashes as part of the identifier, but those are illegal in an
> `id` attribute, leading to the generation of invalid XML documents. As
> far as I can see, this affects JATS, TEI, HTML4, and EPUB2. The HTML5
> standard is less restrictive, so EPUB3 is unaffected.
>
> I'd like to fix the problem, but am not sure where and how.
>
> - Where: in each affected writer, or in citeproc?
> - How: by removing the offending characters, or by using a different
>   scheme to generate reference identifiers? Numbering, hashing, …?
>   Do we check for duplicates, or can we assume that identifiers with
>   prefix "ref-" are reserved for pandoc?
>
> The more I think about this, the more questions I have and by now I'm
> overthinking it. Any help to get me back to the ground is appreciated.
>
>
> --
> Albert Krewinkel
> GPG: 8eed e3e2 e8c5 6f18 81fe  e836 388d c0b2 1f63 1124
>
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/877dmc9l7b.fsf%40zeitkraut.de.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/m2ft0x8mes.fsf%40MacBook-Pro.hsd1.ca.comcast.net.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-03-14 21:18 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-12 20:22 Reference IDs in XML output Albert Krewinkel
     [not found] ` <877dmc9l7b.fsf-9EawChwDxG8hFhg+JK9F0w@public.gmane.org>
2021-03-12 21:03   ` TRS-80
2021-03-13  9:41   ` BPJ
2021-03-13 16:12   ` jcr
2021-03-14 21:18   ` John MacFarlane

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).