* Re: Reference IDs in XML output
[not found] ` <877dmc9l7b.fsf-9EawChwDxG8hFhg+JK9F0w@public.gmane.org>
@ 2021-03-12 21:03 ` TRS-80
2021-03-13 9:41 ` BPJ
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: TRS-80 @ 2021-03-12 21:03 UTC (permalink / raw)
To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw
On 2021-03-12 15:22, Albert Krewinkel wrote:
> The more I think about this, the more questions I have and by now I'm
> overthinking it. Any help to get me back to the ground is appreciated.
Apologies for not having any specifics to add which might be helpful.
For whatever it's worth, I just wanted to empathize with your general
position, as I have found myself there many times. Or, as the kids
say nowadays "I know that feel, bro." :D
Happy Friday!
Cheers,
TRS-80
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Reference IDs in XML output
[not found] ` <877dmc9l7b.fsf-9EawChwDxG8hFhg+JK9F0w@public.gmane.org>
2021-03-12 21:03 ` TRS-80
@ 2021-03-13 9:41 ` BPJ
2021-03-13 16:12 ` jcr
2021-03-14 21:18 ` John MacFarlane
3 siblings, 0 replies; 5+ messages in thread
From: BPJ @ 2021-03-13 9:41 UTC (permalink / raw)
To: pandoc-discuss
[-- Attachment #1: Type: text/plain, Size: 3453 bytes --]
I recently ran into an analogous problem in a Perl program when I wanted to
use strings containing non-ASCII letters/characters as capture group names
in regular expressions (so that I could pull the name of the group that
matched and look it up in an associative array). I ended up replacing all
characters other than ASCII alphanumerics and all initial digits with their
Unicode codepoint number in hex between underscores, using this
substitution:
``````perl
my $name = $key =~ s{^([0-9])|([^A-Za-z0-9])}{sprintf '_%x_', ord $+}egr;
``````
and then this substitution to turn the name back into a key:
``````perl
my $key = $name =~ s{_([[:xdigit:]]+)_}{chr hex $1}egr;
``````
so underscores themselves become `_5f_`, `$` becomes `_24_`. Some (partly
silly) examples with roundtripping:
foo/bar
foo_2f_bar
foo/bar
24/7
_32_4_2f_7
24/7
€3.14
_20ac_3_2e_14
€3.14
šæŋ
_161__e6__14b_
šæŋ
It's ugly as hell but the conversion is simple and straightforward, the
results are predictable, reasonably compact and somewhat human readable --
if all else fails you can look up the code point numbers in a suitable
utility by hand even.
--
Better --help|less than helpless
Den fre 12 mars 2021 21:22Albert Krewinkel <albert+pandoc-9EawChwDxG8hFhg+JK9F0w@public.gmane.org>
skrev:
> There is a small problem which I noticed lately: citation keys are used
> as part of the id of the respective reference item; e.g., if a citation
> has `@misc{foo, ...}` then the bibliography entry has id="ref-foo". This
> can be a problem when generating XML output, as the citation keys may
> contain characters which are not allowed in XML names. E.g., BibTeX
> allows slashes as part of the identifier, but those are illegal in an
> `id` attribute, leading to the generation of invalid XML documents. As
> far as I can see, this affects JATS, TEI, HTML4, and EPUB2. The HTML5
> standard is less restrictive, so EPUB3 is unaffected.
>
> I'd like to fix the problem, but am not sure where and how.
>
> - Where: in each affected writer, or in citeproc?
> - How: by removing the offending characters, or by using a different
> scheme to generate reference identifiers? Numbering, hashing, …?
> Do we check for duplicates, or can we assume that identifiers with
> prefix "ref-" are reserved for pandoc?
>
> The more I think about this, the more questions I have and by now I'm
> overthinking it. Any help to get me back to the ground is appreciated.
>
>
> --
> Albert Krewinkel
> GPG: 8eed e3e2 e8c5 6f18 81fe e836 388d c0b2 1f63 1124
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/877dmc9l7b.fsf%40zeitkraut.de
> .
>
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhD2KfBW9nzO7uJNt7%2BNd%3DJP0WCH%3Dwr7cVAaYxLw5i32%2Bg%40mail.gmail.com.
[-- Attachment #2: Type: text/html, Size: 5023 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Reference IDs in XML output
[not found] ` <877dmc9l7b.fsf-9EawChwDxG8hFhg+JK9F0w@public.gmane.org>
2021-03-12 21:03 ` TRS-80
2021-03-13 9:41 ` BPJ
@ 2021-03-13 16:12 ` jcr
2021-03-14 21:18 ` John MacFarlane
3 siblings, 0 replies; 5+ messages in thread
From: jcr @ 2021-03-13 16:12 UTC (permalink / raw)
To: pandoc-discuss
[-- Attachment #1.1: Type: text/plain, Size: 2002 bytes --]
I haven't given much thought to this, but why not reuse the algorithm that
pandoc uses to generate IDs for headers? Whether or not you prefix ref-, I
think it would be best to check for duplicate IDs for the sake of having
robust output.
On Friday, March 12, 2021 at 9:22:07 PM UTC+1 Albert Krewinkel wrote:
> There is a small problem which I noticed lately: citation keys are used
> as part of the id of the respective reference item; e.g., if a citation
> has `@misc{foo, ...}` then the bibliography entry has id="ref-foo". This
> can be a problem when generating XML output, as the citation keys may
> contain characters which are not allowed in XML names. E.g., BibTeX
> allows slashes as part of the identifier, but those are illegal in an
> `id` attribute, leading to the generation of invalid XML documents. As
> far as I can see, this affects JATS, TEI, HTML4, and EPUB2. The HTML5
> standard is less restrictive, so EPUB3 is unaffected.
>
> I'd like to fix the problem, but am not sure where and how.
>
> - Where: in each affected writer, or in citeproc?
> - How: by removing the offending characters, or by using a different
> scheme to generate reference identifiers? Numbering, hashing, …?
> Do we check for duplicates, or can we assume that identifiers with
> prefix "ref-" are reserved for pandoc?
>
> The more I think about this, the more questions I have and by now I'm
> overthinking it. Any help to get me back to the ground is appreciated.
>
>
> --
> Albert Krewinkel
> GPG: 8eed e3e2 e8c5 6f18 81fe e836 388d c0b2 1f63 1124
>
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/c77442be-5e75-4238-a8f5-56611e7e9404n%40googlegroups.com.
[-- Attachment #1.2: Type: text/html, Size: 2577 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Reference IDs in XML output
[not found] ` <877dmc9l7b.fsf-9EawChwDxG8hFhg+JK9F0w@public.gmane.org>
` (2 preceding siblings ...)
2021-03-13 16:12 ` jcr
@ 2021-03-14 21:18 ` John MacFarlane
3 siblings, 0 replies; 5+ messages in thread
From: John MacFarlane @ 2021-03-14 21:18 UTC (permalink / raw)
To: Albert Krewinkel, pandoc-discuss
If I recall, we have a similar issue in the LateX writer,
since labels in LaTeX are limited in what they can contain.
We use this function to map an identifier to a label:
toLabel :: PandocMonad m => Text -> LW m Text
toLabel z = go `fmap` stringToLaTeX URLString z
where
go = T.concatMap $ \x -> case x of
_ | (isLetter x || isDigit x) && isAscii x -> T.singleton x
| x `elemText` "_-+=:;." -> T.singleton x
| otherwise -> T.pack $ "ux" <> printf "%x" (ord x)
Maybe something like this could work?
Albert Krewinkel <albert+pandoc-9EawChwDxG8hFhg+JK9F0w@public.gmane.org> writes:
> There is a small problem which I noticed lately: citation keys are used
> as part of the id of the respective reference item; e.g., if a citation
> has `@misc{foo, ...}` then the bibliography entry has id="ref-foo". This
> can be a problem when generating XML output, as the citation keys may
> contain characters which are not allowed in XML names. E.g., BibTeX
> allows slashes as part of the identifier, but those are illegal in an
> `id` attribute, leading to the generation of invalid XML documents. As
> far as I can see, this affects JATS, TEI, HTML4, and EPUB2. The HTML5
> standard is less restrictive, so EPUB3 is unaffected.
>
> I'd like to fix the problem, but am not sure where and how.
>
> - Where: in each affected writer, or in citeproc?
> - How: by removing the offending characters, or by using a different
> scheme to generate reference identifiers? Numbering, hashing, …?
> Do we check for duplicates, or can we assume that identifiers with
> prefix "ref-" are reserved for pandoc?
>
> The more I think about this, the more questions I have and by now I'm
> overthinking it. Any help to get me back to the ground is appreciated.
>
>
> --
> Albert Krewinkel
> GPG: 8eed e3e2 e8c5 6f18 81fe e836 388d c0b2 1f63 1124
>
> --
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/877dmc9l7b.fsf%40zeitkraut.de.
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/m2ft0x8mes.fsf%40MacBook-Pro.hsd1.ca.comcast.net.
^ permalink raw reply [flat|nested] 5+ messages in thread