Question regarding policy on allowed citation ID characters, specifically the definition of "alphanumerical"

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

* Question regarding policy on allowed citation ID characters, specifically the definition of "alphanumerical"
@ 2019-10-24 16:03 Hendrik Erz
       [not found] ` <9d39c347-6c2d-4b70-976d-4e77df9240ef-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Hendrik Erz @ 2019-10-24 16:03 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1.1: Type: text/plain, Size: 3361 bytes --]

Dear all,

I've so far not come across a solution for an issue that's been haunting me 
for quite some time now. Different implementations of citeproc use 
different approaches to the problem, and I would like to hear whether or 
not there's clarity on that issue across the Pandoc-community.

The Citeproc-documentation 
<https://pandoc.org/demo/example19/Extension-citations.html> on allowed 
characters for citation IDs states the following:

Each citation must have a key, composed of `@' + the citation identifier 
> from the database, and may optionally have a prefix, a locator, and a 
> suffix. The citation key must begin with a letter, digit, or _, and may 
> contain alphanumerics, _, and internal punctuation characters (
> :.#$%&-+?<>~/).

Using the strict definition of alphanumerics, for me this translates into 
the following:

// The ID must have the following form:
  // 1. Begin with an @.
  // 2. Followed by a-zA-Z0-9_.
  // 3. Optionally followed by a-za-Z0-9_ and (:.#$%&-+?<>~/).

This means: Only ASCII-letters. However, depending on implementation, of 
course other letters can also be part of the definition of "alphanumerical" 
(however this would mean Unicode-characters).

Nevertheless, I am still struck with not being exactly sure on what should 
be allowed and what not:

   1. Pandoc Citeproc (Haskell) uses the builtin-function isAlphaNum, which 
   not only counts a-z, or diacritics, but even non-western alphabets, such as 
   Greek or Iceland characters.
   2. citeproc-js (JavaScript) uses a hugely complicated regular expression 
   in which several Unicode-ranges are included.
   3. Citr (a library I developed) currently adheres to the strict set of 
   only latin characters.

Apparently, people are using all kinds of different characters in their 
citation keys. Until now I always tried to read citation IDs as something 
that should bridge the gap between machine-readable (as it's going to be 
replaced with a nicely readable citation either way) and human-readable. 
But using more than ASCII-characters always runs at some risk of breaking 
some algorithms, especially Python, if Unicode is not handled correctly.

So before I completely scrap my current validation function, I would like 
to ask around here whether or not a definite definition of allowed 
characters is already around, whether it is something you'd like to develop 
(which would make different implementations much more easy), or if this 
shouldn't matter, essentially giving way to hugely different 
interpretations of citation keys.

Currently, the handling of citation ID validation is inconsistent to some 
extend.

Basically, my question boils down to the following: *What should citation 
ID validation algorithms consider to be "alphanumeric", and what should be 
excluded (except non-LaTeX-conform characters, obviously)?*

Thank you all in advance for reading and hopefully we can work something 
out!

Cheers,
Hendrik

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/9d39c347-6c2d-4b70-976d-4e77df9240ef%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 5254 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Question regarding policy on allowed citation ID characters, specifically the definition of "alphanumerical"
       [not found] ` <9d39c347-6c2d-4b70-976d-4e77df9240ef-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2019-10-24 22:02   ` BPJ
  2019-10-25  5:11   ` John MacFarlane
  1 sibling, 0 replies; 4+ messages in thread
From: BPJ @ 2019-10-24 22:02 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 4376 bytes --]

In what respect are Icelandic letters not Western?  You can hardly come
further west than Iceland in the Old World! In what way is it not a good
thing that you can (theoretically at least) use non-*ASCII* characters in
citation ids?

Den tors 24 okt. 2019 18:04Hendrik Erz <erz.hendrik-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> skrev:

> Dear all,
>
> I've so far not come across a solution for an issue that's been haunting
> me for quite some time now. Different implementations of citeproc use
> different approaches to the problem, and I would like to hear whether or
> not there's clarity on that issue across the Pandoc-community.
>
> The Citeproc-documentation
> <https://pandoc.org/demo/example19/Extension-citations.html> on allowed
> characters for citation IDs states the following:
>
> Each citation must have a key, composed of `@' + the citation identifier
>> from the database, and may optionally have a prefix, a locator, and a
>> suffix. The citation key must begin with a letter, digit, or _, and may
>> contain alphanumerics, _, and internal punctuation characters (
>> :.#$%&-+?<>~/).
>
>
> Using the strict definition of alphanumerics, for me this translates into
> the following:
>
> // The ID must have the following form:
>   // 1. Begin with an @.
>   // 2. Followed by a-zA-Z0-9_.
>   // 3. Optionally followed by a-za-Z0-9_ and (:.#$%&-+?<>~/).
>
> This means: Only ASCII-letters. However, depending on implementation, of
> course other letters can also be part of the definition of "alphanumerical"
> (however this would mean Unicode-characters).
>
> Nevertheless, I am still struck with not being exactly sure on what should
> be allowed and what not:
>
>    1. Pandoc Citeproc (Haskell) uses the builtin-function isAlphaNum,
>    which not only counts a-z, or diacritics, but even non-western alphabets,
>    such as Greek or Iceland characters.
>    2. citeproc-js (JavaScript) uses a hugely complicated regular
>    expression in which several Unicode-ranges are included.
>    3. Citr (a library I developed) currently adheres to the strict set of
>    only latin characters.
>
> Apparently, people are using all kinds of different characters in their
> citation keys. Until now I always tried to read citation IDs as something
> that should bridge the gap between machine-readable (as it's going to be
> replaced with a nicely readable citation either way) and human-readable.
> But using more than ASCII-characters always runs at some risk of breaking
> some algorithms, especially Python, if Unicode is not handled correctly.
>
> So before I completely scrap my current validation function, I would like
> to ask around here whether or not a definite definition of allowed
> characters is already around, whether it is something you'd like to develop
> (which would make different implementations much more easy), or if this
> shouldn't matter, essentially giving way to hugely different
> interpretations of citation keys.
>
> Currently, the handling of citation ID validation is inconsistent to some
> extend.
>
> Basically, my question boils down to the following: *What should citation
> ID validation algorithms consider to be "alphanumeric", and what should be
> excluded (except non-LaTeX-conform characters, obviously)?*
>
> Thank you all in advance for reading and hopefully we can work something
> out!
>
> Cheers,
> Hendrik
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/9d39c347-6c2d-4b70-976d-4e77df9240ef%40googlegroups.com
> <https://groups.google.com/d/msgid/pandoc-discuss/9d39c347-6c2d-4b70-976d-4e77df9240ef%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhDTm%3D-Ty2D0O6ZKd2e%3DbZ7Rm%2B99C-7HDcwZ9NwmrpYk%2Bg%40mail.gmail.com.

[-- Attachment #2: Type: text/html, Size: 6630 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Question regarding policy on allowed citation ID characters, specifically the definition of "alphanumerical"
       [not found] ` <9d39c347-6c2d-4b70-976d-4e77df9240ef-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2019-10-24 22:02   ` BPJ
@ 2019-10-25  5:11   ` John MacFarlane
       [not found]     ` <m28sp9s938.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
  1 sibling, 1 reply; 4+ messages in thread
From: John MacFarlane @ 2019-10-25  5:11 UTC (permalink / raw)
  To: Hendrik Erz, pandoc-discuss


I wouldn't have thought that 'alphanumeric' was limited to ASCII.
And it seems undesirable to force, e.g. Chinese writers to use
latin characters in their citation keys.  I see no reason for
this.  Non-ASCII characters are perfectly machine-readable,
once you've pinned down an encoding.

In any case, note that pandoc's rules for citations are in some
respects more restrictive than citeproc's, because of the need
for heuristics to identify where an author-in-text citation ends
and surrounding punctuation begins.

Hendrik Erz <erz.hendrik-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:

> Dear all,
>
> I've so far not come across a solution for an issue that's been haunting me 
> for quite some time now. Different implementations of citeproc use 
> different approaches to the problem, and I would like to hear whether or 
> not there's clarity on that issue across the Pandoc-community.
>
> The Citeproc-documentation 
> <https://pandoc.org/demo/example19/Extension-citations.html> on allowed 
> characters for citation IDs states the following:
>
> Each citation must have a key, composed of `@' + the citation identifier 
>> from the database, and may optionally have a prefix, a locator, and a 
>> suffix. The citation key must begin with a letter, digit, or _, and may 
>> contain alphanumerics, _, and internal punctuation characters (
>> :.#$%&-+?<>~/).
>
>
> Using the strict definition of alphanumerics, for me this translates into 
> the following:
>
> // The ID must have the following form:
>   // 1. Begin with an @.
>   // 2. Followed by a-zA-Z0-9_.
>   // 3. Optionally followed by a-za-Z0-9_ and (:.#$%&-+?<>~/).
>
> This means: Only ASCII-letters. However, depending on implementation, of 
> course other letters can also be part of the definition of "alphanumerical" 
> (however this would mean Unicode-characters).
>
> Nevertheless, I am still struck with not being exactly sure on what should 
> be allowed and what not:
>
>    1. Pandoc Citeproc (Haskell) uses the builtin-function isAlphaNum, which 
>    not only counts a-z, or diacritics, but even non-western alphabets, such as 
>    Greek or Iceland characters.
>    2. citeproc-js (JavaScript) uses a hugely complicated regular expression 
>    in which several Unicode-ranges are included.
>    3. Citr (a library I developed) currently adheres to the strict set of 
>    only latin characters.
>
> Apparently, people are using all kinds of different characters in their 
> citation keys. Until now I always tried to read citation IDs as something 
> that should bridge the gap between machine-readable (as it's going to be 
> replaced with a nicely readable citation either way) and human-readable. 
> But using more than ASCII-characters always runs at some risk of breaking 
> some algorithms, especially Python, if Unicode is not handled correctly.
>
> So before I completely scrap my current validation function, I would like 
> to ask around here whether or not a definite definition of allowed 
> characters is already around, whether it is something you'd like to develop 
> (which would make different implementations much more easy), or if this 
> shouldn't matter, essentially giving way to hugely different 
> interpretations of citation keys.
>
> Currently, the handling of citation ID validation is inconsistent to some 
> extend.
>
> Basically, my question boils down to the following: *What should citation 
> ID validation algorithms consider to be "alphanumeric", and what should be 
> excluded (except non-LaTeX-conform characters, obviously)?*
>
> Thank you all in advance for reading and hopefully we can work something 
> out!
>
> Cheers,
> Hendrik
>
>
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/9d39c347-6c2d-4b70-976d-4e77df9240ef%40googlegroups.com.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Question regarding policy on allowed citation ID characters, specifically the definition of "alphanumerical"
       [not found]     ` <m28sp9s938.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
@ 2019-10-25  8:01       ` Hendrik Erz
  0 siblings, 0 replies; 4+ messages in thread
From: Hendrik Erz @ 2019-10-25  8:01 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 6560 bytes --]

Of course alphanumeric isn't necessarily confined to ASCII, but after some 
years of listening to encoding-horrorstories, I try to always be rather 
conservative on what's allowed and what not, when it comes to stuff that 
needs to be handled internally within programs.

You're absolutely right that it doesn't make sense to force 
non-latin-script writers to use ASCII; what I was always implicitly 
assuming was that citation IDs should be "safe for everything", hence my 
hesitation to allow more than ASCII-letters in the IDs themselves.

Thinking back to what information I've gathered so far, and taking your 
response into account, I would assume that basically citation IDs should 
always have a bias towards human-readibility (e.g. Chinese letters instead 
of latin-script), as we can safely assume that most computers nowadays 
don't have non-UTF-8 files, at least the ones that Pandoc should be 
concerned with. Is this correct?

Additionally, as you write that pandoc's rules are somewhat more 
restrictive, I would assume that citation IDs should only follow this rule: *Allow 
any character, except LaTeX control characters, whitespace, and commas, if 
possible*, only limiting this if context restricts this?

(The reason why I'm inverting the rules to list what is *not* allowed 
instead of what *is* allowed is because many Javascript implementations 
don't have the ES6 *\{letter}*-option for regular expressions, therefore it 
would be easier to check for */[^\s,\\\{\}]/* than for a regular expression 
listing all sorts of Unicode-ranges)

Do you think this would make sense, to be as allowing as possible without 
endangering compatibility with Pandoc?

Thank you already for the helpful response!

On Friday, 25 October 2019 07:11:41 UTC+2, John MacFarlane wrote:
>
>
> I wouldn't have thought that 'alphanumeric' was limited to ASCII. 
> And it seems undesirable to force, e.g. Chinese writers to use 
> latin characters in their citation keys.  I see no reason for 
> this.  Non-ASCII characters are perfectly machine-readable, 
> once you've pinned down an encoding. 
>
> In any case, note that pandoc's rules for citations are in some 
> respects more restrictive than citeproc's, because of the need 
> for heuristics to identify where an author-in-text citation ends 
> and surrounding punctuation begins. 
>
> Hendrik Erz <erz.h...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org <javascript:>> writes: 
>
> > Dear all, 
> > 
> > I've so far not come across a solution for an issue that's been haunting 
> me 
> > for quite some time now. Different implementations of citeproc use 
> > different approaches to the problem, and I would like to hear whether or 
> > not there's clarity on that issue across the Pandoc-community. 
> > 
> > The Citeproc-documentation 
> > <https://pandoc.org/demo/example19/Extension-citations.html> on allowed 
> > characters for citation IDs states the following: 
> > 
> > Each citation must have a key, composed of `@' + the citation identifier 
> >> from the database, and may optionally have a prefix, a locator, and a 
> >> suffix. The citation key must begin with a letter, digit, or _, and may 
> >> contain alphanumerics, _, and internal punctuation characters ( 
> >> :.#$%&-+?<>~/). 
> > 
> > 
> > Using the strict definition of alphanumerics, for me this translates 
> into 
> > the following: 
> > 
> > // The ID must have the following form: 
> >   // 1. Begin with an @. 
> >   // 2. Followed by a-zA-Z0-9_. 
> >   // 3. Optionally followed by a-za-Z0-9_ and (:.#$%&-+?<>~/). 
> > 
> > This means: Only ASCII-letters. However, depending on implementation, of 
> > course other letters can also be part of the definition of 
> "alphanumerical" 
> > (however this would mean Unicode-characters). 
> > 
> > Nevertheless, I am still struck with not being exactly sure on what 
> should 
> > be allowed and what not: 
> > 
> >    1. Pandoc Citeproc (Haskell) uses the builtin-function isAlphaNum, 
> which 
> >    not only counts a-z, or diacritics, but even non-western alphabets, 
> such as 
> >    Greek or Iceland characters. 
> >    2. citeproc-js (JavaScript) uses a hugely complicated regular 
> expression 
> >    in which several Unicode-ranges are included. 
> >    3. Citr (a library I developed) currently adheres to the strict set 
> of 
> >    only latin characters. 
> > 
> > Apparently, people are using all kinds of different characters in their 
> > citation keys. Until now I always tried to read citation IDs as 
> something 
> > that should bridge the gap between machine-readable (as it's going to be 
> > replaced with a nicely readable citation either way) and human-readable. 
> > But using more than ASCII-characters always runs at some risk of 
> breaking 
> > some algorithms, especially Python, if Unicode is not handled correctly. 
> > 
> > So before I completely scrap my current validation function, I would 
> like 
> > to ask around here whether or not a definite definition of allowed 
> > characters is already around, whether it is something you'd like to 
> develop 
> > (which would make different implementations much more easy), or if this 
> > shouldn't matter, essentially giving way to hugely different 
> > interpretations of citation keys. 
> > 
> > Currently, the handling of citation ID validation is inconsistent to 
> some 
> > extend. 
> > 
> > Basically, my question boils down to the following: *What should 
> citation 
> > ID validation algorithms consider to be "alphanumeric", and what should 
> be 
> > excluded (except non-LaTeX-conform characters, obviously)?* 
> > 
> > Thank you all in advance for reading and hopefully we can work something 
> > out! 
> > 
> > Cheers, 
> > Hendrik 
> > 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "pandoc-discuss" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <javascript:>. 
> > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/9d39c347-6c2d-4b70-976d-4e77df9240ef%40googlegroups.com. 
>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/704149d0-76bf-4896-a016-17e47c7d46f5%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 9191 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-10-25  8:01 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-24 16:03 Question regarding policy on allowed citation ID characters, specifically the definition of "alphanumerical" Hendrik Erz
     [not found] ` <9d39c347-6c2d-4b70-976d-4e77df9240ef-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2019-10-24 22:02   ` BPJ
2019-10-25  5:11   ` John MacFarlane
     [not found]     ` <m28sp9s938.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
2019-10-25  8:01       ` Hendrik Erz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).