pandoc/citeproc issues: recognizing citations

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

* pandoc/citeproc issues: recognizing citations
@ 2010-11-21 19:13 John MacFarlane
       [not found] ` <20101121191336.GA25657-nFAEphtLEs+AA6luYCgp0U1S2cYJDpTV9nwVQlTi/Pw@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: John MacFarlane @ 2010-11-21 19:13 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Suppose pandoc sees

    @foo

It cannot assume that this is a citation, since @foo could also
be a reference to an example list item, as in the following:

    (@foo)  my list item
    (@bar)  another list item

    The advantage (@foo) has over (@bar) is that...

(See http://johnmacfarlane.net/pandoc/README.html#numbered-examples.)

So currently, the parser determines whether '@foo' is a citation
by looking up 'foo' in the list of citation identifiers defined
in the bibliography file.  If 'foo' is found, '@foo' is treated
as a citation.  Otherwise, it is left as it is (and at the end
of the markdown reader, it will be transformed into a reference
to the appropriate list item).

There are a few problems with this:

(1)  Users could accidentally use a label for an example list
that corresponds to an item in the bibliography; this would cause
the parser to treat the label as a citation, unexpectedly. Worse,
the behavior could change if you added a new item to the bibliography
or used a different bibliography, without any change in the source
document.  It would seem much better if there were a separate,
unambiguous syntax for citations and example list labels.

(2)  Pandoc needs to read the whole bibliography before parsing.
This means that we can't have a "default bibliography" in ~/.pandoc
without slowing down *every* invocation of pandoc with a read + parse
of the bibliography.  Maybe this is okay -- I don't think we need
the default bibliography feature.

(3)  There's an awkward inconsistency in the way pandoc treats
textual citations and bracketed citations.  It checks textual citations
against the bibliography, but it doesn't do the same for bracketed ones.
Why not?  Because the parser for a bracket list of citations needs
to return a single Cite inline.  If we require the individual citations
to exist in the bibliography, then, a single missing citation will
cause the whole list to be parsed as regular text, rather than a
citation. Again, maybe this is okay -- but a few people have already
said that it seems weird to treat

    [@missing p. 3]

differently from

    @missing [p. 3] says...

Here are some possible solutions:

A.  I think the best solution, looking forward, would be to change
the syntax for numbered example lists, using ! instead of @.
Then there would be no possibility of conflict with citation keys,
and we wouldn't have to look up keys in the bibliography database
as we were parsing.  An example list would look like this:

    (!)  First example
    (!foo)  Second example, labeled 'foo'.
    (!bar)  Third example, labeled 'bar'.

    (!bar) follows from (!foo), because ...

This would solve all of the problems above.  However, it has a serious
drawback:  it would break existing documents, something I have tried
very hard to avoid doing in updates of pandoc.  I might consider it,
because numbered example lists have only been in pandoc for a little while,
and may not yet be in widespread use.

B.  An alternative would be to use a different symbol, say !, for
citations, reserving @ for the existing example lists.  Thus:

    !item1 [p. 99] says that blah [see also !item2 p. 33-34; !item3].

The problem is that @ is very natural for citations, and looks much
better in my view.  # would be another possibility:

    #item1 [p. 99] says that blah [see also #item2 p. 33-34; #item3].

However, # is much more likely to occur at the beginning of a word
in normal writing, so I think I'd avoid this.  ~, *, _, ^, $, <, > should be
avoided because they already have pandoc meanings.  & might work:

    &item1 [p. 99] says that blah [see also &item2 p. 33-34; &item3].

But it is less natural and tends to read as "and" -- also, you'd
capture things like "&c."

Another possibilities would include, + and =.

    +item1 [p. 99] says that blah [see also +item2 p. 33-34; +item3].

    =item1 [p. 99] says that blah [see also =item2 p. 33-34; =item3].

C.  Or, we could live with problems (1-2) and solve problem (3)
by checking ALL citations, not just textual citations, to make
sure they're in the bibliography.

Thoughts?

John

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: pandoc/citeproc issues: recognizing citations
       [not found] ` <20101121191336.GA25657-nFAEphtLEs+AA6luYCgp0U1S2cYJDpTV9nwVQlTi/Pw@public.gmane.org>
@ 2010-11-21 23:43   ` Tillmann Rendel
       [not found]     ` <4CE9AEBE.8090401-jNDFPZUTrfTbB13WlS47k8u21/r88PR+s0AfqQuZ5sE@public.gmane.org>
  2010-11-22 22:36   ` Andrea Rossato
  1 sibling, 1 reply; 8+ messages in thread
From: Tillmann Rendel @ 2010-11-21 23:43 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Hi,

John MacFarlane wrote:
> So currently, the parser determines whether '@foo' is a citation
> by looking up 'foo' in the list of citation identifiers defined
> in the bibliography file.  If 'foo' is found, '@foo' is treated
> as a citation.

I generally like this kind of behavior. It reminds me of variables in a 
programming language: When you use a name, it denotes whatever 
definition for that name is in scope. No need to use precious syntax to 
distinguish different types of denotees.

But maybe the scoping rules could be changed, so that example lists have 
priority over bibliographic references. It is usually better to have 
"more local" definitions take priority over "more global" definitions: 
In this case, the example lists in the same file are more local than the 
items in the bibliography file.

The benefit of this "local shadows global" policy is that the user, when 
searching for a name's denotation, can start searching locally, and stop 
searching when the first matching definition is found. So even in a very 
big system, most names can be understood without looking at the full system.

The same is, of course, true for the computer, so this could also solve 
the performance issue: If, after resolving example list entries, there 
are no unbound @item occurences left, we are done and do not need to 
process the bibliography file. If there are example list entries left, 
however, these are probably bibliographic references, so we need to read 
the bibliography anyway.

   Tillmann

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: pandoc/citeproc issues: recognizing citations
       [not found]     ` <4CE9AEBE.8090401-jNDFPZUTrfTbB13WlS47k8u21/r88PR+s0AfqQuZ5sE@public.gmane.org>
@ 2010-11-22  1:51       ` dsanson
       [not found]         ` <d172f8a7-6e23-43dd-8c08-07597348d514-0AuElzF1THGuSQfxCNEuIGB/v6IoIuQBVpNB7YpNyf8@public.gmane.org>
  2010-11-22  2:23       ` John MacFarlane
  1 sibling, 1 reply; 8+ messages in thread
From: dsanson @ 2010-11-22  1:51 UTC (permalink / raw)
  To: pandoc-discuss

I remember discussion of running numbered examples, but didn't realize
it had been implemented. So count me in as one who would not be
impacted by a change to that syntax. The best solution looking forward
really does seem like the best solution, though I understand the cost
of breaking existing documents.

About default bibliography files: (1) if you decide not to process
citations by default, a default bibliography file would still be
useful. `pandoc --biblio` is easier to type than `pandoc --biblio=/
Users/david/bibfile.bib`; (2) if you decide to process citations by
default (given a default bibliography file), it would remain useful to
have some option for turning this off, when, for example, using
`pandoc -f markdown -t markdown` to tidy up a file.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: pandoc/citeproc issues: recognizing citations
       [not found]     ` <4CE9AEBE.8090401-jNDFPZUTrfTbB13WlS47k8u21/r88PR+s0AfqQuZ5sE@public.gmane.org>
  2010-11-22  1:51       ` dsanson
@ 2010-11-22  2:23       ` John MacFarlane
  1 sibling, 0 replies; 8+ messages in thread
From: John MacFarlane @ 2010-11-22  2:23 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

+++ Tillmann Rendel [Nov 22 10 00:43 ]:
> Hi,
> 
> John MacFarlane wrote:
> >So currently, the parser determines whether '@foo' is a citation
> >by looking up 'foo' in the list of citation identifiers defined
> >in the bibliography file.  If 'foo' is found, '@foo' is treated
> >as a citation.
> 
> I generally like this kind of behavior. It reminds me of variables
> in a programming language: When you use a name, it denotes whatever
> definition for that name is in scope. No need to use precious syntax
> to distinguish different types of denotees.
> 
> But maybe the scoping rules could be changed, so that example lists
> have priority over bibliographic references. It is usually better to
> have "more local" definitions take priority over "more global"
> definitions: In this case, the example lists in the same file are
> more local than the items in the bibliography file.
> 
> The benefit of this "local shadows global" policy is that the user,
> when searching for a name's denotation, can start searching locally,
> and stop searching when the first matching definition is found. So
> even in a very big system, most names can be understood without
> looking at the full system.
> 
> The same is, of course, true for the computer, so this could also
> solve the performance issue: If, after resolving example list
> entries, there are no unbound @item occurences left, we are done and
> do not need to process the bibliography file. If there are example
> list entries left, however, these are probably bibliographic
> references, so we need to read the bibliography anyway.

Unfortunately, this would be difficult, given how these things
are currently dealt with.  Currently pandoc does not parse
the example list references at all, but waits till parsing
is completed, then uses some generics magic to walk the AST
and find potential example list references, turning them
into the relevant numbers if they match defined labels.

So, when pandoc is parsing a potential citation, it does not
yet know whether it is an example list label.

Why this way of dealing with example lists?  A reference to an
example can occur either before or after the example.  So, when
we parse '@foo', we don't know yet whether we'll encounter an
example list label '@foo'.  This is only known once parsing
is complete.

I suppose the following would be possible:  parse everything
as a Cite inline at first, without checking for the entries in
the bibliography file.  While parsing, construct a list of example
list labels.  Then walk the tree looking for citations, and
change the ones that should be example list references to numerals.
This ought to work.

John

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: pandoc/citeproc issues: recognizing citations
       [not found]         ` <d172f8a7-6e23-43dd-8c08-07597348d514-0AuElzF1THGuSQfxCNEuIGB/v6IoIuQBVpNB7YpNyf8@public.gmane.org>
@ 2010-11-22 15:44           ` Nathan Gass
       [not found]             ` <4CEA8FD4.1060407-8UOIJiGH10pyDzI6CaY1VQ@public.gmane.org>
  2010-11-22 16:10           ` Nathan Gass
  1 sibling, 1 reply; 8+ messages in thread
From: Nathan Gass @ 2010-11-22 15:44 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

On 22.11.10 02:51, dsanson wrote:
> I remember discussion of running numbered examples, but didn't realize
> it had been implemented. So count me in as one who would not be
> impacted by a change to that syntax. The best solution looking forward
> really does seem like the best solution, though I understand the cost
> of breaking existing documents.
>
> About default bibliography files: (1) if you decide not to process
> citations by default, a default bibliography file would still be
> useful. `pandoc --biblio` is easier to type than `pandoc --biblio=/
> Users/david/bibfile.bib`;

I'd like to have this info in the pandoc file itself. We should imho 
avoid situations, where a document has to be used with a specific pandoc 
incantation to get sensible output, and requiring the right bibliography 
given by commandline is a step in this direction. This of course needs 
some way to write additional meta info in the document.

> (2) if you decide to process citations by
> default (given a default bibliography file), it would remain useful to
> have some option for turning this off, when, for example, using
> `pandoc -f markdown -t markdown` to tidy up a file.
>

In my branch I adapted the markdown writer to output citation commands. 
This is also useful for `pandoc -r latex -w markdown`. There is still 
the plain writer to get rendered citations in plain text files. I think 
this is reasonable and hope it gets adapted in mainline pandoc.

So you would use `pandoc -r markdown -t markdown` to tidy up your 
markdown file and `pandoc -r markdown -w plain` to get rendered citations.

Nathan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: pandoc/citeproc issues: recognizing citations
       [not found]         ` <d172f8a7-6e23-43dd-8c08-07597348d514-0AuElzF1THGuSQfxCNEuIGB/v6IoIuQBVpNB7YpNyf8@public.gmane.org>
  2010-11-22 15:44           ` Nathan Gass
@ 2010-11-22 16:10           ` Nathan Gass
  1 sibling, 0 replies; 8+ messages in thread
From: Nathan Gass @ 2010-11-22 16:10 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

On 22.11.10 02:51, dsanson wrote:
> I remember discussion of running numbered examples, but didn't realize
> it had been implemented. So count me in as one who would not be
> impacted by a change to that syntax. The best solution looking forward
> really does seem like the best solution, though I understand the cost
> of breaking existing documents.
>
> About default bibliography files: (1) if you decide not to process
> citations by default, a default bibliography file would still be
> useful. `pandoc --biblio` is easier to type than `pandoc --biblio=/
> Users/david/bibfile.bib`; (2) if you decide to process citations by
> default (given a default bibliography file), it would remain useful to
> have some option for turning this off, when, for example, using
> `pandoc -f markdown -t markdown` to tidy up a file.
>

In my branch I've implemented a markdown writer, which writes citations 
out as citation commands. I hope this will be the default for markdown, 
as you can still use plain to get a rendered text file. This is 
necessary for commands like `pandoc -r latex -w markdown` to work as 
intended.

My branch is currently a bit behind, but I'm working on the merge.

Nathan



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: pandoc/citeproc issues: recognizing citations
       [not found]             ` <4CEA8FD4.1060407-8UOIJiGH10pyDzI6CaY1VQ@public.gmane.org>
@ 2010-11-22 16:33               ` John MacFarlane
  0 siblings, 0 replies; 8+ messages in thread
From: John MacFarlane @ 2010-11-22 16:33 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

+++ Nathan Gass [Nov 22 10 16:44 ]:
> On 22.11.10 02:51, dsanson wrote:
> >I remember discussion of running numbered examples, but didn't realize
> >it had been implemented. So count me in as one who would not be
> >impacted by a change to that syntax. The best solution looking forward
> >really does seem like the best solution, though I understand the cost
> >of breaking existing documents.
> >
> >About default bibliography files: (1) if you decide not to process
> >citations by default, a default bibliography file would still be
> >useful. `pandoc --biblio` is easier to type than `pandoc --biblio=/
> >Users/david/bibfile.bib`;
> 
> I'd like to have this info in the pandoc file itself. We should imho
> avoid situations, where a document has to be used with a specific
> pandoc incantation to get sensible output, and requiring the right
> bibliography given by commandline is a step in this direction. This
> of course needs some way to write additional meta info in the
> document.

One natural way to do this, if we had a <references> tag to indicate where the
bibliography goes, would be to specify the file there:

<references source="mybiblio.bib" />

> >(2) if you decide to process citations by
> >default (given a default bibliography file), it would remain useful to
> >have some option for turning this off, when, for example, using
> >`pandoc -f markdown -t markdown` to tidy up a file.
> >
> 
> In my branch I adapted the markdown writer to output citation
> commands. This is also useful for `pandoc -r latex -w markdown`.
> There is still the plain writer to get rendered citations in plain
> text files. I think this is reasonable and hope it gets adapted in
> mainline pandoc.
> 
> So you would use `pandoc -r markdown -t markdown` to tidy up your
> markdown file and `pandoc -r markdown -w plain` to get rendered
> citations.

I agree that it's good to have a way to do this, but I think people
may also want to generate markdown documents without citations.
(Generally, I have tried to keep the output of the markdown writer
compatible with standard markdown, though there are exceptions.)

It seems to me that the best approach would be to have a flag
--raw-citations or something like that.

John


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: pandoc/citeproc issues: recognizing citations
       [not found] ` <20101121191336.GA25657-nFAEphtLEs+AA6luYCgp0U1S2cYJDpTV9nwVQlTi/Pw@public.gmane.org>
  2010-11-21 23:43   ` Tillmann Rendel
@ 2010-11-22 22:36   ` Andrea Rossato
  1 sibling, 0 replies; 8+ messages in thread
From: Andrea Rossato @ 2010-11-22 22:36 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

On Sun, Nov 21, 2010 at 11:13:36AM -0800, John MacFarlane wrote:
> Suppose pandoc sees
> 
>     @foo
> 
> It cannot assume that this is a citation, since @foo could also
> be a reference to an example list item, as in the following:
> 
>     (@foo)  my list item
>     (@bar)  another list item
> 
>     The advantage (@foo) has over (@bar) is that...
> 
> (See http://johnmacfarlane.net/pandoc/README.html#numbered-examples.)
> 
> So currently, the parser determines whether '@foo' is a citation
> by looking up 'foo' in the list of citation identifiers defined
> in the bibliography file.  If 'foo' is found, '@foo' is treated
> as a citation.  Otherwise, it is left as it is (and at the end
> of the markdown reader, it will be transformed into a reference
> to the appropriate list item).
> 
> There are a few problems with this:
> 
> (1)  Users could accidentally use a label for an example list
> that corresponds to an item in the bibliography; this would cause
> the parser to treat the label as a citation, unexpectedly. Worse,
> the behavior could change if you added a new item to the bibliography
> or used a different bibliography, without any change in the source
> document.  It would seem much better if there were a separate,
> unambiguous syntax for citations and example list labels.
> 
> (2)  Pandoc needs to read the whole bibliography before parsing.
> This means that we can't have a "default bibliography" in ~/.pandoc
> without slowing down *every* invocation of pandoc with a read + parse
> of the bibliography.  Maybe this is okay -- I don't think we need
> the default bibliography feature.

I don't think it either, indeed. Maybe it would be nice to have a
syntax to include a bibliographic database in the document source, but
parsing a file every time pandoc is called should be avoided.

> (3)  There's an awkward inconsistency in the way pandoc treats
> textual citations and bracketed citations.  It checks textual citations
> against the bibliography, but it doesn't do the same for bracketed ones.
> Why not?  Because the parser for a bracket list of citations needs
> to return a single Cite inline.  If we require the individual citations
> to exist in the bibliography, then, a single missing citation will
> cause the whole list to be parsed as regular text, rather than a
> citation. Again, maybe this is okay -- but a few people have already
> said that it seems weird to treat
> 
>     [@missing p. 3]
> 
> differently from
> 
>     @missing [p. 3] says...
> 
> Here are some possible solutions:
> 
> A.  I think the best solution, looking forward, would be to change
> the syntax for numbered example lists, using ! instead of @.
> Then there would be no possibility of conflict with citation keys,
> and we wouldn't have to look up keys in the bibliography database
> as we were parsing.  An example list would look like this:
> 
>     (!)  First example
>     (!foo)  Second example, labeled 'foo'.
>     (!bar)  Third example, labeled 'bar'.
> 
>     (!bar) follows from (!foo), because ...
> 
> This would solve all of the problems above.  However, it has a serious
> drawback:  it would break existing documents, something I have tried
> very hard to avoid doing in updates of pandoc.  I might consider it,
> because numbered example lists have only been in pandoc for a little while,
> and may not yet be in widespread use.
> 
> B.  An alternative would be to use a different symbol, say !, for
> citations, reserving @ for the existing example lists.  Thus:
> 
>     !item1 [p. 99] says that blah [see also !item2 p. 33-34; !item3].
> 
> The problem is that @ is very natural for citations, and looks much
> better in my view.  # would be another possibility:
> 
>     #item1 [p. 99] says that blah [see also #item2 p. 33-34; #item3].
> 
> However, # is much more likely to occur at the beginning of a word
> in normal writing, so I think I'd avoid this.  ~, *, _, ^, $, <, > should be
> avoided because they already have pandoc meanings.  & might work:
> 
>     &item1 [p. 99] says that blah [see also &item2 p. 33-34; &item3].
> 
> But it is less natural and tends to read as "and" -- also, you'd
> capture things like "&c."
> 
> Another possibilities would include, + and =.
> 
>     +item1 [p. 99] says that blah [see also +item2 p. 33-34; +item3].
> 
>     =item1 [p. 99] says that blah [see also =item2 p. 33-34; =item3].
> 
> C.  Or, we could live with problems (1-2) and solve problem (3)
> by checking ALL citations, not just textual citations, to make
> sure they're in the bibliography.
> 
> Thoughts?

I think C would be fine. If the parser fails with a complex list of
citations that could be a problem, but I see a failure as a hint that
something is wrong, which is always a useful information.

Still I would also add:

D. We could drop textual citations and use [-@item] to get the same
result. The hyphen should be connected to the '[' (your original
grammar if I remember correctly).

I would place A third. I don like B: I think '@' is perfect for
citations.

Andrea


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-11-22 22:36 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-21 19:13 pandoc/citeproc issues: recognizing citations John MacFarlane
     [not found] ` <20101121191336.GA25657-nFAEphtLEs+AA6luYCgp0U1S2cYJDpTV9nwVQlTi/Pw@public.gmane.org>
2010-11-21 23:43   ` Tillmann Rendel
     [not found]     ` <4CE9AEBE.8090401-jNDFPZUTrfTbB13WlS47k8u21/r88PR+s0AfqQuZ5sE@public.gmane.org>
2010-11-22  1:51       ` dsanson
     [not found]         ` <d172f8a7-6e23-43dd-8c08-07597348d514-0AuElzF1THGuSQfxCNEuIGB/v6IoIuQBVpNB7YpNyf8@public.gmane.org>
2010-11-22 15:44           ` Nathan Gass
     [not found]             ` <4CEA8FD4.1060407-8UOIJiGH10pyDzI6CaY1VQ@public.gmane.org>
2010-11-22 16:33               ` John MacFarlane
2010-11-22 16:10           ` Nathan Gass
2010-11-22  2:23       ` John MacFarlane
2010-11-22 22:36   ` Andrea Rossato

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).