How to extract all citation keys from a document

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

* How to extract all citation keys from a document
@ 2013-06-04 18:34 Makaken Affe
       [not found] ` <06b38ec2-b028-48d8-89cf-50b3151158d8-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Makaken Affe @ 2013-06-04 18:34 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 1078 bytes --]

I'd like to extract all citations from a Pandoc markdown document to 
retrieve the corresponding BibTeX entries file from a reference management 
software and to see which citation keys are unknown. I expected one of this 
to work:

    echo 'See [@Foo]' | pandoc -t json
    echo 'See [@Foo]' | pandoc -t json --bibliography=empty.bib

Unfortunately citations are only parsed when their citation key is defined 
in the bibliography. How can I extract all citation keys from the document?

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/06b38ec2-b028-48d8-89cf-50b3151158d8%40googlegroups.com?hl=en-US.
For more options, visit https://groups.google.com/groups/opt_out.



[-- Attachment #2: Type: text/html, Size: 1397 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to extract all citation keys from a document
       [not found] ` <06b38ec2-b028-48d8-89cf-50b3151158d8-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2013-06-04 19:01   ` John MacFarlane
       [not found]     ` <20130604190129.GC5256-nFAEphtLEs+AA6luYCgp0U1S2cYJDpTV9nwVQlTi/Pw@public.gmane.org>
  2013-06-04 19:39   ` Joseph Reagle
  1 sibling, 1 reply; 6+ messages in thread
From: John MacFarlane @ 2013-06-04 19:01 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Sorry, there's currently no way to do this.  '@foo' is ambiguous in
pandoc markdown.  It could be a reference to an example list, or a plain
string '@foo', or a citation.  Pandoc uses the actual bibliography file
to help decide.  For example, in:

```
(@foo) hello

In example @foo, I show...
```

The second @foo will be interpreted as a citation if 'foo' is a defined
key in the specified bibliography file; otherwise, it will be
interpreted as a reference to the example list item marked (@foo).

+++ Makaken Affe [Jun 04 13 11:34 ]:
>    I'd like to extract all citations from a Pandoc markdown document to
>    retrieve the corresponding BibTeX entries file from a reference
>    management software and to see which citation keys are unknown. I
>    expected one of this to work:
>        echo 'See [@Foo]' | pandoc -t json
>        echo 'See [@Foo]' | pandoc -t json --bibliography=empty.bib
>    Unfortunately citations are only parsed when their citation key is
>    defined in the bibliography. How can I extract all citation keys from
>    the document?
> 
>    --
>    You received this message because you are subscribed to the Google
>    Groups "pandoc-discuss" group.
>    To unsubscribe from this group and stop receiving emails from it, send
>    an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>    To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>    To view this discussion on the web visit
>    [1]https://groups.google.com/d/msgid/pandoc-discuss/06b38ec2-b028-48d8-
>    89cf-50b3151158d8%40googlegroups.com?hl=en-US.
>    For more options, visit [2]https://groups.google.com/groups/opt_out.
> 
> References
> 
>    1. https://groups.google.com/d/msgid/pandoc-discuss/06b38ec2-b028-48d8-89cf-50b3151158d8%40googlegroups.com?hl=en-US
>    2. https://groups.google.com/groups/opt_out


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to extract all citation keys from a document
       [not found]     ` <20130604190129.GC5256-nFAEphtLEs+AA6luYCgp0U1S2cYJDpTV9nwVQlTi/Pw@public.gmane.org>
@ 2013-06-04 19:36       ` Erik Hetzner
  2013-06-05 19:12       ` Makaken Affe
  1 sibling, 0 replies; 6+ messages in thread
From: Erik Hetzner @ 2013-06-04 19:36 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

At Tue, 4 Jun 2013 12:01:29 -0700,
John MacFarlane wrote:
> 
> Sorry, there's currently no way to do this.  '@foo' is ambiguous in
> pandoc markdown.  It could be a reference to an example list, or a plain
> string '@foo', or a citation.  Pandoc uses the actual bibliography file
> to help decide.  For example, in:
> 
> ```
> (@foo) hello
> 
> In example @foo, I show...
> ```
> 
> The second @foo will be interpreted as a citation if 'foo' is a defined
> key in the specified bibliography file; otherwise, it will be
> interpreted as a reference to the example list item marked (@foo).

If you don’t care about this problem, and you just want a hack that
works, you can easily scan the document with a script. See, e.g.,

  https://bitbucket.org/egh/zotxt/src/tip/scripts/extractcites.py

which builds a JSON file from @AuthorTitleDate keys via Zotero.

best, Erik

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/87y5apllbd.wl%25egh%40e6h.org?hl=en-US.
For more options, visit https://groups.google.com/groups/opt_out.




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to extract all citation keys from a document
       [not found] ` <06b38ec2-b028-48d8-89cf-50b3151158d8-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2013-06-04 19:01   ` John MacFarlane
@ 2013-06-04 19:39   ` Joseph Reagle
  1 sibling, 0 replies; 6+ messages in thread
From: Joseph Reagle @ 2013-06-04 19:39 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw; +Cc: Makaken Affe

On 06/04/2013 02:34 PM, Makaken Affe wrote:
> Unfortunately citations are only parsed when their citation key is
> defined in the bibliography. How can I extract all citation keys from
> the document?

Because my large bib files slows down citeproc-hs in my wrapper I regex 
for keys and use this to generate a smaller bib file. see function 
`getKeysFromMD()` [1].

[1]: https://github.com/reagle/pandoc-wrappers/blob/master/md2bib.py#L79


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to extract all citation keys from a document
       [not found]     ` <20130604190129.GC5256-nFAEphtLEs+AA6luYCgp0U1S2cYJDpTV9nwVQlTi/Pw@public.gmane.org>
  2013-06-04 19:36       ` Erik Hetzner
@ 2013-06-05 19:12       ` Makaken Affe
       [not found]         ` <aa6ac405-ac89-47e5-a8e8-896688d3325b-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  1 sibling, 1 reply; 6+ messages in thread
From: Makaken Affe @ 2013-06-05 19:12 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 1373 bytes --]

Thanks for answer and suggestions. As my citation ids have known format, 
its just a simple grep at the command line:

    egrep -o '@[A-Za-z]+[0-9]+[a-z]*' paper.md

fiddlosopher wrote:

Sorry, there's currently no way to do this.  '@foo' is ambiguous in 
> pandoc markdown.  It could be a reference to an example list, or a plain 
> string '@foo', or a citation.
>

I just want to extract potential citation ids, so some false-positives are 
ok. However it would be nice at least to exclude obvious cases, such as 
@foo inside code blocks. Anyway, a workflow is also possible with the 
simple regexp hack: 

#. extract possible citation ids
#. remove blacklisted ids
#. get references and report which ids have not been found
#. manually adjust blacklist (false positives)

Cheers!

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/aa6ac405-ac89-47e5-a8e8-896688d3325b%40googlegroups.com?hl=en-US.
For more options, visit https://groups.google.com/groups/opt_out.



[-- Attachment #2: Type: text/html, Size: 1859 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to extract all citation keys from a document
       [not found]         ` <aa6ac405-ac89-47e5-a8e8-896688d3325b-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2013-06-06  9:52           ` BP Jonsson
  0 siblings, 0 replies; 6+ messages in thread
From: BP Jonsson @ 2013-06-06  9:52 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

2013-06-05 21:12, Makaken Affe skrev:
> Thanks for answer and suggestions. As my citation ids have known format,
> its just a simple grep at the command line:
>
>      egrep -o '@[A-Za-z]+[0-9]+[a-z]*' paper.md
>
> fiddlosopher wrote:
>
> Sorry, there's currently no way to do this.  '@foo' is ambiguous in
>> pandoc markdown.  It could be a reference to an example list, or a plain
>> string '@foo', or a citation.
>>
>
> I just want to extract potential citation ids, so some false-positives are
> ok. However it would be nice at least to exclude obvious cases, such as
> @foo inside code blocks. Anyway, a workflow is also possible with the
> simple regexp hack:
>
> #. extract possible citation ids
> #. remove blacklisted ids
> #. get references and report which ids have not been found
> #. manually adjust blacklist (false positives)
>
> Cheers!
>

Try the following perl script.
The trick is to look for what you want to skip over (code)
as well as for what you want to collect (cite keys) but
only print out the latter.  It's pretty easy with code
spans and fenced code blocks, but you have to not use
or convert indented code blocks because there are other
things which are indented in markdown!

/bpj

#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use open qw(:utf8 :std);

my $regex = qr{
     (?: # Capture and skip code spans/fenced code blocks
         (```+) .*? \1              # Inline/github-style block
     |   ^ \s* (~~~+) .*? ^ \s* \2  # Pandoc-style block
     )
     # Capture and print actual hits
     |   (\@[A-Za-z]+[0-9]+[a-z]*)
}msx;

my $text = do { local $/; <> }: # Slurp text into memory...

my %seen;

MATCH:
while ( $text =~ /$regex/g ) {      # DON'T forget /g !
     next MATCH unless defined $3;   # Skip code
     next MATCH if $seen{$3}++;      # Skip duplicates
     print $3, "\n";                 # Print match
}

__END__


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-06-06  9:52 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-04 18:34 How to extract all citation keys from a document Makaken Affe
     [not found] ` <06b38ec2-b028-48d8-89cf-50b3151158d8-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2013-06-04 19:01   ` John MacFarlane
     [not found]     ` <20130604190129.GC5256-nFAEphtLEs+AA6luYCgp0U1S2cYJDpTV9nwVQlTi/Pw@public.gmane.org>
2013-06-04 19:36       ` Erik Hetzner
2013-06-05 19:12       ` Makaken Affe
     [not found]         ` <aa6ac405-ac89-47e5-a8e8-896688d3325b-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2013-06-06  9:52           ` BP Jonsson
2013-06-04 19:39   ` Joseph Reagle

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).