public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* tip: use aux2bib to get around slow bibutils parsing
@ 2011-11-10 13:38 Joseph Reagle
       [not found] ` <201111100838.06496.joseph.2011-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Joseph Reagle @ 2011-11-10 13:38 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw; +Cc: andrea rossato

[-- Attachment #1: Type: text/plain, Size: 2033 bytes --]


I have a Python script that helps me build documents and pandoc is of course central. Now that I'm using [@Smith1999] citation syntax there is *massive* overhead in citeproc-hs/bibutils bibtex parsing (of large files) even with the '--biblatex' option. That is I'm only asking it to replace markdown syntax with biblatex syntax. Because  '--biblatex' parsers bibtex files (to determine whether there is a corresponding key and hence translate it, or just let it be) a build which ordinarly takes 5s takes about 14s! 

Here's three suggestions:
1. hope the bibutil developers speed up their parser.
2. do not require parsing of bibtex files for '--biblatex'.
3. use aux2bib to create smaller bib files.

For example:

~~~~
def pandoc_call(mdn_tmp_file, tex_tmp_file, build_file_base):
    """
    Call pandoc on tweaked markdown files.
    """

    bib_file = BIB_FILE
    if not options.fast: 
		....
    else: # fast!
        if os.path.exists(build_file_base+'.aux'):
            # bibutils (via pandoc/citeproc-hs) is slow on large bib files, so use smaller bib
            print(" * calling aux2bib on %s" % build_file_base+'.aux')
            bib_file = build_file_base+'.bib'
            call(['aux2bib', build_file_base+'.aux'], stdout=open(build_file_base+'.bib', 'w'))
        
    pandoc_opts = ['-t', 'latex', '--biblatex', '--bibliography=%s' %bib_file, '--no-wrap', '--tab-stop', '8']
    pandoc_cmd = ['pandoc', mdn_tmp_file]
    pandoc_cmd.extend(pandoc_opts)
    print "** pandoc cmd = '%s'" % ' '.join(pandoc_cmd)
    call(pandoc_cmd, stdout=codecs.open(tex_tmp_file, 'w', 'utf-8'))
~~~~

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to pandoc-discuss+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/pandoc-discuss?hl=en.


[-- Attachment #2: Type: text/html, Size: 7190 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: tip: use aux2bib to get around slow bibutils parsing
       [not found] ` <201111100838.06496.joseph.2011-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
@ 2011-11-21 18:41   ` David Sanson
  2011-11-21 19:27     ` Joseph Reagle
  2011-12-21 18:24     ` Joseph Reagle
  0 siblings, 2 replies; 16+ messages in thread
From: David Sanson @ 2011-11-21 18:41 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw
  Cc: andrea rossato, joseph.2011-T1oY19WcHSwdnm+yROfE0A

[-- Attachment #1: Type: text/plain, Size: 2535 bytes --]

It is frustrating how slow the parsing is on large bibliography files. In 
my testing, it doesn't much matter what format the file is in: a mods file 
parses just as slowly as a bibtex file, for example. Maybe parsing citeproc 
json is faster? I don't know how to generate bibliographies in that format.

Instead of using aux2bib, another strategy for speeding things up is to get 
the citekeys directly from the document and then extract the smaller bibtex 
file using the excellent and speedy [BibTool] utility. I threw together a 
ruby script, [extract_bib.rb] that does this. Testing on a paper with 30 
references and a bibtex database with 1149 items, on my MacBook,

   $ time pandoc -t markdown -sS junk.markdown

   real    0m0.276s
   user    0m0.183s 
   sys     0m0.019s

while

    $ time pandoc -t markdown --bibliography ~/.pandoc/default.bib -sS 
junk.markdown

    real    0m3.628s
    user    0m3.164s
    sys     0m0.350s

Yikes! If instead we first

    $ time extract_bib.rb junk.markdown > test.bib

    real    0m0.360s
    user    0m0.342s
    sys     0m0.015s

and then

    $ time pandoc -t markdown --bibliography test.bib -sS junk.markdown

    real    0m0.855s
    user    0m0.786s
    sys     0m0.039s

we get a fairly significant speed up. Just looking at the sys times, and 
combining the time taken to extract and then process the smaller bibtex 
file,

no bib: 0.019s
small bib: 0.054s
full bib: 0.350s

Someone who knows their way around pandoc's --dump-args and --ignore-args 
could probably cook up a decent wrapper script that implements this basic 
process. I don't know whether or not it is worth considering eventually 
implementing something like this in citeproc-hs to help speed things up. 
Obviously using BibTool itself (or the BibTool C library) would introduce 
an additional external dependency, which would be less than ideal.

David

[BibTool]: http://www.gerd-neugebauer.de/software/TeX/BibTool/index.en.html
[extract_bib.rb]: https://gist.github.com/1383132

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To view this discussion on the web visit https://groups.google.com/d/msg/pandoc-discuss/-/lEFSpPKW8IcJ.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit this group at http://groups.google.com/group/pandoc-discuss?hl=en.


[-- Attachment #2: Type: text/html, Size: 4068 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: tip: use aux2bib to get around slow bibutils parsing
  2011-11-21 18:41   ` David Sanson
@ 2011-11-21 19:27     ` Joseph Reagle
  2011-12-21 18:24     ` Joseph Reagle
  1 sibling, 0 replies; 16+ messages in thread
From: Joseph Reagle @ 2011-11-21 19:27 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw; +Cc: andrea rossato

[-- Attachment #1: Type: text/plain, Size: 1216 bytes --]

On Monday, November 21, 2011, David Sanson wrote:
> Instead of using aux2bib, another strategy for speeding things up is to get 
> the citekeys directly from the document and then extract the smaller bibtex 
> file using the excellent and speedy [BibTool] utility.

Yes, that is actually preferable because aux2bib works on the basis of having an existing aux file around from a previous latex compile. If you aren't using LaTeX as a destination format, or you add any *new* keys, those won't be in the previous aux file.

> I threw together a 
> ruby script, [extract_bib.rb] that does this. Testing on a paper with 30 
> references and a bibtex database with 1149 items, on my MacBook,

I plan on doing something like that in my python wrapper too so as to overcome the limitations of aux2bib.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit this group at http://groups.google.com/group/pandoc-discuss?hl=en.


[-- Attachment #2: Type: text/html, Size: 3480 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: tip: use aux2bib to get around slow bibutils parsing
  2011-11-21 18:41   ` David Sanson
  2011-11-21 19:27     ` Joseph Reagle
@ 2011-12-21 18:24     ` Joseph Reagle
       [not found]       ` <201112211324.59757.joseph.2011-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
  1 sibling, 1 reply; 16+ messages in thread
From: Joseph Reagle @ 2011-12-21 18:24 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw; +Cc: David Sanson, andrea rossato

On Monday, November 21, 2011, David Sanson wrote:
> Instead of using aux2bib, another strategy for speeding things up is to get 
> the citekeys directly from the document and then extract the smaller bibtex 
> file using the excellent and speedy [BibTool] utility. I threw together a 
> ruby script, [extract_bib.rb] that does this. Testing on a paper with 30 
> references and a bibtex database with 1149 items, on my MacBook,

I've now adopted this approach too, but got annoyed with the fickleness of BibTool and bib2bib and wrote a python library [mdn2bib] for parsing and subsetting bibtex files -- which I've now released on github with my varied pandoc wrappers.

[mdn2bib]: https://github.com/reagle/pandoc-wrappers/blob/master/mdn2bib.py


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: tip: use aux2bib to get around slow bibutils parsing
       [not found]       ` <201112211324.59757.joseph.2011-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
@ 2018-01-08 13:21         ` Joseph
       [not found]           ` <7dbad100-7790-4e03-b274-0f954bdfef64-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Joseph @ 2018-01-08 13:21 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1: Type: text/plain, Size: 848 bytes --]

Since 2011, and using my pandoc-wrappers, I've been contentedly subsetting my bibliography files (using the keys occurring in the markdown file) to speed up pandoc/citeproc. Does anyone know if there was any clarity on why pandoc is so slow with large bibliographies?

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/7dbad100-7790-4e03-b274-0f954bdfef64%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: tip: use aux2bib to get around slow bibutils parsing
       [not found]           ` <7dbad100-7790-4e03-b274-0f954bdfef64-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2018-01-08 17:30             ` John MACFARLANE
  2018-01-08 18:52               ` Joseph Reagle
  0 siblings, 1 reply; 16+ messages in thread
From: John MACFARLANE @ 2018-01-08 17:30 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

What format?  If it's anything other than bibtex/biblatex,
it's due to the bibutils library.  But pandoc-citeproc
parses bibtex and biblatex itself.  Certainly speed
improvements should be possible, but we need to understand
the problem and have some good samples to test on.

+++ Joseph [Jan 08 18 05:21 ]:
> Does anyone know if there was any clarity on why pandoc is so slow with large bibliographies?
>
>-- 
>You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/7dbad100-7790-4e03-b274-0f954bdfef64%40googlegroups.com.
>For more options, visit https://groups.google.com/d/optout.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: tip: use aux2bib to get around slow bibutils parsing
  2018-01-08 17:30             ` John MACFARLANE
@ 2018-01-08 18:52               ` Joseph Reagle
       [not found]                 ` <b893c212-ee8f-120e-d75e-8f13effaed69-4Y/TWYOca5dXfO9P/gJGhg@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Joseph Reagle @ 2018-01-08 18:52 UTC (permalink / raw)
  To: pandoc-discuss, John MACFARLANE

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2088 bytes --]


On 1/8/18 12:30 PM, John MACFARLANE wrote:
> What format?  If it's anything other than bibtex/biblatex,
> it's due to the bibutils library.  But pandoc-citeproc
> parses bibtex and biblatex itself. 

The bibliography files are between 4 and 5 MB.

4.5M Jan  8 13:36 readings.bib
4.9M Jan  8 13:30 readings.yaml

YAML is 3x slower than bibtex.

For my wrappers, I use a very fast (very strict) parser to subset the bibliographies to those used in the document. 

But I recently found myself again wishing I could switch to pandoc within the static blog generator (Pelican) but at 9s a page I can't.

```
╰─➤  time pandoc --bibliography=readings.yaml <<< "Rittman tweeted [@Rittman20163hl]."     

<p>Rittman tweeted <span class="citation" data-cites="Rittman20163hl">(Rittman 2016)</span>.</p>
<div id="refs" class="references">
<div id="ref-Rittman20163hl">
<p>Rittman, Mark. 2016. “3 Hrs Later and Still No Tea. Mandatory Recalibration Caused Wifi Base-Station Reset, Now Port-Scanning Network to Find Where Kettle Is Now.” Twitter. October 11, 2016. <a href="https://twitter.com/markrittman/status/785763443185942529" class="uri">https://twitter.com/markrittman/status/785763443185942529</a>.</p>
</div>
</div>
pandoc --bibliography=readings.yaml <<< "Rittman tweeted [@Rittman20163hl]."  
9.13s user 0.45s system 97% cpu 9.816 total

╰─➤  time pandoc --bibliography=readings.bib <<< "Rittman tweeted [@Rittman20163hl]."

<p>Rittman tweeted <span class="citation" data-cites="Rittman20163hl">(Rittman 2016)</span>.</p>
<div id="refs" class="references">
<div id="ref-Rittman20163hl">
<p>Rittman, Mark. 2016. “3 Hrs Later and Still No Tea. Mandatory Recalibration Caused Wifi Base-Station Reset, Now Port-Scanning Network to Find Where Kettle Is Now.” Twitter. October 11, 2016. <a href="https://twitter.com/markrittman/status/785763443185942529" class="uri">https://twitter.com/markrittman/status/785763443185942529</a>.</p>
</div>
</div>
pandoc --bibliography=readings.bib <<< "Rittman tweeted [@Rittman20163hl]."  
3.37s user 0.21s system 98% cpu 3.637 total
```

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: tip: use aux2bib to get around slow bibutils parsing
       [not found]                 ` <b893c212-ee8f-120e-d75e-8f13effaed69-4Y/TWYOca5dXfO9P/gJGhg@public.gmane.org>
@ 2018-01-08 21:04                   ` John MACFARLANE
  2018-01-08 22:34                     ` Joseph Reagle
  0 siblings, 1 reply; 16+ messages in thread
From: John MACFARLANE @ 2018-01-08 21:04 UTC (permalink / raw)
  To: Joseph Reagle; +Cc: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

I did some tests with a 17M bibtex file.

% time echo "@dworkin_1973" | pandoc --bibliography philosophy.json -t plain -F pandoc-citeproc
Dworkin (1973)

Dworkin, Ronald. 1973. “The Original Position.” _University of Chicago
Law Review_ 40 (3):500–533.

real    0m2.743s

% time echo "@dworkin_1973" | pandoc --bibliography philosophy.bib -t plain -F pandoc-citeproc 
Dworkin (1973)

Dworkin, Ronald. 1973. “The Original Position.” _University of Chicago
Law Review_ 40 (3):500–533.

real    0m15.746s

% time echo "@dworkin_1973" | pandoc --bibliography philosophy.yaml -t plain -F pandoc-citeproc
pandoc-citeproc: reference dworkin_1973 not found
(???)

real    0m42.967s

I don't yet understand either

(a) why it's much faster with a json bibliography than with
    bibtex, and much faster with bibtex than with pandoc yaml.

(b) why the reference isn't found in the last case with the
    yaml bibliography.

Incidentally, the yaml and json versions were created from
the bibtex version using pandoc-citeproc -j/-y.

PS. In the future please send to pandoc-discuss only, not
also to my email address.

+++ Joseph Reagle [Jan 08 18 13:52 ]:
>On 1/8/18 12:30 PM, John MACFARLANE wrote:
>> What format?  If it's anything other than bibtex/biblatex,
>> it's due to the bibutils library.  But pandoc-citeproc
>> parses bibtex and biblatex itself.
>
>The bibliography files are between 4 and 5 MB.
>
>4.5M Jan  8 13:36 readings.bib
>4.9M Jan  8 13:30 readings.yaml
>
>YAML is 3x slower than bibtex.
>
>For my wrappers, I use a very fast (very strict) parser to subset the bibliographies to those used in the document.
>
>But I recently found myself again wishing I could switch to pandoc within the static blog generator (Pelican) but at 9s a page I can't.
>
>```
>╰─➤  time pandoc --bibliography=readings.yaml <<< "Rittman tweeted [@Rittman20163hl]."
>
><p>Rittman tweeted <span class="citation" data-cites="Rittman20163hl">(Rittman 2016)</span>.</p>
><div id="refs" class="references">
><div id="ref-Rittman20163hl">
><p>Rittman, Mark. 2016. “3 Hrs Later and Still No Tea. Mandatory Recalibration Caused Wifi Base-Station Reset, Now Port-Scanning Network to Find Where Kettle Is Now.” Twitter. October 11, 2016. <a href="https://twitter.com/markrittman/status/785763443185942529" class="uri">https://twitter.com/markrittman/status/785763443185942529</a>.</p>
></div>
></div>
>pandoc --bibliography=readings.yaml <<< "Rittman tweeted [@Rittman20163hl]."
>9.13s user 0.45s system 97% cpu 9.816 total
>
>╰─➤  time pandoc --bibliography=readings.bib <<< "Rittman tweeted [@Rittman20163hl]."
>
><p>Rittman tweeted <span class="citation" data-cites="Rittman20163hl">(Rittman 2016)</span>.</p>
><div id="refs" class="references">
><div id="ref-Rittman20163hl">
><p>Rittman, Mark. 2016. “3 Hrs Later and Still No Tea. Mandatory Recalibration Caused Wifi Base-Station Reset, Now Port-Scanning Network to Find Where Kettle Is Now.” Twitter. October 11, 2016. <a href="https://twitter.com/markrittman/status/785763443185942529" class="uri">https://twitter.com/markrittman/status/785763443185942529</a>.</p>
></div>
></div>
>pandoc --bibliography=readings.bib <<< "Rittman tweeted [@Rittman20163hl]."
>3.37s user 0.21s system 98% cpu 3.637 total
>```

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/20180108210441.GB20924%40protagoras.
For more options, visit https://groups.google.com/d/optout.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: tip: use aux2bib to get around slow bibutils parsing
  2018-01-08 21:04                   ` John MACFARLANE
@ 2018-01-08 22:34                     ` Joseph Reagle
       [not found]                       ` <8eb8b80b-0e69-f5b7-36d9-afe7f8e4c348-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Joseph Reagle @ 2018-01-08 22:34 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

On 1/8/18 4:04 PM, John MACFARLANE wrote:
> (a) why it's much faster with a json bibliography than with
>    bibtex, and much faster with bibtex than with pandoc yaml.

Converting my bibtex to json, I also find the latter is much faster.

json   1.3s
bibtex 3.3s
yaml   9.1s

A second isn't too bad. I don't remember why I ended up with yaml instead of json when I moved from bib(la)tex; performance has long been a concern of mine with a big bibliography and many files... Perhaps because yaml can be included in the document...

The python script I use to find keys in a markdown file and quickly create a subset from the large yaml bibliography (2 file reads, regexing in both, and a file write) takes 0.29s . 

> (b) why the reference isn't found in the last case with the
>    yaml bibliography.

Did it complete the parse given it threw an error? (Maybe it failed at 42s after parsing exhausted a resource?)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: tip: use aux2bib to get around slow bibutils parsing
       [not found]                       ` <8eb8b80b-0e69-f5b7-36d9-afe7f8e4c348-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
@ 2018-01-09  1:02                         ` John MACFARLANE
  2018-01-09 13:53                           ` Joseph Reagle
  0 siblings, 1 reply; 16+ messages in thread
From: John MACFARLANE @ 2018-01-09  1:02 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

OK, I've just made some changes to pandoc-citeproc (in the
master branch).  They change the API, so this would be a
0.13 release when it comes.

With these changes, pandoc-citeproc first checks to see
what citation ids occur in the document, then focuses its
parsing efforts on these, ignoring the others.

This dramatically improves performance with large bibtex
files. I got ~1.1s performance for a 4MB bibtex file
and a document with one citation.

+++ Joseph Reagle [Jan 08 18 17:34 ]:
>On 1/8/18 4:04 PM, John MACFARLANE wrote:
>> (a) why it's much faster with a json bibliography than with
>>    bibtex, and much faster with bibtex than with pandoc yaml.
>
>Converting my bibtex to json, I also find the latter is much faster.
>
>json   1.3s
>bibtex 3.3s
>yaml   9.1s
>
>A second isn't too bad. I don't remember why I ended up with yaml instead of json when I moved from bib(la)tex; performance has long been a concern of mine with a big bibliography and many files... Perhaps because yaml can be included in the document...
>
>The python script I use to find keys in a markdown file and quickly create a subset from the large yaml bibliography (2 file reads, regexing in both, and a file write) takes 0.29s .
>
>> (b) why the reference isn't found in the last case with the
>>    yaml bibliography.
>
>Did it complete the parse given it threw an error? (Maybe it failed at 42s after parsing exhausted a resource?)
>
>-- 
>You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/8eb8b80b-0e69-f5b7-36d9-afe7f8e4c348%40reagle.org.
>For more options, visit https://groups.google.com/d/optout.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: tip: use aux2bib to get around slow bibutils parsing
  2018-01-09  1:02                         ` John MACFARLANE
@ 2018-01-09 13:53                           ` Joseph Reagle
       [not found]                             ` <879a0ce7-7ac1-407b-228b-3f2cec0adbb2-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Joseph Reagle @ 2018-01-09 13:53 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

On 1/8/18 8:02 PM, John MACFARLANE wrote:
> With these changes, pandoc-citeproc first checks to see what citation
> ids occur in the document, then focuses its parsing efforts on these,
> ignoring the others.

Excellent, I look forward to seeing it! Was this bibtex only though? And if it works for YAML, will it work for all ways of including biblio data?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: tip: use aux2bib to get around slow bibutils parsing
       [not found]                             ` <879a0ce7-7ac1-407b-228b-3f2cec0adbb2-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
@ 2018-01-09 17:57                               ` John MacFarlane
       [not found]                                 ` <20180109175719.GE43599-9Rnp8PDaXcadBw3G0RLmbRFnWt+6NQIA@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: John MacFarlane @ 2018-01-09 17:57 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

So far, bibtex only.  With YAML we just use pandoc to parse a
regular pandoc metadata block.  It might be possible to
preprocess the string first to remove irrelevant IDs,
but I'm not sure we can do that with 100% accuracy without
really parsing the YAML...

+++ Joseph Reagle [Jan 09 18 08:53 ]:
>On 1/8/18 8:02 PM, John MACFARLANE wrote:
>>With these changes, pandoc-citeproc first checks to see what citation
>>ids occur in the document, then focuses its parsing efforts on these,
>>ignoring the others.
>
>Excellent, I look forward to seeing it! Was this bibtex only though? And if it works for YAML, will it work for all ways of including biblio data?
>
>-- 
>You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/879a0ce7-7ac1-407b-228b-3f2cec0adbb2%40reagle.org.
>For more options, visit https://groups.google.com/d/optout.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: tip: use aux2bib to get around slow bibutils parsing
       [not found]                                 ` <20180109175719.GE43599-9Rnp8PDaXcadBw3G0RLmbRFnWt+6NQIA@public.gmane.org>
@ 2018-01-09 20:18                                   ` John MACFARLANE
  2018-01-09 20:59                                     ` Joseph Reagle
  0 siblings, 1 reply; 16+ messages in thread
From: John MACFARLANE @ 2018-01-09 20:18 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

OK, I just got it working with YAML.
Using an 800k YAML bibliography was almost instant!

+++ John MacFarlane [Jan 09 18 09:57 ]:
>So far, bibtex only.  With YAML we just use pandoc to parse a
>regular pandoc metadata block.  It might be possible to
>preprocess the string first to remove irrelevant IDs,
>but I'm not sure we can do that with 100% accuracy without
>really parsing the YAML...
>
>+++ Joseph Reagle [Jan 09 18 08:53 ]:
>>On 1/8/18 8:02 PM, John MACFARLANE wrote:
>>>With these changes, pandoc-citeproc first checks to see what citation
>>>ids occur in the document, then focuses its parsing efforts on these,
>>>ignoring the others.
>>
>>Excellent, I look forward to seeing it! Was this bibtex only though? And if it works for YAML, will it work for all ways of including biblio data?
>>
>>-- 
>>You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>>To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/879a0ce7-7ac1-407b-228b-3f2cec0adbb2%40reagle.org.
>>For more options, visit https://groups.google.com/d/optout.
>
>-- 
>You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/20180109175719.GE43599%40Johns-MacBook-Pro.local.
>For more options, visit https://groups.google.com/d/optout.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: tip: use aux2bib to get around slow bibutils parsing
  2018-01-09 20:18                                   ` John MACFARLANE
@ 2018-01-09 20:59                                     ` Joseph Reagle
       [not found]                                       ` <359f7adb-12f8-d493-fc34-57c182f02f14-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Joseph Reagle @ 2018-01-09 20:59 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

On 1/9/18 3:18 PM, John MACFARLANE wrote:
> OK, I just got it working with YAML.
> Using an 800k YAML bibliography was almost instant!
Great! Definitely looking forward to it as I use YAML now.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: tip: use aux2bib to get around slow bibutils parsing
       [not found]                                       ` <359f7adb-12f8-d493-fc34-57c182f02f14-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
@ 2018-01-09 22:53                                         ` John MACFARLANE
  2018-01-10 16:36                                           ` Joseph Reagle
  0 siblings, 1 reply; 16+ messages in thread
From: John MACFARLANE @ 2018-01-09 22:53 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

New timings, with a document with one citation, using a
17MB bibliography with 10,500 entries (on a 3-year old iMac).

yaml:   1.4s
json:   2.7s
bibtex: 4.5s

So, this is looking MUCH better, and maybe you can now
dispense with aux2bib.  (I also tried with a 4 MB
bibliography; it took 0.58s for a document with one
citation.)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: tip: use aux2bib to get around slow bibutils parsing
  2018-01-09 22:53                                         ` John MACFARLANE
@ 2018-01-10 16:36                                           ` Joseph Reagle
  0 siblings, 0 replies; 16+ messages in thread
From: Joseph Reagle @ 2018-01-10 16:36 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

On 1/9/18 5:53 PM, John MACFARLANE wrote:
> So, this is looking MUCH better, and maybe you can now dispense with
> aux2bib.  (I also tried with a 4 MB bibliography; it took 0.58s for a
> document with one citation.)

Awesome! Just pulled pandoc-citeproc 0.13 and my test for yaml dropped from 9.1s to 1.2s!


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-01-10 16:36 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-10 13:38 tip: use aux2bib to get around slow bibutils parsing Joseph Reagle
     [not found] ` <201111100838.06496.joseph.2011-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
2011-11-21 18:41   ` David Sanson
2011-11-21 19:27     ` Joseph Reagle
2011-12-21 18:24     ` Joseph Reagle
     [not found]       ` <201112211324.59757.joseph.2011-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
2018-01-08 13:21         ` Joseph
     [not found]           ` <7dbad100-7790-4e03-b274-0f954bdfef64-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-01-08 17:30             ` John MACFARLANE
2018-01-08 18:52               ` Joseph Reagle
     [not found]                 ` <b893c212-ee8f-120e-d75e-8f13effaed69-4Y/TWYOca5dXfO9P/gJGhg@public.gmane.org>
2018-01-08 21:04                   ` John MACFARLANE
2018-01-08 22:34                     ` Joseph Reagle
     [not found]                       ` <8eb8b80b-0e69-f5b7-36d9-afe7f8e4c348-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
2018-01-09  1:02                         ` John MACFARLANE
2018-01-09 13:53                           ` Joseph Reagle
     [not found]                             ` <879a0ce7-7ac1-407b-228b-3f2cec0adbb2-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
2018-01-09 17:57                               ` John MacFarlane
     [not found]                                 ` <20180109175719.GE43599-9Rnp8PDaXcadBw3G0RLmbRFnWt+6NQIA@public.gmane.org>
2018-01-09 20:18                                   ` John MACFARLANE
2018-01-09 20:59                                     ` Joseph Reagle
     [not found]                                       ` <359f7adb-12f8-d493-fc34-57c182f02f14-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
2018-01-09 22:53                                         ` John MACFARLANE
2018-01-10 16:36                                           ` Joseph Reagle

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).