Re: Best way to create a large number of documents from database

From: Mojca Miklavec <mojca.miklavec.lists@gmail.com>
To: mailing list for ConTeXt users <ntg-context@ntg.nl>
Subject: Re: Best way to create a large number of documents from database
Date: Fri, 17 Apr 2020 16:37:48 +0200	[thread overview]
Message-ID: <CALBOmsbY4nCvPaef-DjOS-1MZmrzeNRLTg=Ko2YdPYipimLBqg@mail.gmail.com> (raw)
In-Reply-To: <CALBOmsa8zz=++caJUv+cToXevXbs-g=-b8BJHqF8A1=NseGTXg@mail.gmail.com>

On Thu, 16 Apr 2020 at 16:38, Mojca Miklavec wrote:
> On Thu, 16 Apr 2020 at 11:29, Taco Hoekwater wrote:
> > > On 16 Apr 2020, at 11:12, Mojca Miklavec wrote:
> > >
> > > I have been asked to create a few thousand PDF documents from a CSV
> > > "database" today
> >
> > In CPU cycles, the fastest way is to do a single context —once
> > run generating all the pages as a single document, then using
> > mutool merge to split it into separate documents using a (shell)
> > loop.
>
> Just to make it clear: I don't really need to optimize on the CPU end,

... says the optimist ... :) :) :)

> as the bottleneck is on the other side of the keyboard, so as long as
> the CPU can process 5k pages today, I'm fine with it :) :) :)

While the bottleneck was in fact at the other side of the keyboard
(preparation was certainly longer than the execution), it still took
cca 2,5 hours to generate the full batch.

(I'm pretty sure I could have further optimised the code, even though
1 second per run is still pretty fast [when I started using context it
was more like 30 seconds per run], it just adds up when talking about
thousands of pages. This greatly reminds me on the awesome speedup
that Hans achieved when rewriting the mplib code & the initial
\sometxt changes inside metapost which also lead to 100-fold speedups
as one no longer needed to start TeX a zillion times.)

While waiting I wanted to start being clever and do the processing in
the same folder in parallel (I have lots of cores after all), and
ended up calling a script with
    context --N={n} --output=doc-{nnnn}.pdf template.tex
    context --purge
only to notice much later that running multiple context runs in the
same folder (some of them compiling and some of them deleting the
temporary files) might not have been the best idea on the planet, many
documents ended up missing, and many corrupted. So I had to rerun half
of the documents.

One of the interesting statistics.
I used a bunch of images (the same png images in all documents; cca.
290k in total).

The generated documents were 1,5 GB in size. When compressed with
tar.gz, there was almost no noticeable difference between the
compressed and non-compressed data size (1,4 GB vs. 1,5 GB). But when
compressing with tar.xz, it compressed 1,5 GB worth of document into
merely 27 MB (a single document is 360 k).

The documents have been e-mailed out, but now they need to print hard
copies for archive. I'm happy I don't need to be the one printing and
storing that :) :) :)

Mojca
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki     : http://contextgarden.net
___________________________________________________________________________________