Sorry, forgot to report the results from my text extraction tests. In short: everything worked fine: I did run these two commands: kp@mb:git.pandoc.trunk > *pdftotext -layout myreadme_pdflatex.pdf myreadme_pdflatex--pdftotext.text* kp@mb:git.pandoc.trunk > *pdftotext -layout myreadme_xelatex.pdf myreadme_xelatex--pdftotext.text* kp@mb:git.pandoc.trunk >* wc -l *.text* 2212 myreadme_pdflatex--pdftotext.text 2230 myreadme_xelatex--pdftotext.text This shows number of text lines extracted (2212 and 2230, respectively). Visual inspection of the *.text files showed no problem for either of the source PDF files. Of course, your mileage may vary as soon as you start using custom font settings with xelatex (or lualatex, should it work for you). However, I'm very happy that pandoc + pdflatex/xelatex do work so well with their default settings when it comes to fonts and text extraction (LaTeX-based PDF files used to be infamous for causing huuuuge problems in the past when it came to text extraction or merging them with other PDF files). Am Samstag, 11. Januar 2014 16:12:11 UTC+1 schrieb kurt.p...-gM/ > > A few weeks ago I've been playing with different settings to created PDF > from my own Markdown files, using --latex-engine=pdflatex|xelatex|lualatex > > At the time I noticed there were significant performance differences: > > - *pdflatex* was the fastest (but sometimes had problems with special > characters, like German umlauts) > - *xelatex* was significantly slower (but handled my umlauts out of > the box) > - *lualatex* was extremely slow, and in many cases didn't finish the > job at all but threw an error > > But I didn't have much time to investigate more deeply -- I decided to > write most of my content in Markdown first, before I turning to fine-tuning > the style details of the different output formats. > > Today I found some time to start taking a deeper look at the performance. > In order to have a common (and stable) base for these measurements, I'm > using the main README file from pandoc's Git repository as my Markdown > source. > > I'm using the freshly released version 1.12.3, installed via cabal on a > Macbook (running Mavericks), the different LaTeX-engines were installed via > MacPorts: > > kp@mb:git.pandoc.trunk >* pandoc --version* > pandoc 1.12.3 > Compiled with texmath 0.6.6, highlighting-kate 0.5.6. > [...] > > > kp@mbp:git.pandoc.trunk > *pdflatex --version* > pdfTeX 3.1415926-2.5-1.40.14 (TeX Live 2013/MacPorts 2013_5) > kpathsea version 6.1.1 > Copyright 2013 Peter Breitenlohner (eTeX)/Han The Thanh (pdfTeX). > There is NO warranty. Redistribution of this software is > covered by the terms of both the pdfTeX copyright and > the Lesser GNU General Public License. > For more information about these matters, see the file > named COPYING and the pdfTeX source. > Primary author of pdfTeX: Peter Breitenlohner (eTeX)/Han The Thanh > (pdfTeX). > Compiled with libpng 1.6.7; using libpng 1.6.8 > Compiled with zlib 1.2.8; using zlib 1.2.8 > Compiled with poppler version 0.24.4 > > kp@mbp:git.pandoc.trunk > *xelatex --version* > XeTeX 3.1415926-2.5-0.9999.3-2013122212 (TeX Live 2013/MacPorts 2013_5) > kpathsea version 6.1.1 > Copyright 2013 SIL International and Jonathan Kew. > There is NO warranty. Redistribution of this software is > covered by the terms of both the XeTeX copyright and > the Lesser GNU General Public License. > For more information about these matters, see the file > named COPYING and the XeTeX source. > Primary author of XeTeX: Jonathan Kew. > Compiled with ICU version 51.2; using 51.2 > Compiled with zlib version 1.2.8; using 1.2.8 > Compiled with FreeType2 version 2.5.2; using 2.5.2 > Compiled with Graphite2 version 1.2.4; using 1.2.4 > Compiled with HarfBuzz version 0.9.25; using 0.9.25 > Using Mac OS X Core Text, Cocoa & ImageIO frameworks > > kp@mbp:git.pandoc.trunk > *lualatex --version* > This is LuaTeX, Version beta-0.76.0-2013122212 (TeX Live 2013/MacPorts > 2013_5) (rev 4627) > Execute 'luatex --credits' for credits and version details > There is NO warranty. Redistribution of this software is covered by > the terms of the GNU General Public License, version 2 or (at your > option) > any later version. For more information about these matters, see the file > named COPYING and the LuaTeX source. > Copyright 2013 Taco Hoekwater, the LuaTeX Team. > > > > *Speed Differences pdflatex vs. xelatex* > > Here are first results from my performance testing: > > > kp@mb:git.pandoc.trunk > *time for i in {1..10}; do pandoc -f markdown > --latex-engine=pdflatex -o myreadme_pdflatex_${i}.pdf README; done* > real 0m19.262s > user 0m23.205s > sys 0m1.032s > > kp@mb:git.pandoc.trunk > *time for i in {1..10}; do pandoc -f markdown > --latex-engine=xelatex -o myreadme_xelatex_${i}.pdf README; done* > real 0m44.976s > user 0m50.706s > sys 0m2.519s > > > So It seems fair to state that the *xelatex*-path to PDF takes about > double the time compared to the *pdflatex*-path. > > *lualatex is b0rken for me* > > However, lualatex didn't work at all: > > kp@mb:git.pandoc.trunk > *time pandoc -f markdown --latex-engine=lualatex > -o myreadme_lualatex.pdf README* > pandoc: Error producing PDF from TeX source. > This is LuaTeX, Version beta-0.76.0-2013122212 (rev 4627) > restricted \write18 enabled. > (/var/folders/80/3wtx3wys21l921zvl6mp_lp80000gn/T/tex2pdf.60796/input.tex > LaTeX2e <2011/06/27> > Babel <3.9f> and hyphenation patterns for 43 languages loaded. > (/opt/local/share/texmf-texlive/tex/latex/base/article.cls > Document Class: article 2007/10/19 v1.4h Standard LaTeX document class > (/opt/local/share/texmf-texlive/tex/latex/base/size10.clo)) > (/opt/local/share/texmf-texlive/tex/latex/base/fontenc.sty > (/opt/local/share/texmf-texlive/tex/latex/base/t1enc.def)) > (/opt/local/share/texmf-texlive/tex/latex/lm/lmodern.sty) > (/opt/local/share/texmf-texlive/tex/latex/amsfonts/amssymb.sty > (/opt/local/share/texmf-texlive/tex/latex/amsfonts/amsfonts.sty)) > (/opt/local/share/texmf-texlive/tex/latex/amsmath/amsmath.sty > For additional information on amsmath, use the `?' option. > (/opt/local/share/texmf-texlive/tex/latex/amsmath/amstext.sty > (/opt/local/share/texmf-texlive/tex/latex/amsmath/amsgen.sty)) > (/opt/local/share/texmf-texlive/tex/latex/amsmath/amsbsy.sty) > (/opt/local/share/texmf-texlive/tex/latex/amsmath/amsopn.sty)) > (/opt/local/share/texmf-texlive/tex/generic/ifxetex/ifxetex.sty) > (/opt/local/share/texmf-texlive/tex/generic/oberdiek/ifluatex.sty) > (/opt/local/share/texmf-texlive/tex/latex/base/fixltx2e.sty) > (/opt/local/share/texmf-texlive/tex/latex/upquote/upquote.sty > (/opt/local/share/texmf-texlive/tex/latex/base/textcomp.sty > (/opt/local/share/texmf-texlive/tex/latex/base/ts1enc.def))) > (/opt/local/share/texmf-texlive/tex/latex/fontspec/fontspec.sty > (/opt/local/share/texmf-texlive/tex/latex/l3kernel/expl3.sty > (/opt/local/share/texmf-texlive/tex/latex/l3kernel/l3names.sty > (/opt/local/share/texmf-texlive/tex/latex/l3kernel/l3bootstrap.sty > (/opt/local/share/texmf-texlive/tex/generic/oberdiek/luatex.sty > (/opt/local/share/texmf-texlive/tex/generic/oberdiek/infwarerr.sty) > (/opt/local/share/texmf-texlive/tex/latex/etex-pkg/etex.sty) > (/opt/local/share/texmf-texlive/tex/generic/oberdiek/luatex-loader.sty > (/opt/local/share/texmf-texlive/scripts/oberdiek/oberdiek.luatex.lua))) > (/opt/local/share/texmf-texlive/tex/generic/oberdiek/pdftexcmds.sty > (/opt/local/share/texmf-texlive/tex/generic/oberdiek/ltxcmds.sty) > (/opt/local/share/texmf-texlive/tex/generic/oberdiek/ifpdf.sty)))) > (/opt/local/share/texmf-texlive/tex/latex/l3kernel/l3basics.sty) > (/opt/local/share/texmf-texlive/tex/latex/l3kernel/l3expan.sty) > (/opt/local/share/texmf-texlive/tex/latex/l3kernel/l3tl.sty) > (/opt/local/share/texmf-texlive/tex/latex/l3kernel/l3seq.sty) > (/opt/local/share/texmf-texlive/tex/latex/l3kernel/l3int.sty) > (/opt/local/share/texmf-texlive/tex/latex/l3kernel/l3quark.sty) > (/opt/local/share/texmf-texlive/tex/latex/l3kernel/l3prg.sty) > (/opt/local/share/texmf-texlive/tex/latex/l3kernel/l3clist.sty) > (/opt/local/share/texmf-texlive/tex/latex/l3kernel/l3token.sty) > (/opt/local/share/texmf-texlive/tex/latex/l3kernel/l3prop.sty) > (/opt/local/share/texmf-texlive/tex/latex/l3kernel/l3msg.sty) > (/opt/local/share/texmf-texlive/tex/latex/l3kernel/l3file.sty) > (/opt/local/share/texmf-texlive/tex/latex/l3kernel/l3skip.sty) > (/opt/local/share/texmf-texlive/tex/latex/l3kernel/l3keys.sty) > (/opt/local/share/texmf-texlive/tex/latex/l3kernel/l3fp.sty) > (/opt/local/share/texmf-texlive/tex/latex/l3kernel/l3box.sty) > (/opt/local/share/texmf-texlive/tex/latex/l3kernel/l3coffins.sty) > (/opt/local/share/texmf-texlive/tex/latex/l3kernel/l3color.sty) > (/opt/local/share/texmf-texlive/tex/latex/l3kernel/l3luatex.sty) > (/opt/local/share/texmf-texlive/tex/latex/l3kernel/l3candidates.sty)) > (/opt/local/share/texmf-texlive/tex/latex/l3packages/xparse/xparse.sty) > (/opt/local/share/texmf-texlive/tex/luatex/luaotfload/luaotfload.sty > (/opt/local/share/texmf-texlive/tex/luatex/luatexbase/luatexbase.sty > > (/opt/local/share/texmf-texlive/tex/luatex/luatexbase/luatexbase-compat.sty) > > (/opt/local/share/texmf-texlive/tex/luatex/luatexbase/luatexbase-modutils.sty > (/opt/local/share/texmf-texlive/tex/luatex/luatexbase/luatexbase-loader.sty > > (/opt/local/share/texmf-texlive/tex/luatex/luatexbase/luatexbase.loader.lua)) > (/opt/local/share/texmf-texlive/tex/luatex/luatexbase/modutils.lua)) > (/opt/local/share/texmf-texlive/tex/luatex/luatexbase/luatexbase-regs.sty) > (/opt/local/share/texmf-texlive/tex/luatex/luatexbase/luatexbase-attr.sty > (/opt/local/share/texmf-texlive/tex/luatex/luatexbase/attr.lua)) > (/opt/local/share/texmf-texlive/tex/luatex/luatexbase/luatexbase-cctb.sty > (/opt/local/share/texmf-texlive/tex/luatex/luatexbase/cctb.lua)) > (/opt/local/share/texmf-texlive/tex/luatex/luatexbase/luatexbase-mcb.sty > (/opt/local/share/texmf-texlive/tex/luatex/luatexbase/mcb.lua))) > (/opt/local/share/texmf-texlive/tex/luatex/luaotfload/luaotfload.lua) > > (/opt/local/share/texmf-texlive/tex/luatex/luaotfload/luaotfload-merged.lua)(usi > ng write cache: > /Users/kurtpfeifle/.texlive2013/texmf-var/luatex-cache/generic)( > using read cache: /opt/local/var/db/texmf/luatex-cache/generic > /Users/kurtpfeifl > e/.texlive2013/texmf-var/luatex-cache/generic) > > (/opt/local/share/texmf-texlive/tex/luatex/luaotfload/luaotfload-lib-dir.lua) > > (/opt/local/share/texmf-texlive/tex/luatex/luaotfload/luaotfload-override.lua) > > (/opt/local/share/texmf-texlive/tex/luatex/luaotfload/luaotfload-loaders.lua) > > (/opt/local/share/texmf-texlive/tex/luatex/luaotfload/luaotfload-database.lua) > > (/opt/local/share/texmf-texlive/tex/luatex/luaotfload/luaotfload-colors.lua) > > (/opt/local/share/texmf-texlive/tex/luatex/luaotfload/luaotfload-features.lua) > > (/opt/local/share/texmf-texlive/tex/luatex/luaotfload/luaotfload-extralibs.lua) > > (/opt/local/share/texmf-texlive/tex/luatex/luaotfload/luaotfload-typo-krn.lua) > > (/opt/local/share/texmf-texlive/tex/luatex/luaotfload/luaotfload-letterspace.lua > ) > > (/opt/local/share/texmf-texlive/tex/luatex/luaotfload/luaotfload-auxiliary.lua)) > (/opt/local/share/texmf-texlive/tex/latex/fontspec/fontspec.lua) > (/opt/local/share/texmf-texlive/tex/latex/fontspec/fontspec-patches.sty > ************************************************* > * LaTeX warning: "xparse/redefine-command" > * > * Redefining document command \oldstylenums with arg. spec. 'm' on line > 128. > ************************************************* > ) (/opt/local/share/texmf-texlive/tex/latex/fontspec/fontspec-luatex.sty > (/opt/local/share/texmf-texlive/tex/latex/base/fontenc.sty > (/opt/local/share/texmf-texlive/tex/latex/euenc/eu2enc.def) > (/opt/local/share/texmf-texlive/tex/latex/euenc/eu2lmr.fd > > real 3m31.995s > user 3m20.349s > sys 0m7.601s > > > kp@mb:git.pandoc.trunk >* echo $?* > 43 > > > The lualatex engine didn't produce any PDF: > > kp@mb:git.pandoc.trunk >* ls -tlar *.pdf* > -rw-r--r-- 1 kp staff 470292 Jan 11 12:01 myreadme_pdflatex.pdf > -rw-r--r-- 1 kp staff 205823 Jan 11 12:02 myreadme_xelatex.pdf > > > > *Fixing the Page Size Differences* > > Another significant difference in the output of the two successful PDF > conversions: > > - *xelatex* used A4 media format for the PDF pages > - *pdflatex* used Letter format > > (but I guess these defaults are builtin to the respective engines and do > not have anything to do with pandoc. Or?!) This pagesize difference does > not allow for an easy visual side-by side inspection of the two PDFs for > any more subtile differences in their pages' appearance. > > So in order to make the output of the two working engines better > comparable, I extended my commandline options: > > time pandoc \ > -V "geometry:paperwidth=8.26387in" \ > -V "geometry:paperheight=29.7cm" \ > -V "geometry:vmargin=40pt" \ > -V "geometry:hmargin=40pt" \ > -f markdown \ > --latex-engine=pdflatex \ > -o myreadme_pdflatex.pdf \ > README > > > time pandoc \ > -V "geometry:paperwidth=8.26387in" \ > -V "geometry:paperheight=29.7cm" \ > -V "geometry:vmargin=40pt" \ > -V "geometry:hmargin=40pt" \ > -f markdown \ > --latex-engine=xelatex \ > -o myreadme_xelatex.pdf \ > README > > > The timings didn't change significantly, but now I have two different PDFs > for inspection, to see if there are any qualitative differences. > > On a first supervisual view, both PDFs look nearly identical. However, > some word spacings are slightly different, leading to lines which wrap > differently, which leads to more differences of line wraps on further > pages, which leads to some pages which wrap differently. > > Not a big issue, though. > > *PDF Metadata* > > kp@mb:git.pandoc.trunk > *pdfinfo myreadme_pdflatex.pdf * > Title: Pandoc User's Guide > Subject: > Keywords: > Author: John MacFarlane > Creator: LaTeX with hyperref package > Producer: pdfTeX-1.40.14 > CreationDate: Sat Jan 11 14:28:53 2014 > ModDate: Sat Jan 11 14:28:53 2014 > Tagged: no > Form: none > Pages: 36 > Encrypted: no > Page size: 594.999 x 841.89 pts (A4) > Page rot: 0 > File size: 455149 bytes > Optimized: no > PDF version: 1.5 > > kp@mb:git.pandoc.trunk > *pdfinfo myreadme_xelatex.pdf * > Title: Pandoc User's Guide > Author: John MacFarlane > Creator: LaTeX with hyperref package > Producer: xdvipdfmx (0.7.9) > CreationDate: Sat Jan 11 14:28:04 2014 > Tagged: no > Form: none > Pages: 37 > Encrypted: no > Page size: 595 x 841.89 pts (A4) > Page rot: 0 > File size: 189281 bytes > Optimized: no > PDF version: 1.5 > > > As you can see, there are a few significant differences: > > - *File size:* pdflatex outputs ~444 kB, xelatex outputs -185 kB > (difference of ~259 kB). > - *Page numbers:* pdflatex generates 36 pages, xelatex generates 37 > pages. > - *Producer:* pdflatex states "pdfTeX-1.40.14", xelatex states > "xdvipdfmx (0.7.9)". This means xelatex goes a detour via DVI to produce > its PDF. > - *Subject *and* Keywords:* pdflatex doesn't put these metadata fields > into the PDF (into object with /Type /Catalog), xelatex does so, but > leaves them empty. > - *Page size:* despite identical commandline parameters, there are > slight differences in the page size. I assume this is because of the DVI > detour of xelatex which may introduce some rounding errors when calculating > stuff. > > > *PDF Fonts* > > kp@mb:git.pandoc.trunk > *pdffonts myreadme_pdflatex.pdf * > name type encoding > emb sub uni object ID > ------------------------------------ ----------------- ---------------- > --- --- --- --------- > YRKMSP+LMRoman17-Regular Type 1 Custom > yes yes no 347 0 > FTOMDN+LMRoman12-Regular Type 1 Custom > yes yes no 348 0 > GUKOVW+LMRoman12-Bold Type 1 Custom > yes yes no 349 0 > CFVARR+LMRoman10-Regular Type 1 Custom > yes yes no 350 0 > SWKNVD+LMRoman10-Italic Type 1 Custom > yes yes no 351 0 > GCGIOZ+LMMono10-Regular Type 1 Custom > yes yes no 353 0 > BPKJXQ+LMMonoLt10-Bold Type 1 Custom > yes yes no 354 0 > WMNHHZ+LMRoman10-BoldItalic Type 1 Custom > yes yes no 364 0 > JKORXP+LMRoman10-Bold Type 1 Custom > yes yes no 365 0 > CFVARR+LMRoman10-Regular Type 1 Custom > yes yes no 429 0 > GCGIOZ+LMMono10-Regular Type 1 Custom > yes yes no 446 0 > UMYEZP+LMRoman7-Regular Type 1 Custom > yes yes no 462 0 > URPVMO+LMRoman6-Regular Type 1 Custom > yes yes no 463 0 > UAEFEH+LMRoman8-Regular Type 1 Custom > yes yes no 465 0 > PGWEIL+LMMono8-Regular Type 1 Custom > yes yes no 466 0 > > kp@mb:git.pandoc.trunk > *pdffonts myreadme_xelatex.pdf* > name type encoding > emb sub uni object ID > -------------------------------------- --------------- ---------------- > --- --- --- --------- > ERGCXD+LMRoman17-Regular-Identity-H CID Type 0C Identity-H > yes yes yes 5 0 > PXEJIZ+LMRoman12-Regular-Identity-H CID Type 0C Identity-H > yes yes yes 7 0 > SNYTKW+LMRoman12-Bold-Identity-H CID Type 0C Identity-H > yes yes yes 9 0 > DKWVLY+LMRoman10-Regular-Identity-H CID Type 0C Identity-H > yes yes yes 11 0 > FKSWYW+LMRoman10-Italic-Identity-H CID Type 0C Identity-H > yes yes yes 13 0 > TFUYQQ+LMMono10-Regular-Identity-H CID Type 0C Identity-H > yes yes yes 57 0 > EEPFTP+LMMonoLt10-Bold-Identity-H CID Type 0C Identity-H > yes yes yes 59 0 > UDLNER+LMRoman10-BoldItalic-Identity-H CID Type 0C Identity-H > yes yes yes 64 0 > JLOSBI+LMRoman10-Bold-Identity-H CID Type 0C Identity-H > yes yes yes 66 0 > QWNYRO+LMRoman7-Regular-Identity-H CID Type 0C Identity-H > yes yes yes 139 0 > IKIBKZ+LMRoman6-Regular-Identity-H CID Type 0C Identity-H > yes yes yes 142 0 > EIDUGE+LMRoman8-Regular-Identity-H CID Type 0C Identity-H > yes yes yes 144 0 > YNRKKN+LMMono8-Regular-Identity-H CID Type 0C Identity-H > yes yes yes 146 0 > > So here is another significant difference: > > - *pdflatex* uses Type 1 (PostScript) fonts with a custom encoding > - *xelatex* here converts all fonts to CID Type 0C > (CFF/CompactFontFormat) fonts with Identity-H encoding > > So this font handling IMHO most likely explains to a large part the speed > and size differences of the two PDFs: converting Type1 fonts to CID takes > time (but saves space), and leads to slight differences in character + word > spacing which finally end up with an additional page being created. > > On the other hand, sometimes PDFs containing CID fonts with Identity_H as > well as those containing any font with a custom encoding do not play nice > when it comes to copy'n'paste text from their pages, or to extract their > text altogether. > > But better let's check both these statements... > > *Font Size Differences* > > I used the following commands to extract the fonts from the two PDFs: > > kp@mb:git.pandoc.trunk > *mutool extract myreadme_xelatex.pdf* > kp@mb:git.pandoc.trunk > *mutool extract myreadme_pdflatex.pdf* > > > (mutool is a companion commandline tool to MuPDF). This gave me 13 *.pfa > and 13 *.cid files in the current directory. A (rough) comparison of the > combined file sizes for each of the two groups yields this result: > > kp@mb:git.pandoc.trunk > *tar cvzf pfas.tar.gz *.pfa 2>/dev/null && ls > -lh pfas.tar.gz* > > -rw-r--r-- 1 kurtpfeifle staff 312K Jan 11 15:53 pfas.tar.gz > kp@mb:git.pandoc.trunk > *tar cvzf cids.tar.gz *.cid 2>/dev/null && ls > -lh cids.tar.gz* > > -rw-r--r-- 1 kurtpfeifle staff 52K Jan 11 15:53 cids.tar.gz > > Extracted *.pfa fonts are not compressed any more, hence I re-compressed > them again inside a tarball. (Inside a PDF, all fonts usually are > compressed too -- so this should be a more reasonable comparison than the > direct filesize sum of the fonts as they are present when extracted...). > > The size difference between the two font groups is ~260 kB, which thusly > accounts pretty well for the size differences of the respective PDFs. > > > *Summary* > > 1. The file size difference is worth switching from the default > (pdflatex) engine to xelatex, if output file size is a major concern. > However, you pay for this gain with a conversion speed penalty. > 2. xelatex can give you additional benefits, should you need them: you > can more easily switch to different fonts, use advanced OpenType font > features > 3. However, if you use Markdown/Pandoc to write a paper to submit to > some conference organizers who insist on embedding Type 1 fonts into your > PDF, you're probably better of to stick with the (default) pdflatex engine. > 4. It needs to be determined why my lualatex engine currently does not > work. > 5. I would be grateful other people on this list could test this too, > especially with --pdf-engine=lualatex, and also post their respective > speed and other results [in case it's not b0rken for them as well]... > > (It's nice when pandoc works flawlessly -- however, it is quite difficult > to narrow down the cause of a problem when something goes wrong, like in > this case with LuaLaTeX. I think I'll run a Markdown=>LaTeX conversion > next, and then run a LaTeX=>PDF conversion manually on the commandline, to > see if I can enable some debugging switches there. Currently I do not have > any experiences about directly running lualatex, xelatex or pdflatex on the > command line...) > > > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/ To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/ To view this discussion on the web visit For more options, visit