pandoc.markdown to epub conversion took just under 4 hours on an average linux laptop

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

* pandoc.markdown to epub conversion took just under 4 hours on an average linux laptop
@ 2020-10-26 19:22 Chris Jones
       [not found] ` <af5fe26b-4d84-4dcb-bdcd-6382469c476ao-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Jones @ 2020-10-26 19:22 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 626 bytes --]

Six files... ~274,000 words. A pandoc conversion to EPUB last night took 
almost 4 hours. Comparable conversions on the same hardware take at most a 
couple of minutes.

How can I investigate & hopefully optimize? 

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/af5fe26b-4d84-4dcb-bdcd-6382469c476ao%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 973 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

[parent not found: <af5fe26b-4d84-4dcb-bdcd-6382469c476ao-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]

* Re: pandoc.markdown to epub conversion took just under 4 hours on an average linux laptop
       [not found] ` <af5fe26b-4d84-4dcb-bdcd-6382469c476ao-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2020-10-26 21:15   ` John MacFarlane
       [not found]     ` <m2a6w8ofib.fsf-jF64zX8BO08an7k8zZ43ob9bIa4KchGshsV+eolpW18@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: John MacFarlane @ 2020-10-26 21:15 UTC (permalink / raw)
  To: Chris Jones, pandoc-discuss

There are a few things that can trigger pathological behavior in
the markdown parser.

One way to find out what is to divide and conquer, converting
shorter and shorter segments of your document to see if you can
find where things get slow.

Another possibility is to use --trace, which will give you
very verbose output that will allow you to determine where
excessive backtracking is occurring.

If you don't need all pandoc extensions, and you're using recent
pandoc, you might try `-f commonmark_x`, which uses the
efficient commonmark parser extended with many (but not all)
pandoc extensions.  I would expect this to be much faster.

Chris Jones <cjns1989-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:

> Six files... ~274,000 words. A pandoc conversion to EPUB last night took 
> almost 4 hours. Comparable conversions on the same hardware take at most a 
> couple of minutes.
>
> How can I investigate & hopefully optimize? 
>
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/af5fe26b-4d84-4dcb-bdcd-6382469c476ao%40googlegroups.com.

^ permalink raw reply	[flat|nested] 13+ messages in thread

[parent not found: <m2a6w8ofib.fsf-jF64zX8BO08an7k8zZ43ob9bIa4KchGshsV+eolpW18@public.gmane.org>]

* Re: pandoc.markdown to epub conversion took just under 4 hours on an average linux laptop
       [not found]     ` <m2a6w8ofib.fsf-jF64zX8BO08an7k8zZ43ob9bIa4KchGshsV+eolpW18@public.gmane.org>
@ 2020-10-27 20:34       ` cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
       [not found]         ` <e9e43a84-9ec5-4732-8dec-e6caac2e59ffn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2020-10-27 21:50       ` cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
  1 sibling, 1 reply; 13+ messages in thread
From: cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org @ 2020-10-27 20:34 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 3879 bytes --]

There are a few things that can trigger pathological behavior in
the markdown parser.

One way to find out what is to divide and conquer, converting
shorter and shorter segments of your document to see if you can
find where things get slow.

I sort of did that. Ran pandoc on each individual chapter... hoping I would 
find one that took longer than the rest of them. Not much luck with this 
approach. They all took a long time.


    Another possibility is to use --trace, which will give you
    very verbose output that will allow you to determine where
    excessive backtracking is occurring.

It's been stuck for over 10 minutes (!) on this:

] Parsed [RawBlock (Format "tex") "\\begin{center}\n\\textbf{\\old158 at 
line 4926

The \\old switch allows switching to oldstyle numbers on the fly:
\newcommand{\old}{\addfontfeature{Numbers=OldStyle}}
Unfortunately the number is truncated (could be 158{0..9} and the line 
number does not tell me much. I did find a bug in my source in this 
vicinity (caused by a broken regex) but fixing it makes no difference.

Running: pandoc -o epub/test.epub md/title.txt md/md.* 
--css=css/stylesheet.css --epub-embed-font=fonts/* --trace

I'm again stuck in the same exact spot.

Ah… it's come unstuck but now it's stuck on something else. Unfortunately I 
wasn't watching the trace when pandoc started rolling again.

    If you don't need all pandoc extensions, and you're using recent
    pandoc, you might try `-f commonmark_x`, which uses the
    efficient commonmark parser extended with many (but not all)
    pandoc extensions.  I would expect this to be much faster.

Sounds good. 

A quick reminder how I install the "nightly" (hopefully a standalone 
version… I vaguely remember it's a statically linked program) or where it's 
documented? Done that in the past but that was over a year ago and I don't 
remember the finery.


On Monday, October 26, 2020 at 5:16:00 PM UTC-4 John MacFarlane wrote:

>
> There are a few things that can trigger pathological behavior in
> the markdown parser.
>
> One way to find out what is to divide and conquer, converting
> shorter and shorter segments of your document to see if you can
> find where things get slow.
>
> Another possibility is to use --trace, which will give you
> very verbose output that will allow you to determine where
> excessive backtracking is occurring.
>
> If you don't need all pandoc extensions, and you're using recent
> pandoc, you might try `-f commonmark_x`, which uses the
> efficient commonmark parser extended with many (but not all)
> pandoc extensions. I would expect this to be much faster.
>
>
>
> Chris Jones <cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:
>
> > Six files... ~274,000 words. A pandoc conversion to EPUB last night took 
> > almost 4 hours. Comparable conversions on the same hardware take at most 
> a 
> > couple of minutes.
> >
> > How can I investigate & hopefully optimize? 
> >
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "pandoc-discuss" group.
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/af5fe26b-4d84-4dcb-bdcd-6382469c476ao%40googlegroups.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/e9e43a84-9ec5-4732-8dec-e6caac2e59ffn%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 5360 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

[parent not found: <e9e43a84-9ec5-4732-8dec-e6caac2e59ffn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]

* Re: pandoc.markdown to epub conversion took just under 4 hours on an average linux laptop
       [not found]         ` <e9e43a84-9ec5-4732-8dec-e6caac2e59ffn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2020-10-27 21:05           ` John MacFarlane
  0 siblings, 0 replies; 13+ messages in thread
From: John MacFarlane @ 2020-10-27 21:05 UTC (permalink / raw)
  To: cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org, pandoc-discuss

"cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org" <cjns1989-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:

> There are a few things that can trigger pathological behavior in
> the markdown parser.
>
> One way to find out what is to divide and conquer, converting
> shorter and shorter segments of your document to see if you can
> find where things get slow.
>
> I sort of did that. Ran pandoc on each individual chapter... hoping I would 
> find one that took longer than the rest of them. Not much luck with this 
> approach. They all took a long time.
>
>
>     Another possibility is to use --trace, which will give you
>     very verbose output that will allow you to determine where
>     excessive backtracking is occurring.
>
> It's been stuck for over 10 minutes (!) on this:
>
> ] Parsed [RawBlock (Format "tex") "\\begin{center}\n\\textbf{\\old158 at 
> line 4926

This got parsed. So it's stuck on whatever comes after this bit
of raw HTML (only the first part is shown, it's the whole center
environment presumably).



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: pandoc.markdown to epub conversion took just under 4 hours on an average linux laptop
       [not found]     ` <m2a6w8ofib.fsf-jF64zX8BO08an7k8zZ43ob9bIa4KchGshsV+eolpW18@public.gmane.org>
  2020-10-27 20:34       ` cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
@ 2020-10-27 21:50       ` cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
       [not found]         ` <22d3d478-357d-464c-b407-aefd2ed81dccn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  1 sibling, 1 reply; 13+ messages in thread
From: cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org @ 2020-10-27 21:50 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 3187 bytes --]

With the nightly version (2.11.0.4) 

    /tmp/pandoc -o epub/test.epub md/title.txt md2/ch*.md 
--css=css/stylesheet.css --epub-embed-font=fonts/* 
--epub-cover-image=images/cover.png

the conversion took seconds.

But pandoc complains that,

[WARNING] This document format requires a nonempty <title> element.
  Defaulting to 'title' as the title.
  To specify a title, use 'title' in metadata or --metadata title="...".

And the epubcheck report the following errors probably related to the above 
warning:

ERROR(RSC-005): epub/test.epub/EPUB/content.opf(9,14): Error while parsing 
file: element "metadata" incomplete; missing required element "dc:title"
ERROR(RSC-005): epub/test.epub/EPUB/nav.xhtml(11,134): Error while parsing 
file: Anchors within nav elements must contain text

Check finished with errors
Messages: 0 fatal / 2 errors / 0 warnings / 0 info

epubcheck completed

The title.txt file contains:

% URBAIN DUBOIS
% La cuisine classique — Volume II

It looks as if pandoc is unable to process the content of the title.txt 
file.

When I take a look at the output everything looks good except that the raw 
latex bits are now included verbatim as if they were part of the text/data.
On Monday, October 26, 2020 at 5:16:00 PM UTC-4 John MacFarlane wrote:

>
> There are a few things that can trigger pathological behavior in
> the markdown parser.
>
> One way to find out what is to divide and conquer, converting
> shorter and shorter segments of your document to see if you can
> find where things get slow.
>
> Another possibility is to use --trace, which will give you
> very verbose output that will allow you to determine where
> excessive backtracking is occurring.
>
> If you don't need all pandoc extensions, and you're using recent
> pandoc, you might try `-f commonmark_x`, which uses the
> efficient commonmark parser extended with many (but not all)
> pandoc extensions. I would expect this to be much faster.
>
>
>
> Chris Jones <cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:
>
> > Six files... ~274,000 words. A pandoc conversion to EPUB last night took 
> > almost 4 hours. Comparable conversions on the same hardware take at most 
> a 
> > couple of minutes.
> >
> > How can I investigate & hopefully optimize? 
> >
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "pandoc-discuss" group.
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/af5fe26b-4d84-4dcb-bdcd-6382469c476ao%40googlegroups.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/22d3d478-357d-464c-b407-aefd2ed81dccn%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 4578 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

[parent not found: <22d3d478-357d-464c-b407-aefd2ed81dccn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]

* Re: pandoc.markdown to epub conversion took just under 4 hours on an average linux laptop
       [not found]         ` <22d3d478-357d-464c-b407-aefd2ed81dccn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2020-10-28  0:28           ` John MacFarlane
       [not found]             ` <m2y2jrurb4.fsf-jF64zX8BO08an7k8zZ43ob9bIa4KchGshsV+eolpW18@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: John MacFarlane @ 2020-10-28  0:28 UTC (permalink / raw)
  To: cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org, pandoc-discuss

"cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org" <cjns1989-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:

> With the nightly version (2.11.0.4) 
>
>     /tmp/pandoc -o epub/test.epub md/title.txt md2/ch*.md 
> --css=css/stylesheet.css --epub-embed-font=fonts/* 
> --epub-cover-image=images/cover.png
>
> the conversion took seconds.
>
> But pandoc complains that,
>
> [WARNING] This document format requires a nonempty <title> element.
>   Defaulting to 'title' as the title.
>   To specify a title, use 'title' in metadata or --metadata title="...".
>
> And the epubcheck report the following errors probably related to the above 
> warning:
>
> ERROR(RSC-005): epub/test.epub/EPUB/content.opf(9,14): Error while parsing 
> file: element "metadata" incomplete; missing required element "dc:title"
> ERROR(RSC-005): epub/test.epub/EPUB/nav.xhtml(11,134): Error while parsing 
> file: Anchors within nav elements must contain text
>
> Check finished with errors
> Messages: 0 fatal / 2 errors / 0 warnings / 0 info
>
> epubcheck completed
>
> The title.txt file contains:
>
> % URBAIN DUBOIS
> % La cuisine classique — Volume II

Weird.  This SHOULD work.  Are you seeing anything
of this in the resulting epub?  (I.e. did it get parsed,
but not as metadata? If so, maybe you need a blank line
at the end of title.txt.)  (Also, I assume your input
format is pandoc markdown?  commonmark_x doesn't include
an extension for this kind of title.)

> When I take a look at the output everything looks good except that the raw 
> latex bits are now included verbatim as if they were part of the text/data.

They shouldn't be -- again, is pandoc markdown your input format?
Maybe a sample of how these occur in the markdown file?

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/m2y2jrurb4.fsf%40MacBook-Pro.hsd1.ca.comcast.net.


^ permalink raw reply	[flat|nested] 13+ messages in thread

[parent not found: <m2y2jrurb4.fsf-jF64zX8BO08an7k8zZ43ob9bIa4KchGshsV+eolpW18@public.gmane.org>]

* Re: pandoc.markdown to epub conversion took just under 4 hours on an average linux laptop
       [not found]             ` <m2y2jrurb4.fsf-jF64zX8BO08an7k8zZ43ob9bIa4KchGshsV+eolpW18@public.gmane.org>
@ 2020-10-28 18:10               ` cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
       [not found]                 ` <824220b2-6c2e-4c60-a935-e908f573a3d7n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org @ 2020-10-28 18:10 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1.1: Type: text/plain, Size: 4032 bytes --]

Sorry for the confusion.... copy-pasted the wrong pandoc command. The one I 
actutally used  for this particular run that "took seconds" was:

pandoc -o epub/test.epub md/title.txt md/* --css=css/stylesheet.css 
--epub-embed-font=fonts/* --epub-cover-image=images/cover.png -f 
commonmark_x

And yes I did see (same as the raw latex stuff) the content of the 
title.txt file verbatim in the output.

So basically  in my use case this run of pandoc did little more than the 
cat command and format the output as an EPUB file. 

I have tons of script/regex-generated of both HTML and LaTeX code in this 
source so it has to be pandoc.markdown input.

The odd thing is that I have been doing this for ages (even Vol. I of this 
same book which is similar) and never had  anything that took ages to 
compile.  

Otherwise with nightly and  without the "-f commonmark" flag the situation 
is unchanged.

Is there any way I could take a storage dump... backtrace... or something 
when I kill the hung job?

Would some kind of filter that takes some kind of snapshot of the internal 
state of the process help?

Thanks,

CJ

P.S. I apologize for the messy reports I have sent in lately but I'm having 
major problems with this particular google group. I had to switch to google 
chrome (a mess on linux. I normally use firefox) in order to be able to 
post. And the posts I tried to send from my mail client never made it to 
the group. I think I mentioned that this is not caused by my local setup 
since I used someone else's account/machine and it still didn't go through. 
Any chance someone might look into this at some point?

On Tuesday, October 27, 2020 at 8:29:03 PM UTC-4 John MacFarlane wrote:

> "cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org" <cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:
>
> > With the nightly version (2.11.0.4) 
> >
> > /tmp/pandoc -o epub/test.epub md/title.txt md2/ch*.md 
> > --css=css/stylesheet.css --epub-embed-font=fonts/* 
> > --epub-cover-image=images/cover.png
> >
> > the conversion took seconds.
> >
> > But pandoc complains that,
> >
> > [WARNING] This document format requires a nonempty <title> element.
> > Defaulting to 'title' as the title.
> > To specify a title, use 'title' in metadata or --metadata title="...".
> >
> > And the epubcheck report the following errors probably related to the 
> above 
> > warning:
> >
> > ERROR(RSC-005): epub/test.epub/EPUB/content.opf(9,14): Error while 
> parsing 
> > file: element "metadata" incomplete; missing required element "dc:title"
> > ERROR(RSC-005): epub/test.epub/EPUB/nav.xhtml(11,134): Error while 
> parsing 
> > file: Anchors within nav elements must contain text
> >
> > Check finished with errors
> > Messages: 0 fatal / 2 errors / 0 warnings / 0 info
> >
> > epubcheck completed
> >
> > The title.txt file contains:
> >
> > % URBAIN DUBOIS
> > % La cuisine classique — Volume II
>
> Weird. This SHOULD work. Are you seeing anything
> of this in the resulting epub? (I.e. did it get parsed,
> but not as metadata? If so, maybe you need a blank line
> at the end of title.txt.) (Also, I assume your input
> format is pandoc markdown? commonmark_x doesn't include
> an extension for this kind of title.)
>
> > When I take a look at the output everything looks good except that the 
> raw 
> > latex bits are now included verbatim as if they were part of the 
> text/data.
>
> They shouldn't be -- again, is pandoc markdown your input format?
> Maybe a sample of how these occur in the markdown file?
>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/824220b2-6c2e-4c60-a935-e908f573a3d7n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 5063 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

[parent not found: <824220b2-6c2e-4c60-a935-e908f573a3d7n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]

* Re: pandoc.markdown to epub conversion took just under 4 hours on an average linux laptop
       [not found]                 ` <824220b2-6c2e-4c60-a935-e908f573a3d7n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2020-10-29  0:04                   ` John MacFarlane
       [not found]                     ` <m28sbpucc4.fsf-jF64zX8BO08an7k8zZ43ob9bIa4KchGshsV+eolpW18@public.gmane.org>
  2020-10-30 10:21                   ` BPJ
  2020-10-30 16:49                   ` John MacFarlane
  2 siblings, 1 reply; 13+ messages in thread
From: John MacFarlane @ 2020-10-29  0:04 UTC (permalink / raw)
  To: cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org, pandoc-discuss


As I mentioned, --trace is the way to get an internal snap shot
of parsing -- at least at the block level.  It sounds as if
that did tell you where the parser is getting stuck (it would
be AFTER the last traced block).

Putting raw tex blocks inside

```{=latex}
...
```

(the raw attribute syntax) will help the parser in tricky cases,
so you might try that.

"cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org" <cjns1989-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:

> Sorry for the confusion.... copy-pasted the wrong pandoc command. The one I 
> actutally used  for this particular run that "took seconds" was:
>
> pandoc -o epub/test.epub md/title.txt md/* --css=css/stylesheet.css 
> --epub-embed-font=fonts/* --epub-cover-image=images/cover.png -f 
> commonmark_x
>
> And yes I did see (same as the raw latex stuff) the content of the 
> title.txt file verbatim in the output.
>
> So basically  in my use case this run of pandoc did little more than the 
> cat command and format the output as an EPUB file. 
>
> I have tons of script/regex-generated of both HTML and LaTeX code in this 
> source so it has to be pandoc.markdown input.
>
> The odd thing is that I have been doing this for ages (even Vol. I of this 
> same book which is similar) and never had  anything that took ages to 
> compile.  
>
> Otherwise with nightly and  without the "-f commonmark" flag the situation 
> is unchanged.
>
> Is there any way I could take a storage dump... backtrace... or something 
> when I kill the hung job?
>
> Would some kind of filter that takes some kind of snapshot of the internal 
> state of the process help?
>
> Thanks,
>
> CJ
>
> P.S. I apologize for the messy reports I have sent in lately but I'm having 
> major problems with this particular google group. I had to switch to google 
> chrome (a mess on linux. I normally use firefox) in order to be able to 
> post. And the posts I tried to send from my mail client never made it to 
> the group. I think I mentioned that this is not caused by my local setup 
> since I used someone else's account/machine and it still didn't go through. 
> Any chance someone might look into this at some point?
>
> On Tuesday, October 27, 2020 at 8:29:03 PM UTC-4 John MacFarlane wrote:
>
>> "cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org" <cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:
>>
>> > With the nightly version (2.11.0.4) 
>> >
>> > /tmp/pandoc -o epub/test.epub md/title.txt md2/ch*.md 
>> > --css=css/stylesheet.css --epub-embed-font=fonts/* 
>> > --epub-cover-image=images/cover.png
>> >
>> > the conversion took seconds.
>> >
>> > But pandoc complains that,
>> >
>> > [WARNING] This document format requires a nonempty <title> element.
>> > Defaulting to 'title' as the title.
>> > To specify a title, use 'title' in metadata or --metadata title="...".
>> >
>> > And the epubcheck report the following errors probably related to the 
>> above 
>> > warning:
>> >
>> > ERROR(RSC-005): epub/test.epub/EPUB/content.opf(9,14): Error while 
>> parsing 
>> > file: element "metadata" incomplete; missing required element "dc:title"
>> > ERROR(RSC-005): epub/test.epub/EPUB/nav.xhtml(11,134): Error while 
>> parsing 
>> > file: Anchors within nav elements must contain text
>> >
>> > Check finished with errors
>> > Messages: 0 fatal / 2 errors / 0 warnings / 0 info
>> >
>> > epubcheck completed
>> >
>> > The title.txt file contains:
>> >
>> > % URBAIN DUBOIS
>> > % La cuisine classique — Volume II
>>
>> Weird. This SHOULD work. Are you seeing anything
>> of this in the resulting epub? (I.e. did it get parsed,
>> but not as metadata? If so, maybe you need a blank line
>> at the end of title.txt.) (Also, I assume your input
>> format is pandoc markdown? commonmark_x doesn't include
>> an extension for this kind of title.)
>>
>> > When I take a look at the output everything looks good except that the 
>> raw 
>> > latex bits are now included verbatim as if they were part of the 
>> text/data.
>>
>> They shouldn't be -- again, is pandoc markdown your input format?
>> Maybe a sample of how these occur in the markdown file?
>>
>>
>
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/824220b2-6c2e-4c60-a935-e908f573a3d7n%40googlegroups.com.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/m28sbpucc4.fsf%40MacBook-Pro.hsd1.ca.comcast.net.


^ permalink raw reply	[flat|nested] 13+ messages in thread

[parent not found: <m28sbpucc4.fsf-jF64zX8BO08an7k8zZ43ob9bIa4KchGshsV+eolpW18@public.gmane.org>]

* Re: pandoc.markdown to epub conversion took just under 4 hours on an average linux laptop
       [not found]                     ` <m28sbpucc4.fsf-jF64zX8BO08an7k8zZ43ob9bIa4KchGshsV+eolpW18@public.gmane.org>
@ 2020-10-29 23:35                       ` Chris Jones
  0 siblings, 0 replies; 13+ messages in thread
From: Chris Jones @ 2020-10-29 23:35 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 6186 bytes --]

With the raw latex explicitly identified/marked as such as recommended 
above the compilation takes minutes instead of hours. To add to my 
embarrassment over this difficulty I now remember that you told me not long 
ago to do this when pandoc goes postal. I guess I was too focused on the 
fact I was creating an EPUB not a latex/pdf document to remember this piece 
of advice. After adding hundreds such ```{=latex} tags the code does not 
look any cleaner but it definitly addresses the problem. As to the 
generation of a pdf off of the same source it takes quite a long time but 
nothing out of the ordinary.

Thank you for your patience

On Wednesday, October 28, 2020 at 8:04:44 PM UTC-4, John MacFarlane wrote:
>
>
> As I mentioned, --trace is the way to get an internal snap shot 
> of parsing -- at least at the block level.  It sounds as if 
> that did tell you where the parser is getting stuck (it would 
> be AFTER the last traced block). 
>
> Putting raw tex blocks inside 
>
> ```{=latex} 
> ... 
> ``` 
>
> (the raw attribute syntax) will help the parser in tricky cases, 
> so you might try that. 
>
> "cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org" <cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org <javascript:>> writes: 
>
> > Sorry for the confusion.... copy-pasted the wrong pandoc command. The 
> one I 
> > actutally used  for this particular run that "took seconds" was: 
> > 
> > pandoc -o epub/test.epub md/title.txt md/* --css=css/stylesheet.css 
> > --epub-embed-font=fonts/* --epub-cover-image=images/cover.png -f 
> > commonmark_x 
> > 
> > And yes I did see (same as the raw latex stuff) the content of the 
> > title.txt file verbatim in the output. 
> > 
> > So basically  in my use case this run of pandoc did little more than the 
> > cat command and format the output as an EPUB file. 
> > 
> > I have tons of script/regex-generated of both HTML and LaTeX code in 
> this 
> > source so it has to be pandoc.markdown input. 
> > 
> > The odd thing is that I have been doing this for ages (even Vol. I of 
> this 
> > same book which is similar) and never had  anything that took ages to 
> > compile.   
> > 
> > Otherwise with nightly and  without the "-f commonmark" flag the 
> situation 
> > is unchanged. 
> > 
> > Is there any way I could take a storage dump... backtrace... or 
> something 
> > when I kill the hung job? 
> > 
> > Would some kind of filter that takes some kind of snapshot of the 
> internal 
> > state of the process help? 
> > 
> > Thanks, 
> > 
> > CJ 
> > 
> > P.S. I apologize for the messy reports I have sent in lately but I'm 
> having 
> > major problems with this particular google group. I had to switch to 
> google 
> > chrome (a mess on linux. I normally use firefox) in order to be able to 
> > post. And the posts I tried to send from my mail client never made it to 
> > the group. I think I mentioned that this is not caused by my local setup 
> > since I used someone else's account/machine and it still didn't go 
> through. 
> > Any chance someone might look into this at some point? 
> > 
> > On Tuesday, October 27, 2020 at 8:29:03 PM UTC-4 John MacFarlane wrote: 
> > 
> >> "cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org" <cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes: 
> >> 
> >> > With the nightly version (2.11.0.4) 
> >> > 
> >> > /tmp/pandoc -o epub/test.epub md/title.txt md2/ch*.md 
> >> > --css=css/stylesheet.css --epub-embed-font=fonts/* 
> >> > --epub-cover-image=images/cover.png 
> >> > 
> >> > the conversion took seconds. 
> >> > 
> >> > But pandoc complains that, 
> >> > 
> >> > [WARNING] This document format requires a nonempty <title> element. 
> >> > Defaulting to 'title' as the title. 
> >> > To specify a title, use 'title' in metadata or --metadata 
> title="...". 
> >> > 
> >> > And the epubcheck report the following errors probably related to the 
> >> above 
> >> > warning: 
> >> > 
> >> > ERROR(RSC-005): epub/test.epub/EPUB/content.opf(9,14): Error while 
> >> parsing 
> >> > file: element "metadata" incomplete; missing required element 
> "dc:title" 
> >> > ERROR(RSC-005): epub/test.epub/EPUB/nav.xhtml(11,134): Error while 
> >> parsing 
> >> > file: Anchors within nav elements must contain text 
> >> > 
> >> > Check finished with errors 
> >> > Messages: 0 fatal / 2 errors / 0 warnings / 0 info 
> >> > 
> >> > epubcheck completed 
> >> > 
> >> > The title.txt file contains: 
> >> > 
> >> > % URBAIN DUBOIS 
> >> > % La cuisine classique — Volume II 
> >> 
> >> Weird. This SHOULD work. Are you seeing anything 
> >> of this in the resulting epub? (I.e. did it get parsed, 
> >> but not as metadata? If so, maybe you need a blank line 
> >> at the end of title.txt.) (Also, I assume your input 
> >> format is pandoc markdown? commonmark_x doesn't include 
> >> an extension for this kind of title.) 
> >> 
> >> > When I take a look at the output everything looks good except that 
> the 
> >> raw 
> >> > latex bits are now included verbatim as if they were part of the 
> >> text/data. 
> >> 
> >> They shouldn't be -- again, is pandoc markdown your input format? 
> >> Maybe a sample of how these occur in the markdown file? 
> >> 
> >> 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "pandoc-discuss" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <javascript:>. 
> > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/824220b2-6c2e-4c60-a935-e908f573a3d7n%40googlegroups.com. 
>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/ecd86bee-8049-471a-a97b-a7be98e08c46o%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 8308 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: pandoc.markdown to epub conversion took just under 4 hours on an average linux laptop
       [not found]                 ` <824220b2-6c2e-4c60-a935-e908f573a3d7n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2020-10-29  0:04                   ` John MacFarlane
@ 2020-10-30 10:21                   ` BPJ
  2020-10-30 16:49                   ` John MacFarlane
  2 siblings, 0 replies; 13+ messages in thread
From: BPJ @ 2020-10-30 10:21 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1: Type: text/plain, Size: 1596 bytes --]

Den ons 28 okt. 2020 19:11cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org <cjns1989-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> skrev:

P.S. I apologize for the messy reports I have sent in lately but I'm having
> major problems with this particular google group. I had to switch to google
> chrome (a mess on linux. I normally use firefox)
>

How so? The switching as such or using Chrome? I have all of FF, Chrome and
Chromium installed on my system and just use whichever I want; usually
Chrome.

in order to be able to post. And the posts I tried to send from my mail
> client never made it to the group. I think I mentioned that this is not
> caused by my local setup since I used someone else's account/machine and it
> still didn't go through. Any chance someone might look into this at some
> point?
>

Have you tried reading Google Groups in your email client? I have done so
for years without a hitch — so long ago that I unfortunately don't know
anymore what setting you have to (de)activate. I *think* that GG messages
automatically go to the main email address associated with your Google
account unless you deactivate it.

/bpj

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhBW87Lr4PzaaL42zNBCwXD%3Ds5svkYd4zxFmxsjs3df3rQ%40mail.gmail.com.

[-- Attachment #2: Type: text/html, Size: 2675 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: pandoc.markdown to epub conversion took just under 4 hours on an average linux laptop
       [not found]                 ` <824220b2-6c2e-4c60-a935-e908f573a3d7n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2020-10-29  0:04                   ` John MacFarlane
  2020-10-30 10:21                   ` BPJ
@ 2020-10-30 16:49                   ` John MacFarlane
       [not found]                     ` <m2zh43psk7.fsf-jF64zX8BO08an7k8zZ43ob9bIa4KchGshsV+eolpW18@public.gmane.org>
  2 siblings, 1 reply; 13+ messages in thread
From: John MacFarlane @ 2020-10-30 16:49 UTC (permalink / raw)
  To: cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org, pandoc-discuss


> P.S. I apologize for the messy reports I have sent in lately but I'm having 
> major problems with this particular google group. I had to switch to google 
> chrome (a mess on linux. I normally use firefox) in order to be able to 
> post. And the posts I tried to send from my mail client never made it to 
> the group. I think I mentioned that this is not caused by my local setup 
> since I used someone else's account/machine and it still didn't go through. 
> Any chance someone might look into this at some point?

Google's spam filter is sometimes over-aggressive.
I've just gone in and approved some pending messages, so
maybe that fixes the problem!


^ permalink raw reply	[flat|nested] 13+ messages in thread

[parent not found: <m2zh43psk7.fsf-jF64zX8BO08an7k8zZ43ob9bIa4KchGshsV+eolpW18@public.gmane.org>]

* Re: pandoc.markdown to epub conversion took just under 4 hours on an average linux laptop
       [not found]                     ` <m2zh43psk7.fsf-jF64zX8BO08an7k8zZ43ob9bIa4KchGshsV+eolpW18@public.gmane.org>
@ 2020-10-30 22:03                       ` cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
       [not found]                         ` <20201030220312.GD5998-611mE6nXTcHDOqzlkpFKJg@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org @ 2020-10-30 22:03 UTC (permalink / raw)
  To: pandoc-discuss

On Fri, Oct 30, 2020 at 12:49:44PM EDT, John MacFarlane wrote:
> 
> > P.S. I apologize for the messy reports I have sent in lately but I'm having 
> > major problems with this particular google group. I had to switch to google 
> > chrome (a mess on linux. I normally use firefox) in order to be able to 
> > post. And the posts I tried to send from my mail client never made it to 
> > the group. I think I mentioned that this is not caused by my local setup 
> > since I used someone else's account/machine and it still didn't go through. 
> > Any chance someone might look into this at some point?
> 
> Google's spam filter is sometimes over-aggressive.
> I've just gone in and approved some pending messages, so
> maybe that fixes the problem!

Much appreciated. 

Thanks,

CJ


^ permalink raw reply	[flat|nested] 13+ messages in thread

[parent not found: <20201030220312.GD5998-611mE6nXTcHDOqzlkpFKJg@public.gmane.org>]

* Re: pandoc.markdown to epub conversion took just under 4 hours on an average linux laptop
       [not found]                         ` <20201030220312.GD5998-611mE6nXTcHDOqzlkpFKJg@public.gmane.org>
@ 2020-10-30 22:58                           ` cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
  0 siblings, 0 replies; 13+ messages in thread
From: cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org @ 2020-10-30 22:58 UTC (permalink / raw)
  To: pandoc-discuss

On Fri, Oct 30, 2020 at 06:03:12PM EDT, cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org wrote:
> On Fri, Oct 30, 2020 at 12:49:44PM EDT, John MacFarlane wrote:
> > 
> > Google's spam filter is sometimes over-aggressive.
> > I've just gone in and approved some pending messages, so
> > maybe that fixes the problem!
> 
> Much appreciated. 
> 
> Thanks,

Seems to have done the trick. 

The above reply to your message got through and just came back to my
mail reader.

Thanks,

CJ


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-10-30 22:58 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-26 19:22 pandoc.markdown to epub conversion took just under 4 hours on an average linux laptop Chris Jones
     [not found] ` <af5fe26b-4d84-4dcb-bdcd-6382469c476ao-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-10-26 21:15   ` John MacFarlane
     [not found]     ` <m2a6w8ofib.fsf-jF64zX8BO08an7k8zZ43ob9bIa4KchGshsV+eolpW18@public.gmane.org>
2020-10-27 20:34       ` cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
     [not found]         ` <e9e43a84-9ec5-4732-8dec-e6caac2e59ffn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-10-27 21:05           ` John MacFarlane
2020-10-27 21:50       ` cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
     [not found]         ` <22d3d478-357d-464c-b407-aefd2ed81dccn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-10-28  0:28           ` John MacFarlane
     [not found]             ` <m2y2jrurb4.fsf-jF64zX8BO08an7k8zZ43ob9bIa4KchGshsV+eolpW18@public.gmane.org>
2020-10-28 18:10               ` cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
     [not found]                 ` <824220b2-6c2e-4c60-a935-e908f573a3d7n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2020-10-29  0:04                   ` John MacFarlane
     [not found]                     ` <m28sbpucc4.fsf-jF64zX8BO08an7k8zZ43ob9bIa4KchGshsV+eolpW18@public.gmane.org>
2020-10-29 23:35                       ` Chris Jones
2020-10-30 10:21                   ` BPJ
2020-10-30 16:49                   ` John MacFarlane
     [not found]                     ` <m2zh43psk7.fsf-jF64zX8BO08an7k8zZ43ob9bIa4KchGshsV+eolpW18@public.gmane.org>
2020-10-30 22:03                       ` cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
     [not found]                         ` <20201030220312.GD5998-611mE6nXTcHDOqzlkpFKJg@public.gmane.org>
2020-10-30 22:58                           ` cjns...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).