Performance hit between 1.19 and 2.2 markdown->latex

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

* Performance hit between 1.19 and 2.2 markdown->latex
@ 2018-08-01 20:06 JeffP
       [not found] ` <f1326854-a34d-43bc-8f3c-046341a1366b-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: JeffP @ 2018-08-01 20:06 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1.1: Type: text/plain, Size: 1503 bytes --]

Hello,

I'm using pandoc to convert a somewhat large file (3Mb) which contains a 
lot of tables, hyperlinks and cross-references.  With version 1.19 this 
file converts in 9 seconds, but there are some random (?) issues with table 
formatting with the final pdf output (the text in a table cell runs off the 
edge of the page).  I upgraded to 2.2 which addresses the formatting issue, 
but the conversion time for this file is now 20 minutes plus.

The file is just the same pattern over and over again with different text. 
 Processing time grows exponentially with the number of instances: 16s at 
10k lines, 2.5 min at 25k lines, 16.76 min at 40k lines.  It is looking 
like it is the number of cross references - when I create a dummy large 
file, it still takes longer than 1.19 but not as drastic, however I get a 
bunch of duplicate identifier warnings.

Any suggestions on how to chase this down, and/or workarounds?

Jeff

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/f1326854-a34d-43bc-8f3c-046341a1366b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 2009 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Performance hit between 1.19 and 2.2 markdown->latex
       [not found] ` <f1326854-a34d-43bc-8f3c-046341a1366b-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2018-08-01 21:38   ` John MacFarlane
  2018-08-01 23:37   ` John MacFarlane
  1 sibling, 0 replies; 4+ messages in thread
From: John MacFarlane @ 2018-08-01 21:38 UTC (permalink / raw)
  To: JeffP, pandoc-discuss


If you're able to link to the file or send me the file
privately, I can have a look.

One way to diagnose this kind of thing is to use the
--trace option, which will give you some idea where
the parser is going.


JeffP <jeff-ZgEzG5fdufQyLce1RVWEUA@public.gmane.org> writes:

> Hello,
>
> I'm using pandoc to convert a somewhat large file (3Mb) which contains a 
> lot of tables, hyperlinks and cross-references.  With version 1.19 this 
> file converts in 9 seconds, but there are some random (?) issues with table 
> formatting with the final pdf output (the text in a table cell runs off the 
> edge of the page).  I upgraded to 2.2 which addresses the formatting issue, 
> but the conversion time for this file is now 20 minutes plus.
>
> The file is just the same pattern over and over again with different text. 
>  Processing time grows exponentially with the number of instances: 16s at 
> 10k lines, 2.5 min at 25k lines, 16.76 min at 40k lines.  It is looking 
> like it is the number of cross references - when I create a dummy large 
> file, it still takes longer than 1.19 but not as drastic, however I get a 
> bunch of duplicate identifier warnings.
>
> Any suggestions on how to chase this down, and/or workarounds?
>
> Jeff
>
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/f1326854-a34d-43bc-8f3c-046341a1366b%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Performance hit between 1.19 and 2.2 markdown->latex
       [not found] ` <f1326854-a34d-43bc-8f3c-046341a1366b-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2018-08-01 21:38   ` John MacFarlane
@ 2018-08-01 23:37   ` John MacFarlane
       [not found]     ` <yh480kh8kdslpe.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
  1 sibling, 1 reply; 4+ messages in thread
From: John MacFarlane @ 2018-08-01 23:37 UTC (permalink / raw)
  To: JeffP, pandoc-discuss


You sent me a test file which consists of a large
number of repetitions of this pattern:

```
# Header {#ref0}
\nopandoc{ \hypertarget{ref0}{} }

+------------------+-----------------+
| A                | Z               |
+==================+=================+
| B                | Y               |
+------------------+-----------------+
| C                | X               |
+------------------+-----------------+
| D                | W               |
+------------------+-----------------+
| E                | V               |
+------------------+-----------------+
| F                | U               |
+------------------+-----------------+
| G                | T               |
+------------------+-----------------+
| H                | S               |
+------------------+-----------------+
```

I discovered that if you remove the raw tex, the file
again converts in about 9 seconds.  So I'll bet
that the differences are due to changes in the way
pandoc handles raw tex -- and there has been a big
change in this between 1.19 and 2.2.

I've benchmarked table parsing and there doesn't seem
to be a big difference there.

Note:  pandoc will add hypertargets for you
automatically, are you sure you even need these?

JeffP <jeff-ZgEzG5fdufQyLce1RVWEUA@public.gmane.org> writes:

> Hello,
>
> I'm using pandoc to convert a somewhat large file (3Mb) which contains a 
> lot of tables, hyperlinks and cross-references.  With version 1.19 this 
> file converts in 9 seconds, but there are some random (?) issues with table 
> formatting with the final pdf output (the text in a table cell runs off the 
> edge of the page).  I upgraded to 2.2 which addresses the formatting issue, 
> but the conversion time for this file is now 20 minutes plus.
>
> The file is just the same pattern over and over again with different text. 
>  Processing time grows exponentially with the number of instances: 16s at 
> 10k lines, 2.5 min at 25k lines, 16.76 min at 40k lines.  It is looking 
> like it is the number of cross references - when I create a dummy large 
> file, it still takes longer than 1.19 but not as drastic, however I get a 
> bunch of duplicate identifier warnings.
>
> Any suggestions on how to chase this down, and/or workarounds?
>
> Jeff
>
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/f1326854-a34d-43bc-8f3c-046341a1366b%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Performance hit between 1.19 and 2.2 markdown->latex
       [not found]     ` <yh480kh8kdslpe.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
@ 2018-08-01 23:43       ` John MacFarlane
  0 siblings, 0 replies; 4+ messages in thread
From: John MacFarlane @ 2018-08-01 23:43 UTC (permalink / raw)
  To: JeffP, pandoc-discuss


Can you open an issue on the github tracker about
the performance regression involving raw latex?
To reproduce it it it's sufficient to generate
a file with several hundred copies of the pattern
below.

John MacFarlane <jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org> writes:

> You sent me a test file which consists of a large
> number of repetitions of this pattern:
>
> ```
> # Header {#ref0}
> \nopandoc{ \hypertarget{ref0}{} }
>
> +------------------+-----------------+
> | A                | Z               |
> +==================+=================+
> | B                | Y               |
> +------------------+-----------------+
> | C                | X               |
> +------------------+-----------------+
> | D                | W               |
> +------------------+-----------------+
> | E                | V               |
> +------------------+-----------------+
> | F                | U               |
> +------------------+-----------------+
> | G                | T               |
> +------------------+-----------------+
> | H                | S               |
> +------------------+-----------------+
> ```
>
> I discovered that if you remove the raw tex, the file
> again converts in about 9 seconds.  So I'll bet
> that the differences are due to changes in the way
> pandoc handles raw tex -- and there has been a big
> change in this between 1.19 and 2.2.
>
> I've benchmarked table parsing and there doesn't seem
> to be a big difference there.
>
> Note:  pandoc will add hypertargets for you
> automatically, are you sure you even need these?
>
> JeffP <jeff-ZgEzG5fdufQyLce1RVWEUA@public.gmane.org> writes:
>
>> Hello,
>>
>> I'm using pandoc to convert a somewhat large file (3Mb) which contains a 
>> lot of tables, hyperlinks and cross-references.  With version 1.19 this 
>> file converts in 9 seconds, but there are some random (?) issues with table 
>> formatting with the final pdf output (the text in a table cell runs off the 
>> edge of the page).  I upgraded to 2.2 which addresses the formatting issue, 
>> but the conversion time for this file is now 20 minutes plus.
>>
>> The file is just the same pattern over and over again with different text. 
>>  Processing time grows exponentially with the number of instances: 16s at 
>> 10k lines, 2.5 min at 25k lines, 16.76 min at 40k lines.  It is looking 
>> like it is the number of cross references - when I create a dummy large 
>> file, it still takes longer than 1.19 but not as drastic, however I get a 
>> bunch of duplicate identifier warnings.
>>
>> Any suggestions on how to chase this down, and/or workarounds?
>>
>> Jeff
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/f1326854-a34d-43bc-8f3c-046341a1366b%40googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-08-01 23:43 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-01 20:06 Performance hit between 1.19 and 2.2 markdown->latex JeffP
     [not found] ` <f1326854-a34d-43bc-8f3c-046341a1366b-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-08-01 21:38   ` John MacFarlane
2018-08-01 23:37   ` John MacFarlane
     [not found]     ` <yh480kh8kdslpe.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
2018-08-01 23:43       ` John MacFarlane

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).