public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* Excessive memory usage with --normalize
@ 2014-06-29 20:17 Daniel Staal
  2014-06-30  3:11 ` John MacFarlane
  0 siblings, 1 reply; 6+ messages in thread
From: Daniel Staal @ 2014-06-29 20:17 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw


With all the talk about Project Gutenburg, I thought I'd try converting a 
few of their docs to Markdown, and I stumbled upon a memory issue...

I was starting with their top download: Pride and Prejudice.  I took their 
HTML version[^1], and used pandoc to get a start, but when I set it to 
running my computer slowed way down and it was taking ages.  Opening 
activity monitor I noticed that pandoc was using over 5GB of RAM, for a 
file that's ~800k.  Playing with the options I'd picked I found that the 
problem option was `--normalize`.  (I was trying to get it to clean up the 
text a bit.)

Is this known?  I doesn't seem like expected behavior - I never let pandoc 
run to completion with that option, and I have the feeling that it would 
have taken more RAM if it was able.  (As it was, it took over half the RAM 
on the box, which had other things running as well.)

Daniel T. Staal

[^1]: <http://www.gutenberg.org/files/1342/1342-h/1342-h.htm>

---------------------------------------------------------------
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Excessive memory usage with --normalize
  2014-06-29 20:17 Excessive memory usage with --normalize Daniel Staal
@ 2014-06-30  3:11 ` John MacFarlane
       [not found]   ` <20140630031155.GB16744-bi+AKbBUZKbivNSvqvJHCtPlBySK3R6THiGdP5j34PU@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: John MacFarlane @ 2014-06-30  3:11 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

--normalize uses syb generics, which are slow.  There may
be other problems with it.  It actually shouldn't be too much
work to rewrite it to be efficient -- I'll try to do that.

I tried pandoc without --normalize and it converted Pride
and Prejudice well.  I think you just need to preprocess
the HTML and remove a few crufty bits, like the div at
the end of each chapter with four <br /> tags.

+++ Daniel Staal [Jun 29 14 16:17 ]:
>
>With all the talk about Project Gutenburg, I thought I'd try 
>converting a few of their docs to Markdown, and I stumbled upon a 
>memory issue...
>
>I was starting with their top download: Pride and Prejudice.  I took 
>their HTML version[^1], and used pandoc to get a start, but when I set 
>it to running my computer slowed way down and it was taking ages.  
>Opening activity monitor I noticed that pandoc was using over 5GB of 
>RAM, for a file that's ~800k.  Playing with the options I'd picked I 
>found that the problem option was `--normalize`.  (I was trying to get 
>it to clean up the text a bit.)
>
>Is this known?  I doesn't seem like expected behavior - I never let 
>pandoc run to completion with that option, and I have the feeling that 
>it would have taken more RAM if it was able.  (As it was, it took over 
>half the RAM on the box, which had other things running as well.)
>
>Daniel T. Staal
>
>[^1]: <http://www.gutenberg.org/files/1342/1342-h/1342-h.htm>
>
>---------------------------------------------------------------
>This email copyright the author.  Unless otherwise noted, you
>are expressly allowed to retransmit, quote, or otherwise use
>the contents for non-commercial purposes.  This copyright will
>expire 5 years after the author's death, or in 30 years,
>whichever is longer, unless such a period is in excess of
>local copyright law.
>---------------------------------------------------------------
>
>-- 
>You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/AEC195F0C4F4824302B6F2BD%40%5B192.168.1.50%5D.
>For more options, visit https://groups.google.com/d/optout.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Excessive memory usage with --normalize
       [not found]   ` <20140630031155.GB16744-bi+AKbBUZKbivNSvqvJHCtPlBySK3R6THiGdP5j34PU@public.gmane.org>
@ 2014-06-30  3:52     ` Daniel Staal
  2014-06-30  6:06       ` John MacFarlane
  0 siblings, 1 reply; 6+ messages in thread
From: Daniel Staal @ 2014-06-30  3:52 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

--As of June 29, 2014 8:11:55 PM -0700, John MacFarlane is alleged to have 
said:

> I tried pandoc without --normalize and it converted Pride
> and Prejudice well.  I think you just need to preprocess
> the HTML and remove a few crufty bits, like the div at
> the end of each chapter with four <br /> tags.

--As for the rest, it is mine.

Yeah, I was mostly trying different options to see what worked best.  I 
just felt I should report this because it was such an outlier.

Daniel T. Staal

---------------------------------------------------------------
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Excessive memory usage with --normalize
  2014-06-30  3:52     ` Daniel Staal
@ 2014-06-30  6:06       ` John MacFarlane
  0 siblings, 0 replies; 6+ messages in thread
From: John MacFarlane @ 2014-06-30  6:06 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

I've rewritten normalize.  Just tried it on the Jane AUsten
page, and the results are encouraging.  The time penalty of
using `--normalize` was only 0.01 seconds.

+++ Daniel Staal [Jun 29 14 23:52 ]:
>--As of June 29, 2014 8:11:55 PM -0700, John MacFarlane is alleged to 
>have said:
>
>>I tried pandoc without --normalize and it converted Pride
>>and Prejudice well.  I think you just need to preprocess
>>the HTML and remove a few crufty bits, like the div at
>>the end of each chapter with four <br /> tags.
>
>--As for the rest, it is mine.
>
>Yeah, I was mostly trying different options to see what worked best.  
>I just felt I should report this because it was such an outlier.
>
>Daniel T. Staal
>
>---------------------------------------------------------------
>This email copyright the author.  Unless otherwise noted, you
>are expressly allowed to retransmit, quote, or otherwise use
>the contents for non-commercial purposes.  This copyright will
>expire 5 years after the author's death, or in 30 years,
>whichever is longer, unless such a period is in excess of
>local copyright law.
>---------------------------------------------------------------
>
>-- 
>You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/E5CF642FC0BBDC4B49C952D6%40%5B192.168.1.50%5D.
>For more options, visit https://groups.google.com/d/optout.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Excessive memory usage with --normalize
       [not found]     ` <CADAJKhAWnUCVoUtWzn+XU3PJbRNBCNvaD6ivsTLCVnKa_Fzt2g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-06-30  0:38       ` Daniel Staal
  0 siblings, 0 replies; 6+ messages in thread
From: Daniel Staal @ 2014-06-30  0:38 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

--As of June 30, 2014 1:10:01 AM +0200, BPJ is alleged to have said:

> Was the HTML excessively dirty? Out of curiosity: what was pandoc's
> memory usage on so big a file without --normalize?

The HTML isn't very dirty at all, actually.  It's hard to catch the memory 
use without --normalize: Pandoc finishes quite quickly then.  ;)  Looks 
like about 400MB, so --normalize is using more than 10 times the RAM.  (And 
my box was paging out as fast as it could with --normalize; pandoc 
obviously wanted *more*.)

> BTW I also thought about starting from PG HTML rather than from their
> ambiguous text format(s).

I figured it was a better format to start with. ;) It works fairly well, 
actually.  Their table of contents links needed to be re-made, and the 
top/bottom boilerplate needs some formatting, but the main text appeared to 
come through fairly well.

I'm playing around with it a bit; I kinda want a couple of fully developed 
'type specimens' before I try sending them to PG.  Which may or may not 
happen - one of these days soon I'm going to actually have to do something 
that gets me money...

Daniel T. Staal

---------------------------------------------------------------
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Excessive memory usage with --normalize
       [not found] ` <AEC195F0C4F4824302B6F2BD-Q0ErXNX1RuZz+/J76PBWHg@public.gmane.org>
@ 2014-06-29 23:10   ` BPJ
       [not found]     ` <CADAJKhAWnUCVoUtWzn+XU3PJbRNBCNvaD6ivsTLCVnKa_Fzt2g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: BPJ @ 2014-06-29 23:10 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 2985 bytes --]

Was the HTML excessively dirty? Out of curiosity: what was pandoc's memory
usage on so big a file without --normalize?

BTW I also thought about starting from PG HTML rather than from their
ambiguous text format(s).

/bpj


söndag 29 juni 2014 skrev Daniel Staal <DStaal-Jdbf3xiKgS8@public.gmane.org>:

>
> With all the talk about Project Gutenburg, I thought I'd try converting a
> few of their docs to Markdown, and I stumbled upon a memory issue...
>
> I was starting with their top download: Pride and Prejudice.  I took their
> HTML version[^1], and used pandoc to get a start, but when I set it to
> running my computer slowed way down and it was taking ages.  Opening
> activity monitor I noticed that pandoc was using over 5GB of RAM, for a
> file that's ~800k.  Playing with the options I'd picked I found that the
> problem option was `--normalize`.  (I was trying to get it to clean up the
> text a bit.)
>
> Is this known?  I doesn't seem like expected behavior - I never let pandoc
> run to completion with that option, and I have the feeling that it would
> have taken more RAM if it was able.  (As it was, it took over half the RAM
> on the box, which had other things running as well.)
>
> Daniel T. Staal
>
> [^1]: <http://www.gutenberg.org/files/1342/1342-h/1342-h.htm>
>
> ---------------------------------------------------------------
> This email copyright the author.  Unless otherwise noted, you
> are expressly allowed to retransmit, quote, or otherwise use
> the contents for non-commercial purposes.  This copyright will
> expire 5 years after the author's death, or in 30 years,
> whichever is longer, unless such a period is in excess of
> local copyright law.
> ---------------------------------------------------------------
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/pandoc-discuss/AEC195F0C4F4824302B6F2BD%40%5B192.168.1.50%5D.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhAWnUCVoUtWzn%2BXU3PJbRNBCNvaD6ivsTLCVnKa_Fzt2g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #2: Type: text/html, Size: 4018 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-06-30  6:06 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-29 20:17 Excessive memory usage with --normalize Daniel Staal
2014-06-30  3:11 ` John MacFarlane
     [not found]   ` <20140630031155.GB16744-bi+AKbBUZKbivNSvqvJHCtPlBySK3R6THiGdP5j34PU@public.gmane.org>
2014-06-30  3:52     ` Daniel Staal
2014-06-30  6:06       ` John MacFarlane
     [not found] <AEC195F0C4F4824302B6F2BD@192.168.1.50>
     [not found] ` <AEC195F0C4F4824302B6F2BD-Q0ErXNX1RuZz+/J76PBWHg@public.gmane.org>
2014-06-29 23:10   ` BPJ
     [not found]     ` <CADAJKhAWnUCVoUtWzn+XU3PJbRNBCNvaD6ivsTLCVnKa_Fzt2g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-06-30  0:38       ` Daniel Staal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).