* Re: Excessive memory usage with --normalize
[not found] ` <AEC195F0C4F4824302B6F2BD-Q0ErXNX1RuZz+/J76PBWHg@public.gmane.org>
@ 2014-06-29 23:10 ` BPJ
[not found] ` <CADAJKhAWnUCVoUtWzn+XU3PJbRNBCNvaD6ivsTLCVnKa_Fzt2g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 6+ messages in thread
From: BPJ @ 2014-06-29 23:10 UTC (permalink / raw)
To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw
[-- Attachment #1: Type: text/plain, Size: 2985 bytes --]
Was the HTML excessively dirty? Out of curiosity: what was pandoc's memory
usage on so big a file without --normalize?
BTW I also thought about starting from PG HTML rather than from their
ambiguous text format(s).
/bpj
söndag 29 juni 2014 skrev Daniel Staal <DStaal-Jdbf3xiKgS8@public.gmane.org>:
>
> With all the talk about Project Gutenburg, I thought I'd try converting a
> few of their docs to Markdown, and I stumbled upon a memory issue...
>
> I was starting with their top download: Pride and Prejudice. I took their
> HTML version[^1], and used pandoc to get a start, but when I set it to
> running my computer slowed way down and it was taking ages. Opening
> activity monitor I noticed that pandoc was using over 5GB of RAM, for a
> file that's ~800k. Playing with the options I'd picked I found that the
> problem option was `--normalize`. (I was trying to get it to clean up the
> text a bit.)
>
> Is this known? I doesn't seem like expected behavior - I never let pandoc
> run to completion with that option, and I have the feeling that it would
> have taken more RAM if it was able. (As it was, it took over half the RAM
> on the box, which had other things running as well.)
>
> Daniel T. Staal
>
> [^1]: <http://www.gutenberg.org/files/1342/1342-h/1342-h.htm>
>
> ---------------------------------------------------------------
> This email copyright the author. Unless otherwise noted, you
> are expressly allowed to retransmit, quote, or otherwise use
> the contents for non-commercial purposes. This copyright will
> expire 5 years after the author's death, or in 30 years,
> whichever is longer, unless such a period is in excess of
> local copyright law.
> ---------------------------------------------------------------
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/pandoc-discuss/AEC195F0C4F4824302B6F2BD%40%5B192.168.1.50%5D.
> For more options, visit https://groups.google.com/d/optout.
>
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhAWnUCVoUtWzn%2BXU3PJbRNBCNvaD6ivsTLCVnKa_Fzt2g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
[-- Attachment #2: Type: text/html, Size: 4018 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Excessive memory usage with --normalize
[not found] ` <CADAJKhAWnUCVoUtWzn+XU3PJbRNBCNvaD6ivsTLCVnKa_Fzt2g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-06-30 0:38 ` Daniel Staal
0 siblings, 0 replies; 6+ messages in thread
From: Daniel Staal @ 2014-06-30 0:38 UTC (permalink / raw)
To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw
--As of June 30, 2014 1:10:01 AM +0200, BPJ is alleged to have said:
> Was the HTML excessively dirty? Out of curiosity: what was pandoc's
> memory usage on so big a file without --normalize?
The HTML isn't very dirty at all, actually. It's hard to catch the memory
use without --normalize: Pandoc finishes quite quickly then. ;) Looks
like about 400MB, so --normalize is using more than 10 times the RAM. (And
my box was paging out as fast as it could with --normalize; pandoc
obviously wanted *more*.)
> BTW I also thought about starting from PG HTML rather than from their
> ambiguous text format(s).
I figured it was a better format to start with. ;) It works fairly well,
actually. Their table of contents links needed to be re-made, and the
top/bottom boilerplate needs some formatting, but the main text appeared to
come through fairly well.
I'm playing around with it a bit; I kinda want a couple of fully developed
'type specimens' before I try sending them to PG. Which may or may not
happen - one of these days soon I'm going to actually have to do something
that gets me money...
Daniel T. Staal
---------------------------------------------------------------
This email copyright the author. Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes. This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Excessive memory usage with --normalize
2014-06-30 3:52 ` Daniel Staal
@ 2014-06-30 6:06 ` John MacFarlane
0 siblings, 0 replies; 6+ messages in thread
From: John MacFarlane @ 2014-06-30 6:06 UTC (permalink / raw)
To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw
I've rewritten normalize. Just tried it on the Jane AUsten
page, and the results are encouraging. The time penalty of
using `--normalize` was only 0.01 seconds.
+++ Daniel Staal [Jun 29 14 23:52 ]:
>--As of June 29, 2014 8:11:55 PM -0700, John MacFarlane is alleged to
>have said:
>
>>I tried pandoc without --normalize and it converted Pride
>>and Prejudice well. I think you just need to preprocess
>>the HTML and remove a few crufty bits, like the div at
>>the end of each chapter with four <br /> tags.
>
>--As for the rest, it is mine.
>
>Yeah, I was mostly trying different options to see what worked best.
>I just felt I should report this because it was such an outlier.
>
>Daniel T. Staal
>
>---------------------------------------------------------------
>This email copyright the author. Unless otherwise noted, you
>are expressly allowed to retransmit, quote, or otherwise use
>the contents for non-commercial purposes. This copyright will
>expire 5 years after the author's death, or in 30 years,
>whichever is longer, unless such a period is in excess of
>local copyright law.
>---------------------------------------------------------------
>
>--
>You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/E5CF642FC0BBDC4B49C952D6%40%5B192.168.1.50%5D.
>For more options, visit https://groups.google.com/d/optout.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Excessive memory usage with --normalize
[not found] ` <20140630031155.GB16744-bi+AKbBUZKbivNSvqvJHCtPlBySK3R6THiGdP5j34PU@public.gmane.org>
@ 2014-06-30 3:52 ` Daniel Staal
2014-06-30 6:06 ` John MacFarlane
0 siblings, 1 reply; 6+ messages in thread
From: Daniel Staal @ 2014-06-30 3:52 UTC (permalink / raw)
To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw
--As of June 29, 2014 8:11:55 PM -0700, John MacFarlane is alleged to have
said:
> I tried pandoc without --normalize and it converted Pride
> and Prejudice well. I think you just need to preprocess
> the HTML and remove a few crufty bits, like the div at
> the end of each chapter with four <br /> tags.
--As for the rest, it is mine.
Yeah, I was mostly trying different options to see what worked best. I
just felt I should report this because it was such an outlier.
Daniel T. Staal
---------------------------------------------------------------
This email copyright the author. Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes. This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Excessive memory usage with --normalize
2014-06-29 20:17 Daniel Staal
@ 2014-06-30 3:11 ` John MacFarlane
[not found] ` <20140630031155.GB16744-bi+AKbBUZKbivNSvqvJHCtPlBySK3R6THiGdP5j34PU@public.gmane.org>
0 siblings, 1 reply; 6+ messages in thread
From: John MacFarlane @ 2014-06-30 3:11 UTC (permalink / raw)
To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw
--normalize uses syb generics, which are slow. There may
be other problems with it. It actually shouldn't be too much
work to rewrite it to be efficient -- I'll try to do that.
I tried pandoc without --normalize and it converted Pride
and Prejudice well. I think you just need to preprocess
the HTML and remove a few crufty bits, like the div at
the end of each chapter with four <br /> tags.
+++ Daniel Staal [Jun 29 14 16:17 ]:
>
>With all the talk about Project Gutenburg, I thought I'd try
>converting a few of their docs to Markdown, and I stumbled upon a
>memory issue...
>
>I was starting with their top download: Pride and Prejudice. I took
>their HTML version[^1], and used pandoc to get a start, but when I set
>it to running my computer slowed way down and it was taking ages.
>Opening activity monitor I noticed that pandoc was using over 5GB of
>RAM, for a file that's ~800k. Playing with the options I'd picked I
>found that the problem option was `--normalize`. (I was trying to get
>it to clean up the text a bit.)
>
>Is this known? I doesn't seem like expected behavior - I never let
>pandoc run to completion with that option, and I have the feeling that
>it would have taken more RAM if it was able. (As it was, it took over
>half the RAM on the box, which had other things running as well.)
>
>Daniel T. Staal
>
>[^1]: <http://www.gutenberg.org/files/1342/1342-h/1342-h.htm>
>
>---------------------------------------------------------------
>This email copyright the author. Unless otherwise noted, you
>are expressly allowed to retransmit, quote, or otherwise use
>the contents for non-commercial purposes. This copyright will
>expire 5 years after the author's death, or in 30 years,
>whichever is longer, unless such a period is in excess of
>local copyright law.
>---------------------------------------------------------------
>
>--
>You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/AEC195F0C4F4824302B6F2BD%40%5B192.168.1.50%5D.
>For more options, visit https://groups.google.com/d/optout.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Excessive memory usage with --normalize
@ 2014-06-29 20:17 Daniel Staal
2014-06-30 3:11 ` John MacFarlane
0 siblings, 1 reply; 6+ messages in thread
From: Daniel Staal @ 2014-06-29 20:17 UTC (permalink / raw)
To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw
With all the talk about Project Gutenburg, I thought I'd try converting a
few of their docs to Markdown, and I stumbled upon a memory issue...
I was starting with their top download: Pride and Prejudice. I took their
HTML version[^1], and used pandoc to get a start, but when I set it to
running my computer slowed way down and it was taking ages. Opening
activity monitor I noticed that pandoc was using over 5GB of RAM, for a
file that's ~800k. Playing with the options I'd picked I found that the
problem option was `--normalize`. (I was trying to get it to clean up the
text a bit.)
Is this known? I doesn't seem like expected behavior - I never let pandoc
run to completion with that option, and I have the feeling that it would
have taken more RAM if it was able. (As it was, it took over half the RAM
on the box, which had other things running as well.)
Daniel T. Staal
[^1]: <http://www.gutenberg.org/files/1342/1342-h/1342-h.htm>
---------------------------------------------------------------
This email copyright the author. Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes. This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2014-06-30 6:06 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <AEC195F0C4F4824302B6F2BD@192.168.1.50>
[not found] ` <AEC195F0C4F4824302B6F2BD-Q0ErXNX1RuZz+/J76PBWHg@public.gmane.org>
2014-06-29 23:10 ` Excessive memory usage with --normalize BPJ
[not found] ` <CADAJKhAWnUCVoUtWzn+XU3PJbRNBCNvaD6ivsTLCVnKa_Fzt2g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-06-30 0:38 ` Daniel Staal
2014-06-29 20:17 Daniel Staal
2014-06-30 3:11 ` John MacFarlane
[not found] ` <20140630031155.GB16744-bi+AKbBUZKbivNSvqvJHCtPlBySK3R6THiGdP5j34PU@public.gmane.org>
2014-06-30 3:52 ` Daniel Staal
2014-06-30 6:06 ` John MacFarlane
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).