From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/10206 Path: news.gmane.org!not-for-mail From: BPJ Newsgroups: gmane.text.pandoc Subject: Re: Excessive memory usage with --normalize Date: Mon, 30 Jun 2014 01:10:01 +0200 Message-ID: References: Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary=001a11c044a463870804fd01a497 X-Trace: ger.gmane.org 1404083408 3067 80.91.229.3 (29 Jun 2014 23:10:08 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sun, 29 Jun 2014 23:10:08 +0000 (UTC) To: "pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org" Original-X-From: pandoc-discuss+bncBCWMVYEK54FRBSNZYKOQKGQEQ23YLTI-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mon Jun 30 01:10:04 2014 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane.org Original-Received: from mail-oa0-f59.google.com ([209.85.219.59]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1X1OEV-0005GF-Q9 for gtp-pandoc-discuss@m.gmane.org; Mon, 30 Jun 2014 01:10:04 +0200 Original-Received: by mail-oa0-f59.google.com with SMTP id n16sf1558119oag.4 for ; Sun, 29 Jun 2014 16:10:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20120806; h=mime-version:reply-to:sender:in-reply-to:references:date:message-id :subject:from:to:x-original-sender:x-original-authentication-results :precedence:mailing-list:list-id:list-post:list-help:list-archive :list-subscribe:list-unsubscribe:content-type; bh=IqRfsfXmOvIxyKmsQAWdQ/tKn0oOkwuj3B/dIHMPkFQ=; b=ygUwVG8N0Sstf6flU2X0ppcqIbsMUFPAuH/lP5Nsq0JY7AZz13Z804cXTp6E6iCJ4Q sulxKf3xVeUYALtD36aVjsq3Hq+K7x6/g25vZ7tnQvXpJYMoAYtvEBo96BySgUvSnKl5 ijJRJVgbN0zQN8EOGP8bP/eBDlAxzSwTa9og/TkeR2IVeSW1TXWiPQ4uz4WavOwt5++W U4Y/UkgujW6EpVz2nZ4OjugQmoQZYB41Ho1dGN3Y0NXXOD9xyafYwYdHJVbeTU5/2onb R+rsXzDMbywGXCryYrXDt4QNHAZct0yf024sv5r7p0RZCFPhZV9WR04opndC3Ry5nD5e CieQ== X-Received: by 10.50.111.232 with SMTP id il8mr309472igb.6.1404083402279; Sun, 29 Jun 2014 16:10:02 -0700 (PDT) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 10.50.164.168 with SMTP id yr8ls1339187igb.11.canary; Sun, 29 Jun 2014 16:10:01 -0700 (PDT) X-Received: by 10.67.4.202 with SMTP id cg10mr20833910pad.42.1404083401757; Sun, 29 Jun 2014 16:10:01 -0700 (PDT) Original-Received: from mail-qc0-x242.google.com (mail-qc0-x242.google.com [2607:f8b0:400d:c01::242]) by gmr-mx.google.com with ESMTPS id x7si659132qcd.3.2014.06.29.16.10.01 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sun, 29 Jun 2014 16:10:01 -0700 (PDT) Received-SPF: pass (google.com: domain of melroch-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org designates 2607:f8b0:400d:c01::242 as permitted sender) client-ip=2607:f8b0:400d:c01::242; Original-Received: by mail-qc0-f194.google.com with SMTP id i8so2292119qcq.9 for ; Sun, 29 Jun 2014 16:10:01 -0700 (PDT) X-Received: by 10.140.34.195 with SMTP id l61mr52501570qgl.87.1404083401621; Sun, 29 Jun 2014 16:10:01 -0700 (PDT) Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 10.140.101.83 with HTTP; Sun, 29 Jun 2014 16:10:01 -0700 (PDT) In-Reply-To: X-Original-Sender: melroch-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org X-Original-Authentication-Results: gmr-mx.google.com; spf=pass (google.com: domain of melroch-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org designates 2607:f8b0:400d:c01::242 as permitted sender) smtp.mail=melroch-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org; dkim=pass header.i=@gmail.com Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: List-Subscribe: , List-Unsubscribe: , Xref: news.gmane.org gmane.text.pandoc:10206 Archived-At: --001a11c044a463870804fd01a497 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Was the HTML excessively dirty? Out of curiosity: what was pandoc's memory usage on so big a file without --normalize? BTW I also thought about starting from PG HTML rather than from their ambiguous text format(s). /bpj s=C3=B6ndag 29 juni 2014 skrev Daniel Staal : > > With all the talk about Project Gutenburg, I thought I'd try converting a > few of their docs to Markdown, and I stumbled upon a memory issue... > > I was starting with their top download: Pride and Prejudice. I took thei= r > HTML version[^1], and used pandoc to get a start, but when I set it to > running my computer slowed way down and it was taking ages. Opening > activity monitor I noticed that pandoc was using over 5GB of RAM, for a > file that's ~800k. Playing with the options I'd picked I found that the > problem option was `--normalize`. (I was trying to get it to clean up th= e > text a bit.) > > Is this known? I doesn't seem like expected behavior - I never let pando= c > run to completion with that option, and I have the feeling that it would > have taken more RAM if it was able. (As it was, it took over half the RA= M > on the box, which had other things running as well.) > > Daniel T. Staal > > [^1]: > > --------------------------------------------------------------- > This email copyright the author. Unless otherwise noted, you > are expressly allowed to retransmit, quote, or otherwise use > the contents for non-commercial purposes. This copyright will > expire 5 years after the author's death, or in 30 years, > whichever is longer, unless such a period is in excess of > local copyright law. > --------------------------------------------------------------- > > -- > You received this message because you are subscribed to the Google Groups > "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit https://groups.google.com/d/ > msgid/pandoc-discuss/AEC195F0C4F4824302B6F2BD%40%5B192.168.1.50%5D. > For more options, visit https://groups.google.com/d/optout. > --=20 You received this message because you are subscribed to the Google Groups "= pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an e= mail to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/= pandoc-discuss/CADAJKhAWnUCVoUtWzn%2BXU3PJbRNBCNvaD6ivsTLCVnKa_Fzt2g%40mail= .gmail.com. For more options, visit https://groups.google.com/d/optout. --001a11c044a463870804fd01a497 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Was the HTML excessively dirty? Out of curiosity: what was pandoc's mem= ory usage on so big a file without --normalize?

BTW I al= so thought about starting from PG HTML rather than from their ambiguous tex= t format(s).

/bpj


s=C3=B6ndag 29 juni= 2014 skrev Daniel Staal <DStaal@usa.n= et>:

With all the talk about Project Gutenburg, I thought I'd try converting= a few of their docs to Markdown, and I stumbled upon a memory issue...

I was starting with their top download: Pride and Prejudice. =C2=A0I took t= heir HTML version[^1], and used pandoc to get a start, but when I set it to= running my computer slowed way down and it was taking ages. =C2=A0Opening = activity monitor I noticed that pandoc was using over 5GB of RAM, for a fil= e that's ~800k. =C2=A0Playing with the options I'd picked I found t= hat the problem option was `--normalize`. =C2=A0(I was trying to get it to = clean up the text a bit.)

Is this known? =C2=A0I doesn't seem like expected behavior - I never le= t pandoc run to completion with that option, and I have the feeling that it= would have taken more RAM if it was able. =C2=A0(As it was, it took over h= alf the RAM on the box, which had other things running as well.)

Daniel T. Staal

[^1]: <http://www.gutenberg.org/files/1342/1342-h/1342-h= .htm>

-------------------------------------------------------------= --
This email copyright the author. =C2=A0Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes. =C2=A0This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
-------------------------------------------------------------= --

--
You received this message because you are subscribed to the Google Groups &= quot;pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org= .
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/= AEC195F0C4F4824302B6F2BD%40%5B192.168.1.50%5D.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups &= quot;pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to pand= oc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To view this discussion on the web visit https://group= s.google.com/d/msgid/pandoc-discuss/CADAJKhAWnUCVoUtWzn%2BXU3PJbRNBCNvaD6iv= sTLCVnKa_Fzt2g%40mail.gmail.com.
For more options, visit http= s://groups.google.com/d/optout.
--001a11c044a463870804fd01a497--