From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.io/gmane.emacs.gnus.general/65326
Path: news.gmane.org!not-for-mail
From: Daniel Pittman <daniel@rimspace.net>
Newsgroups: gmane.emacs.gnus.general
Subject: Re: Huge memory consumption on accessing large newsgroup
Date: Tue, 02 Oct 2007 13:23:56 +1000
Organization: Cybersource: Australia's Leading Linux and Open Source Solutions Company
Message-ID: <874phatd0z.fsf@enki.rimspace.net>
References: <87wsw4u21m.fsf@gmx.de> <b4m3ayriwp1.fsf@jpl.org>
	<87sl6rh886.fsf@gmx.de> <87hcldp4j1.fsf@srcf.ucam.org>
	<m21wcf9527.fsf@lifelogs.com> <87odfjprux.fsf@enki.rimspace.net>
	<m2zlz25kn1.fsf@lifelogs.com>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: sea.gmane.org 1191295506 3946 80.91.229.12 (2 Oct 2007 03:25:06 GMT)
X-Complaints-To: usenet@sea.gmane.org
NNTP-Posting-Date: Tue, 2 Oct 2007 03:25:06 +0000 (UTC)
To: ding@gnus.org
Original-X-From: ding-owner+M13838@lists.math.uh.edu Tue Oct 02 05:25:02 2007
Return-path: <ding-owner+M13838@lists.math.uh.edu>
Envelope-to: ding-account@gmane.org
Original-Received: from util0.math.uh.edu ([129.7.128.18])
	by lo.gmane.org with esmtp (Exim 4.50)
	id 1IcYNZ-0007UF-UJ
	for ding-account@gmane.org; Tue, 02 Oct 2007 05:25:02 +0200
Original-Received: from localhost ([127.0.0.1] helo=lists.math.uh.edu)
	by util0.math.uh.edu with smtp (Exim 4.63)
	(envelope-from <ding-owner+M13838@lists.math.uh.edu>)
	id 1IcYMu-0007bX-4q; Mon, 01 Oct 2007 22:24:20 -0500
Original-Received: from mx2.math.uh.edu ([129.7.128.33])
	by util0.math.uh.edu with esmtps (TLSv1:AES256-SHA:256)
	(Exim 4.63)
	(envelope-from <daniel@rimspace.net>)
	id 1IcYMs-0007bI-LC
	for ding@lists.math.uh.edu; Mon, 01 Oct 2007 22:24:18 -0500
Original-Received: from quimby.gnus.org ([80.91.231.51])
	by mx2.math.uh.edu with esmtp (Exim 4.67)
	(envelope-from <daniel@rimspace.net>)
	id 1IcYMj-0001Ox-H0
	for ding@lists.math.uh.edu; Mon, 01 Oct 2007 22:24:18 -0500
Original-Received: from 203-217-31-68.perm.iinet.net.au ([203.217.31.68] helo=anu.rimspace.net)
	by quimby.gnus.org with esmtp (Exim 3.35 #1 (Debian))
	id 1IcYMZ-0005X2-00
	for <ding@gnus.org>; Tue, 02 Oct 2007 05:23:59 +0200
Original-Received: by anu.rimspace.net (Postfix, from userid 10)
	id F408B192002F; Tue,  2 Oct 2007 13:24:02 +1000 (EST)
Original-Received: by enki.rimspace.net (Postfix, from userid 1000)
	id 880F62F0568; Tue,  2 Oct 2007 13:23:56 +1000 (EST)
User-Agent: Gnus/5.110006 (No Gnus v0.6) Emacs/23.0.0 (gnu/linux)
X-Spam-Score: -2.5 (--)
List-ID: <ding.gnus.org>
Precedence: bulk
Xref: news.gmane.org gmane.emacs.gnus.general:65326
Archived-At: <http://permalink.gmane.org/gmane.emacs.gnus.general/65326>

Ted Zlatanov <tzz@lifelogs.com> writes:
> On Mon, 01 Oct 2007 11:04:54 +1000 Daniel Pittman <daniel@rimspace.net> wrote: 
> DP> Ted Zlatanov <tzz@lifelogs.com> writes:
>
>>> Consider using inversion lists.  
>
>>> They don't require pairs of integers; each value represents a flip.
>>> They have other nice properties too, though it's been a while since
>>> I looked at them so I can't name them.
>
> DP> Mmm.  The existing Gnus range code is pretty much as efficient,
> DP> has lower computational complexity for the operations Gnus uses
> DP> and is already existent and tested.
>
> We're talking about reducing memory consumption, not CPU usage.  For
> CPU usage the current structure is pretty good.  For memory usage, as
> you can see from the subject of the message, it's not ideal.

Ah.  I mis-expressed myself.

The issue with large memory consumption, as far as I could see from the
thread, was that the compressed range data type was expanded to a flat
list.  This caused, no surprise, huge memory use.


The original problem is that the code calls `gnus-uncompress-range' on
the data *at all* -- and, so, turns a nicely brief data structure into a
vast bloated million-number list.

The *solution* is to rebuild the algorithm to operate on the compressed
version (regardless of the internal representation), not to change the
representation.


In other words: the problem is that something calls
`gnus-uncompress-range', not the format of the data passed to that call.


The "efficient" part referred to both memory and CPU, and the
lower computational complexity is a consideration only after the memory
load is addressed -- but we wouldn't want to turn huge memory use into
hours of CPU time to enter the group. ;)

> We could use several data structures as needed to accomodate large
> data lists.  

Regarding the original problem: we already /have/ a representation of
large data lists and, in fact, the problem occurs related to them:

    (gnus-uncompress-range '(3437 . 30538699))

That second bit is the compressed range data.  The function, which
causes the memory bloat, turns that into a list that contains every
integer between those two values...


Using inversion lists rather than the Gnus range data here would have no
effect other than to (a) convert the Gnus range data into an inversion
list and (b) uncompress the inversion list with the same memory use.

We need to remove the code that deals with uncompressed values rather
than to change the representation of the (already memory efficient)
compressed data format. :)

> Inversion lists are nice in some cases and not others; I think they
> are worth at least some consideration since they don't require a pair
> of numbers to express a range and thus may save some memory.

They do, however, use a pair of numbers to express a single stand-alone
number in the sequence.

> I don't know the Emacs internals intimately, but I think (24 86 88 90)
> takes up less memory than ((24 . 86) (88 . 90)).  Am I wrong?

No, you are correct.  However, if the sequence was:

  24, 26, 28, 30-300 the output would be:

  (24 25 26 27 28 29 30 300)
  (24 26 26 (30 . 300))

Inversion lists are more efficient for long runs of range numbers by one
additional cons cell, but less efficient by one cons cell for each
stand-alone number that is not part of a range.

> Note Gaute's original message was about compressing pairs of integers,
> not general Gnus data structures, so please consider my message in
> that context.

That seems reasonable, and in general I agree: inversion lists are a
nice data structure, a (presumably) good implementation (with tests)
already exists and the cost of adding this new data type isn't /that/
high -- on those terms.

On the other hand the Gnus range code already exists, is well tested (if
not with stand-alone tests) and is the data structure used elsewhere in
the Gnus code.


The Gnus range code is also more efficient if stand alone outlier values
are more common than ranges, while inversion lists are more efficient if
they are less common.


The end problem, though, is that the compressed representation is
expanded to a flat list by the code.  That makes the exact compressed
data structure less important than finding /any/ solution that doesn't
expand it.


Oh, and using inversion lists here would require converting the existing
compressed data format into an inversion list and then operating on
that.  That is ... not really helpful. :)

> DP> I don't think that you would see sufficient benefit from
> DP> introducing the additional data type to justify spending your time
> DP> on it -- but since I am not volunteering to do the work I can't
> DP> tell you how to do it. ;)
>
> I am not sure if you are addressing me or Gaute.  I made a suggestion,
> and an implementation is already in place (I mentioned Kim Storm
> implemented it).  I'm willing to assist Gaute, but he has to be
> interested in my suggestion.  I don't have the time for this as a solo
> project.

I was addressing both of you -- I can express an opinion, and while I
feel qualified to comment on this since I have both worked on this
specific problem before[1] and addressed this conversation before[2].

Since I am not actually writing the code, though, if Gaute or yourself
feel like writing some Lisp that uses inversion lists I can't actually
stop you. ;)


I would strongly advise, however, that you are much better off ignoring
inversion lists here and focusing on converting the existing code to use
the compressed `gnus-range' data structure rather than expanding to a
flat list.

Also of note: the proposed "solution" of narrowing the range is not
going to help much; fixing the code is the right answer.

Regards,
        Daniel

(And because this has been a stupidly annoying couple of week in other
 areas, and because this is nice simple and essentially stress-free work
 I am getting tempted to fix it myself.

 So, maybe inversion lists were the way to get the code fixed after all,
 if not quite so directly as expected. ;) 

Footnotes: 
[1]  The nnimap code did, as far as I can tell, exactly what the code
     causing trouble here dose.

[2]  Inversion lists were suggested at the time, as I recall, though I
     don't have the enthusiasm to dig in the archive and find out who
     suggested them.

-- 
Daniel Pittman <daniel@cybersource.com.au>           Phone: 03 9621 2377
Level 4, 10 Queen St, Melbourne             Web: http://www.cyber.com.au
Cybersource: Australia's Leading Linux and Open Source Solutions Company