From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.3 required=5.0 tests=MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_PASS
	autolearn=ham autolearn_force=no version=3.4.2
Received: (qmail 19878 invoked from network); 3 Apr 2020 21:31:28 -0000
Received-SPF:  pass (mother.openwall.net: domain of lists.openwall.com
  designates 195.42.179.200 as permitted sender)
  receiver=inbox.vuxu.org; client-ip=195.42.179.200
  envelope-from=<musl-return-15652-ml=inbox.vuxu.org@lists.openwall.com>
Received: from mother.openwall.net (195.42.179.200)
  by inbox.vuxu.org with UTF8ESMTPZ; 3 Apr 2020 21:31:28 -0000
Received: (qmail 4049 invoked by uid 550); 3 Apr 2020 21:31:23 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 4019 invoked from network); 3 Apr 2020 21:31:22 -0000
Date: Fri, 3 Apr 2020 17:31:10 -0400
From: Rich Felker <dalias@libc.org>
To: musl@lists.openwall.com
Message-ID: <20200403213110.GD11469@brightrain.aerifal.cx>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.21 (2010-09-15)
Subject: [musl] New malloc tuning for low usage

It's come to my attention that there are usage patterns where the new
malloc (under development, out-of-tree at
https://github.com/richfelker/mallocng-draft) uses significantly more
memory. This happens when a program allocates just one of an object
size between around 4k and 128k, resulting in up to nearly 3.5x the
request size being obtained in the form of 1 or 2 additional slots
(for the next allocation of the same size class) and slack space
(sometimes called internal fragmentation) within the allocated slot
due to coarse size-classing.

The second part of this, coarse classing, was introduced in commit
67bf74c2e8127c260c807b5694904324d3aeaf30 to mitigate a more extreme
form of the larger first problem: due to divisibility properties, the
minimum slot counts for size classes that are index 0, 1, 2, or 3 mod
4 are 7, 3, 5, and 2, respectively. Refraining from using classes that
are 0 or 2 mod 4 until there's significant usage at that size class
both avoids allocating 7 or 5 slots of a large size when the program
might only need a couple such objects, and allows a single group to
serve a wider range of sizes. However it looks like even 2 or 3 slots
can be undesirable -- for example a program that nominally needs only
75k can end up taking 256k.

The mitigation I'm looking at now is a feature I've wanted to have for
a while -- individually servicing allocations smaller than the hard
mmap-threshold cutoff (128k) with their own mmaps until usage is
sufficiently high to justify (by making the relative overhead low)
allocation of whole multi-slot groups. The data structures already
represent large mmap-serviced allocations as a single-slot group, so
that we can track and validate them at realloc/free (i.e. so that
passing a pointer to something that "looks like" a mmap-serviced
allocation can't be used to corrupt the allocator state, like is
possible with musl's old/current malloc and most other
implementations), and treat single-slot groups as a special case where
the slot size is not the nominal size class spacing, but instead the
entire length from the header out to the end of the mapping. With
minimal changes, this same approach can be used for smaller
allocations that are below the mmap threshold and thus in a size
class. (Note: it's still desirable for them to be classified with
their size class, rather than just like large mmap-serviced
allocations, because we need to account for the total usage to be able
to switch to using larger groups.)

If this change could be made for all sizes, commit
67bf74c2e8127c260c807b5694904324d3aeaf30 would be obsolete and
could/should be reverted. However, servicing individual allocations
with mmap only makes sense when they're at least close to a whole page
in length, and when rounding up to the next whole number of pages does
not have significant relative overhead. (For example, servicing a
4100-byte allocation with two 4k pages would likely be worse than
creating a group of three 5.4k slots and using one of them, since only
1.4k rather than 4k would be "slack" with the latter.) Obviously
there's a size threshold (e.g. 4 pages, 16k on most archs) that puts a
reasonable bound (for 4 pages, 25%) on overhead of individually
mmap-servicing allocations, but it may also make sense to do this for
smaller sizes that fall within some threshold below an integral
number of pages (e.g (-size)%PGSZ < PGSZ/4).

The other concern with doing this is performance. Any program with
significant ongoing memory usage in a class will at some point
allocate a non-single-slot group, and then as single-slot allocations
are freed they won't be recreated. However, there's a risk that a
program with small usage might only use a single object of a given
size at a time, but repeatedly free it and allocate a new one, thereby
spending inordinate amounts of time in mmap/munmap. This is roughly
https://github.com/richfelker/mallocng-draft/issues/3, so it's not new
to single-slot groups, just more severe with them since each free of
such a slot is an munmap. The only easy solution I see for this is
just introspective counting of munmap events and backing off to
disfavor allocation strategies that make it happen if it happens at
too high a rate.

There's also a question of whether the hard mmap threshold should just
be lowered too. 128k is probably a lot higher than it needs to be for
archs with 4k pages -- just 64k would bound waste at 1/16 -- but on
archs with larger pages a lower threshold probably doesn't work (and
none of the above memory-saving tricks really work).