From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.3 required=5.0 tests=MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_PASS
	autolearn=ham autolearn_force=no version=3.4.2
Received: (qmail 26555 invoked from network); 4 Apr 2020 02:56:10 -0000
Received-SPF:  pass (mother.openwall.net: domain of lists.openwall.com
  designates 195.42.179.200 as permitted sender)
  receiver=inbox.vuxu.org; client-ip=195.42.179.200
  envelope-from=<musl-return-15653-ml=inbox.vuxu.org@lists.openwall.com>
Received: from mother.openwall.net (195.42.179.200)
  by inbox.vuxu.org with UTF8ESMTPZ; 4 Apr 2020 02:56:10 -0000
Received: (qmail 13858 invoked by uid 550); 4 Apr 2020 02:56:07 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 13840 invoked from network); 4 Apr 2020 02:56:06 -0000
Date: Fri, 3 Apr 2020 22:55:54 -0400
From: Rich Felker <dalias@libc.org>
To: musl@lists.openwall.com
Message-ID: <20200404025554.GG11469@brightrain.aerifal.cx>
References: <20200403213110.GD11469@brightrain.aerifal.cx>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20200403213110.GD11469@brightrain.aerifal.cx>
User-Agent: Mutt/1.5.21 (2010-09-15)
Subject: Re: [musl] New malloc tuning for low usage

On Fri, Apr 03, 2020 at 05:31:10PM -0400, Rich Felker wrote:
> The mitigation I'm looking at now is a feature I've wanted to have for
> a while -- individually servicing allocations smaller than the hard
> mmap-threshold cutoff (128k) with their own mmaps until usage is
> sufficiently high to justify (by making the relative overhead low)
> allocation of whole multi-slot groups. The data structures already
> represent large mmap-serviced allocations as a single-slot group, so
> that we can track and validate them at realloc/free (i.e. so that
> passing a pointer to something that "looks like" a mmap-serviced
> allocation can't be used to corrupt the allocator state, like is
> possible with musl's old/current malloc and most other
> implementations), and treat single-slot groups as a special case where
> the slot size is not the nominal size class spacing, but instead the
> entire length from the header out to the end of the mapping. With
> minimal changes, this same approach can be used for smaller
> allocations that are below the mmap threshold and thus in a size
> class. (Note: it's still desirable for them to be classified with
> their size class, rather than just like large mmap-serviced
> allocations, because we need to account for the total usage to be able
> to switch to using larger groups.)
> 
> If this change could be made for all sizes, commit
> 67bf74c2e8127c260c807b5694904324d3aeaf30 would be obsolete and
> could/should be reverted. However, servicing individual allocations
> with mmap only makes sense when they're at least close to a whole page
> in length, and when rounding up to the next whole number of pages does
> not have significant relative overhead. (For example, servicing a
> 4100-byte allocation with two 4k pages would likely be worse than
> creating a group of three 5.4k slots and using one of them, since only
> 1.4k rather than 4k would be "slack" with the latter.) Obviously
> there's a size threshold (e.g. 4 pages, 16k on most archs) that puts a
> reasonable bound (for 4 pages, 25%) on overhead of individually
> mmap-servicing allocations, but it may also make sense to do this for
> smaller sizes that fall within some threshold below an integral
> number of pages (e.g (-size)%PGSZ < PGSZ/4).
> 
> The other concern with doing this is performance. Any program with
> significant ongoing memory usage in a class will at some point
> allocate a non-single-slot group, and then as single-slot allocations
> are freed they won't be recreated. However, there's a risk that a
> program with small usage might only use a single object of a given
> size at a time, but repeatedly free it and allocate a new one, thereby
> spending inordinate amounts of time in mmap/munmap. This is roughly
> https://github.com/richfelker/mallocng-draft/issues/3, so it's not new
> to single-slot groups, just more severe with them since each free of
> such a slot is an munmap. The only easy solution I see for this is
> just introspective counting of munmap events and backing off to
> disfavor allocation strategies that make it happen if it happens at
> too high a rate.

I'm not 100% decided but I'm thinking it's probably a mistake to try
to use individual mmaps for sizes below ~16k. At smaller sizes that
are just below a whole page multiple (4k or 8k minus epsilon), the
savings aren't that big (at most 2 pages) vs just allocating the
smallest possible multi-slot group, and performance impact seems like
a big unknown. And while I'm okay with introspective bounce counting
to decide whether to return memory to the system, I'm skeptical of
also using it for decisions about how to allocate new memory, since it
could lead to situations where, without any other changes to memory
usage in the application or on the system, free followed by malloc of
the same size could fail.

I have a draft of the above-described functionality and it seems to
work as intended. One key trick that makes things simpler is to
consider single-slot groups (individually mmapped allocations) only
for odd size classes, since even size clases (due to coarse classes
logic in top level malloc) are only used once there's significant
usage in the coarse class.

In working on this, I noticed that it looks like the coarse size class
threshold (6) in top-level malloc() is too low. At that threshold, the
first fine-grained-class group allocation will be roughly a 100%
increase in memory usage by the class; I'd rather keep the relative
increase bounded by 50% or less. It should probably be something more
like 10 or 12 to achieve this. With 12, repeated allocations of 16k
first produce 7 individual 20k mmaps, then a 3-slot class-37
(21824-byte slots) group, then a 7-slot class-36 (18704-byte slots)
group.

One thing that's not clear to me is whether it's useful at all to
produce the 3-slot class-37 group rather than just going on making
more individual mmaps until it's time to switch to the larger group.
It's easy to tune things to do the latter, and seems to offer more
flexibility in how memory is used. It also allows slightly more
fragmentation, but the number of such objects is highly bounded to
begin with because we use increasingly larger groups as usage goes up,
so the contribution should be asymptotically irrelevant.