[musl] New malloc tuning for low usage

mailing list of musl libc
 help / color / mirror / code / Atom feed

* [musl] New malloc tuning for low usage
@ 2020-04-03 21:31 Rich Felker
  2020-04-04  2:55 ` Rich Felker
  0 siblings, 1 reply; 5+ messages in thread
From: Rich Felker @ 2020-04-03 21:31 UTC (permalink / raw)
  To: musl

It's come to my attention that there are usage patterns where the new
malloc (under development, out-of-tree at
https://github.com/richfelker/mallocng-draft) uses significantly more
memory. This happens when a program allocates just one of an object
size between around 4k and 128k, resulting in up to nearly 3.5x the
request size being obtained in the form of 1 or 2 additional slots
(for the next allocation of the same size class) and slack space
(sometimes called internal fragmentation) within the allocated slot
due to coarse size-classing.

The second part of this, coarse classing, was introduced in commit
67bf74c2e8127c260c807b5694904324d3aeaf30 to mitigate a more extreme
form of the larger first problem: due to divisibility properties, the
minimum slot counts for size classes that are index 0, 1, 2, or 3 mod
4 are 7, 3, 5, and 2, respectively. Refraining from using classes that
are 0 or 2 mod 4 until there's significant usage at that size class
both avoids allocating 7 or 5 slots of a large size when the program
might only need a couple such objects, and allows a single group to
serve a wider range of sizes. However it looks like even 2 or 3 slots
can be undesirable -- for example a program that nominally needs only
75k can end up taking 256k.

The mitigation I'm looking at now is a feature I've wanted to have for
a while -- individually servicing allocations smaller than the hard
mmap-threshold cutoff (128k) with their own mmaps until usage is
sufficiently high to justify (by making the relative overhead low)
allocation of whole multi-slot groups. The data structures already
represent large mmap-serviced allocations as a single-slot group, so
that we can track and validate them at realloc/free (i.e. so that
passing a pointer to something that "looks like" a mmap-serviced
allocation can't be used to corrupt the allocator state, like is
possible with musl's old/current malloc and most other
implementations), and treat single-slot groups as a special case where
the slot size is not the nominal size class spacing, but instead the
entire length from the header out to the end of the mapping. With
minimal changes, this same approach can be used for smaller
allocations that are below the mmap threshold and thus in a size
class. (Note: it's still desirable for them to be classified with
their size class, rather than just like large mmap-serviced
allocations, because we need to account for the total usage to be able
to switch to using larger groups.)

If this change could be made for all sizes, commit
67bf74c2e8127c260c807b5694904324d3aeaf30 would be obsolete and
could/should be reverted. However, servicing individual allocations
with mmap only makes sense when they're at least close to a whole page
in length, and when rounding up to the next whole number of pages does
not have significant relative overhead. (For example, servicing a
4100-byte allocation with two 4k pages would likely be worse than
creating a group of three 5.4k slots and using one of them, since only
1.4k rather than 4k would be "slack" with the latter.) Obviously
there's a size threshold (e.g. 4 pages, 16k on most archs) that puts a
reasonable bound (for 4 pages, 25%) on overhead of individually
mmap-servicing allocations, but it may also make sense to do this for
smaller sizes that fall within some threshold below an integral
number of pages (e.g (-size)%PGSZ < PGSZ/4).

The other concern with doing this is performance. Any program with
significant ongoing memory usage in a class will at some point
allocate a non-single-slot group, and then as single-slot allocations
are freed they won't be recreated. However, there's a risk that a
program with small usage might only use a single object of a given
size at a time, but repeatedly free it and allocate a new one, thereby
spending inordinate amounts of time in mmap/munmap. This is roughly
https://github.com/richfelker/mallocng-draft/issues/3, so it's not new
to single-slot groups, just more severe with them since each free of
such a slot is an munmap. The only easy solution I see for this is
just introspective counting of munmap events and backing off to
disfavor allocation strategies that make it happen if it happens at
too high a rate.

There's also a question of whether the hard mmap threshold should just
be lowered too. 128k is probably a lot higher than it needs to be for
archs with 4k pages -- just 64k would bound waste at 1/16 -- but on
archs with larger pages a lower threshold probably doesn't work (and
none of the above memory-saving tricks really work).

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [musl] New malloc tuning for low usage
  2020-04-03 21:31 [musl] New malloc tuning for low usage Rich Felker
@ 2020-04-04  2:55 ` Rich Felker
  2020-04-04 18:19   ` Rich Felker
  0 siblings, 1 reply; 5+ messages in thread
From: Rich Felker @ 2020-04-04  2:55 UTC (permalink / raw)
  To: musl

On Fri, Apr 03, 2020 at 05:31:10PM -0400, Rich Felker wrote:
> The mitigation I'm looking at now is a feature I've wanted to have for
> a while -- individually servicing allocations smaller than the hard
> mmap-threshold cutoff (128k) with their own mmaps until usage is
> sufficiently high to justify (by making the relative overhead low)
> allocation of whole multi-slot groups. The data structures already
> represent large mmap-serviced allocations as a single-slot group, so
> that we can track and validate them at realloc/free (i.e. so that
> passing a pointer to something that "looks like" a mmap-serviced
> allocation can't be used to corrupt the allocator state, like is
> possible with musl's old/current malloc and most other
> implementations), and treat single-slot groups as a special case where
> the slot size is not the nominal size class spacing, but instead the
> entire length from the header out to the end of the mapping. With
> minimal changes, this same approach can be used for smaller
> allocations that are below the mmap threshold and thus in a size
> class. (Note: it's still desirable for them to be classified with
> their size class, rather than just like large mmap-serviced
> allocations, because we need to account for the total usage to be able
> to switch to using larger groups.)
> 
> If this change could be made for all sizes, commit
> 67bf74c2e8127c260c807b5694904324d3aeaf30 would be obsolete and
> could/should be reverted. However, servicing individual allocations
> with mmap only makes sense when they're at least close to a whole page
> in length, and when rounding up to the next whole number of pages does
> not have significant relative overhead. (For example, servicing a
> 4100-byte allocation with two 4k pages would likely be worse than
> creating a group of three 5.4k slots and using one of them, since only
> 1.4k rather than 4k would be "slack" with the latter.) Obviously
> there's a size threshold (e.g. 4 pages, 16k on most archs) that puts a
> reasonable bound (for 4 pages, 25%) on overhead of individually
> mmap-servicing allocations, but it may also make sense to do this for
> smaller sizes that fall within some threshold below an integral
> number of pages (e.g (-size)%PGSZ < PGSZ/4).
> 
> The other concern with doing this is performance. Any program with
> significant ongoing memory usage in a class will at some point
> allocate a non-single-slot group, and then as single-slot allocations
> are freed they won't be recreated. However, there's a risk that a
> program with small usage might only use a single object of a given
> size at a time, but repeatedly free it and allocate a new one, thereby
> spending inordinate amounts of time in mmap/munmap. This is roughly
> https://github.com/richfelker/mallocng-draft/issues/3, so it's not new
> to single-slot groups, just more severe with them since each free of
> such a slot is an munmap. The only easy solution I see for this is
> just introspective counting of munmap events and backing off to
> disfavor allocation strategies that make it happen if it happens at
> too high a rate.

I'm not 100% decided but I'm thinking it's probably a mistake to try
to use individual mmaps for sizes below ~16k. At smaller sizes that
are just below a whole page multiple (4k or 8k minus epsilon), the
savings aren't that big (at most 2 pages) vs just allocating the
smallest possible multi-slot group, and performance impact seems like
a big unknown. And while I'm okay with introspective bounce counting
to decide whether to return memory to the system, I'm skeptical of
also using it for decisions about how to allocate new memory, since it
could lead to situations where, without any other changes to memory
usage in the application or on the system, free followed by malloc of
the same size could fail.

I have a draft of the above-described functionality and it seems to
work as intended. One key trick that makes things simpler is to
consider single-slot groups (individually mmapped allocations) only
for odd size classes, since even size clases (due to coarse classes
logic in top level malloc) are only used once there's significant
usage in the coarse class.

In working on this, I noticed that it looks like the coarse size class
threshold (6) in top-level malloc() is too low. At that threshold, the
first fine-grained-class group allocation will be roughly a 100%
increase in memory usage by the class; I'd rather keep the relative
increase bounded by 50% or less. It should probably be something more
like 10 or 12 to achieve this. With 12, repeated allocations of 16k
first produce 7 individual 20k mmaps, then a 3-slot class-37
(21824-byte slots) group, then a 7-slot class-36 (18704-byte slots)
group.

One thing that's not clear to me is whether it's useful at all to
produce the 3-slot class-37 group rather than just going on making
more individual mmaps until it's time to switch to the larger group.
It's easy to tune things to do the latter, and seems to offer more
flexibility in how memory is used. It also allows slightly more
fragmentation, but the number of such objects is highly bounded to
begin with because we use increasingly larger groups as usage goes up,
so the contribution should be asymptotically irrelevant.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [musl] New malloc tuning for low usage
  2020-04-04  2:55 ` Rich Felker
@ 2020-04-04 18:19   ` Rich Felker
  2020-04-05  2:20     ` Rich Felker
  0 siblings, 1 reply; 5+ messages in thread
From: Rich Felker @ 2020-04-04 18:19 UTC (permalink / raw)
  To: musl

On Fri, Apr 03, 2020 at 10:55:54PM -0400, Rich Felker wrote:
> In working on this, I noticed that it looks like the coarse size class
> threshold (6) in top-level malloc() is too low. At that threshold, the
> first fine-grained-class group allocation will be roughly a 100%
> increase in memory usage by the class; I'd rather keep the relative
> increase bounded by 50% or less. It should probably be something more
> like 10 or 12 to achieve this. With 12, repeated allocations of 16k
> first produce 7 individual 20k mmaps, then a 3-slot class-37
> (21824-byte slots) group, then a 7-slot class-36 (18704-byte slots)
> group.
> 
> One thing that's not clear to me is whether it's useful at all to
> produce the 3-slot class-37 group rather than just going on making
> more individual mmaps until it's time to switch to the larger group.
> It's easy to tune things to do the latter, and seems to offer more
> flexibility in how memory is used. It also allows slightly more
> fragmentation, but the number of such objects is highly bounded to
> begin with because we use increasingly larger groups as usage goes up,
> so the contribution should be asymptotically irrelevant.

The answer is that it depends on where the sizes fall. At 16k,
rounding up to page size produces 20k usage (5 pages) but the 3-slot
class-37 group uses 5+1/3 pages, so individual mmaps are preferable.
However if we requested 20k, individual mmaps would be 24k (6 pages)
while the 3-slot group would still just use 5+1/3 page, and would be
preferable to switch to. The condition seems to be just whether the
rounded-up-to-whole-pages request size is larger than the slot size,
and we should prefer individual mmaps if (1) it's smaller than the
slot size, or (2) using a multi-slot group would be a relative usage
increase in the class of more than 50% (or whatever threshold it ends
up being tuned to).

I'll see if I can put together a quick implementation of this and see
how it works.

Rich

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [musl] New malloc tuning for low usage
  2020-04-04 18:19   ` Rich Felker
@ 2020-04-05  2:20     ` Rich Felker
  2020-04-07 17:50       ` Rich Felker
  0 siblings, 1 reply; 5+ messages in thread
From: Rich Felker @ 2020-04-05  2:20 UTC (permalink / raw)
  To: musl

On Sat, Apr 04, 2020 at 02:19:48PM -0400, Rich Felker wrote:
> On Fri, Apr 03, 2020 at 10:55:54PM -0400, Rich Felker wrote:
> > In working on this, I noticed that it looks like the coarse size class
> > threshold (6) in top-level malloc() is too low. At that threshold, the
> > first fine-grained-class group allocation will be roughly a 100%
> > increase in memory usage by the class; I'd rather keep the relative
> > increase bounded by 50% or less. It should probably be something more
> > like 10 or 12 to achieve this. With 12, repeated allocations of 16k
> > first produce 7 individual 20k mmaps, then a 3-slot class-37
> > (21824-byte slots) group, then a 7-slot class-36 (18704-byte slots)
> > group.
> > 
> > One thing that's not clear to me is whether it's useful at all to
> > produce the 3-slot class-37 group rather than just going on making
> > more individual mmaps until it's time to switch to the larger group.
> > It's easy to tune things to do the latter, and seems to offer more
> > flexibility in how memory is used. It also allows slightly more
> > fragmentation, but the number of such objects is highly bounded to
> > begin with because we use increasingly larger groups as usage goes up,
> > so the contribution should be asymptotically irrelevant.
> 
> The answer is that it depends on where the sizes fall. At 16k,
> rounding up to page size produces 20k usage (5 pages) but the 3-slot
> class-37 group uses 5+1/3 pages, so individual mmaps are preferable.
> However if we requested 20k, individual mmaps would be 24k (6 pages)
> while the 3-slot group would still just use 5+1/3 page, and would be
> preferable to switch to. The condition seems to be just whether the
> rounded-up-to-whole-pages request size is larger than the slot size,
> and we should prefer individual mmaps if (1) it's smaller than the
> slot size, or (2) using a multi-slot group would be a relative usage
> increase in the class of more than 50% (or whatever threshold it ends
> up being tuned to).
> 
> I'll see if I can put together a quick implementation of this and see
> how it works.

This seems to be working very well with the condition:

	if (sc >= 35 && cnt<=3 && (size*cnt > usage/2 || ((req+20+pagesize-1) & -pagesize) <= size))
	    ^^^^^^^^    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	 at least ~16k    wanted to make a smaller        requested size
	                  group but hit lower cnt         rounded up to
	                  limit; see loop above           page <= slot size

at the end of the else clause for if (sc < 8) in alloc_group. Here req
is a new argument to expose the size of the actual request malloc
made, so that for single-slot groups (mmap serviced allocations) we
can allocate just the minimum needed rather than the nominal slot
size.

Rich

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [musl] New malloc tuning for low usage
  2020-04-05  2:20     ` Rich Felker
@ 2020-04-07 17:50       ` Rich Felker
  0 siblings, 0 replies; 5+ messages in thread
From: Rich Felker @ 2020-04-07 17:50 UTC (permalink / raw)
  To: musl

On Sat, Apr 04, 2020 at 10:20:23PM -0400, Rich Felker wrote:
> > The answer is that it depends on where the sizes fall. At 16k,
> > rounding up to page size produces 20k usage (5 pages) but the 3-slot
> > class-37 group uses 5+1/3 pages, so individual mmaps are preferable.
> > However if we requested 20k, individual mmaps would be 24k (6 pages)
> > while the 3-slot group would still just use 5+1/3 page, and would be
> > preferable to switch to. The condition seems to be just whether the
> > rounded-up-to-whole-pages request size is larger than the slot size,
> > and we should prefer individual mmaps if (1) it's smaller than the
> > slot size, or (2) using a multi-slot group would be a relative usage
> > increase in the class of more than 50% (or whatever threshold it ends
> > up being tuned to).
> > 
> > I'll see if I can put together a quick implementation of this and see
> > how it works.
> 
> This seems to be working very well with the condition:
> 
> 	if (sc >= 35 && cnt<=3 && (size*cnt > usage/2 || ((req+20+pagesize-1) & -pagesize) <= size))
> 	    ^^^^^^^^    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 	 at least ~16k    wanted to make a smaller        requested size
> 	                  group but hit lower cnt         rounded up to
> 	                  limit; see loop above           page <= slot size
> 
> at the end of the else clause for if (sc < 8) in alloc_group. Here req
> is a new argument to expose the size of the actual request malloc
> made, so that for single-slot groups (mmap serviced allocations) we
> can allocate just the minimum needed rather than the nominal slot
> size.

This isn't quite right for arbitrary page size; in particular there's
a missing condition that the potential multi-slot group is actually
larger than the single-slot mmap rounded up to page size. This can be
expressed as size*cnt >= ROUND(req+20). It's automatically true for
sc>=35 with PGSZ==4k but not with 64k.

In summary, it seems there are 3 necessary conditions to consider use
of single-slot group (individual mmap):

- It would actually be smaller than multi-slot group
  (otherwise you're just wasting memory)

- The absolute size is large enough to justify syscall overhead
  (otherwise you can get pathological performance from alloc/free cycles)

- Current usage is low enough that the multi-slot group doesn't obey
  desired growth bounds on usagae
  (otherwise you get vm space fragmentation)

I think it's preferable to break the third condition down into two
(either-or) cases:

- size*cnt > usage/2 (i.e. multi-slot would grow usage by >50%), or

- ROUND(req+20) < size && "low usage" (i.e. slot slack/internal
  fragmentation is sufficiently high that individual mmap not just
  avoids costly preallocation but saves memory)

The second condition here is especially helpful in the presence of
"coarse size classing", since it will almost always be true as long as
the threshold to stop coarse classing has been reached, and negates
all the potential waste.

It would be possible just to disable coarse classing for size ranges
eligible for individual mmap, and in some ways that would be cleaner,
but it requires duplicating eligibility logic in two places where it's
difficult for them to get exactly the same result.

Rich

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-04-07 17:50 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-03 21:31 [musl] New malloc tuning for low usage Rich Felker
2020-04-04  2:55 ` Rich Felker
2020-04-04 18:19   ` Rich Felker
2020-04-05  2:20     ` Rich Felker
2020-04-07 17:50       ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).