From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/12763
Path: news.gmane.org!.POSTED!not-for-mail
From: Rich Felker <dalias@libc.org>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: What's wrong with musl's malloc
Date: Mon, 23 Apr 2018 15:47:22 -0400
Message-ID: <20180423194722.GB3094@brightrain.aerifal.cx>
References: <20180420200904.GS3094@brightrain.aerifal.cx>
 <20180422193450.GA11638@voyager>
 <20180423020010.GX3094@brightrain.aerifal.cx>
 <20180423190205.GB11638@voyager>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: blaine.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: blaine.gmane.org 1524512733 32755 195.159.176.226 (23 Apr 2018 19:45:33 GMT)
X-Complaints-To: usenet@blaine.gmane.org
NNTP-Posting-Date: Mon, 23 Apr 2018 19:45:33 +0000 (UTC)
User-Agent: Mutt/1.5.21 (2010-09-15)
To: musl@lists.openwall.com
Original-X-From: musl-return-12779-gllmg-musl=m.gmane.org@lists.openwall.com Mon Apr 23 21:45:28 2018
Return-path: <musl-return-12779-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@m.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by blaine.gmane.org with smtp (Exim 4.84_2)
	(envelope-from <musl-return-12779-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1fAhOy-0008R6-0s
	for gllmg-musl@m.gmane.org; Mon, 23 Apr 2018 21:45:28 +0200
Original-Received: (qmail 26165 invoked by uid 550); 23 Apr 2018 19:47:35 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Original-Received: (qmail 26140 invoked from network); 23 Apr 2018 19:47:35 -0000
Content-Disposition: inline
In-Reply-To: <20180423190205.GB11638@voyager>
Original-Sender: Rich Felker <dalias@aerifal.cx>
Xref: news.gmane.org gmane.linux.lib.musl.general:12763
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/12763>

On Mon, Apr 23, 2018 at 09:02:05PM +0200, Markus Wichmann wrote:
> > There is also no bin for "sized 2k to 4k". It's "sized 2048 bytes to
> > 2560 bytes", etc. -- granularity of bin sizes is in 1/4 steps up to
> > the next power of 2. So If you do start with a 4k chunk, after
> > splitting off 50 bytes to allocate, you have left a chunk in the
> > "sized 3586 up to 4096" bin, and the 2050-byte allocation is perfectly
> > satisfiable from it. You could adjust the example to work, but then
> > the fragmentation you find as a result is much lower.
> > 
> 
> Yup, by my calculation I get about 400 bytes of fragmentation in the
> worst case. The "just shy of 4k" bin starts at 3616 bytes, so after
> allocating a tiny amount of memory, allocating 3648 bytes or more will
> require heap expansion even though the memory to fulfill the request is
> already available. And expand_heap() will only round up to one page
> size. So afterwards, roughly 3700 bytes will be allocated in two pages.
> 
> The example also does not scale, as repeating the tiny allocation will
> not get another page. And repeatedly allocating 3650 bytes will allocate
> one page per request, but it is doubtful that a better scheme is
> available.

Ah, I was going to correct you here, but I think you've identified a
significant inefficiency in the current implementation. My intent was
that expand_heap merge the newly obtained space with the existing top
of the heap, but in fact it leaves a free gap below it. Merging would
require temporarily taking possession of the existing free chunk just
below it, which brings us back to the original problem this thread
described (non-atomicity, ability of other threads to 'see' the moment
when it's temporarily taken) unlesswe have a solution for that.

So, if things worked as I originally envisioned, there would be no
fragmentation here. But as it's working now, there is, and it could be
significant. This might be the biggest source of unintended
fragmentation we've found yet.

> > > > In the long term, I think the whole malloc design we're using now is
> > > > wrong, and should be replaced, but until the time comes to actually do
> > > > that, it may make sense to try to improve some of the worst
> > > > properties, or even to reduce or eliminate the granularity of the
> > > > locking if it proves not to be very valuable on real-world loads.
> > > > 
> > > 
> > > Care to elaborate on what is wrong with the current design of the
> > > malloc? Everything you named so far was just implementation issues, but
> > > nothing that's design related.
> > 
> > Designs that require splitting/merging chunks inherently require
> > excessive synchronization that makes them not scale well to many
> > threads/cores (and awful on NUMA, if anyone cares). They also have
> > inherent fragmentation problems with certain allocation patterns, like
> > alternating between small/large allocations then freeing all the large
> > ones, which is a reasonable pattern (e.g. allocating large working
> > spaces and small result spaces, then freeing all the working spaces).
> > 
> 
> How do we work around this, then? We could go the tcmalloc route and
> essentially run the allocator separately for each thread (arena
> allocation), which reduces the amount of synchronization at the cost of
> more fragmentation; this time cross-thread fragmentation (no-one will be
> able to utilize all the memory that is lost to overhead across all
> threads).
> 
> We could also go the dietlibc route, which uses fixed widths for each
> bin. There is a bin of 16-byte elements, and it is a list of chunks that
> are all 16 bytes in size. If someone wants 16 bytes, we know where to
> look. Of course, if someone wants first 16, then 32, then 48 bytes, that
> will allocate 96 bytes in three pages. Fragmentation doesn't begin to
> describe it.

One possible approach I'm considering is treating the partitioning of
virtual address space into chunks as immutable -- then, little or no
synchronization is needed to access the data structures, and rather
than having to start the search for a free chunk from a global bin,
each thread could have its own "current position" in the graph of free
chunks for each size, in such a way that they naturally (at least
statistically) avoid stomping on the same memory.

Early in process lifetime, when not much has been allocated, the
partioning could be purely based on demand, avoiding over-allocation
but creating structures that aren't as flexible for future reuse.
Later, groups of up to 32 chunks of the same size could be created
such that a single atomic could allocate one. Consecutive ones could
also be allocated together in principle but I'm concerned that might
be problematic.

One could envision having lockable bins (like the current bin system)
populated not with individual chunks but with groups of [up to] 32
chunks of the same size, where when a thread allocates the last free
chunk in a group, it locks the bin and shuffles the data structures in
it around to pick (or allocate) a new group of free chunks. This would
drop the need for locks to ~1/32 malloc calls, and much less if the
mallocs are frequently paired with frees, in which case you'd get by
with mostly just atomic test-and-set-bit operations.

> > Designs where all the metadata (and especially freelist pointers) are
> > inline are highly vulnerable to heap overflow and use-after-free
> > attacks. If possible, I'd rather have less metadata inline and have it
> > efficiently coupled with something out-of-line that's harder to
> > attack. This might also improve performance and reduce contention,
> > e.g. if we used atomic bitmasks in an index of chunks to allocate/free
> > them rather than having to move them in and out of linked lists.
> > 
> 
> That is a very good idea. Say, don't OS memory managers have to do it
> this way? Maybe pick up some pointers from them...

If you mean like the Linux kernel's, it has very different usage
patterns, a lot more control over the usage patterns, and different
allocation functions that document the caller's intent to the
allocator, versus a userspace malloc. So I think it's solving a
simpler problem.

Rich