From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/12763 Path: news.gmane.org!.POSTED!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: What's wrong with musl's malloc Date: Mon, 23 Apr 2018 15:47:22 -0400 Message-ID: <20180423194722.GB3094@brightrain.aerifal.cx> References: <20180420200904.GS3094@brightrain.aerifal.cx> <20180422193450.GA11638@voyager> <20180423020010.GX3094@brightrain.aerifal.cx> <20180423190205.GB11638@voyager> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: blaine.gmane.org 1524512733 32755 195.159.176.226 (23 Apr 2018 19:45:33 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Mon, 23 Apr 2018 19:45:33 +0000 (UTC) User-Agent: Mutt/1.5.21 (2010-09-15) To: musl@lists.openwall.com Original-X-From: musl-return-12779-gllmg-musl=m.gmane.org@lists.openwall.com Mon Apr 23 21:45:28 2018 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by blaine.gmane.org with smtp (Exim 4.84_2) (envelope-from ) id 1fAhOy-0008R6-0s for gllmg-musl@m.gmane.org; Mon, 23 Apr 2018 21:45:28 +0200 Original-Received: (qmail 26165 invoked by uid 550); 23 Apr 2018 19:47:35 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 26140 invoked from network); 23 Apr 2018 19:47:35 -0000 Content-Disposition: inline In-Reply-To: <20180423190205.GB11638@voyager> Original-Sender: Rich Felker Xref: news.gmane.org gmane.linux.lib.musl.general:12763 Archived-At: On Mon, Apr 23, 2018 at 09:02:05PM +0200, Markus Wichmann wrote: > > There is also no bin for "sized 2k to 4k". It's "sized 2048 bytes to > > 2560 bytes", etc. -- granularity of bin sizes is in 1/4 steps up to > > the next power of 2. So If you do start with a 4k chunk, after > > splitting off 50 bytes to allocate, you have left a chunk in the > > "sized 3586 up to 4096" bin, and the 2050-byte allocation is perfectly > > satisfiable from it. You could adjust the example to work, but then > > the fragmentation you find as a result is much lower. > > > > Yup, by my calculation I get about 400 bytes of fragmentation in the > worst case. The "just shy of 4k" bin starts at 3616 bytes, so after > allocating a tiny amount of memory, allocating 3648 bytes or more will > require heap expansion even though the memory to fulfill the request is > already available. And expand_heap() will only round up to one page > size. So afterwards, roughly 3700 bytes will be allocated in two pages. > > The example also does not scale, as repeating the tiny allocation will > not get another page. And repeatedly allocating 3650 bytes will allocate > one page per request, but it is doubtful that a better scheme is > available. Ah, I was going to correct you here, but I think you've identified a significant inefficiency in the current implementation. My intent was that expand_heap merge the newly obtained space with the existing top of the heap, but in fact it leaves a free gap below it. Merging would require temporarily taking possession of the existing free chunk just below it, which brings us back to the original problem this thread described (non-atomicity, ability of other threads to 'see' the moment when it's temporarily taken) unlesswe have a solution for that. So, if things worked as I originally envisioned, there would be no fragmentation here. But as it's working now, there is, and it could be significant. This might be the biggest source of unintended fragmentation we've found yet. > > > > In the long term, I think the whole malloc design we're using now is > > > > wrong, and should be replaced, but until the time comes to actually do > > > > that, it may make sense to try to improve some of the worst > > > > properties, or even to reduce or eliminate the granularity of the > > > > locking if it proves not to be very valuable on real-world loads. > > > > > > > > > > Care to elaborate on what is wrong with the current design of the > > > malloc? Everything you named so far was just implementation issues, but > > > nothing that's design related. > > > > Designs that require splitting/merging chunks inherently require > > excessive synchronization that makes them not scale well to many > > threads/cores (and awful on NUMA, if anyone cares). They also have > > inherent fragmentation problems with certain allocation patterns, like > > alternating between small/large allocations then freeing all the large > > ones, which is a reasonable pattern (e.g. allocating large working > > spaces and small result spaces, then freeing all the working spaces). > > > > How do we work around this, then? We could go the tcmalloc route and > essentially run the allocator separately for each thread (arena > allocation), which reduces the amount of synchronization at the cost of > more fragmentation; this time cross-thread fragmentation (no-one will be > able to utilize all the memory that is lost to overhead across all > threads). > > We could also go the dietlibc route, which uses fixed widths for each > bin. There is a bin of 16-byte elements, and it is a list of chunks that > are all 16 bytes in size. If someone wants 16 bytes, we know where to > look. Of course, if someone wants first 16, then 32, then 48 bytes, that > will allocate 96 bytes in three pages. Fragmentation doesn't begin to > describe it. One possible approach I'm considering is treating the partitioning of virtual address space into chunks as immutable -- then, little or no synchronization is needed to access the data structures, and rather than having to start the search for a free chunk from a global bin, each thread could have its own "current position" in the graph of free chunks for each size, in such a way that they naturally (at least statistically) avoid stomping on the same memory. Early in process lifetime, when not much has been allocated, the partioning could be purely based on demand, avoiding over-allocation but creating structures that aren't as flexible for future reuse. Later, groups of up to 32 chunks of the same size could be created such that a single atomic could allocate one. Consecutive ones could also be allocated together in principle but I'm concerned that might be problematic. One could envision having lockable bins (like the current bin system) populated not with individual chunks but with groups of [up to] 32 chunks of the same size, where when a thread allocates the last free chunk in a group, it locks the bin and shuffles the data structures in it around to pick (or allocate) a new group of free chunks. This would drop the need for locks to ~1/32 malloc calls, and much less if the mallocs are frequently paired with frees, in which case you'd get by with mostly just atomic test-and-set-bit operations. > > Designs where all the metadata (and especially freelist pointers) are > > inline are highly vulnerable to heap overflow and use-after-free > > attacks. If possible, I'd rather have less metadata inline and have it > > efficiently coupled with something out-of-line that's harder to > > attack. This might also improve performance and reduce contention, > > e.g. if we used atomic bitmasks in an index of chunks to allocate/free > > them rather than having to move them in and out of linked lists. > > > > That is a very good idea. Say, don't OS memory managers have to do it > this way? Maybe pick up some pointers from them... If you mean like the Linux kernel's, it has very different usage patterns, a lot more control over the usage patterns, and different allocation functions that document the caller's intent to the allocator, versus a userspace malloc. So I think it's solving a simpler problem. Rich