From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/12762
Path: news.gmane.org!.POSTED!not-for-mail
From: Markus Wichmann <nullplan@gmx.net>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: What's wrong with musl's malloc
Date: Mon, 23 Apr 2018 21:02:05 +0200
Message-ID: <20180423190205.GB11638@voyager>
References: <20180420200904.GS3094@brightrain.aerifal.cx>
 <20180422193450.GA11638@voyager>
 <20180423020010.GX3094@brightrain.aerifal.cx>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: blaine.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: blaine.gmane.org 1524510021 12486 195.159.176.226 (23 Apr 2018 19:00:21 GMT)
X-Complaints-To: usenet@blaine.gmane.org
NNTP-Posting-Date: Mon, 23 Apr 2018 19:00:21 +0000 (UTC)
User-Agent: Mutt/1.9.4 (2018-02-28)
To: musl@lists.openwall.com
Original-X-From: musl-return-12778-gllmg-musl=m.gmane.org@lists.openwall.com Mon Apr 23 21:00:17 2018
Return-path: <musl-return-12778-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@m.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by blaine.gmane.org with smtp (Exim 4.84_2)
	(envelope-from <musl-return-12778-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1fAghE-0003Ay-Qv
	for gllmg-musl@m.gmane.org; Mon, 23 Apr 2018 21:00:16 +0200
Original-Received: (qmail 32471 invoked by uid 550); 23 Apr 2018 19:02:24 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Original-Received: (qmail 32445 invoked from network); 23 Apr 2018 19:02:24 -0000
Content-Disposition: inline
In-Reply-To: <20180423020010.GX3094@brightrain.aerifal.cx>
X-Provags-ID: V03:K1:zkNfTp+zCGeLjp0HTDGvlDR7fD0/y3lxzh0ZimiwKcVgm4AP2y3
 QrtwdRTV5v0SP8fQNFjfEDKTbmKb5bIN0G+o2OxyRmL1Kgbty28GjcVruQ6MtcIuHnmpBd/
 riAW08NFqCU7fHlkYCYeysghSS5nAJx4fUn1POzzS/jGamCksEuB8k1KdkTylbUZTIh9rcB
 0HPg9XC/z0WUq/9lJDD7g==
X-UI-Out-Filterresults: notjunk:1;V01:K0:uLqnQDUwg4k=:OJivcHeIxm1trwBvajQ0qn
 sNGKkXoFAP/+/HMABGMEHLlv09uGsxyCvl0CMah2Tic71jKr73p14x71yBbSL0duVggeGMEii
 U0YSgsU9WWjxvAay0dHdPcZt1BpcMIIzbGLYfQAy1GUvPfNWf6RPOELp15zn0x9OY5sMbf9/e
 ZAleI3TW1jzMjndtq7vL42BvnMQ9BWykWdVT4y+5JCf5TnkEwFruo7TBaRryysbNxBsBrXXL5
 9KDT2m8jLroKj5a08+1hmtsKqcaQSjV1/P5x7j19YSbYrxUu3YeMuq4BT1s3vHQq9jf1jZWP8
 0MuHKDsFGZMpgOJVRNpfXJ0Jko+yUptOMaqleNNHJzYnlSS4rcCSz/8CkETzPE6b+WnJUpa+o
 QM3cp2LFr864FLTOQUGZW2j1o82ux0WPU44w0U239xTW/YfO2kM8Pkmf/KQRgbQEgvM4lATwk
 pEPD2T0uRudKDn3jgFX6BlLyG3sxcuxUALJUwtCp6ohRKcHQfZf7VLAkCHzrpW/HUCzWjBslr
 MQxLZZ2W2LoAVtqNuAfM5soYTBQ121hE1b3xiHUcy4BYh61QWlx+pdbBQmwfW1iU3wmJnLrua
 zq6jr3UMkuFlwjrJ3eNkpEODnhfwCGbrzHnoke10oCR1ATFLU1/yivljDnkYRE4qxX5AnI5Pf
 73YisEnTnXOFK1c1/Ybo2e+PbHMydfYDFiTF18uLAE+QYRhvy8oHhJSp+NyKK5ooGliKsMTzt
 NbiM6sswOTIS+lwkm5h6RkLCRb2iQRLw2EZbW/Buj+SuA9ffMDFYSy/vJ20taDKOBQygnyrB 
Xref: news.gmane.org gmane.linux.lib.musl.general:12762
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/12762>

On Sun, Apr 22, 2018 at 10:00:10PM -0400, Rich Felker wrote:
> This is incorrect. Up to size 512 bytes (1024 on 64-bit archs), every
> single multiple of 16 (or of 32, on 64-bit) is its own bin. Only past
> that point does sparse spacing of bins begin.
> 

Yeah, I see that now. My error was assuming that bin_index() was
basically just a binary logarithm. It is not. It is some logarithm
beyond the linear range, but not a binary log. Which one can easily see
by plotting the results of that function in a graph.

Maybe I should have verified my assumptions before posting.

> There is also no bin for "sized 2k to 4k". It's "sized 2048 bytes to
> 2560 bytes", etc. -- granularity of bin sizes is in 1/4 steps up to
> the next power of 2. So If you do start with a 4k chunk, after
> splitting off 50 bytes to allocate, you have left a chunk in the
> "sized 3586 up to 4096" bin, and the 2050-byte allocation is perfectly
> satisfiable from it. You could adjust the example to work, but then
> the fragmentation you find as a result is much lower.
> 

Yup, by my calculation I get about 400 bytes of fragmentation in the
worst case. The "just shy of 4k" bin starts at 3616 bytes, so after
allocating a tiny amount of memory, allocating 3648 bytes or more will
require heap expansion even though the memory to fulfill the request is
already available. And expand_heap() will only round up to one page
size. So afterwards, roughly 3700 bytes will be allocated in two pages.

The example also does not scale, as repeating the tiny allocation will
not get another page. And repeatedly allocating 3650 bytes will allocate
one page per request, but it is doubtful that a better scheme is
available.

> > The benefit of this scheme, of course, is that allocations in the
> > single-threaded case are bounded time: The algorithm is to pop off the
> > first chunk in the bins large enough to support the request, or to
> > allocate the memory necessary for that from the OS. In the worst case,
> > allocation is
> >     - OS memory allocation
> >     - allocation of a chunk from the bin
> >     - splitting that chunk
> > 
> > Each of these is constant time. I am not sure optimizing fragmentation
> > is worth reducing the performance for. Especially in today's desktop
> > and server systems, where anything below 16GB of RAM will just get
> > laughed off the stage (slight exaggeration).
> 
> I've rarely used a system with 16GB ram, and plenty of systems using
> musl have under 1GB. The old musl git/web server had 256MB before the
> host shutdown had; now we're stuck with a more expensive one with
> something like 768MB or 1GB. Also 32-bit archs have a hard *virtual*
> limit of 2, 3, or 4 GB (mostly 2 or 3) regardless of how much physical
> memory you have. Treating 16GB like something reasonable to have is
> how you get the glibc & mainstream browser ecosystem...
> 

Yeah, that statement was not terribly well thought through. My point
was, though, that the current design manages to make single-threaded
allocation constant time (well, as constant as expand_heap()). At the
cost of more fragmentation than necessary.

Of course 16GB is a ridiculous number. That's why I chose it. My largest
system has 8GB, and that was after scavenging some parts out of laptops
due for the scrap heap.

> > > In the long term, I think the whole malloc design we're using now is
> > > wrong, and should be replaced, but until the time comes to actually do
> > > that, it may make sense to try to improve some of the worst
> > > properties, or even to reduce or eliminate the granularity of the
> > > locking if it proves not to be very valuable on real-world loads.
> > > 
> > 
> > Care to elaborate on what is wrong with the current design of the
> > malloc? Everything you named so far was just implementation issues, but
> > nothing that's design related.
> 
> Designs that require splitting/merging chunks inherently require
> excessive synchronization that makes them not scale well to many
> threads/cores (and awful on NUMA, if anyone cares). They also have
> inherent fragmentation problems with certain allocation patterns, like
> alternating between small/large allocations then freeing all the large
> ones, which is a reasonable pattern (e.g. allocating large working
> spaces and small result spaces, then freeing all the working spaces).
> 

How do we work around this, then? We could go the tcmalloc route and
essentially run the allocator separately for each thread (arena
allocation), which reduces the amount of synchronization at the cost of
more fragmentation; this time cross-thread fragmentation (no-one will be
able to utilize all the memory that is lost to overhead across all
threads).

We could also go the dietlibc route, which uses fixed widths for each
bin. There is a bin of 16-byte elements, and it is a list of chunks that
are all 16 bytes in size. If someone wants 16 bytes, we know where to
look. Of course, if someone wants first 16, then 32, then 48 bytes, that
will allocate 96 bytes in three pages. Fragmentation doesn't begin to
describe it.

> Designs where all the metadata (and especially freelist pointers) are
> inline are highly vulnerable to heap overflow and use-after-free
> attacks. If possible, I'd rather have less metadata inline and have it
> efficiently coupled with something out-of-line that's harder to
> attack. This might also improve performance and reduce contention,
> e.g. if we used atomic bitmasks in an index of chunks to allocate/free
> them rather than having to move them in and out of linked lists.
> 

That is a very good idea. Say, don't OS memory managers have to do it
this way? Maybe pick up some pointers from them...

> Here when I say "whole design is wrong" I'm considering all these as
> the same basic design -- they're all dlmalloc variants. Doing better
> in important ways while not doing worse in any important way is a hard
> problem. But it seems like one worth solving.
> 
> Rich

And here I thought memory management was a solved problem. Maybe it's
time I hit a library again.

Ciao,
Markus