From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/12762 Path: news.gmane.org!.POSTED!not-for-mail From: Markus Wichmann Newsgroups: gmane.linux.lib.musl.general Subject: Re: What's wrong with musl's malloc Date: Mon, 23 Apr 2018 21:02:05 +0200 Message-ID: <20180423190205.GB11638@voyager> References: <20180420200904.GS3094@brightrain.aerifal.cx> <20180422193450.GA11638@voyager> <20180423020010.GX3094@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: blaine.gmane.org 1524510021 12486 195.159.176.226 (23 Apr 2018 19:00:21 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Mon, 23 Apr 2018 19:00:21 +0000 (UTC) User-Agent: Mutt/1.9.4 (2018-02-28) To: musl@lists.openwall.com Original-X-From: musl-return-12778-gllmg-musl=m.gmane.org@lists.openwall.com Mon Apr 23 21:00:17 2018 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by blaine.gmane.org with smtp (Exim 4.84_2) (envelope-from ) id 1fAghE-0003Ay-Qv for gllmg-musl@m.gmane.org; Mon, 23 Apr 2018 21:00:16 +0200 Original-Received: (qmail 32471 invoked by uid 550); 23 Apr 2018 19:02:24 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 32445 invoked from network); 23 Apr 2018 19:02:24 -0000 Content-Disposition: inline In-Reply-To: <20180423020010.GX3094@brightrain.aerifal.cx> X-Provags-ID: V03:K1:zkNfTp+zCGeLjp0HTDGvlDR7fD0/y3lxzh0ZimiwKcVgm4AP2y3 QrtwdRTV5v0SP8fQNFjfEDKTbmKb5bIN0G+o2OxyRmL1Kgbty28GjcVruQ6MtcIuHnmpBd/ riAW08NFqCU7fHlkYCYeysghSS5nAJx4fUn1POzzS/jGamCksEuB8k1KdkTylbUZTIh9rcB 0HPg9XC/z0WUq/9lJDD7g== X-UI-Out-Filterresults: notjunk:1;V01:K0:uLqnQDUwg4k=:OJivcHeIxm1trwBvajQ0qn sNGKkXoFAP/+/HMABGMEHLlv09uGsxyCvl0CMah2Tic71jKr73p14x71yBbSL0duVggeGMEii U0YSgsU9WWjxvAay0dHdPcZt1BpcMIIzbGLYfQAy1GUvPfNWf6RPOELp15zn0x9OY5sMbf9/e ZAleI3TW1jzMjndtq7vL42BvnMQ9BWykWdVT4y+5JCf5TnkEwFruo7TBaRryysbNxBsBrXXL5 9KDT2m8jLroKj5a08+1hmtsKqcaQSjV1/P5x7j19YSbYrxUu3YeMuq4BT1s3vHQq9jf1jZWP8 0MuHKDsFGZMpgOJVRNpfXJ0Jko+yUptOMaqleNNHJzYnlSS4rcCSz/8CkETzPE6b+WnJUpa+o QM3cp2LFr864FLTOQUGZW2j1o82ux0WPU44w0U239xTW/YfO2kM8Pkmf/KQRgbQEgvM4lATwk pEPD2T0uRudKDn3jgFX6BlLyG3sxcuxUALJUwtCp6ohRKcHQfZf7VLAkCHzrpW/HUCzWjBslr MQxLZZ2W2LoAVtqNuAfM5soYTBQ121hE1b3xiHUcy4BYh61QWlx+pdbBQmwfW1iU3wmJnLrua zq6jr3UMkuFlwjrJ3eNkpEODnhfwCGbrzHnoke10oCR1ATFLU1/yivljDnkYRE4qxX5AnI5Pf 73YisEnTnXOFK1c1/Ybo2e+PbHMydfYDFiTF18uLAE+QYRhvy8oHhJSp+NyKK5ooGliKsMTzt NbiM6sswOTIS+lwkm5h6RkLCRb2iQRLw2EZbW/Buj+SuA9ffMDFYSy/vJ20taDKOBQygnyrB Xref: news.gmane.org gmane.linux.lib.musl.general:12762 Archived-At: On Sun, Apr 22, 2018 at 10:00:10PM -0400, Rich Felker wrote: > This is incorrect. Up to size 512 bytes (1024 on 64-bit archs), every > single multiple of 16 (or of 32, on 64-bit) is its own bin. Only past > that point does sparse spacing of bins begin. > Yeah, I see that now. My error was assuming that bin_index() was basically just a binary logarithm. It is not. It is some logarithm beyond the linear range, but not a binary log. Which one can easily see by plotting the results of that function in a graph. Maybe I should have verified my assumptions before posting. > There is also no bin for "sized 2k to 4k". It's "sized 2048 bytes to > 2560 bytes", etc. -- granularity of bin sizes is in 1/4 steps up to > the next power of 2. So If you do start with a 4k chunk, after > splitting off 50 bytes to allocate, you have left a chunk in the > "sized 3586 up to 4096" bin, and the 2050-byte allocation is perfectly > satisfiable from it. You could adjust the example to work, but then > the fragmentation you find as a result is much lower. > Yup, by my calculation I get about 400 bytes of fragmentation in the worst case. The "just shy of 4k" bin starts at 3616 bytes, so after allocating a tiny amount of memory, allocating 3648 bytes or more will require heap expansion even though the memory to fulfill the request is already available. And expand_heap() will only round up to one page size. So afterwards, roughly 3700 bytes will be allocated in two pages. The example also does not scale, as repeating the tiny allocation will not get another page. And repeatedly allocating 3650 bytes will allocate one page per request, but it is doubtful that a better scheme is available. > > The benefit of this scheme, of course, is that allocations in the > > single-threaded case are bounded time: The algorithm is to pop off the > > first chunk in the bins large enough to support the request, or to > > allocate the memory necessary for that from the OS. In the worst case, > > allocation is > > - OS memory allocation > > - allocation of a chunk from the bin > > - splitting that chunk > > > > Each of these is constant time. I am not sure optimizing fragmentation > > is worth reducing the performance for. Especially in today's desktop > > and server systems, where anything below 16GB of RAM will just get > > laughed off the stage (slight exaggeration). > > I've rarely used a system with 16GB ram, and plenty of systems using > musl have under 1GB. The old musl git/web server had 256MB before the > host shutdown had; now we're stuck with a more expensive one with > something like 768MB or 1GB. Also 32-bit archs have a hard *virtual* > limit of 2, 3, or 4 GB (mostly 2 or 3) regardless of how much physical > memory you have. Treating 16GB like something reasonable to have is > how you get the glibc & mainstream browser ecosystem... > Yeah, that statement was not terribly well thought through. My point was, though, that the current design manages to make single-threaded allocation constant time (well, as constant as expand_heap()). At the cost of more fragmentation than necessary. Of course 16GB is a ridiculous number. That's why I chose it. My largest system has 8GB, and that was after scavenging some parts out of laptops due for the scrap heap. > > > In the long term, I think the whole malloc design we're using now is > > > wrong, and should be replaced, but until the time comes to actually do > > > that, it may make sense to try to improve some of the worst > > > properties, or even to reduce or eliminate the granularity of the > > > locking if it proves not to be very valuable on real-world loads. > > > > > > > Care to elaborate on what is wrong with the current design of the > > malloc? Everything you named so far was just implementation issues, but > > nothing that's design related. > > Designs that require splitting/merging chunks inherently require > excessive synchronization that makes them not scale well to many > threads/cores (and awful on NUMA, if anyone cares). They also have > inherent fragmentation problems with certain allocation patterns, like > alternating between small/large allocations then freeing all the large > ones, which is a reasonable pattern (e.g. allocating large working > spaces and small result spaces, then freeing all the working spaces). > How do we work around this, then? We could go the tcmalloc route and essentially run the allocator separately for each thread (arena allocation), which reduces the amount of synchronization at the cost of more fragmentation; this time cross-thread fragmentation (no-one will be able to utilize all the memory that is lost to overhead across all threads). We could also go the dietlibc route, which uses fixed widths for each bin. There is a bin of 16-byte elements, and it is a list of chunks that are all 16 bytes in size. If someone wants 16 bytes, we know where to look. Of course, if someone wants first 16, then 32, then 48 bytes, that will allocate 96 bytes in three pages. Fragmentation doesn't begin to describe it. > Designs where all the metadata (and especially freelist pointers) are > inline are highly vulnerable to heap overflow and use-after-free > attacks. If possible, I'd rather have less metadata inline and have it > efficiently coupled with something out-of-line that's harder to > attack. This might also improve performance and reduce contention, > e.g. if we used atomic bitmasks in an index of chunks to allocate/free > them rather than having to move them in and out of linked lists. > That is a very good idea. Say, don't OS memory managers have to do it this way? Maybe pick up some pointers from them... > Here when I say "whole design is wrong" I'm considering all these as > the same basic design -- they're all dlmalloc variants. Doing better > in important ways while not doing worse in any important way is a hard > problem. But it seems like one worth solving. > > Rich And here I thought memory management was a solved problem. Maybe it's time I hit a library again. Ciao, Markus