From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/12755 Path: news.gmane.org!.POSTED!not-for-mail From: Markus Wichmann Newsgroups: gmane.linux.lib.musl.general Subject: Re: What's wrong with musl's malloc Date: Sun, 22 Apr 2018 21:34:50 +0200 Message-ID: <20180422193450.GA11638@voyager> References: <20180420200904.GS3094@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: blaine.gmane.org 1524425586 19433 195.159.176.226 (22 Apr 2018 19:33:06 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sun, 22 Apr 2018 19:33:06 +0000 (UTC) User-Agent: Mutt/1.9.4 (2018-02-28) To: musl@lists.openwall.com Original-X-From: musl-return-12771-gllmg-musl=m.gmane.org@lists.openwall.com Sun Apr 22 21:33:02 2018 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by blaine.gmane.org with smtp (Exim 4.84_2) (envelope-from ) id 1fAKjN-0004yS-NQ for gllmg-musl@m.gmane.org; Sun, 22 Apr 2018 21:33:01 +0200 Original-Received: (qmail 3576 invoked by uid 550); 22 Apr 2018 19:35:09 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 3558 invoked from network); 22 Apr 2018 19:35:08 -0000 Content-Disposition: inline In-Reply-To: <20180420200904.GS3094@brightrain.aerifal.cx> X-Provags-ID: V03:K1:OeU6NlXXpAsUyXcJKTJUC1ggfOyPYAVolhqtEbLxGEO/xMF86Mj TMQ+hCJAwUvHTjXCQLfwcSdpQe8XuikSFePpEM9L8EegNge2Iuq3GpETXrTbgmvI5iUzAv5 zEP4+mCeFj8Qfk/3UF0O0Nr4jrtpntzEq7DiojAI2xsyvUF55GbayDlYcjZT7v59n488mPX RSbGLgPD3xUKH2rwtdmgw== X-UI-Out-Filterresults: notjunk:1;V01:K0:SRt9Wit119s=:FJcBV+UQ3Sa6HqIoUxPBB7 m3IK35QOBqJZM0GNEFPukQ8F1QJqwtOz0PvFxBKJiVVp3NRpHlvixCoLCRFsvpmvc0xuqrgpo aGzZBrf7RCwClfvWGqWxhZ48mVpEL3LoPmSKkL3+ASShJ1VIXJRAJsfIo6lNjt1F6XBTP4H97 XPNCiw21ER2cCcmtlxfBQcgO0pE5ogO/IxvY/Qrf+lawE/133v3VsO0u2cZJMVGo9xVN2V6Vr sozazU4Wo+Qlv4gDRLcxNX1W0NaPe5iJBosS2lOz0zD9a4+radM1fLoKZZ9LjNLG2E6wgidwG EJbzSHT1QftJ4oHt05MAV7Wk7D30MW1VbLjZXzLUQm8zqU6OmSigJ2zANLn5NtpTTXTdATjXs 23E87LHGggbGRzbZI3Zxjl7QN5nFCBNDVlZ4I+eOlq/TSgv7TWliDpoZgw3Rg7Jm10LK3tefj QUzJjWfhtTJog+MZIMN5d0+qBRcDUfeO8hSAE32aeRfDxAHQTUx5YDBqn4WyCn+69Fbo2OlHC SXRuC77+ykZPuENrRHz44oMIfzEw9X4p2cKPrLRmqeQB05VPtJQ5DVkTs9L+mkuAd8Txxgpa4 QLGV+PLuGGt4oeJRK0d2jERPkNKbqxjPcKZn/I47AmxXABP8Q/5uoCPs+zNY2EVjO1QUYAiw4 kqKjbnhNIo0EG+SXtFXPf3Dj0AsG117/WUbE5Vr3JrcqTT4/nT56TDnHk7ejEukokw8i4aAF8 S2czKerN48RimCP/0wiKzNOMWiEWF/XAqrnLuWJrthMQnHR7VIZ3FVM8uGcSV63SUH5FtTOJ Xref: news.gmane.org gmane.linux.lib.musl.general:12755 Archived-At: On Fri, Apr 20, 2018 at 04:09:04PM -0400, Rich Felker wrote: > I've talked on and off about musl's malloc needing to be overhauled or > replaced, and gaining the ability to experiment with that is one of > the motivations for making malloc interposable/replacable. Here's a > brief write-up of what's wrong that needs to be addressed: > Yeah, I was about to ask for one. > > The main benefits of musl's malloc vs the standard dlmalloc algorithms > it's based on is the fine-grained locking. As long as there are binned > free chunks of various sizes available, threads calling malloc will > only contend for a lock when they're requesting allocations of same or > similar size. This works out well under artificial random loads; I'm > not sure how much it helps on typical real-world loads. > That chiefly depends on the design of the programs. Mine tend to avoid heap allocation wherever possible, and wherever impossible tend to coalesce allocations into few large requests. However, a lot of object-oriented code I have seen (even in C) allocates many small objects. But that is good: In the small numbers, there are many bins available. If the program makes random requests for anything between 16 and 128 bytes, there are 4 bins for that. Add 2k to each of these and they all fall into one bin. > > In any case, the fine-grained locking also created a problem I didn't > anticipate when I designed it: when allocating memory that involves > splitting a free chunk, or when freeing memory that will get merged > with an adjacent free chunk, the operation as a whole is not atomic; > rather, a large free chunk is consumed, possibly emptying the bin it > lies in, split or merged, then returned either to the same or a > different bin. By saying this operation is non-atomic, I mean other > threads see the intermediate state where the large free chunk has been > consumed but not yet returned, and when this happens, they > unnecessarily split another free chunk or expand the heap rather than > waiting for it to be returned. This yields bad, potentially > catastrophic, fragmentation. > Fragmentation is bad with the current malloc() even in the single threaded case. A simple example is a request for fifty bytes followed by a request for two-thousand fifty bytes. If the stack was empty before, the first request will allocate a page from the OS. Assuming that was 4k, malloc will now split off the rest of the page and put it in the bin for "chunks sized 2k - 4k". The second request however will be rounded up, and seeing as the bin for "chunks sized 4k - 8k" is still empty, two more pages will be allocated from the system. Then the requested 2k and change will be split off, leaving the heap with one free chunk in the "2k - 4k" bin that is just a bit beneath 4k, and one free chunk in the "4k - 8k" bin that is circa 6k in size. Three pages were allocated where one would have sufficed. (That is one definition of fragmentation: The request could not be serviced although the resources where available.) The benefit of this scheme, of course, is that allocations in the single-threaded case are bounded time: The algorithm is to pop off the first chunk in the bins large enough to support the request, or to allocate the memory necessary for that from the OS. In the worst case, allocation is - OS memory allocation - allocation of a chunk from the bin - splitting that chunk Each of these is constant time. I am not sure optimizing fragmentation is worth reducing the performance for. Especially in today's desktop and server systems, where anything below 16GB of RAM will just get laughed off the stage (slight exaggeration). [...] > > In the long term, I think the whole malloc design we're using now is > wrong, and should be replaced, but until the time comes to actually do > that, it may make sense to try to improve some of the worst > properties, or even to reduce or eliminate the granularity of the > locking if it proves not to be very valuable on real-world loads. > Care to elaborate on what is wrong with the current design of the malloc? Everything you named so far was just implementation issues, but nothing that's design related. So far, I am also not impressed with the competition. dietlibc for instance runs essentially the same algorithm, but with a big global lock in the multi-threaded case. tcmalloc runs essentially the same algorithm, but with an arena for each thread. Therefore every chunk has to know which arena it came from, thuns increasing the overhead by a pointer. But it is all fundamentally the same. Not like with sorting algorithms, where you get meaningful differences and tradeoffs. No, the tradeoffs appear to be entirely in the area I'd call tuning: Where do you put the mmap threshold, how do you lock, do you optimize locality of reference and tiny allocations, all that stuff. But as far as I can tell, they all suffer from the problem described above. Well, OK, for 32-bit versions of dietlibc, 2050 bytes is already beyond the mmap threshold. But if it weren't, my point would still stand. > Rich Ciao, Markus