From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/12755
Path: news.gmane.org!.POSTED!not-for-mail
From: Markus Wichmann <nullplan@gmx.net>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: What's wrong with musl's malloc
Date: Sun, 22 Apr 2018 21:34:50 +0200
Message-ID: <20180422193450.GA11638@voyager>
References: <20180420200904.GS3094@brightrain.aerifal.cx>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: blaine.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: blaine.gmane.org 1524425586 19433 195.159.176.226 (22 Apr 2018 19:33:06 GMT)
X-Complaints-To: usenet@blaine.gmane.org
NNTP-Posting-Date: Sun, 22 Apr 2018 19:33:06 +0000 (UTC)
User-Agent: Mutt/1.9.4 (2018-02-28)
To: musl@lists.openwall.com
Original-X-From: musl-return-12771-gllmg-musl=m.gmane.org@lists.openwall.com Sun Apr 22 21:33:02 2018
Return-path: <musl-return-12771-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@m.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by blaine.gmane.org with smtp (Exim 4.84_2)
	(envelope-from <musl-return-12771-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1fAKjN-0004yS-NQ
	for gllmg-musl@m.gmane.org; Sun, 22 Apr 2018 21:33:01 +0200
Original-Received: (qmail 3576 invoked by uid 550); 22 Apr 2018 19:35:09 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Original-Received: (qmail 3558 invoked from network); 22 Apr 2018 19:35:08 -0000
Content-Disposition: inline
In-Reply-To: <20180420200904.GS3094@brightrain.aerifal.cx>
X-Provags-ID: V03:K1:OeU6NlXXpAsUyXcJKTJUC1ggfOyPYAVolhqtEbLxGEO/xMF86Mj
 TMQ+hCJAwUvHTjXCQLfwcSdpQe8XuikSFePpEM9L8EegNge2Iuq3GpETXrTbgmvI5iUzAv5
 zEP4+mCeFj8Qfk/3UF0O0Nr4jrtpntzEq7DiojAI2xsyvUF55GbayDlYcjZT7v59n488mPX
 RSbGLgPD3xUKH2rwtdmgw==
X-UI-Out-Filterresults: notjunk:1;V01:K0:SRt9Wit119s=:FJcBV+UQ3Sa6HqIoUxPBB7
 m3IK35QOBqJZM0GNEFPukQ8F1QJqwtOz0PvFxBKJiVVp3NRpHlvixCoLCRFsvpmvc0xuqrgpo
 aGzZBrf7RCwClfvWGqWxhZ48mVpEL3LoPmSKkL3+ASShJ1VIXJRAJsfIo6lNjt1F6XBTP4H97
 XPNCiw21ER2cCcmtlxfBQcgO0pE5ogO/IxvY/Qrf+lawE/133v3VsO0u2cZJMVGo9xVN2V6Vr
 sozazU4Wo+Qlv4gDRLcxNX1W0NaPe5iJBosS2lOz0zD9a4+radM1fLoKZZ9LjNLG2E6wgidwG
 EJbzSHT1QftJ4oHt05MAV7Wk7D30MW1VbLjZXzLUQm8zqU6OmSigJ2zANLn5NtpTTXTdATjXs
 23E87LHGggbGRzbZI3Zxjl7QN5nFCBNDVlZ4I+eOlq/TSgv7TWliDpoZgw3Rg7Jm10LK3tefj
 QUzJjWfhtTJog+MZIMN5d0+qBRcDUfeO8hSAE32aeRfDxAHQTUx5YDBqn4WyCn+69Fbo2OlHC
 SXRuC77+ykZPuENrRHz44oMIfzEw9X4p2cKPrLRmqeQB05VPtJQ5DVkTs9L+mkuAd8Txxgpa4
 QLGV+PLuGGt4oeJRK0d2jERPkNKbqxjPcKZn/I47AmxXABP8Q/5uoCPs+zNY2EVjO1QUYAiw4
 kqKjbnhNIo0EG+SXtFXPf3Dj0AsG117/WUbE5Vr3JrcqTT4/nT56TDnHk7ejEukokw8i4aAF8
 S2czKerN48RimCP/0wiKzNOMWiEWF/XAqrnLuWJrthMQnHR7VIZ3FVM8uGcSV63SUH5FtTOJ 
Xref: news.gmane.org gmane.linux.lib.musl.general:12755
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/12755>

On Fri, Apr 20, 2018 at 04:09:04PM -0400, Rich Felker wrote:
> I've talked on and off about musl's malloc needing to be overhauled or
> replaced, and gaining the ability to experiment with that is one of
> the motivations for making malloc interposable/replacable. Here's a
> brief write-up of what's wrong that needs to be addressed:
> 

Yeah, I was about to ask for one.

> 
> The main benefits of musl's malloc vs the standard dlmalloc algorithms
> it's based on is the fine-grained locking. As long as there are binned
> free chunks of various sizes available, threads calling malloc will
> only contend for a lock when they're requesting allocations of same or
> similar size. This works out well under artificial random loads; I'm
> not sure how much it helps on typical real-world loads.
> 

That chiefly depends on the design of the programs. Mine tend to avoid
heap allocation wherever possible, and wherever impossible tend to
coalesce allocations into few large requests. However, a lot of
object-oriented code I have seen (even in C) allocates many small
objects.

But that is good: In the small numbers, there are many bins available.
If the program makes random requests for anything between 16 and 128
bytes, there are 4 bins for that. Add 2k to each of these and they all
fall into one bin.

> 
> In any case, the fine-grained locking also created a problem I didn't
> anticipate when I designed it: when allocating memory that involves
> splitting a free chunk, or when freeing memory that will get merged
> with an adjacent free chunk, the operation as a whole is not atomic;
> rather, a large free chunk is consumed, possibly emptying the bin it
> lies in, split or merged, then returned either to the same or a
> different bin. By saying this operation is non-atomic, I mean other
> threads see the intermediate state where the large free chunk has been
> consumed but not yet returned, and when this happens, they
> unnecessarily split another free chunk or expand the heap rather than
> waiting for it to be returned. This yields bad, potentially
> catastrophic, fragmentation.
> 

Fragmentation is bad with the current malloc() even in the single
threaded case. A simple example is a request for fifty bytes followed by
a request for two-thousand fifty bytes. If the stack was empty before,
the first request will allocate a page from the OS. Assuming that was
4k, malloc will now split off the rest of the page and put it in the bin
for "chunks sized 2k - 4k". The second request however will be rounded
up, and seeing as the bin for "chunks sized 4k - 8k" is still empty, two
more pages will be allocated from the system. Then the requested 2k and
change will be split off, leaving the heap with one free chunk in the
"2k - 4k" bin that is just a bit beneath 4k, and one free chunk in the
"4k - 8k" bin that is circa 6k in size. Three pages were allocated where
one would have sufficed. (That is one definition of fragmentation: The
request could not be serviced although the resources where available.)

The benefit of this scheme, of course, is that allocations in the
single-threaded case are bounded time: The algorithm is to pop off the
first chunk in the bins large enough to support the request, or to
allocate the memory necessary for that from the OS. In the worst case,
allocation is
    - OS memory allocation
    - allocation of a chunk from the bin
    - splitting that chunk

Each of these is constant time. I am not sure optimizing fragmentation
is worth reducing the performance for. Especially in today's desktop
and server systems, where anything below 16GB of RAM will just get
laughed off the stage (slight exaggeration).

[...]
> 
> In the long term, I think the whole malloc design we're using now is
> wrong, and should be replaced, but until the time comes to actually do
> that, it may make sense to try to improve some of the worst
> properties, or even to reduce or eliminate the granularity of the
> locking if it proves not to be very valuable on real-world loads.
> 

Care to elaborate on what is wrong with the current design of the
malloc? Everything you named so far was just implementation issues, but
nothing that's design related.

So far, I am also not impressed with the competition. dietlibc for
instance runs essentially the same algorithm, but with a big global lock
in the multi-threaded case. tcmalloc runs essentially the same
algorithm, but with an arena for each thread. Therefore every chunk has
to know which arena it came from, thuns increasing the overhead by a
pointer. But it is all fundamentally the same. Not like with sorting
algorithms, where you get meaningful differences and tradeoffs. No, the
tradeoffs appear to be entirely in the area I'd call tuning: Where do
you put the mmap threshold, how do you lock, do you optimize locality of
reference and tiny allocations, all that stuff. But as far as I can
tell, they all suffer from the problem described above. Well, OK, for
32-bit versions of dietlibc, 2050 bytes is already beyond the mmap
threshold. But if it weren't, my point would still stand.

> Rich

Ciao,
Markus