From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/12756 Path: news.gmane.org!.POSTED!not-for-mail From: Michael Clark Newsgroups: gmane.linux.lib.musl.general Subject: Re: What's wrong with musl's malloc Date: Mon, 23 Apr 2018 09:40:08 +1200 Message-ID: <0EC2E771-368F-454E-B08D-7D6A5321F598@mac.com> References: <20180420200904.GS3094@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 (1.0) Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: blaine.gmane.org 1524433103 10657 195.159.176.226 (22 Apr 2018 21:38:23 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sun, 22 Apr 2018 21:38:23 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-12772-gllmg-musl=m.gmane.org@lists.openwall.com Sun Apr 22 23:38:19 2018 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by blaine.gmane.org with smtp (Exim 4.84_2) (envelope-from ) id 1fAMgc-0002hZ-Ta for gllmg-musl@m.gmane.org; Sun, 22 Apr 2018 23:38:19 +0200 Original-Received: (qmail 2012 invoked by uid 550); 22 Apr 2018 21:40:27 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 1991 invoked from network); 22 Apr 2018 21:40:26 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mac.com; s=04042017; t=1524433214; bh=v/l5AxXhbRLpUMGTOh+epmwX+xegur31VQuRwfipJFM=; h=From:Content-type:MIME-version:Date:Subject:Message-id:To; b=bwfCgzA7UeWJB1NM8ov39vV9DMtjI3MtHgLlLKeYn3t3dnjHWBA9BDmvpulSkty3d Ef30oYDv9p/IOf3LL46taLeZGvbWPhxzo+lNkPLIqLPziyByYoDaqpTJiIZFVplcV7 cDxlBTN/H+FghsvsF25xCj1igeKU6X4tST0CBUPc7HM26f0oua4zNKUkbVi/cDScTy ZeHGOdUf0NSu4DiA9sVHBRZVoQez6l6rXiRo8FxZa97zVPxaBvKTgTxnNd8122vW0A 8f0iIesgRTjZUyv8oJ2uW8lUFRRqXkOs7FOilPKZSmsbyI77uBy5tD1g2gUjHByHaO lp0WN4ig6aWyQ== X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2018-04-22_09:,, signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 clxscore=1015 suspectscore=35 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1707230000 definitions=main-1804220241 In-reply-to: <20180420200904.GS3094@brightrain.aerifal.cx> X-Mailer: iPhone Mail (15E216) Xref: news.gmane.org gmane.linux.lib.musl.general:12756 Archived-At: > On 21/04/2018, at 8:09 AM, Rich Felker wrote: >=20 > I've talked on and off about musl's malloc needing to be overhauled or > replaced, and gaining the ability to experiment with that is one of > the motivations for making malloc interposable/replacable. Here's a > brief write-up of what's wrong that needs to be addressed: >=20 >=20 > The main benefits of musl's malloc vs the standard dlmalloc algorithms > it's based on is the fine-grained locking. As long as there are binned > free chunks of various sizes available, threads calling malloc will > only contend for a lock when they're requesting allocations of same or > similar size. This works out well under artificial random loads; I'm > not sure how much it helps on typical real-world loads. >=20 >=20 > In any case, the fine-grained locking also created a problem I didn't > anticipate when I designed it: when allocating memory that involves > splitting a free chunk, or when freeing memory that will get merged > with an adjacent free chunk, the operation as a whole is not atomic; > rather, a large free chunk is consumed, possibly emptying the bin it > lies in, split or merged, then returned either to the same or a > different bin. By saying this operation is non-atomic, I mean other > threads see the intermediate state where the large free chunk has been > consumed but not yet returned, and when this happens, they > unnecessarily split another free chunk or expand the heap rather than > waiting for it to be returned. This yields bad, potentially > catastrophic, fragmentation. >=20 > The malloc side already partially mitigates this with the "pretrim" > function, which is used whenever the leftover part after splitting > would remain in the same bin it was already in. In particular his > covers splitting small allocations off a large "top of heap" zone, but > it doesn't help with splitting small chunks. And the free side is much > worse -- large free zones momentarily disappear every time something > adjacent to them is freed. >=20 > In order to overcome this fully while maintaining the fine-grained > locking, we'd need the ability to hold multiple locks concurrently, > which would require a protocol for lock acquisition order to avoid > deadlock. But there may be cheap mitigations that could be made in > free like on the malloc side. For instance it might be possible to > make a simpler protocol whereby we could always hold the lock for bin > 63, thereby avoiding exposing a state where large free zones are > unavailable. If there's a way to make this work, moving the binmap > masking for newly-emptied bins from unbin() to unlock_bin() would > ensure that malloc doesn't "see" a "temporary empty" state. >=20 >=20 > In the long term, I think the whole malloc design we're using now is > wrong, and should be replaced, but until the time comes to actually do > that, it may make sense to try to improve some of the worst > properties, or even to reduce or eliminate the granularity of the > locking if it proves not to be very valuable on real-world loads. Hi Rich, Have you considered importing jemalloc? It=E2=80=99s been battle tested for 12 years in FreeBSD and in many other ap= ps. Given it=E2=80=99s bundled and used in MariaDB which is undoubtedly a de= manding user and presumably Monty approved the choice of allocator, so it sh= ouldn=E2=80=99t be a bad choice. I=E2=80=99d trust the allocator choice of a= well respected database architect. =46rom looking at the Lockless, Inc benchmarks jemalloc has good performance= from very small to very large allocations. =46rom the Lockless Inc memory a= llocator comparisons: =E2=80=9CThe disadvantage of the jemalloc allocator is= its memory usage. It uses power-of-two sized bins, which leads to a greatly= increased memory footprint compared to other allocators. This can affect re= al-world performance due to excess cache and TLB misses.=E2=80=9D =46rom the wiki: - https://github.com/jemalloc/jemalloc/wiki/Background =E2=80=9Cjemalloc is a general purpose malloc(3) implementation that emphasi= zes fragmentation avoidance and scalable concurrency support. It is intended= for use as the system-provided memory allocator, as in FreeBSD's libc libra= ry, as well as for linking into C/C++ applications. jemalloc provides many i= ntrospection, memory management, and tuning features beyond the standard all= ocator functionality.=E2=80=9D It=E2=80=99s a libc allocator. Here are some of the major users (who it=E2=80=99s fair to say must have exe= rcised due consideration in choosing their memory allocator given there are b= rowsers (Firefox=E2=80=99s mozjemalloc fork), several databases and caches):= - FreeBSD - NetBSD - Mozilla Firefox - Varnish - Cassandra - Redis - hhvm - MariaDB - Aerospike - Android Curious how big the linked code is... assuming code size is an important con= sideration... Looking at the code, ... it doesn=E2=80=99t look too big... and it=E2=80=99s= interesting to note that the recent changes to glibc to support per thread p= ools uses the same tcache feature name as jemalloc whose per thread pool imp= lementation is in tcache.c - https://sourceware.org/ml/libc-alpha/2017-01/msg00452.html - https://sourceware.org/ml/libc-alpha/2017-01/msg00524.html musl libc could vendor the jemalloc code with some sed scripts to modify inc= ludes so that jemalloc fits into the musl build structure and can be easily s= ynced with upstream. Ideally musl would want to import/vendor vs using a sub= module so musl remains stand-alone. I assume many projects have done the sam= e. MariaDB bundles it. Firefox forked it. I think the primary disadvantage of jemalloc is power of 2 size bins. Non po= wer of 2 size strides for small structures have better caching performance f= or many workloads due to stochastic cache eviction when not using power of 2= strides. I=E2=80=99ve noticed that power of 2 alignments can have pathologi= cal performance in some cases. i.e. witnessing an array with sizeof(struct) =3D= =3D 28 or 36 perform much better than sizeof(struct) =3D=3D 32. That said mo= st apps would be using a single malloc call for these types of performance c= ritical arrays so it=E2=80=99s only an issue for apps that individually mall= oc many small objects. Worth considering. BTW - I owe you an update on the riscv musl port. I=E2=80=99ll post one soon= ... I=E2=80=99ve been busy on the QEMU RISC-V port, whose =E2=80=98virt=E2=80= =99 board has recently been used by Fedora and Debian distro porters for ris= cv64 arch support (glibc based distros). Stefan O=E2=80=99Rear got MTTCG wor= king on qemu/target/riscv so we can run SMP Linux in QEMU with reasonable pe= rformance. Debian for example now has QEMU RISC-V builders for the riscv64 a= rch. - https://github.com/riscv/riscv-qemu/wiki It will be nice to get the riscv musl support upstream. The major issue at p= resent is ELF thread local storage (pthreads are working but GCC=E2=80=99s _= _thread is not). The RISC-V TLS model is not documented so it requires some r= everse engineering of the riscv support in GCC/binutils/glibc. I haven=E2=80= =99t been able to isolate the issue and could benefit from some help by som= eone more knowledgable in that area (ELF TLS and architecture specific TLS m= odels). Regards, Michael=