Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)

mailing list of musl libc
 help / color / mirror / code / Atom feed

From: Rich Felker <dalias@libc.org>
To: baiyang <baiyang@gmail.com>
Cc: musl <musl@lists.openwall.com>
Subject: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)
Date: Mon, 19 Sep 2022 09:43:19 -0400	[thread overview]
Message-ID: <20220919134319.GN9709@brightrain.aerifal.cx> (raw)
In-Reply-To: <2022091915532777412615@gmail.com>

On Mon, Sep 19, 2022 at 03:53:30PM +0800, baiyang wrote:
> Hi there,
> 
> As we have discussed at
> https://github.com/openwrt/openwrt/issues/10752. The
> malloc_usable_size() function in musl 1.2 (mallocng) seems to have
> some performance issues.
> 
> It caused realloc and free spends too long time for get the chunk size.
> 
> As we mentioned in the discussion, tcmalloc and some other
> allocators can also accurately obtain the size class corresponding
> to a memory block and its precise size, and it is also very fast at
> the same time.
> 
> Can we make some improvements to the existing malloc_usable_size
> algorithm in mallocng? This should significantly improve the
> performance of existing algorithms.

Can you please start from a point of identifying the real-world case
in which you're hitting a performance degredation? Made-up tests are
generally not helpful and will almost always lead to focusing on the
wrong problem.

For now I'm going to focus on some things from the linked thread:

> > Considering that realloc itself contains a complete
> > malloc_usable_size (refer to here and here), So actually most
> > (66.7%) of the realloc time is spent doing malloc_usable_size.

In your test that increments the realloc size by one each iteration,
only one in every PAGESIZE calls has any real work to do. The rest do
nothing but set_size after obtaining the metadata on the object
they're acting on. It's completely expected that the runtime of these
will be dominated by obtaining the metadata; this isn't evidence of
anything wrong. And, moreover, it's almost surely a lot more than
66.7%. Most of the 0.8s difference is likely spent on the 2560 mmap
syscalls and page faults accessing the new pages they produce.

> > In implementations such as: glibc, tcmalloc, msw crt (_msize), mac
> > os x (malloc_size), and musl 1.1, even on low-end embedded
> > processors, the consumption of malloc_usable_size per 10 million
> > calls is mostly not more than a few hundred milliseconds.

It looks like mallocng's malloc_usable_size is taking around 150 ns
per call on your system, vs maybe 30-50 for others?

> > In addition, this very slow slab size acquisition algorithm also
> > needs to be called every time free (see here). So we believe it
> > should be the main reason for malloc/free and realloc performance
> > degradation in version 1.2.

Unless you have an application that's explicitly using
malloc_usable_size all over the place, it's highly unlikely that this
is the cause of your real-world performance problems. The vast
majority of reported problems with malloc performance have been in
multithreaded applications, where the dominating time cost is
fundamental: synchronization cost of having global consistency. There
you'll expect to find very similar performance figures from any other
allocator with global consistency, such as hardened_malloc.

If you're really having single-threaded performance problems that
aren't just in made-up benchmarks, please see if you can narrow down
the cause empirically rather than speculatively. For example, running
the program under perf and looking at where the time is being spent.

> > If we can improve its speed and make it close to implementations
> > like tcmalloc (tcmalloc can also accurately return the size of the
> > size class to which the chunk belongs), it should significantly
> > improve the performance of mallocng (at least in single-threaded
> > scenarios) .

tcmalloc is fast by not having global consistency, not being hardened
against memory errors like double-free and use-after-free, and not
avoiding fragmentation and excessive memory usage. Likewise for most
of the others. The run time costs in mallocng for looking up the
out-of-band metadata are largely fundamental to it being out-of-band
(not subject to direct falsification via typically exploitable
application bugs), size-efficient, 32-bit-compatible,
nommu-compatible, etc. Other approaches like in hardened_malloc can be
moderately more efficient to access the metadata, at the price of not
being at all amenable to small systems, which are a core goal of musl
we can't really disregard.

I can't say for sure there's not any room for optimization in the
metadata fetching though. Looking at the assembly output might be
informative, to see if we're doing anything that's making the compiler
emit gratuitously inefficient code.

Rich

next prev parent reply	other threads:[~2022-09-19 13:43 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-19  7:53 baiyang
2022-09-19 11:08 ` Szabolcs Nagy
2022-09-19 12:36   ` Florian Weimer
2022-09-19 13:46     ` Rich Felker
2022-09-19 13:53       ` James Y Knight
2022-09-19 17:40         ` baiyang
2022-09-19 18:14           ` Szabolcs Nagy
2022-09-19 18:40             ` baiyang
2022-09-19 19:07             ` Gabriel Ravier
2022-09-19 19:21               ` Rich Felker
2022-09-19 21:02                 ` Gabriel Ravier
2022-09-19 21:47                   ` Rich Felker
2022-09-19 22:31                     ` Gabriel Ravier
2022-09-19 22:46                       ` baiyang
2022-09-19 20:46             ` Nat!
2022-09-20  8:51               ` Szabolcs Nagy
2022-09-20  0:13           ` James Y Knight
2022-09-20  0:25             ` baiyang
2022-09-20  0:38               ` Rich Felker
2022-09-20  0:47                 ` baiyang
2022-09-20  1:00                   ` Rich Felker
2022-09-20  1:18                     ` baiyang
2022-09-20  2:15                       ` Rich Felker
2022-09-20  2:35                         ` baiyang
2022-09-20  3:28                           ` Rich Felker
2022-09-20  3:53                             ` baiyang
2022-09-20  5:41                               ` Rich Felker
2022-09-20  5:56                                 ` baiyang
2022-09-20 12:16                                   ` Rich Felker
2022-09-20 17:21                                     ` baiyang
2022-09-20  8:33       ` Florian Weimer
2022-09-20 13:54         ` Siddhesh Poyarekar
2022-09-20 16:59           ` James Y Knight
2022-09-20 17:34             ` Szabolcs Nagy
2022-09-20 19:53               ` James Y Knight
2022-09-24  8:55               ` Fangrui Song
2022-09-20 17:39             ` baiyang
2022-09-20 18:12               ` Quentin Rameau
2022-09-20 18:19                 ` Rich Felker
2022-09-20 18:26                   ` Alexander Monakov
2022-09-20 18:35                     ` baiyang
2022-09-20 20:33                       ` Gabriel Ravier
2022-09-20 20:45                         ` baiyang
2022-09-21  8:42                           ` NRK
2022-09-20 18:37                     ` Quentin Rameau
2022-09-21 10:15                   ` [musl] " 王志强
2022-09-21 16:11                     ` [musl] " 王志强
2022-09-21 17:15                     ` [musl] " Rich Felker
2022-09-21 17:58                       ` Rich Felker
2022-09-22  3:34                         ` [musl] " 王志强
2022-09-22  9:10                           ` [musl] " 王志强
2022-09-22  9:39                             ` [musl] " 王志强
2022-09-20 17:28           ` baiyang
2022-09-20 17:44             ` Siddhesh Poyarekar
2022-10-10 14:13           ` Florian Weimer
2022-09-19 13:43 ` Rich Felker [this message]
2022-09-19 17:32   ` baiyang
2022-09-19 18:15     ` Rich Felker
2022-09-19 18:44       ` baiyang
2022-09-19 19:18         ` Rich Felker
2022-09-19 19:45           ` baiyang
2022-09-19 20:07             ` Rich Felker
2022-09-19 20:17               ` baiyang
2022-09-19 20:28                 ` Rich Felker
2022-09-19 20:38                   ` baiyang
2022-09-19 22:02                 ` Quentin Rameau
2022-09-19 20:17             ` Joakim Sindholt
2022-09-19 20:33               ` baiyang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220919134319.GN9709@brightrain.aerifal.cx \
    --to=dalias@libc.org \
    --cc=baiyang@gmail.com \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).