Best Regards BaiYang baiyang@gmail.com http://i.baiy.cn |
From: Rich FelkerDate: 2022-09-19 21:43To: baiyangCC: muslSubject: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)On Mon, Sep 19, 2022 at 03:53:30PM +0800, baiyang wrote:> Hi there,>> As we have discussed at> https://github.com/openwrt/openwrt/issues/10752. The> malloc_usable_size() function in musl 1.2 (mallocng) seems to have> some performance issues.>> It caused realloc and free spends too long time for get the chunk size.>> As we mentioned in the discussion, tcmalloc and some other> allocators can also accurately obtain the size class corresponding> to a memory block and its precise size, and it is also very fast at> the same time.>> Can we make some improvements to the existing malloc_usable_size> algorithm in mallocng? This should significantly improve the> performance of existing algorithms.Can you please start from a point of identifying the real-world casein which you're hitting a performance degredation? Made-up tests aregenerally not helpful and will almost always lead to focusing on thewrong problem.For now I'm going to focus on some things from the linked thread:> > Considering that realloc itself contains a complete> > malloc_usable_size (refer to here and here), So actually most> > (66.7%) of the realloc time is spent doing malloc_usable_size.In your test that increments the realloc size by one each iteration,only one in every PAGESIZE calls has any real work to do. The rest donothing but set_size after obtaining the metadata on the objectthey're acting on. It's completely expected that the runtime of thesewill be dominated by obtaining the metadata; this isn't evidence ofanything wrong. And, moreover, it's almost surely a lot more than66.7%. Most of the 0.8s difference is likely spent on the 2560 mmapsyscalls and page faults accessing the new pages they produce.> > In implementations such as: glibc, tcmalloc, msw crt (_msize), mac> > os x (malloc_size), and musl 1.1, even on low-end embedded> > processors, the consumption of malloc_usable_size per 10 million> > calls is mostly not more than a few hundred milliseconds.It looks like mallocng's malloc_usable_size is taking around 150 nsper call on your system, vs maybe 30-50 for others?> > In addition, this very slow slab size acquisition algorithm also> > needs to be called every time free (see here). So we believe it> > should be the main reason for malloc/free and realloc performance> > degradation in version 1..2.Unless you have an application that's explicitly usingmalloc_usable_size all over the place, it's highly unlikely that thisis the cause of your real-world performance problems. The vastmajority of reported problems with malloc performance have been inmultithreaded applications, where the dominating time cost isfundamental: synchronization cost of having global consistency. Thereyou'll expect to find very similar performance figures from any otherallocator with global consistency, such as hardened_malloc.If you're really having single-threaded performance problems thataren't just in made-up benchmarks, please see if you can narrow downthe cause empirically rather than speculatively. For example, runningthe program under perf and looking at where the time is being spent.> > If we can improve its speed and make it close to implementations> > like tcmalloc (tcmalloc can also accurately return the size of the> > size class to which the chunk belongs), it should significantly> > improve the performance of mallocng (at least in single-threaded> > scenarios) .tcmalloc is fast by not having global consistency, not being hardenedagainst memory errors like double-free and use-after-free, and notavoiding fragmentation and excessive memory usage. Likewise for mostof the others. The run time costs in mallocng for looking up theout-of-band metadata are largely fundamental to it being out-of-band(not subject to direct falsification via typically exploitableapplication bugs), size-efficient, 32-bit-compatible,nommu-compatible, etc. Other approaches like in hardened_malloc can bemoderately more efficient to access the metadata, at the price of notbeing at all amenable to small systems, which are a core goal of muslwe can't really disregard.I can't say for sure there's not any room for optimization in themetadata fetching though. Looking at the assembly output might beinformative, to see if we're doing anything that's making the compileremit gratuitously inefficient code.Rich