Best Regards BaiYang baiyang@gmail.com http://i.baiy.cn |
From: Rich FelkerDate: 2022-09-20 13:41To: baiyangCC: muslSubject: Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)On Tue, Sep 20, 2022 at 11:53:52AM +0800, baiyang wrote:> > The ones that return some value larger than the requested size are> > returning "the requested size, rounded up to a multiple of 16" or> > similar. Not "the requested size plus 1500 bytes".> ...> > They don't return 8100. They return something like 6608 or 6624.>> No, AFAIK, There are many allocators whose return value of> malloc_usable_size is 1KB (or more) larger than the requested value> at malloc time.> For Example: if you do "void* p = malloc(6700)" on tcmalloc, then> "malloc_usable_size(p)" will return **8192**. Far more than just> "rounded up to a multiple of 16".OK, thanks for checking and correcting.> > This does not follow at all. tcmalloc is fast because it does not have> > global consistency, does not have any notable hardening, and (see the> > name) keeps large numbers of freed slots *cached* to reuse, thereby> > using lots of extra memory. Its malloc_usable_size is not fast because> > of returning the wrong value, if it even does return the wrong value> > (I have no idea).>> We don't need to refer to these features of tcmalloc, we only need> to refer to its malloc_usable_size algorithm.Those (mis)features are what provide a fast path here, regardless ofwhether you care about them.> > It's fast because they store the size in-band right> > next to the allocated memory and trust that it's valid, rather than> > computing it from out-of-band metadata that is not subject to> > falsification unless the attacker already has nearly full control of> > execution.>> No, if I understand correctly, tcmalloce doesn't store the size> in-band right next to the allocated memory. On the contrary, when> executing malloc_usable_size(p) (actually GetSize(p)), it will first> find the size class corresponding to p through a quick lookup table,> and then return the length of the size class. See:> https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099OK, I was confusing tcmalloc with the more conventional "thread-localfreelist caching on top of dlmalloc type base" allocator strategy.Indeed tcmalloc however is one of the gigantic ones.> My understanding: the biggest impediment to our inability to apply> similar optimizations is that we have to return 6700, not 8192 (of> course, you've denied this is the reason).Your understanding is wrong. I've told you how you can measure thatit's wrong. You insist on being stuck on it for no good reason.If you want to understand *why* tcmalloc is different, start with thecomments at the top of the file you linked:> // 4. The pagemap (which maps from page-number to descriptor),> // can be read without holding any locks, and written while holding> // the "pageheap_lock".> //> // This multi-threaded access to the pagemap is safe for fairly> // subtle reasons. We basically assume that when an object X is> // allocated by thread A and deallocated by thread B, there must> // have been appropriate synchronization in the handoff of object> // X from thread A to thread B.This is the kind of thing I mean by lack of global consistency (nosynchronization around access to these data structures) and lack ofany meaningful hardening (*assuming* no memory lifetime usage errorsin the calling application).The GetSize function you cited uses this global pagemap to go straightfrom a page address to a sizeclass, via what amounts to a two-level orthree-level table indexed by upper bits of the address (comment says3-level is only used in slower but lower-mem-use configuration). Thesetables, at least in the 2-level form, are utterly *massive*. I'm notsure if it creates them PROT_NONE and then only instantiates realmemory for the (initially fairly sparse) parts that get used, or if itjust allocates these giant things relying on overcommit, but eitherway that's not compatible with small-memory-space systems or withnommu.On top of that, this approach relies on laying out whole pages (likelylarge slabs of many pages at a time) of identical-sized objects sothat the size and other properties can be looked up by page number. Ihave not looked into the details of "how bad" it gets, but itcompletely precludes having any small processes, and precludespromptly returning freed memory to the system, since *changing* thepagemap is going to be costly and they're going to avoid doing it(note the above comment on locking).mallocng does not have any global mapping optimizing translation fromaddresses to groups/metadata objects. Because we insist on globalconsistency (a prerequisite for being able to define strong hardeningproperties) and on being able to return freed memory promptly to thesystem, maintaining such a data structure would cost a lot more time(performance) than anything it could give, and it would make lock-freeoperations (like your malloc_usable_size, or trivial realloc calls)potentially require locking.Instead of using the numeric value of the address to map to metadata,we chase offsets from the object base address to the metadata, thenvalidate that it round-trips back to conclude that we didn't justfollow random junk from the caller passing an invalid/dangling pointerin, or from this data being overwritten via heap-based bufferoverflows.Fundamentally, this pointer chasing is going to be a little bit moreexpensive than just using address bits as table indices, but notreally all that much. At least half of the cost difference, andprobably a lot more, is not the pointer/offset chasing but thevalidation (hardening). If hypothetically you wanted to turn that alloff (e.g. by defining the assert macro to a no-op) you could have itbe a lot faster, and still have low memory usage too. I'm not sure butfor single-threaded loads I would not be surprised if it were gettingclose to tcmalloc speed. Just casually building with assert() definedto nop out the tests, I got double performance on your TEST2. Ofcourse, I don't recommend doing this. But it's an interesting test forperforming *measurement* (which you so far refuse to do) of what'sactually making the performance differences.> On the other hand, if the low speed is not caused by having to> return 6700, then we should be able to use a similar quick lookup> table optimization ("tc_globals.pagemap().sizeclass(p)") to achieve> at least dozens of times performance improvement.Once again, the big difference is not the "6700". Thetc_globals.pagemap().sizeclass(p) in tcmalloc corresponds to lines6-11 of malloc_usable_size in mallocng, not line 12, and the bulk ofthe work here is in lines 6-11, mainly lines 7 and 10. I did a similarcasual test removing line 12 and just returning something based on theearlier computations, and it made something like a 30% reduction intest run time (with or without the hardening asserts nopped out).Rich