From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=5.0 tests=MAILING_LIST_MULTI, RCVD_IN_MSPIKE_H2 autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 14556 invoked from network); 20 Sep 2022 05:42:06 -0000 Received: from second.openwall.net (193.110.157.125) by inbox.vuxu.org with ESMTPUTF8; 20 Sep 2022 05:42:06 -0000 Received: (qmail 22470 invoked by uid 550); 20 Sep 2022 05:42:03 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 22430 invoked from network); 20 Sep 2022 05:42:02 -0000 Date: Tue, 20 Sep 2022 01:41:50 -0400 From: Rich Felker To: baiyang Cc: musl Message-ID: <20220920054149.GK9709@brightrain.aerifal.cx> References: <2022092008254998320584@gmail.com> <20220920003811.GF9709@brightrain.aerifal.cx> <2022092008470636285288@gmail.com> <20220920010056.GG9709@brightrain.aerifal.cx> <2022092009180277847194@gmail.com> <20220920021511.GH9709@brightrain.aerifal.cx> <20220920103500598557106@gmail.com> <20220920032806.GI9709@brightrain.aerifal.cx> <20220920115350521974120@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20220920115350521974120@gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) Subject: Re: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) On Tue, Sep 20, 2022 at 11:53:52AM +0800, baiyang wrote: > > The ones that return some value larger than the requested size are > > returning "the requested size, rounded up to a multiple of 16" or > > similar. Not "the requested size plus 1500 bytes". > ... > > They don't return 8100. They return something like 6608 or 6624. > > No, AFAIK, There are many allocators whose return value of > malloc_usable_size is 1KB (or more) larger than the requested value > at malloc time. > For Example: if you do "void* p = malloc(6700)" on tcmalloc, then > "malloc_usable_size(p)" will return **8192**. Far more than just > "rounded up to a multiple of 16". OK, thanks for checking and correcting. > > This does not follow at all. tcmalloc is fast because it does not have > > global consistency, does not have any notable hardening, and (see the > > name) keeps large numbers of freed slots *cached* to reuse, thereby > > using lots of extra memory. Its malloc_usable_size is not fast because > > of returning the wrong value, if it even does return the wrong value > > (I have no idea). > > We don't need to refer to these features of tcmalloc, we only need > to refer to its malloc_usable_size algorithm. Those (mis)features are what provide a fast path here, regardless of whether you care about them. > > It's fast because they store the size in-band right > > next to the allocated memory and trust that it's valid, rather than > > computing it from out-of-band metadata that is not subject to > > falsification unless the attacker already has nearly full control of > > execution. > > No, if I understand correctly, tcmalloce doesn't store the size > in-band right next to the allocated memory. On the contrary, when > executing malloc_usable_size(p) (actually GetSize(p)), it will first > find the size class corresponding to p through a quick lookup table, > and then return the length of the size class. See: > https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099 OK, I was confusing tcmalloc with the more conventional "thread-local freelist caching on top of dlmalloc type base" allocator strategy. Indeed tcmalloc however is one of the gigantic ones. > My understanding: the biggest impediment to our inability to apply > similar optimizations is that we have to return 6700, not 8192 (of > course, you've denied this is the reason). Your understanding is wrong. I've told you how you can measure that it's wrong. You insist on being stuck on it for no good reason. If you want to understand *why* tcmalloc is different, start with the comments at the top of the file you linked: > // 4. The pagemap (which maps from page-number to descriptor), > // can be read without holding any locks, and written while holding > // the "pageheap_lock". > // > // This multi-threaded access to the pagemap is safe for fairly > // subtle reasons. We basically assume that when an object X is > // allocated by thread A and deallocated by thread B, there must > // have been appropriate synchronization in the handoff of object > // X from thread A to thread B. This is the kind of thing I mean by lack of global consistency (no synchronization around access to these data structures) and lack of any meaningful hardening (*assuming* no memory lifetime usage errors in the calling application). The GetSize function you cited uses this global pagemap to go straight from a page address to a sizeclass, via what amounts to a two-level or three-level table indexed by upper bits of the address (comment says 3-level is only used in slower but lower-mem-use configuration). These tables, at least in the 2-level form, are utterly *massive*. I'm not sure if it creates them PROT_NONE and then only instantiates real memory for the (initially fairly sparse) parts that get used, or if it just allocates these giant things relying on overcommit, but either way that's not compatible with small-memory-space systems or with nommu. On top of that, this approach relies on laying out whole pages (likely large slabs of many pages at a time) of identical-sized objects so that the size and other properties can be looked up by page number. I have not looked into the details of "how bad" it gets, but it completely precludes having any small processes, and precludes promptly returning freed memory to the system, since *changing* the pagemap is going to be costly and they're going to avoid doing it (note the above comment on locking). mallocng does not have any global mapping optimizing translation from addresses to groups/metadata objects. Because we insist on global consistency (a prerequisite for being able to define strong hardening properties) and on being able to return freed memory promptly to the system, maintaining such a data structure would cost a lot more time (performance) than anything it could give, and it would make lock-free operations (like your malloc_usable_size, or trivial realloc calls) potentially require locking. Instead of using the numeric value of the address to map to metadata, we chase offsets from the object base address to the metadata, then validate that it round-trips back to conclude that we didn't just follow random junk from the caller passing an invalid/dangling pointer in, or from this data being overwritten via heap-based buffer overflows. Fundamentally, this pointer chasing is going to be a little bit more expensive than just using address bits as table indices, but not really all that much. At least half of the cost difference, and probably a lot more, is not the pointer/offset chasing but the validation (hardening). If hypothetically you wanted to turn that all off (e.g. by defining the assert macro to a no-op) you could have it be a lot faster, and still have low memory usage too. I'm not sure but for single-threaded loads I would not be surprised if it were getting close to tcmalloc speed. Just casually building with assert() defined to nop out the tests, I got double performance on your TEST2. Of course, I don't recommend doing this. But it's an interesting test for performing *measurement* (which you so far refuse to do) of what's actually making the performance differences. > On the other hand, if the low speed is not caused by having to > return 6700, then we should be able to use a similar quick lookup > table optimization ("tc_globals.pagemap().sizeclass(p)") to achieve > at least dozens of times performance improvement. Once again, the big difference is not the "6700". The tc_globals.pagemap().sizeclass(p) in tcmalloc corresponds to lines 6-11 of malloc_usable_size in mallocng, not line 12, and the bulk of the work here is in lines 6-11, mainly lines 7 and 10. I did a similar casual test removing line 12 and just returning something based on the earlier computations, and it made something like a 30% reduction in test run time (with or without the hardening asserts nopped out). Rich