From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-3.3 required=5.0 tests=MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_PASS autolearn=ham autolearn_force=no version=3.4.2 Received: (qmail 19878 invoked from network); 3 Apr 2020 21:31:28 -0000 Received-SPF: pass (mother.openwall.net: domain of lists.openwall.com designates 195.42.179.200 as permitted sender) receiver=inbox.vuxu.org; client-ip=195.42.179.200 envelope-from= Received: from mother.openwall.net (195.42.179.200) by inbox.vuxu.org with UTF8ESMTPZ; 3 Apr 2020 21:31:28 -0000 Received: (qmail 4049 invoked by uid 550); 3 Apr 2020 21:31:23 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 4019 invoked from network); 3 Apr 2020 21:31:22 -0000 Date: Fri, 3 Apr 2020 17:31:10 -0400 From: Rich Felker To: musl@lists.openwall.com Message-ID: <20200403213110.GD11469@brightrain.aerifal.cx> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) Subject: [musl] New malloc tuning for low usage It's come to my attention that there are usage patterns where the new malloc (under development, out-of-tree at https://github.com/richfelker/mallocng-draft) uses significantly more memory. This happens when a program allocates just one of an object size between around 4k and 128k, resulting in up to nearly 3.5x the request size being obtained in the form of 1 or 2 additional slots (for the next allocation of the same size class) and slack space (sometimes called internal fragmentation) within the allocated slot due to coarse size-classing. The second part of this, coarse classing, was introduced in commit 67bf74c2e8127c260c807b5694904324d3aeaf30 to mitigate a more extreme form of the larger first problem: due to divisibility properties, the minimum slot counts for size classes that are index 0, 1, 2, or 3 mod 4 are 7, 3, 5, and 2, respectively. Refraining from using classes that are 0 or 2 mod 4 until there's significant usage at that size class both avoids allocating 7 or 5 slots of a large size when the program might only need a couple such objects, and allows a single group to serve a wider range of sizes. However it looks like even 2 or 3 slots can be undesirable -- for example a program that nominally needs only 75k can end up taking 256k. The mitigation I'm looking at now is a feature I've wanted to have for a while -- individually servicing allocations smaller than the hard mmap-threshold cutoff (128k) with their own mmaps until usage is sufficiently high to justify (by making the relative overhead low) allocation of whole multi-slot groups. The data structures already represent large mmap-serviced allocations as a single-slot group, so that we can track and validate them at realloc/free (i.e. so that passing a pointer to something that "looks like" a mmap-serviced allocation can't be used to corrupt the allocator state, like is possible with musl's old/current malloc and most other implementations), and treat single-slot groups as a special case where the slot size is not the nominal size class spacing, but instead the entire length from the header out to the end of the mapping. With minimal changes, this same approach can be used for smaller allocations that are below the mmap threshold and thus in a size class. (Note: it's still desirable for them to be classified with their size class, rather than just like large mmap-serviced allocations, because we need to account for the total usage to be able to switch to using larger groups.) If this change could be made for all sizes, commit 67bf74c2e8127c260c807b5694904324d3aeaf30 would be obsolete and could/should be reverted. However, servicing individual allocations with mmap only makes sense when they're at least close to a whole page in length, and when rounding up to the next whole number of pages does not have significant relative overhead. (For example, servicing a 4100-byte allocation with two 4k pages would likely be worse than creating a group of three 5.4k slots and using one of them, since only 1.4k rather than 4k would be "slack" with the latter.) Obviously there's a size threshold (e.g. 4 pages, 16k on most archs) that puts a reasonable bound (for 4 pages, 25%) on overhead of individually mmap-servicing allocations, but it may also make sense to do this for smaller sizes that fall within some threshold below an integral number of pages (e.g (-size)%PGSZ < PGSZ/4). The other concern with doing this is performance. Any program with significant ongoing memory usage in a class will at some point allocate a non-single-slot group, and then as single-slot allocations are freed they won't be recreated. However, there's a risk that a program with small usage might only use a single object of a given size at a time, but repeatedly free it and allocate a new one, thereby spending inordinate amounts of time in mmap/munmap. This is roughly https://github.com/richfelker/mallocng-draft/issues/3, so it's not new to single-slot groups, just more severe with them since each free of such a slot is an munmap. The only easy solution I see for this is just introspective counting of munmap events and backing off to disfavor allocation strategies that make it happen if it happens at too high a rate. There's also a question of whether the hard mmap threshold should just be lowered too. 128k is probably a lot higher than it needs to be for archs with 4k pages -- just 64k would bound waste at 1/16 -- but on archs with larger pages a lower threshold probably doesn't work (and none of the above memory-saving tricks really work).