From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-3.3 required=5.0 tests=MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_PASS autolearn=ham autolearn_force=no version=3.4.2 Received: (qmail 26555 invoked from network); 4 Apr 2020 02:56:10 -0000 Received-SPF: pass (mother.openwall.net: domain of lists.openwall.com designates 195.42.179.200 as permitted sender) receiver=inbox.vuxu.org; client-ip=195.42.179.200 envelope-from= Received: from mother.openwall.net (195.42.179.200) by inbox.vuxu.org with UTF8ESMTPZ; 4 Apr 2020 02:56:10 -0000 Received: (qmail 13858 invoked by uid 550); 4 Apr 2020 02:56:07 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 13840 invoked from network); 4 Apr 2020 02:56:06 -0000 Date: Fri, 3 Apr 2020 22:55:54 -0400 From: Rich Felker To: musl@lists.openwall.com Message-ID: <20200404025554.GG11469@brightrain.aerifal.cx> References: <20200403213110.GD11469@brightrain.aerifal.cx> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200403213110.GD11469@brightrain.aerifal.cx> User-Agent: Mutt/1.5.21 (2010-09-15) Subject: Re: [musl] New malloc tuning for low usage On Fri, Apr 03, 2020 at 05:31:10PM -0400, Rich Felker wrote: > The mitigation I'm looking at now is a feature I've wanted to have for > a while -- individually servicing allocations smaller than the hard > mmap-threshold cutoff (128k) with their own mmaps until usage is > sufficiently high to justify (by making the relative overhead low) > allocation of whole multi-slot groups. The data structures already > represent large mmap-serviced allocations as a single-slot group, so > that we can track and validate them at realloc/free (i.e. so that > passing a pointer to something that "looks like" a mmap-serviced > allocation can't be used to corrupt the allocator state, like is > possible with musl's old/current malloc and most other > implementations), and treat single-slot groups as a special case where > the slot size is not the nominal size class spacing, but instead the > entire length from the header out to the end of the mapping. With > minimal changes, this same approach can be used for smaller > allocations that are below the mmap threshold and thus in a size > class. (Note: it's still desirable for them to be classified with > their size class, rather than just like large mmap-serviced > allocations, because we need to account for the total usage to be able > to switch to using larger groups.) > > If this change could be made for all sizes, commit > 67bf74c2e8127c260c807b5694904324d3aeaf30 would be obsolete and > could/should be reverted. However, servicing individual allocations > with mmap only makes sense when they're at least close to a whole page > in length, and when rounding up to the next whole number of pages does > not have significant relative overhead. (For example, servicing a > 4100-byte allocation with two 4k pages would likely be worse than > creating a group of three 5.4k slots and using one of them, since only > 1.4k rather than 4k would be "slack" with the latter.) Obviously > there's a size threshold (e.g. 4 pages, 16k on most archs) that puts a > reasonable bound (for 4 pages, 25%) on overhead of individually > mmap-servicing allocations, but it may also make sense to do this for > smaller sizes that fall within some threshold below an integral > number of pages (e.g (-size)%PGSZ < PGSZ/4). > > The other concern with doing this is performance. Any program with > significant ongoing memory usage in a class will at some point > allocate a non-single-slot group, and then as single-slot allocations > are freed they won't be recreated. However, there's a risk that a > program with small usage might only use a single object of a given > size at a time, but repeatedly free it and allocate a new one, thereby > spending inordinate amounts of time in mmap/munmap. This is roughly > https://github.com/richfelker/mallocng-draft/issues/3, so it's not new > to single-slot groups, just more severe with them since each free of > such a slot is an munmap. The only easy solution I see for this is > just introspective counting of munmap events and backing off to > disfavor allocation strategies that make it happen if it happens at > too high a rate. I'm not 100% decided but I'm thinking it's probably a mistake to try to use individual mmaps for sizes below ~16k. At smaller sizes that are just below a whole page multiple (4k or 8k minus epsilon), the savings aren't that big (at most 2 pages) vs just allocating the smallest possible multi-slot group, and performance impact seems like a big unknown. And while I'm okay with introspective bounce counting to decide whether to return memory to the system, I'm skeptical of also using it for decisions about how to allocate new memory, since it could lead to situations where, without any other changes to memory usage in the application or on the system, free followed by malloc of the same size could fail. I have a draft of the above-described functionality and it seems to work as intended. One key trick that makes things simpler is to consider single-slot groups (individually mmapped allocations) only for odd size classes, since even size clases (due to coarse classes logic in top level malloc) are only used once there's significant usage in the coarse class. In working on this, I noticed that it looks like the coarse size class threshold (6) in top-level malloc() is too low. At that threshold, the first fine-grained-class group allocation will be roughly a 100% increase in memory usage by the class; I'd rather keep the relative increase bounded by 50% or less. It should probably be something more like 10 or 12 to achieve this. With 12, repeated allocations of 16k first produce 7 individual 20k mmaps, then a 3-slot class-37 (21824-byte slots) group, then a 7-slot class-36 (18704-byte slots) group. One thing that's not clear to me is whether it's useful at all to produce the 3-slot class-37 group rather than just going on making more individual mmaps until it's time to switch to the larger group. It's easy to tune things to do the latter, and seems to offer more flexibility in how memory is used. It also allows slightly more fragmentation, but the number of such objects is highly bounded to begin with because we use increasingly larger groups as usage goes up, so the contribution should be asymptotically irrelevant.