From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-3.3 required=5.0 tests=MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_PASS autolearn=ham autolearn_force=no version=3.4.2 Received: (qmail 32469 invoked from network); 7 Apr 2020 17:50:22 -0000 Received-SPF: pass (mother.openwall.net: domain of lists.openwall.com designates 195.42.179.200 as permitted sender) receiver=inbox.vuxu.org; client-ip=195.42.179.200 envelope-from= Received: from mother.openwall.net (195.42.179.200) by inbox.vuxu.org with UTF8ESMTPZ; 7 Apr 2020 17:50:22 -0000 Received: (qmail 9418 invoked by uid 550); 7 Apr 2020 17:50:20 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 9400 invoked from network); 7 Apr 2020 17:50:19 -0000 Date: Tue, 7 Apr 2020 13:50:07 -0400 From: Rich Felker To: musl@lists.openwall.com Message-ID: <20200407175007.GL11469@brightrain.aerifal.cx> References: <20200403213110.GD11469@brightrain.aerifal.cx> <20200404025554.GG11469@brightrain.aerifal.cx> <20200404181948.GH11469@brightrain.aerifal.cx> <20200405022023.GI11469@brightrain.aerifal.cx> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200405022023.GI11469@brightrain.aerifal.cx> User-Agent: Mutt/1.5.21 (2010-09-15) Subject: Re: [musl] New malloc tuning for low usage On Sat, Apr 04, 2020 at 10:20:23PM -0400, Rich Felker wrote: > > The answer is that it depends on where the sizes fall. At 16k, > > rounding up to page size produces 20k usage (5 pages) but the 3-slot > > class-37 group uses 5+1/3 pages, so individual mmaps are preferable. > > However if we requested 20k, individual mmaps would be 24k (6 pages) > > while the 3-slot group would still just use 5+1/3 page, and would be > > preferable to switch to. The condition seems to be just whether the > > rounded-up-to-whole-pages request size is larger than the slot size, > > and we should prefer individual mmaps if (1) it's smaller than the > > slot size, or (2) using a multi-slot group would be a relative usage > > increase in the class of more than 50% (or whatever threshold it ends > > up being tuned to). > > > > I'll see if I can put together a quick implementation of this and see > > how it works. > > This seems to be working very well with the condition: > > if (sc >= 35 && cnt<=3 && (size*cnt > usage/2 || ((req+20+pagesize-1) & -pagesize) <= size)) > ^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > at least ~16k wanted to make a smaller requested size > group but hit lower cnt rounded up to > limit; see loop above page <= slot size > > at the end of the else clause for if (sc < 8) in alloc_group. Here req > is a new argument to expose the size of the actual request malloc > made, so that for single-slot groups (mmap serviced allocations) we > can allocate just the minimum needed rather than the nominal slot > size. This isn't quite right for arbitrary page size; in particular there's a missing condition that the potential multi-slot group is actually larger than the single-slot mmap rounded up to page size. This can be expressed as size*cnt >= ROUND(req+20). It's automatically true for sc>=35 with PGSZ==4k but not with 64k. In summary, it seems there are 3 necessary conditions to consider use of single-slot group (individual mmap): - It would actually be smaller than multi-slot group (otherwise you're just wasting memory) - The absolute size is large enough to justify syscall overhead (otherwise you can get pathological performance from alloc/free cycles) - Current usage is low enough that the multi-slot group doesn't obey desired growth bounds on usagae (otherwise you get vm space fragmentation) I think it's preferable to break the third condition down into two (either-or) cases: - size*cnt > usage/2 (i.e. multi-slot would grow usage by >50%), or - ROUND(req+20) < size && "low usage" (i.e. slot slack/internal fragmentation is sufficiently high that individual mmap not just avoids costly preallocation but saves memory) The second condition here is especially helpful in the presence of "coarse size classing", since it will almost always be true as long as the threshold to stop coarse classing has been reached, and negates all the potential waste. It would be possible just to disable coarse classing for size ranges eligible for individual mmap, and in some ways that would be cleaner, but it requires duplicating eligibility logic in two places where it's difficult for them to get exactly the same result. Rich