From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.3 required=5.0 tests=MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_PASS
	autolearn=ham autolearn_force=no version=3.4.2
Received: (qmail 32469 invoked from network); 7 Apr 2020 17:50:22 -0000
Received-SPF:  pass (mother.openwall.net: domain of lists.openwall.com
  designates 195.42.179.200 as permitted sender)
  receiver=inbox.vuxu.org; client-ip=195.42.179.200
  envelope-from=<musl-return-15662-ml=inbox.vuxu.org@lists.openwall.com>
Received: from mother.openwall.net (195.42.179.200)
  by inbox.vuxu.org with UTF8ESMTPZ; 7 Apr 2020 17:50:22 -0000
Received: (qmail 9418 invoked by uid 550); 7 Apr 2020 17:50:20 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 9400 invoked from network); 7 Apr 2020 17:50:19 -0000
Date: Tue, 7 Apr 2020 13:50:07 -0400
From: Rich Felker <dalias@libc.org>
To: musl@lists.openwall.com
Message-ID: <20200407175007.GL11469@brightrain.aerifal.cx>
References: <20200403213110.GD11469@brightrain.aerifal.cx>
 <20200404025554.GG11469@brightrain.aerifal.cx>
 <20200404181948.GH11469@brightrain.aerifal.cx>
 <20200405022023.GI11469@brightrain.aerifal.cx>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20200405022023.GI11469@brightrain.aerifal.cx>
User-Agent: Mutt/1.5.21 (2010-09-15)
Subject: Re: [musl] New malloc tuning for low usage

On Sat, Apr 04, 2020 at 10:20:23PM -0400, Rich Felker wrote:
> > The answer is that it depends on where the sizes fall. At 16k,
> > rounding up to page size produces 20k usage (5 pages) but the 3-slot
> > class-37 group uses 5+1/3 pages, so individual mmaps are preferable.
> > However if we requested 20k, individual mmaps would be 24k (6 pages)
> > while the 3-slot group would still just use 5+1/3 page, and would be
> > preferable to switch to. The condition seems to be just whether the
> > rounded-up-to-whole-pages request size is larger than the slot size,
> > and we should prefer individual mmaps if (1) it's smaller than the
> > slot size, or (2) using a multi-slot group would be a relative usage
> > increase in the class of more than 50% (or whatever threshold it ends
> > up being tuned to).
> > 
> > I'll see if I can put together a quick implementation of this and see
> > how it works.
> 
> This seems to be working very well with the condition:
> 
> 	if (sc >= 35 && cnt<=3 && (size*cnt > usage/2 || ((req+20+pagesize-1) & -pagesize) <= size))
> 	    ^^^^^^^^    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 	 at least ~16k    wanted to make a smaller        requested size
> 	                  group but hit lower cnt         rounded up to
> 	                  limit; see loop above           page <= slot size
> 
> at the end of the else clause for if (sc < 8) in alloc_group. Here req
> is a new argument to expose the size of the actual request malloc
> made, so that for single-slot groups (mmap serviced allocations) we
> can allocate just the minimum needed rather than the nominal slot
> size.

This isn't quite right for arbitrary page size; in particular there's
a missing condition that the potential multi-slot group is actually
larger than the single-slot mmap rounded up to page size. This can be
expressed as size*cnt >= ROUND(req+20). It's automatically true for
sc>=35 with PGSZ==4k but not with 64k.

In summary, it seems there are 3 necessary conditions to consider use
of single-slot group (individual mmap):

- It would actually be smaller than multi-slot group
  (otherwise you're just wasting memory)

- The absolute size is large enough to justify syscall overhead
  (otherwise you can get pathological performance from alloc/free cycles)

- Current usage is low enough that the multi-slot group doesn't obey
  desired growth bounds on usagae
  (otherwise you get vm space fragmentation)

I think it's preferable to break the third condition down into two
(either-or) cases:

- size*cnt > usage/2 (i.e. multi-slot would grow usage by >50%), or

- ROUND(req+20) < size && "low usage" (i.e. slot slack/internal
  fragmentation is sufficiently high that individual mmap not just
  avoids costly preallocation but saves memory)

The second condition here is especially helpful in the presence of
"coarse size classing", since it will almost always be true as long as
the threshold to stop coarse classing has been reached, and negates
all the potential waste.

It would be possible just to disable coarse classing for size ranges
eligible for individual mmap, and in some ways that would be cleaner,
but it requires duplicating eligibility logic in two places where it's
difficult for them to get exactly the same result.

Rich