From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.3 required=5.0 tests=MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_PASS
	autolearn=ham autolearn_force=no version=3.4.2
Received: (qmail 10870 invoked from network); 5 Apr 2020 02:20:39 -0000
Received-SPF:  pass (mother.openwall.net: domain of lists.openwall.com
  designates 195.42.179.200 as permitted sender)
  receiver=inbox.vuxu.org; client-ip=195.42.179.200
  envelope-from=<musl-return-15655-ml=inbox.vuxu.org@lists.openwall.com>
Received: from mother.openwall.net (195.42.179.200)
  by inbox.vuxu.org with UTF8ESMTPZ; 5 Apr 2020 02:20:39 -0000
Received: (qmail 5898 invoked by uid 550); 5 Apr 2020 02:20:36 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 5880 invoked from network); 5 Apr 2020 02:20:35 -0000
Date: Sat, 4 Apr 2020 22:20:23 -0400
From: Rich Felker <dalias@libc.org>
To: musl@lists.openwall.com
Message-ID: <20200405022023.GI11469@brightrain.aerifal.cx>
References: <20200403213110.GD11469@brightrain.aerifal.cx>
 <20200404025554.GG11469@brightrain.aerifal.cx>
 <20200404181948.GH11469@brightrain.aerifal.cx>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20200404181948.GH11469@brightrain.aerifal.cx>
User-Agent: Mutt/1.5.21 (2010-09-15)
Subject: Re: [musl] New malloc tuning for low usage

On Sat, Apr 04, 2020 at 02:19:48PM -0400, Rich Felker wrote:
> On Fri, Apr 03, 2020 at 10:55:54PM -0400, Rich Felker wrote:
> > In working on this, I noticed that it looks like the coarse size class
> > threshold (6) in top-level malloc() is too low. At that threshold, the
> > first fine-grained-class group allocation will be roughly a 100%
> > increase in memory usage by the class; I'd rather keep the relative
> > increase bounded by 50% or less. It should probably be something more
> > like 10 or 12 to achieve this. With 12, repeated allocations of 16k
> > first produce 7 individual 20k mmaps, then a 3-slot class-37
> > (21824-byte slots) group, then a 7-slot class-36 (18704-byte slots)
> > group.
> > 
> > One thing that's not clear to me is whether it's useful at all to
> > produce the 3-slot class-37 group rather than just going on making
> > more individual mmaps until it's time to switch to the larger group.
> > It's easy to tune things to do the latter, and seems to offer more
> > flexibility in how memory is used. It also allows slightly more
> > fragmentation, but the number of such objects is highly bounded to
> > begin with because we use increasingly larger groups as usage goes up,
> > so the contribution should be asymptotically irrelevant.
> 
> The answer is that it depends on where the sizes fall. At 16k,
> rounding up to page size produces 20k usage (5 pages) but the 3-slot
> class-37 group uses 5+1/3 pages, so individual mmaps are preferable.
> However if we requested 20k, individual mmaps would be 24k (6 pages)
> while the 3-slot group would still just use 5+1/3 page, and would be
> preferable to switch to. The condition seems to be just whether the
> rounded-up-to-whole-pages request size is larger than the slot size,
> and we should prefer individual mmaps if (1) it's smaller than the
> slot size, or (2) using a multi-slot group would be a relative usage
> increase in the class of more than 50% (or whatever threshold it ends
> up being tuned to).
> 
> I'll see if I can put together a quick implementation of this and see
> how it works.

This seems to be working very well with the condition:

	if (sc >= 35 && cnt<=3 && (size*cnt > usage/2 || ((req+20+pagesize-1) & -pagesize) <= size))
	    ^^^^^^^^    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	 at least ~16k    wanted to make a smaller        requested size
	                  group but hit lower cnt         rounded up to
	                  limit; see loop above           page <= slot size

at the end of the else clause for if (sc < 8) in alloc_group. Here req
is a new argument to expose the size of the actual request malloc
made, so that for single-slot groups (mmap serviced allocations) we
can allocate just the minimum needed rather than the nominal slot
size.

Rich