From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.3 required=5.0 tests=MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL autolearn=ham
	autolearn_force=no version=3.4.4
Received: (qmail 24813 invoked from network); 18 May 2020 19:42:19 -0000
Received: from mother.openwall.net (195.42.179.200)
  by inbox.vuxu.org with ESMTPUTF8; 18 May 2020 19:42:19 -0000
Received: (qmail 29863 invoked by uid 550); 18 May 2020 19:42:17 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 29828 invoked from network); 18 May 2020 19:42:16 -0000
Date: Mon, 18 May 2020 15:42:04 -0400
From: Rich Felker <dalias@libc.org>
To: musl@lists.openwall.com
Message-ID: <20200518194204.GG21576@brightrain.aerifal.cx>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.21 (2010-09-15)
Subject: [musl] Review of mallocng motivation/goals

Since it's come up again lately what the motivation for mallocng is, I
thought I'd review the original document with statement of
motivation/goals, discussing what is or isn't achieved and how
accurate the original assessment of the situation was. Quoted from the
attachment to https://www.openwall.com/lists/musl/2019/10/22/3 below:


> Desirable qualities of existing allocator:

First, I think these were highly overstated; one of the biggest
findings of working on mallocng has been that oldmalloc didn't have a
lot of the desirable qualities I thought it did. The main ones that
were real were:

> - Very small code size.
> 
> - Near-zero constant minimal overhead, preventing large numbers of
>   small processes with only minor use of malloc from consuming excess
>   resources.

And as I've found, this "minimal overhead" is only really the case for
certain low-usage patterns. It's possible to create patterns, even at
low usage, where fragmentation blows up and essentially requires
double allocation of everything. The ldso "reclaim_gaps" hack
(dynamic-linked only) masks this fairly effectively by providing some
extra isolated zone for small allocations to keep them from
fragmenting the space between large ones until usage reaches a certain
level, but the problem is still there.

> - Low, well-understood internal fragmentation.

Narrator: It was not well-understood.

> - Returning of (physical storage for) large contiguous freed memory to
>   the operating system.

It was better-than-nothing, and not much more can be said. Minimal
unit for return was 128k, and it happened unpredictably.

> - Detection of heap-based overflows (with limitations) at realloc/free
>   time via clobbered headers/footers.
> 
> - Detection of double-free (with inherent limitations) via chunk
>   header/flags.

These turned out really to be a bare minimum compared to what you can
do. What I was missing was that it's possible to have strong
internal-consistency properties for the allocator itself even if the
allocated memory can be overflowed/clobbered/UAF'd by the application.

> Known flaws in existing allocator:
> 
> - Race condition (not data race, a distinct concept) in determining
>   the availability of free chunk of desired size while
>   splitting/merging is in progress leads to external fragmentation
>   (unnecessary splitting of larger chunks) and over-expansion of heap
>   (via failure to see large free zone in bin 63).

This was absolutely the case, but wasn't the only way oldmalloc could
over-allocate. The other ways were not unboundedly bad like this, but
potentially had ~2x usage.

> - Reliance on fine-grained per-bin locking to achieve good performance
>   on typical (not even all; it doesn't help when all threads want the
>   same size of allocation) multithreaded loads fundamentally precludes
>   fixing the above race without significant performance regressions.

In light of a new design that maintains global consistency, which is
inherently expensive, I don't think this was really a horrible
mistake. Fine-grained locking can be a good thing if you can avoid the
need for multiple lock/unlock cycles.

> - The overall dlmalloc design (arbitrary split/merge) inherently has
>   usage patterns that can produce catastrophic external fragmentation.

This is absolutely still the case. It might have been something that
could be improved on, but with tradeoffs comparable in cost to the
improvements made in mallocng, but without the corresponding
additional benefits.

> - Heuristic for returning dirty pages to kernel (via MADV_DONTNEED) is
>   over-aggressive in some cases (bad performance), under-aggressive in
>   others (failure to return memory), and it's not clear that it can be
>   significantly improved in a dlmalloc-like design.

I think it could have been improved significantly.

> Together, these necessitate not just enhancements to the existing
> allocator, but a completely different design.

Among the things mentioned in the motivation document, completely new
design was most necessitated by hardening. I don't see any viable way
a dlmalloc-type design could have achieved the types of hardening
mallocng can without a major performance hit or memory usage hit.
Arbitrary split/merge just isn't amenable to efficient paths to locate
out-of-band metadata. Grouping free slots and letting them share a
pointer back to metadata, as well as the contents of the metadata, was
essential to making mallocng's overhead acceptable at small allocation
sizes.

Otherwise I think there would be viable ways to greatly reduce
fragmentation/blow-up with an arbitrary-division dlmalloc-type design.

> Goals for new allocator:
> 
> Primarily, the new allocator should eliminate the above known flaws
> while preserving the good properties of the existing allocator to the
> maximum extent that's possible/practical. In addition, it should:
> 
> - Harden protections against heap-based overflows and double free not
>   to be susceptible to attacks that blindly replace clobbered
>   headers/footers with valid data. This likely involves out-of-band
>   metadata and random secrets, but may be achieved in whatever way is
>   determined to be most effective and compatible with other goals.

This went basically as-expected.

> - Provide either strong rigorous bounds on external fragmentation, or
>   heuristically mitigate the chance of reasonable usage patterns
>   leading to unbounded external fragmentation.

External fragmentation with mallocng essentially occurs only at the
level of kernel-allocated VMAs, and should be limited strongly by
geometric slot-count growth and use of power-of-two page counts
(avoids having odd gap lengths nothing will fit in).

> - Expand or restructure the heap only when no free chunks that satisfy
>   an allocation request (without significant internal fragmentation)
>   exist.

I'm not sure exactly what I meant by this and how it's not
automatically true. I think it was just meant to say that we shouldn't
have visibility of inconsistent state (like the temporary allocation
of the whole free zone in oldmalloc, or thread-local cache/arena that
other threads can't see) causing new memory to be obtained from system
when there's existing available memory that could be used. So not so
much a goal as a "design constraint".

> - Take advantage of realloc as an opportunity to defragment.

This goal was largely ignored so far, because with layout by size
classes there essentially isn't fragmentation. realloc naturally moves
a changing-size object to an appropriate place for its new size.

However, further enhancement here is possible. realloc could
additionally detect the case where it's preserving size and the slot
is the last non-free one in its group, to move it to an empty slot in
another group and free the group. Beyond the special case of "last
non-free slot", though, anything we can do here is heuristics and
or might not help depending on the caller's usage pattern which can't
be known.

> - Improve performance, both under reasonable contention and under
>   single-thread load (with or without existence of other threads).
>   This likely amounts to short fast-case code paths and designing
>   synchronization around minimizing the number of atomic/lock
>   operations.

For the most part I don't think any significant improvement was made
here, and I'm not sure it's possible with global consistency and lack
of over-allocation as requirements.

Still, mallocng may still get a significant boost to performance when
it's integrated with musl and can elide locks in single-threaded use
and use specialized inline locks (vs pthread mutex) when
multithreaded.