[-- Attachment #1: Type: text/plain, Size: 1214 bytes --] Alongside time64 and other enhancements, over the next few months I'll be working on a major project to replace musl's malloc, which has known fundamental problems that can't really be fixed without significant redesign. The early stages of this, in progress now, involve preparing tests that can identify pitfalls into pathological fragmentation, some of which may be avoidable and others of which may be fundamental, for use in assessing possible allocation strategies and other existing malloc implementations. I'm attaching a document describing motivations and goals for the project, and a dead simple minimal test to demonstrate the unbounded heap expansion problem that's the core motivation. The latter produces no output but can be watched via top(1) or similar. I'll follow up soon with some thoughts/findings so far on designs that might work for us, or that don't work but have interesting properties that suggest they may lead to workable designs. I'm not going to try to do development of this inside the musl tree; getting the ability to experiment with and do this out-of-tree, then integrate it later, was one of the motivations for supporting malloc replacement which musl 1.1.20 introduced. [-- Attachment #2: newmalloc.txt --] [-- Type: text/plain, Size: 3042 bytes --] musl libc next-gen malloc implementation *** Motivation and Design Goals Desirable qualities of existing allocator: - Very small code size. - Near-zero constant minimal overhead, preventing large numbers of small processes with only minor use of malloc from consuming excess resources. - Low, well-understood internal fragmentation. - Returning of (physical storage for) large contiguous freed memory to the operating system. - Detection of heap-based overflows (with limitations) at realloc/free time via clobbered headers/footers. - Detection of double-free (with inherent limitations) via chunk header/flags. Known flaws in existing allocator: - Race condition (not data race, a distinct concept) in determining the availability of free chunk of desired size while splitting/merging is in progress leads to external fragmentation (unnecessary splitting of larger chunks) and over-expansion of heap (via failure to see large free zone in bin 63). - Reliance on fine-grained per-bin locking to achieve good performance on typical (not even all; it doesn't help when all threads want the same size of allocation) multithreaded loads fundamentally precludes fixing the above race without significant performance regressions. - The overall dlmalloc design (arbitrary split/merge) inherently has usage patterns that can produce catastrophic external fragmentation. - Heuristic for returning dirty pages to kernel (via MADV_DONTNEED) is over-aggressive in some cases (bad performance), under-aggressive in others (failure to return memory), and it's not clear that it can be significantly improved in a dlmalloc-like design. Together, these necessitate not just enhancements to the existing allocator, but a completely different design. Goals for new allocator: Primarily, the new allocator should eliminate the above known flaws while preserving the good properties of the existing allocator to the maximum extent that's possible/practical. In addition, it should: - Harden protections against heap-based overflows and double free not to be susceptible to attacks that blindly replace clobbered headers/footers with valid data. This likely involves out-of-band metadata and random secrets, but may be achieved in whatever way is determined to be most effective and compatible with other goals. - Provide either strong rigorous bounds on external fragmentation, or heuristically mitigate the chance of reasonable usage patterns leading to unbounded external fragmentation. - Expand or restructure the heap only when no free chunks that satisfy an allocation request (without significant internal fragmentation) exist. - Take advantage of realloc as an opportunity to defragment. - Improve performance, both under reasonable contention and under single-thread load (with or without existence of other threads). This likely amounts to short fast-case code paths and designing synchronization around minimizing the number of atomic/lock operations. [-- Attachment #3: expand_race.c --] [-- Type: text/plain, Size: 259 bytes --] #include <pthread.h> #include <stdlib.h> #include <unistd.h> static void *worker(void *p) { for (;;) { void *p = malloc(100000); free(p); } } int main() { for (int i=0; i<10; i++) pthread_create(&(pthread_t){0}, 0, worker, 0); for (;;) pause(); }
On Tue, Oct 22, 2019 at 01:40:51PM -0400, Rich Felker wrote: > Alongside time64 and other enhancements, over the next few months I'll > be working on a major project to replace musl's malloc, which has > known fundamental problems that can't really be fixed without > significant redesign. > > [...] > > I'll follow up soon with some thoughts/findings so far on designs that > might work for us, or that don't work but have interesting properties > that suggest they may lead to workable designs. The following was written incrementally over the past week or so, and as I indicated before it's not so much about a design that works, but about identifying properties that will motivate a good design. The current dlmalloc-style split/merge of free chunks is problematic in a lot of ways (complex interaction with fragmentation, hard to make efficient with threads because access to neighbors being merged has to be synchronized, etc.), so I want to first explore some simple designs that completely lack it. 1. Extending a bump allocator The simplest of all is a bump allocator with free stacks. This is just one step beyond musl's current simple_malloc (used for static linking without free). The only changes are that you round up requests to discrete size classes, and then implement free as pushing onto a free stack, with one stack per size class. Because it's a stack, a singly linked list suffices, and you can push/pop solely with atomics. One big pro is that there is no progressive fragmentation. If you successfully make a complex sequence of allocations, free some subset, then allocate the previously-freed sizes again in different order, it's guaranteed to succeed. This is a really desirable property for long-lived processses in certain kinds of memory-constrained systems. Of course, the permanence of how memory is split up can also be a con. After successfully performing very large numbers of small allocations and freeing them, future large allocations may not be possible, and vice versa. Stack order is also undesirable for catching or mitigating double-free and use-after-free. LRU order will give best results there. Switching to LRU/FIFO order is possible, but no longer admits simple atomics for alloc/free. 2. Zoned size classes The above bump allocators with free essentially partition address space into separate memory usable for each size class in a way that's (one-time-per-run) adaptive, and completely unorganized. Another simple type of allocator partitions (virtual) address space *statically*, reserving (PROT_NONE, later mprotectable to writable storage) for each size a large range of addresses usable as slabs/object pools for objects of that particular size. This type of design appears in GrapheneOS's hardened_malloc: https://github.com/GrapheneOS/hardened_malloc At first this seems pretty magical -- you automatically get a complete lack of fragmentation... of *virtual* memory. It's possible to fragment the zones themselves, requiring far more physical memory than the logical amount allocated, but the physical memory usage by a given size class is bounded by the peak usage in that size class at any point in the program's lifetime. So that's pretty good. In order to be able to do this, however, your virtual address space has to be vastly larger than physical memory available. Ideally, each size's zone should be larger than the total amount of physical memory (ram+swap) you might have, so that you're not imposing an arbitrary limit on what the application can do with it. Figuring around 50 size classes and leaving at least half of virtual address space free for mmap/large allocations, I figure your virtual address space should be at least 100x larger than physical memory availability in order for this to be viable. For 32-bit, that means it's not viable beyond around 30 MB of physical memory. Ouch. Ineed, GrapheneOS's hardened_malloc is explicitly only for 64-bit archs, and this is why. Aside from that, this kind of design also suffers from extreme over-allocation in small programs, and it's not compatible with nommu at all (since it relies on the mmu to cheat fragmentation). For example, if you allocate just one object of each of 50 sizes classes, you'll be using at least 50 pages (200k) already. One advantage of this approach is that it admits nice out-of-band metadata approaches, which let you make strong guarantees about the consistency of the allocator state under heap corruption/overflows. So.. Neither of these approaches is appropriate for the future of musl's malloc. They're not sufficiently general-purpose/generalizable to all environments and archs musl supports. But they do suggest properties worth trying to achieve in an alternative. I've made some good progress on this, which I'll follow up on in a separate post.
* Rich Felker:
> Aside from that, this kind of design also suffers from extreme
> over-allocation in small programs, and it's not compatible with nommu
> at all (since it relies on the mmu to cheat fragmentation). For
> example, if you allocate just one object of each of 50 sizes classes,
> you'll be using at least 50 pages (200k) already.
I've recently seen an announcement of a malloc which uses alias mappings
to encode the size class in the address. This way, you would only need
to start out with one or two pages in userspace. But the page table
overhead in the kernel (which is missing from your calculation) is still
there because I don't think the kernel can share the page tables for the
alias mappings.
Another concern with such allocators is the increased TLB pressure and
page walking overhead.
Thanks,
Florian
Work on the new malloc is well underway, and I have a draft version now public at: https://github.com/richfelker/mallocng-draft Some highlights: - Smallest size-class now has 12 of 16 bytes usable for storage, compared to 8 of 16 (32-bit archs) or 16 of 32 (64-bit archs) with the old malloc, plus 1.5 bytes (32-bit) or 2.5 bytes (64-bit) of out-of-band metadata. This overhead (5.5 or 6.5 bytes per allocation) is uniform across all sizes. - Small-size allocation granularity/alignment is now 16-byte on both 32-bit and 64-bit archs. No more wasting 64 bytes for a 36-byte structure. - Metadata defining heap state is all out-of-band and protected below by guard pages, where it's not easily reachable through typical attack vectors. - Freed memory is quickly returnable to the kernel. - Common path malloc and free (asymptotically should be ~95% of calls under high use) do not require exclusive locks -- readlock+atomic for malloc, pure-atomic for free. - Size classes are not straight geometric progression but instead 1/7, 1/6, and 1/5 of powers of two in between the whole power-of-two steps, optimizing fit in whole power-of-two numbers of pages. - Code size is very close to the same as the old allocator, and may end up being smaller. This is still early in development, and it won't be merged in musl until the next release cycle, after the new year. There may be plenty of bugs or suboptimal behavior. And I still need to write up the design. For anyone wanting to play around with the draft, it can be linked or used by LD_PRELOAD. A dump_heap function is provided which can show allocator state. In the status masks, a=available and f=freed; these are distinct states to allow some control over interval to reuse of freed memory (which gives increased safety against UAF/DF). The "_" status is in-use. Rich
On Thu, Nov 28, 2019 at 04:56:42PM -0500, Rich Felker wrote:
> Work on the new malloc is well underway, and I have a draft version
> now public at:
>
> https://github.com/richfelker/mallocng-draft
>
> Some highlights:
>
> - Smallest size-class now has 12 of 16 bytes usable for storage,
> compared to 8 of 16 (32-bit archs) or 16 of 32 (64-bit archs) with
> the old malloc, plus 1.5 bytes (32-bit) or 2.5 bytes (64-bit) of
> out-of-band metadata. This overhead (5.5 or 6.5 bytes per
> allocation) is uniform across all sizes.
Make that 6 or 7 since there's also a 16-byte (counting alignment,
which is most of it) group header that imposes 0.5 bytes of overhead
per slot for a full-length 32-slot group.
Rich
Is your malloc implementation any better than https://github.com/microsoft/mimalloc? AFAIK one can use mimalloc in their projects as it has compatible license. the only thing that bothers is that it is done by Microsoft research. On 11/29/19, Rich Felker <dalias@libc.org> wrote: > On Thu, Nov 28, 2019 at 04:56:42PM -0500, Rich Felker wrote: >> Work on the new malloc is well underway, and I have a draft version >> now public at: >> >> https://github.com/richfelker/mallocng-draft >> >> Some highlights: >> >> - Smallest size-class now has 12 of 16 bytes usable for storage, >> compared to 8 of 16 (32-bit archs) or 16 of 32 (64-bit archs) with >> the old malloc, plus 1.5 bytes (32-bit) or 2.5 bytes (64-bit) of >> out-of-band metadata. This overhead (5.5 or 6.5 bytes per >> allocation) is uniform across all sizes. > > Make that 6 or 7 since there's also a 16-byte (counting alignment, > which is most of it) group header that imposes 0.5 bytes of overhead > per slot for a full-length 32-slot group. > > Rich > -- Respectfully, Boytsov Vasya
On Fri, Nov 29, 2019 at 09:37:15AM +0300, Vasya Boytsov wrote: > Is your malloc implementation any better than > https://github.com/microsoft/mimalloc? > AFAIK one can use mimalloc in their projects as it has compatible license. > the only thing that bothers is that it is done by Microsoft research. "Better" is not meaningful without specifying criteria for evaluation. The information on mimalloc is sparse, not covering the basic design it uses AFAICT, rather mainly the new properties it has which are mostly for the purpose of its specialized sharding API (not useful/usable for malloc API). So without reading into it much more, it's hard to say. From the summary, their idea of "small" is on a completely different order of magnitude from musl's. 6kLOC is gigantic, far larger than any single component of musl. Of course it might be really sparse code with lots of whitespace and comments, but still doesn't bode well. Regarding "fast", everyone aiming for "fast" claims fast with their own benchmarks. I generally don't trust malloc benchmarks after N times of digging into them and finding they were tuned just above/below a threshold to make a competitor look bad. That's not to say the same happened here, but "fast at what?" is a real question here, and the only meaningful test seems to be measurement under real-world workloads. The security measures mentioned seem weaker than WIP new malloc and likely not much better than current one. But it's hard to evaluate security/hardening measures without a model of proposed attacks and which ones they mitigate. This is one thing I've focused on in the new malloc, classifying and addressing them such that we can rule out entire classes. Roughly, without already having complex capability to follow multiple pointers and clobber resulting memory, only the application data stored in the heap, not the allocator's heap state, is subject to corruption by common methods (linear overflow, UAF, DF). Not all of this is implemented yet in the draft. Those things aside, ultimately the answer is that the question is not what is the best malloc for some arbitrary definition of best, but what's the best that fits into the constraints of musl, including: - low absolute space cost at low malloc utilization - low relative space cost at moderate to high utilization - compatibility with 32-bit address space - compatibility with nommu - compatibility with RLIMIT_AS, no-overcommit configurations and also provides hardening improvements over the current allocator, which gets by with only minimal measures. Nobody should be expecting magical performance improvements out of this. The likely user-facing changes will be: - elimination of heap size explosion in multithreaded apps (known bug) - better return of freed memory to system - greatly reduced exploitability of bugs related to malloc usage Speed could go up or down. Hopefully it goes up for lots of things. It will be a "win" in terms of speed if it merely avoids going down by more than fixing the "heap size explosion" bug would make it go down with the current dlmalloc-type design -- see: https://www.openwall.com/lists/musl/2019/04/12/4 since fixing this bug is absolutely necessary; it keeps impacting users. Rich > On 11/29/19, Rich Felker <dalias@libc.org> wrote: > > On Thu, Nov 28, 2019 at 04:56:42PM -0500, Rich Felker wrote: > >> Work on the new malloc is well underway, and I have a draft version > >> now public at: > >> > >> https://github.com/richfelker/mallocng-draft > >> > >> Some highlights: > >> > >> - Smallest size-class now has 12 of 16 bytes usable for storage, > >> compared to 8 of 16 (32-bit archs) or 16 of 32 (64-bit archs) with > >> the old malloc, plus 1.5 bytes (32-bit) or 2.5 bytes (64-bit) of > >> out-of-band metadata. This overhead (5.5 or 6.5 bytes per > >> allocation) is uniform across all sizes. > > > > Make that 6 or 7 since there's also a 16-byte (counting alignment, > > which is most of it) group header that imposes 0.5 bytes of overhead > > per slot for a full-length 32-slot group. > > > > Rich
On Thu, Nov 28, 2019 at 11:56 PM Rich Felker <dalias@libc.org> wrote: > > Work on the new malloc is well underway, and I have a draft version > now public at: > > https://github.com/richfelker/mallocng-draft > Hi Rich, will it cover this problem: https://www.openwall.com/lists/musl/2018/02/09/4 ? Regards, Roman
On Thu, Nov 28, 2019 at 04:56:42PM -0500, Rich Felker wrote:
> Work on the new malloc is well underway, and I have a draft version
> now public at:
>
> https://github.com/richfelker/mallocng-draft
>
So I read a bit of the code and noticed a mistake: On error, mmap()
returns MAP_FAILED ((void*)-1) rather than 0.
Ciao,
Markus
On Fri, Nov 29, 2019 at 08:45:58PM +0100, Markus Wichmann wrote:
> On Thu, Nov 28, 2019 at 04:56:42PM -0500, Rich Felker wrote:
> > Work on the new malloc is well underway, and I have a draft version
> > now public at:
> >
> > https://github.com/richfelker/mallocng-draft
> >
>
> So I read a bit of the code and noticed a mistake: On error, mmap()
> returns MAP_FAILED ((void*)-1) rather than 0.
Thanks! Yep, I noticed that a day or two ago while reading but forgot
to mark it down on my list of things to fix. Thanks for catching it.
Rich
On Fri, Nov 29, 2019 at 06:41:08PM +0200, Roman Yeryomin wrote:
> On Thu, Nov 28, 2019 at 11:56 PM Rich Felker <dalias@libc.org> wrote:
> >
> > Work on the new malloc is well underway, and I have a draft version
> > now public at:
> >
> > https://github.com/richfelker/mallocng-draft
>
> Hi Rich,
> will it cover this problem: https://www.openwall.com/lists/musl/2018/02/09/4 ?
Yes. This is the main initial motivation for redoing malloc. I haven't
run that specific test case, but I have one which should be roughly
equivalent and it passes fine (no exploding heap usage).
Rich
On Fri, Nov 29, 2019 at 08:55:23AM -0500, Rich Felker wrote:
> On Fri, Nov 29, 2019 at 09:37:15AM +0300, Vasya Boytsov wrote:
> > Is your malloc implementation any better than
> > https://github.com/microsoft/mimalloc?
> > AFAIK one can use mimalloc in their projects as it has compatible license.
> > the only thing that bothers is that it is done by Microsoft research.
>
> "Better" is not meaningful without specifying criteria for evaluation.
> The information on mimalloc is sparse, not covering the basic design
> it uses AFAICT, rather mainly the new properties it has which are
> mostly for the purpose of its specialized sharding API (not
> useful/usable for malloc API). So without reading into it much more,
> it's hard to say.
I've done a bit more reading of the source (mainly headers) and it
looks like this allocator is in the "extreme performance at the
expense of extreme size costs" camp with jemalloc, etc. It allocates
"segments" of 2MB at a time from the OS, with 2MB alignment (this
probably kills ASLR effectiveness), and splits them up into 512k or
64k "pages" (which becomes an overloaded term) belonging to particular
threads. It looks like each "page" consists of blocks all of the same
size, so that, for N threads each using M different allocation sizes,
you're consuming at least N*M*64k of memory.
So indeed, it looks like it's not using an approach that's relevant to
musl.
On Thu, Nov 28, 2019 at 04:56:42PM -0500, Rich Felker wrote: > Work on the new malloc is well underway, and I have a draft version > now public at: > > https://github.com/richfelker/mallocng-draft > > Some highlights: And some updates: Since posting this, I've found and fixed some bugs. One thing I'm really happy about is that I didn't have to wade through any application-level memory corruption. Aside from issues from compilers doing bad things without -ffreestanding, and the MAP_FAILED issue I never actually hit, all of them were caught as assertion failure traps. This is very different from my experience developing the old malloc in musl, and suggests good coverage for consistency checking which is tied to hardening. (Note: some of the consistency checks are probably overzealous and unrelated to likely attack vectors, and may make sense to disable later to improve performance.) So now, as of: https://github.com/richfelker/mallocng-draft/commits/afc39b01c82100cbb3f343c6e0ca1bc963e4ce23 it's now working to run (via LD_PRELOAD interposition) firefox, gimp, inkscape, and a number of less-demanding applications I've tried. I haven't done any rigorous testing, but at first glance firefox memory usage "seems" to be more stable, and varies up/down with usage rather than just going up. Strategy for creating new groups and how soon to reuse freed memory probably still has a lot of suboptimal properties, but I think the new allocator is usable/testable at this point. Rich
On Sat, Nov 30, 2019 at 05:11:50PM -0500, Rich Felker wrote:
> On Thu, Nov 28, 2019 at 04:56:42PM -0500, Rich Felker wrote:
> > Work on the new malloc is well underway, and I have a draft version
> > now public at:
> >
> > https://github.com/richfelker/mallocng-draft
> >
> > Some highlights:
>
> And some updates:
>
> [...]
>
> Strategy for creating new groups and how soon to reuse freed memory
> probably still has a lot of suboptimal properties, but I think the new
> allocator is usable/testable at this point.
This has been improved a lot, to the point that it's near what I'd
call finished now. Logic is now in place to limit early growth of the
heap, so that a small number of allocations in a new size class don't
cause eagar allocation of largeish groups. The main change I still
want to make here is allowing direct mmap of objects well below
MMAP_THRESHOLD (~128k), likely all the way down to ~8x PAGESIZE (where
waste from page alignment is bounded by 12.5%). With 4k pages, that's
the last 8 (of 48) size classes.
In some sense, we could just have MMAP_THRESHOLD be smaller, but (1)
that breaks archs with large pages, and (2) it likely leads to bad
fragmentation when you have large numbers of such objects. Instead,
what I'd like to do is simply allow single-slot groups for these size
classes, mapping only as many pages as they need (as opposed to
rounding up to size class; like for large mmap-serviced allocations),
but still accounting usage so that we can choose to allocate large
groups once usage is high.
Rich