* What's wrong with musl's malloc @ 2018-04-20 20:09 Rich Felker 2018-04-21 5:28 ` Rich Felker ` (2 more replies) 0 siblings, 3 replies; 12+ messages in thread From: Rich Felker @ 2018-04-20 20:09 UTC (permalink / raw) To: musl I've talked on and off about musl's malloc needing to be overhauled or replaced, and gaining the ability to experiment with that is one of the motivations for making malloc interposable/replacable. Here's a brief write-up of what's wrong that needs to be addressed: The main benefits of musl's malloc vs the standard dlmalloc algorithms it's based on is the fine-grained locking. As long as there are binned free chunks of various sizes available, threads calling malloc will only contend for a lock when they're requesting allocations of same or similar size. This works out well under artificial random loads; I'm not sure how much it helps on typical real-world loads. In any case, the fine-grained locking also created a problem I didn't anticipate when I designed it: when allocating memory that involves splitting a free chunk, or when freeing memory that will get merged with an adjacent free chunk, the operation as a whole is not atomic; rather, a large free chunk is consumed, possibly emptying the bin it lies in, split or merged, then returned either to the same or a different bin. By saying this operation is non-atomic, I mean other threads see the intermediate state where the large free chunk has been consumed but not yet returned, and when this happens, they unnecessarily split another free chunk or expand the heap rather than waiting for it to be returned. This yields bad, potentially catastrophic, fragmentation. The malloc side already partially mitigates this with the "pretrim" function, which is used whenever the leftover part after splitting would remain in the same bin it was already in. In particular his covers splitting small allocations off a large "top of heap" zone, but it doesn't help with splitting small chunks. And the free side is much worse -- large free zones momentarily disappear every time something adjacent to them is freed. In order to overcome this fully while maintaining the fine-grained locking, we'd need the ability to hold multiple locks concurrently, which would require a protocol for lock acquisition order to avoid deadlock. But there may be cheap mitigations that could be made in free like on the malloc side. For instance it might be possible to make a simpler protocol whereby we could always hold the lock for bin 63, thereby avoiding exposing a state where large free zones are unavailable. If there's a way to make this work, moving the binmap masking for newly-emptied bins from unbin() to unlock_bin() would ensure that malloc doesn't "see" a "temporary empty" state. In the long term, I think the whole malloc design we're using now is wrong, and should be replaced, but until the time comes to actually do that, it may make sense to try to improve some of the worst properties, or even to reduce or eliminate the granularity of the locking if it proves not to be very valuable on real-world loads. Rich ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What's wrong with musl's malloc 2018-04-20 20:09 What's wrong with musl's malloc Rich Felker @ 2018-04-21 5:28 ` Rich Felker 2018-04-21 5:52 ` Rich Felker 2018-04-22 19:34 ` Markus Wichmann 2018-04-22 21:40 ` Michael Clark 2 siblings, 1 reply; 12+ messages in thread From: Rich Felker @ 2018-04-21 5:28 UTC (permalink / raw) To: musl On Fri, Apr 20, 2018 at 04:09:04PM -0400, Rich Felker wrote: > I've talked on and off about musl's malloc needing to be overhauled or > replaced, and gaining the ability to experiment with that is one of > the motivations for making malloc interposable/replacable. Here's a > brief write-up of what's wrong that needs to be addressed: > > > The main benefits of musl's malloc vs the standard dlmalloc algorithms > it's based on is the fine-grained locking. As long as there are binned > free chunks of various sizes available, threads calling malloc will > only contend for a lock when they're requesting allocations of same or > similar size. This works out well under artificial random loads; I'm > not sure how much it helps on typical real-world loads. > > > In any case, the fine-grained locking also created a problem I didn't > anticipate when I designed it: when allocating memory that involves > splitting a free chunk, or when freeing memory that will get merged > with an adjacent free chunk, the operation as a whole is not atomic; > rather, a large free chunk is consumed, possibly emptying the bin it > lies in, split or merged, then returned either to the same or a > different bin. By saying this operation is non-atomic, I mean other > threads see the intermediate state where the large free chunk has been > consumed but not yet returned, and when this happens, they > unnecessarily split another free chunk or expand the heap rather than > waiting for it to be returned. This yields bad, potentially > catastrophic, fragmentation. > > The malloc side already partially mitigates this with the "pretrim" > function, which is used whenever the leftover part after splitting > would remain in the same bin it was already in. In particular his > covers splitting small allocations off a large "top of heap" zone, but > it doesn't help with splitting small chunks. And the free side is much > worse -- large free zones momentarily disappear every time something > adjacent to them is freed. > > In order to overcome this fully while maintaining the fine-grained > locking, we'd need the ability to hold multiple locks concurrently, > which would require a protocol for lock acquisition order to avoid > deadlock. But there may be cheap mitigations that could be made in > free like on the malloc side. For instance it might be possible to > make a simpler protocol whereby we could always hold the lock for bin > 63, thereby avoiding exposing a state where large free zones are > unavailable. If there's a way to make this work, moving the binmap > masking for newly-emptied bins from unbin() to unlock_bin() would > ensure that malloc doesn't "see" a "temporary empty" state. One protocol that might work: When locking multiple bins, they must be locked in decreasing order. For malloc, this is the natural thing to do -- the bin you return the excess to will be the same or smaller than the bin you allocate from. For free, take the freelock from the beginning, rather than just momentarily at the end. Then you know the bin sizes you'll be working with (up to at most 3 -- self, prev, and next) and can lock them in the protocol order, merge them, and bin the result without any looping. The MADV_DONTNEED operation (the most expensive and contention-generating part when it happens) can be deferred until after the freelock and all but one of the bin locks are released. Overall I like this because it creates an opportunity to simplify the code. My guess is that performance would be roughly the same, or at least not significantly worse (and probably significantly better for single-threaded use), and the chances at catastrophic fragmentation from contention would be gone. I still think a complete redesign is the right way forward in the long term. Rich ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What's wrong with musl's malloc 2018-04-21 5:28 ` Rich Felker @ 2018-04-21 5:52 ` Rich Felker 0 siblings, 0 replies; 12+ messages in thread From: Rich Felker @ 2018-04-21 5:52 UTC (permalink / raw) To: musl On Sat, Apr 21, 2018 at 01:28:12AM -0400, Rich Felker wrote: > One protocol that might work: > > When locking multiple bins, they must be locked in decreasing order. > > For malloc, this is the natural thing to do -- the bin you return the > excess to will be the same or smaller than the bin you allocate from. > > For free, take the freelock from the beginning, rather than just > momentarily at the end. Then you know the bin sizes you'll be working ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > with (up to at most 3 -- self, prev, and next) and can lock them in ^^^^^^^^^^^^^^^^^^^^^ I don't think this is actually quite correct. A concurrent malloc could consume part of either of the adjacent free chunks. It's okay if it consumes the whole thing, but if it only consumes part, you'll end up with a new adjacent free chunk of a different size (different bin). Of course loop-and-retry is an option but doesn't seem like a good one; the retry time could be unbounded. Not sure if this is salvagable or not. Problems like this keep pointing in the direction of wanting to lock chunks (via their header/footer) rather than locking just bins. On the plus side it's much finer-grained locking and MT-friendly. Unfortunately it's also more complex and might have more fixed overhead. Rich ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What's wrong with musl's malloc 2018-04-20 20:09 What's wrong with musl's malloc Rich Felker 2018-04-21 5:28 ` Rich Felker @ 2018-04-22 19:34 ` Markus Wichmann 2018-04-23 2:00 ` Rich Felker 2018-04-23 6:50 ` A. Wilcox 2018-04-22 21:40 ` Michael Clark 2 siblings, 2 replies; 12+ messages in thread From: Markus Wichmann @ 2018-04-22 19:34 UTC (permalink / raw) To: musl On Fri, Apr 20, 2018 at 04:09:04PM -0400, Rich Felker wrote: > I've talked on and off about musl's malloc needing to be overhauled or > replaced, and gaining the ability to experiment with that is one of > the motivations for making malloc interposable/replacable. Here's a > brief write-up of what's wrong that needs to be addressed: > Yeah, I was about to ask for one. > > The main benefits of musl's malloc vs the standard dlmalloc algorithms > it's based on is the fine-grained locking. As long as there are binned > free chunks of various sizes available, threads calling malloc will > only contend for a lock when they're requesting allocations of same or > similar size. This works out well under artificial random loads; I'm > not sure how much it helps on typical real-world loads. > That chiefly depends on the design of the programs. Mine tend to avoid heap allocation wherever possible, and wherever impossible tend to coalesce allocations into few large requests. However, a lot of object-oriented code I have seen (even in C) allocates many small objects. But that is good: In the small numbers, there are many bins available. If the program makes random requests for anything between 16 and 128 bytes, there are 4 bins for that. Add 2k to each of these and they all fall into one bin. > > In any case, the fine-grained locking also created a problem I didn't > anticipate when I designed it: when allocating memory that involves > splitting a free chunk, or when freeing memory that will get merged > with an adjacent free chunk, the operation as a whole is not atomic; > rather, a large free chunk is consumed, possibly emptying the bin it > lies in, split or merged, then returned either to the same or a > different bin. By saying this operation is non-atomic, I mean other > threads see the intermediate state where the large free chunk has been > consumed but not yet returned, and when this happens, they > unnecessarily split another free chunk or expand the heap rather than > waiting for it to be returned. This yields bad, potentially > catastrophic, fragmentation. > Fragmentation is bad with the current malloc() even in the single threaded case. A simple example is a request for fifty bytes followed by a request for two-thousand fifty bytes. If the stack was empty before, the first request will allocate a page from the OS. Assuming that was 4k, malloc will now split off the rest of the page and put it in the bin for "chunks sized 2k - 4k". The second request however will be rounded up, and seeing as the bin for "chunks sized 4k - 8k" is still empty, two more pages will be allocated from the system. Then the requested 2k and change will be split off, leaving the heap with one free chunk in the "2k - 4k" bin that is just a bit beneath 4k, and one free chunk in the "4k - 8k" bin that is circa 6k in size. Three pages were allocated where one would have sufficed. (That is one definition of fragmentation: The request could not be serviced although the resources where available.) The benefit of this scheme, of course, is that allocations in the single-threaded case are bounded time: The algorithm is to pop off the first chunk in the bins large enough to support the request, or to allocate the memory necessary for that from the OS. In the worst case, allocation is - OS memory allocation - allocation of a chunk from the bin - splitting that chunk Each of these is constant time. I am not sure optimizing fragmentation is worth reducing the performance for. Especially in today's desktop and server systems, where anything below 16GB of RAM will just get laughed off the stage (slight exaggeration). [...] > > In the long term, I think the whole malloc design we're using now is > wrong, and should be replaced, but until the time comes to actually do > that, it may make sense to try to improve some of the worst > properties, or even to reduce or eliminate the granularity of the > locking if it proves not to be very valuable on real-world loads. > Care to elaborate on what is wrong with the current design of the malloc? Everything you named so far was just implementation issues, but nothing that's design related. So far, I am also not impressed with the competition. dietlibc for instance runs essentially the same algorithm, but with a big global lock in the multi-threaded case. tcmalloc runs essentially the same algorithm, but with an arena for each thread. Therefore every chunk has to know which arena it came from, thuns increasing the overhead by a pointer. But it is all fundamentally the same. Not like with sorting algorithms, where you get meaningful differences and tradeoffs. No, the tradeoffs appear to be entirely in the area I'd call tuning: Where do you put the mmap threshold, how do you lock, do you optimize locality of reference and tiny allocations, all that stuff. But as far as I can tell, they all suffer from the problem described above. Well, OK, for 32-bit versions of dietlibc, 2050 bytes is already beyond the mmap threshold. But if it weren't, my point would still stand. > Rich Ciao, Markus ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What's wrong with musl's malloc 2018-04-22 19:34 ` Markus Wichmann @ 2018-04-23 2:00 ` Rich Felker 2018-04-23 19:02 ` Markus Wichmann 2018-04-23 6:50 ` A. Wilcox 1 sibling, 1 reply; 12+ messages in thread From: Rich Felker @ 2018-04-23 2:00 UTC (permalink / raw) To: musl On Sun, Apr 22, 2018 at 09:34:50PM +0200, Markus Wichmann wrote: > On Fri, Apr 20, 2018 at 04:09:04PM -0400, Rich Felker wrote: > > I've talked on and off about musl's malloc needing to be overhauled or > > replaced, and gaining the ability to experiment with that is one of > > the motivations for making malloc interposable/replacable. Here's a > > brief write-up of what's wrong that needs to be addressed: > > Yeah, I was about to ask for one. > > > The main benefits of musl's malloc vs the standard dlmalloc algorithms > > it's based on is the fine-grained locking. As long as there are binned > > free chunks of various sizes available, threads calling malloc will > > only contend for a lock when they're requesting allocations of same or > > similar size. This works out well under artificial random loads; I'm > > not sure how much it helps on typical real-world loads. > > That chiefly depends on the design of the programs. Mine tend to avoid > heap allocation wherever possible, and wherever impossible tend to > coalesce allocations into few large requests. However, a lot of For such programs, the properties of malloc are largely irrelevant, so the only optimization that makes sense for them is ensuring that the malloc implementation not do anything dumb like allocate tens to hundreds of kB or even MB for expensive bookkeeping structures, etc. > object-oriented code I have seen (even in C) allocates many small > objects. Indeed, this is the class of programs that are most affected, and includes things like web browsers. > But that is good: In the small numbers, there are many bins available. > If the program makes random requests for anything between 16 and 128 > bytes, there are 4 bins for that. Add 2k to each of these and they all > fall into one bin. This is incorrect. Up to size 512 bytes (1024 on 64-bit archs), every single multiple of 16 (or of 32, on 64-bit) is its own bin. Only past that point does sparse spacing of bins begin. > > In any case, the fine-grained locking also created a problem I didn't > > anticipate when I designed it: when allocating memory that involves > > splitting a free chunk, or when freeing memory that will get merged > > with an adjacent free chunk, the operation as a whole is not atomic; > > rather, a large free chunk is consumed, possibly emptying the bin it > > lies in, split or merged, then returned either to the same or a > > different bin. By saying this operation is non-atomic, I mean other > > threads see the intermediate state where the large free chunk has been > > consumed but not yet returned, and when this happens, they > > unnecessarily split another free chunk or expand the heap rather than > > waiting for it to be returned. This yields bad, potentially > > catastrophic, fragmentation. > > > > Fragmentation is bad with the current malloc() even in the single > threaded case. A simple example is a request for fifty bytes followed by > a request for two-thousand fifty bytes. If the stack was empty before, > the first request will allocate a page from the OS. Assuming that was > 4k, malloc will now split off the rest of the page and put it in the bin > for "chunks sized 2k - 4k". The second request however will be rounded > up, and seeing as the bin for "chunks sized 4k - 8k" is still empty, two > more pages will be allocated from the system. Then the requested 2k and > change will be split off, leaving the heap with one free chunk in the > "2k - 4k" bin that is just a bit beneath 4k, and one free chunk in the > "4k - 8k" bin that is circa 6k in size. Three pages were allocated where > one would have sufficed. (That is one definition of fragmentation: The > request could not be serviced although the resources where available.) This is not how it works, at least not at scale. When heap is expanded via brk, the expension merges with the existing top of the heap; discrete pages are not relevant. When it's expanded via mmap (because brk doesn't work/is blocked on some other mapping), it's slightly true for the first few allocations, but asymptotically irrelevant since we don't keep allocating small maps. The amount of new anon memory mapped grows exponentially, doubling every two times it happens. There is also no bin for "sized 2k to 4k". It's "sized 2048 bytes to 2560 bytes", etc. -- granularity of bin sizes is in 1/4 steps up to the next power of 2. So If you do start with a 4k chunk, after splitting off 50 bytes to allocate, you have left a chunk in the "sized 3586 up to 4096" bin, and the 2050-byte allocation is perfectly satisfiable from it. You could adjust the example to work, but then the fragmentation you find as a result is much lower. > The benefit of this scheme, of course, is that allocations in the > single-threaded case are bounded time: The algorithm is to pop off the > first chunk in the bins large enough to support the request, or to > allocate the memory necessary for that from the OS. In the worst case, > allocation is > - OS memory allocation > - allocation of a chunk from the bin > - splitting that chunk > > Each of these is constant time. I am not sure optimizing fragmentation > is worth reducing the performance for. Especially in today's desktop > and server systems, where anything below 16GB of RAM will just get > laughed off the stage (slight exaggeration). I've rarely used a system with 16GB ram, and plenty of systems using musl have under 1GB. The old musl git/web server had 256MB before the host shutdown had; now we're stuck with a more expensive one with something like 768MB or 1GB. Also 32-bit archs have a hard *virtual* limit of 2, 3, or 4 GB (mostly 2 or 3) regardless of how much physical memory you have. Treating 16GB like something reasonable to have is how you get the glibc & mainstream browser ecosystem... > > In the long term, I think the whole malloc design we're using now is > > wrong, and should be replaced, but until the time comes to actually do > > that, it may make sense to try to improve some of the worst > > properties, or even to reduce or eliminate the granularity of the > > locking if it proves not to be very valuable on real-world loads. > > > > Care to elaborate on what is wrong with the current design of the > malloc? Everything you named so far was just implementation issues, but > nothing that's design related. Designs that require splitting/merging chunks inherently require excessive synchronization that makes them not scale well to many threads/cores (and awful on NUMA, if anyone cares). They also have inherent fragmentation problems with certain allocation patterns, like alternating between small/large allocations then freeing all the large ones, which is a reasonable pattern (e.g. allocating large working spaces and small result spaces, then freeing all the working spaces). Designs where all the metadata (and especially freelist pointers) are inline are highly vulnerable to heap overflow and use-after-free attacks. If possible, I'd rather have less metadata inline and have it efficiently coupled with something out-of-line that's harder to attack. This might also improve performance and reduce contention, e.g. if we used atomic bitmasks in an index of chunks to allocate/free them rather than having to move them in and out of linked lists. > So far, I am also not impressed with the competition. dietlibc for > instance runs essentially the same algorithm, but with a big global lock > in the multi-threaded case. tcmalloc runs essentially the same > algorithm, but with an arena for each thread. Therefore every chunk has > to know which arena it came from, thuns increasing the overhead by a > pointer. But it is all fundamentally the same. Not like with sorting > algorithms, where you get meaningful differences and tradeoffs. No, the Here when I say "whole design is wrong" I'm considering all these as the same basic design -- they're all dlmalloc variants. Doing better in important ways while not doing worse in any important way is a hard problem. But it seems like one worth solving. Rich ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What's wrong with musl's malloc 2018-04-23 2:00 ` Rich Felker @ 2018-04-23 19:02 ` Markus Wichmann 2018-04-23 19:47 ` Rich Felker 0 siblings, 1 reply; 12+ messages in thread From: Markus Wichmann @ 2018-04-23 19:02 UTC (permalink / raw) To: musl On Sun, Apr 22, 2018 at 10:00:10PM -0400, Rich Felker wrote: > This is incorrect. Up to size 512 bytes (1024 on 64-bit archs), every > single multiple of 16 (or of 32, on 64-bit) is its own bin. Only past > that point does sparse spacing of bins begin. > Yeah, I see that now. My error was assuming that bin_index() was basically just a binary logarithm. It is not. It is some logarithm beyond the linear range, but not a binary log. Which one can easily see by plotting the results of that function in a graph. Maybe I should have verified my assumptions before posting. > There is also no bin for "sized 2k to 4k". It's "sized 2048 bytes to > 2560 bytes", etc. -- granularity of bin sizes is in 1/4 steps up to > the next power of 2. So If you do start with a 4k chunk, after > splitting off 50 bytes to allocate, you have left a chunk in the > "sized 3586 up to 4096" bin, and the 2050-byte allocation is perfectly > satisfiable from it. You could adjust the example to work, but then > the fragmentation you find as a result is much lower. > Yup, by my calculation I get about 400 bytes of fragmentation in the worst case. The "just shy of 4k" bin starts at 3616 bytes, so after allocating a tiny amount of memory, allocating 3648 bytes or more will require heap expansion even though the memory to fulfill the request is already available. And expand_heap() will only round up to one page size. So afterwards, roughly 3700 bytes will be allocated in two pages. The example also does not scale, as repeating the tiny allocation will not get another page. And repeatedly allocating 3650 bytes will allocate one page per request, but it is doubtful that a better scheme is available. > > The benefit of this scheme, of course, is that allocations in the > > single-threaded case are bounded time: The algorithm is to pop off the > > first chunk in the bins large enough to support the request, or to > > allocate the memory necessary for that from the OS. In the worst case, > > allocation is > > - OS memory allocation > > - allocation of a chunk from the bin > > - splitting that chunk > > > > Each of these is constant time. I am not sure optimizing fragmentation > > is worth reducing the performance for. Especially in today's desktop > > and server systems, where anything below 16GB of RAM will just get > > laughed off the stage (slight exaggeration). > > I've rarely used a system with 16GB ram, and plenty of systems using > musl have under 1GB. The old musl git/web server had 256MB before the > host shutdown had; now we're stuck with a more expensive one with > something like 768MB or 1GB. Also 32-bit archs have a hard *virtual* > limit of 2, 3, or 4 GB (mostly 2 or 3) regardless of how much physical > memory you have. Treating 16GB like something reasonable to have is > how you get the glibc & mainstream browser ecosystem... > Yeah, that statement was not terribly well thought through. My point was, though, that the current design manages to make single-threaded allocation constant time (well, as constant as expand_heap()). At the cost of more fragmentation than necessary. Of course 16GB is a ridiculous number. That's why I chose it. My largest system has 8GB, and that was after scavenging some parts out of laptops due for the scrap heap. > > > In the long term, I think the whole malloc design we're using now is > > > wrong, and should be replaced, but until the time comes to actually do > > > that, it may make sense to try to improve some of the worst > > > properties, or even to reduce or eliminate the granularity of the > > > locking if it proves not to be very valuable on real-world loads. > > > > > > > Care to elaborate on what is wrong with the current design of the > > malloc? Everything you named so far was just implementation issues, but > > nothing that's design related. > > Designs that require splitting/merging chunks inherently require > excessive synchronization that makes them not scale well to many > threads/cores (and awful on NUMA, if anyone cares). They also have > inherent fragmentation problems with certain allocation patterns, like > alternating between small/large allocations then freeing all the large > ones, which is a reasonable pattern (e.g. allocating large working > spaces and small result spaces, then freeing all the working spaces). > How do we work around this, then? We could go the tcmalloc route and essentially run the allocator separately for each thread (arena allocation), which reduces the amount of synchronization at the cost of more fragmentation; this time cross-thread fragmentation (no-one will be able to utilize all the memory that is lost to overhead across all threads). We could also go the dietlibc route, which uses fixed widths for each bin. There is a bin of 16-byte elements, and it is a list of chunks that are all 16 bytes in size. If someone wants 16 bytes, we know where to look. Of course, if someone wants first 16, then 32, then 48 bytes, that will allocate 96 bytes in three pages. Fragmentation doesn't begin to describe it. > Designs where all the metadata (and especially freelist pointers) are > inline are highly vulnerable to heap overflow and use-after-free > attacks. If possible, I'd rather have less metadata inline and have it > efficiently coupled with something out-of-line that's harder to > attack. This might also improve performance and reduce contention, > e.g. if we used atomic bitmasks in an index of chunks to allocate/free > them rather than having to move them in and out of linked lists. > That is a very good idea. Say, don't OS memory managers have to do it this way? Maybe pick up some pointers from them... > Here when I say "whole design is wrong" I'm considering all these as > the same basic design -- they're all dlmalloc variants. Doing better > in important ways while not doing worse in any important way is a hard > problem. But it seems like one worth solving. > > Rich And here I thought memory management was a solved problem. Maybe it's time I hit a library again. Ciao, Markus ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What's wrong with musl's malloc 2018-04-23 19:02 ` Markus Wichmann @ 2018-04-23 19:47 ` Rich Felker 0 siblings, 0 replies; 12+ messages in thread From: Rich Felker @ 2018-04-23 19:47 UTC (permalink / raw) To: musl On Mon, Apr 23, 2018 at 09:02:05PM +0200, Markus Wichmann wrote: > > There is also no bin for "sized 2k to 4k". It's "sized 2048 bytes to > > 2560 bytes", etc. -- granularity of bin sizes is in 1/4 steps up to > > the next power of 2. So If you do start with a 4k chunk, after > > splitting off 50 bytes to allocate, you have left a chunk in the > > "sized 3586 up to 4096" bin, and the 2050-byte allocation is perfectly > > satisfiable from it. You could adjust the example to work, but then > > the fragmentation you find as a result is much lower. > > > > Yup, by my calculation I get about 400 bytes of fragmentation in the > worst case. The "just shy of 4k" bin starts at 3616 bytes, so after > allocating a tiny amount of memory, allocating 3648 bytes or more will > require heap expansion even though the memory to fulfill the request is > already available. And expand_heap() will only round up to one page > size. So afterwards, roughly 3700 bytes will be allocated in two pages. > > The example also does not scale, as repeating the tiny allocation will > not get another page. And repeatedly allocating 3650 bytes will allocate > one page per request, but it is doubtful that a better scheme is > available. Ah, I was going to correct you here, but I think you've identified a significant inefficiency in the current implementation. My intent was that expand_heap merge the newly obtained space with the existing top of the heap, but in fact it leaves a free gap below it. Merging would require temporarily taking possession of the existing free chunk just below it, which brings us back to the original problem this thread described (non-atomicity, ability of other threads to 'see' the moment when it's temporarily taken) unlesswe have a solution for that. So, if things worked as I originally envisioned, there would be no fragmentation here. But as it's working now, there is, and it could be significant. This might be the biggest source of unintended fragmentation we've found yet. > > > > In the long term, I think the whole malloc design we're using now is > > > > wrong, and should be replaced, but until the time comes to actually do > > > > that, it may make sense to try to improve some of the worst > > > > properties, or even to reduce or eliminate the granularity of the > > > > locking if it proves not to be very valuable on real-world loads. > > > > > > > > > > Care to elaborate on what is wrong with the current design of the > > > malloc? Everything you named so far was just implementation issues, but > > > nothing that's design related. > > > > Designs that require splitting/merging chunks inherently require > > excessive synchronization that makes them not scale well to many > > threads/cores (and awful on NUMA, if anyone cares). They also have > > inherent fragmentation problems with certain allocation patterns, like > > alternating between small/large allocations then freeing all the large > > ones, which is a reasonable pattern (e.g. allocating large working > > spaces and small result spaces, then freeing all the working spaces). > > > > How do we work around this, then? We could go the tcmalloc route and > essentially run the allocator separately for each thread (arena > allocation), which reduces the amount of synchronization at the cost of > more fragmentation; this time cross-thread fragmentation (no-one will be > able to utilize all the memory that is lost to overhead across all > threads). > > We could also go the dietlibc route, which uses fixed widths for each > bin. There is a bin of 16-byte elements, and it is a list of chunks that > are all 16 bytes in size. If someone wants 16 bytes, we know where to > look. Of course, if someone wants first 16, then 32, then 48 bytes, that > will allocate 96 bytes in three pages. Fragmentation doesn't begin to > describe it. One possible approach I'm considering is treating the partitioning of virtual address space into chunks as immutable -- then, little or no synchronization is needed to access the data structures, and rather than having to start the search for a free chunk from a global bin, each thread could have its own "current position" in the graph of free chunks for each size, in such a way that they naturally (at least statistically) avoid stomping on the same memory. Early in process lifetime, when not much has been allocated, the partioning could be purely based on demand, avoiding over-allocation but creating structures that aren't as flexible for future reuse. Later, groups of up to 32 chunks of the same size could be created such that a single atomic could allocate one. Consecutive ones could also be allocated together in principle but I'm concerned that might be problematic. One could envision having lockable bins (like the current bin system) populated not with individual chunks but with groups of [up to] 32 chunks of the same size, where when a thread allocates the last free chunk in a group, it locks the bin and shuffles the data structures in it around to pick (or allocate) a new group of free chunks. This would drop the need for locks to ~1/32 malloc calls, and much less if the mallocs are frequently paired with frees, in which case you'd get by with mostly just atomic test-and-set-bit operations. > > Designs where all the metadata (and especially freelist pointers) are > > inline are highly vulnerable to heap overflow and use-after-free > > attacks. If possible, I'd rather have less metadata inline and have it > > efficiently coupled with something out-of-line that's harder to > > attack. This might also improve performance and reduce contention, > > e.g. if we used atomic bitmasks in an index of chunks to allocate/free > > them rather than having to move them in and out of linked lists. > > > > That is a very good idea. Say, don't OS memory managers have to do it > this way? Maybe pick up some pointers from them... If you mean like the Linux kernel's, it has very different usage patterns, a lot more control over the usage patterns, and different allocation functions that document the caller's intent to the allocator, versus a userspace malloc. So I think it's solving a simpler problem. Rich ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What's wrong with musl's malloc 2018-04-22 19:34 ` Markus Wichmann 2018-04-23 2:00 ` Rich Felker @ 2018-04-23 6:50 ` A. Wilcox 2018-04-23 16:43 ` Rich Felker 1 sibling, 1 reply; 12+ messages in thread From: A. Wilcox @ 2018-04-23 6:50 UTC (permalink / raw) To: musl [-- Attachment #1.1: Type: text/plain, Size: 2179 bytes --] On 04/22/18 14:34, Markus Wichmann wrote: > Each of these is constant time. I am not sure optimizing fragmentation > is worth reducing the performance for. Especially in today's desktop > and server systems, where anything below 16GB of RAM will just get > laughed off the stage (slight exaggeration). this is so far off the mark I am actually writing a reply on a Sunday. FreeBSD is currently discussing optimising NFS to run better on 256 MB RAM i386 machines, and is considering the case where a non-zero number of people are running it in 64 MB virtual machines. Adelie is running KDE Plasma 5 on a Pentium III with 256 MB RAM, with enough left over for a few tabs in Otter browser. In fact, our base specifications for a server install are a 40 MB 486 or better (you can squeeze text mode much lower, but apk won't run right below 40; our baseline is 32 MB, however, on the PowerPC). "Today's" systems fall in to two buckets: computers with insane specs bought by people in the upper classes, and used computers with lower specs bought by people in the lower classes. after spending a significant chunk of my life (~15 years) in both, I can't see defences like "anything below 16GB of RAM will get laughed off the stage" as anything *but* internalised classism. as engineers, our jobs are to make software for all people, not for the chosen few that can afford 16GB of RAM. hell, my new Talos cost me so much that even though I'd technically be considered upper-class now, I still couldn't afford more than 8 GB RAM for it in the beginning (I'm planning to upgrade in the next months as my finances allow). I'm sorry if this comes off as ranty or preachy. I'm just trying to enlighten everyone that 1) not everyone has or *can afford* 16 GB RAM; 2) that's a poor excuse for not tuning software to be the best it can be. After all, wasted memory is wasted memory, whether you have a lot or a little. (Wouldn't you rather fit more photos / videos / text in there instead of wasting it on malloc overhead? Best to you and yours, --arw -- A. Wilcox (awilfox) Project Lead, Adélie Linux http://adelielinux.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What's wrong with musl's malloc 2018-04-23 6:50 ` A. Wilcox @ 2018-04-23 16:43 ` Rich Felker 0 siblings, 0 replies; 12+ messages in thread From: Rich Felker @ 2018-04-23 16:43 UTC (permalink / raw) To: musl On Mon, Apr 23, 2018 at 01:50:47AM -0500, A. Wilcox wrote: > On 04/22/18 14:34, Markus Wichmann wrote: > > Each of these is constant time. I am not sure optimizing fragmentation > > is worth reducing the performance for. Especially in today's desktop > > and server systems, where anything below 16GB of RAM will just get > > laughed off the stage (slight exaggeration). > > > this is so far off the mark I am actually writing a reply on a Sunday. > > FreeBSD is currently discussing optimising NFS to run better on 256 MB > RAM i386 machines, and is considering the case where a non-zero number > of people are running it in 64 MB virtual machines. > > Adelie is running KDE Plasma 5 on a Pentium III with 256 MB RAM, with > enough left over for a few tabs in Otter browser. In fact, our base > specifications for a server install are a 40 MB 486 or better (you can > squeeze text mode much lower, but apk won't run right below 40; our > baseline is 32 MB, however, on the PowerPC). > > "Today's" systems fall in to two buckets: computers with insane specs > bought by people in the upper classes, and used computers with lower > specs bought by people in the lower classes. after spending a > significant chunk of my life (~15 years) in both, I can't see defences > like "anything below 16GB of RAM will get laughed off the stage" as > anything *but* internalised classism. as engineers, our jobs are to > make software for all people, not for the chosen few that can afford > 16GB of RAM. > > hell, my new Talos cost me so much that even though I'd technically be > considered upper-class now, I still couldn't afford more than 8 GB RAM > for it in the beginning (I'm planning to upgrade in the next months as > my finances allow). > > I'm sorry if this comes off as ranty or preachy. I'm just trying to > enlighten everyone that 1) not everyone has or *can afford* 16 GB RAM; > 2) that's a poor excuse for not tuning software to be the best it can be. > > After all, wasted memory is wasted memory, whether you have a lot or a > little. (Wouldn't you rather fit more photos / videos / text in there > instead of wasting it on malloc overhead? This. All of this. Considerations like the above were largely the motivation for musl. Beyond the question of whether everyone has or can afford 16GB of ram financially -- while I believe it's important, I'm also less idealistic than I used to be about whether we can engineer efficient replacements for bloated stuff faster than Moore's law makes the bloated stuff usable on affordable machines -- there are questions of what you can do with what you can afford that immediately become relevant once you stop thinking just about the OS as something that runs on a physical present-day high-end box and consider uses like containers, full virtual machines, embedded systems, etc. I recall hearing an story I never verified that there was one PC game with virtualized computers inside the game world running musl on them. This is also the whole reason people are interested in Alpine on Docker. Rich ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What's wrong with musl's malloc 2018-04-20 20:09 What's wrong with musl's malloc Rich Felker 2018-04-21 5:28 ` Rich Felker 2018-04-22 19:34 ` Markus Wichmann @ 2018-04-22 21:40 ` Michael Clark 2018-04-23 2:09 ` Rich Felker 2018-04-23 2:51 ` musl riscv port Rich Felker 2 siblings, 2 replies; 12+ messages in thread From: Michael Clark @ 2018-04-22 21:40 UTC (permalink / raw) To: musl > On 21/04/2018, at 8:09 AM, Rich Felker <dalias@libc.org> wrote: > > I've talked on and off about musl's malloc needing to be overhauled or > replaced, and gaining the ability to experiment with that is one of > the motivations for making malloc interposable/replacable. Here's a > brief write-up of what's wrong that needs to be addressed: > > > The main benefits of musl's malloc vs the standard dlmalloc algorithms > it's based on is the fine-grained locking. As long as there are binned > free chunks of various sizes available, threads calling malloc will > only contend for a lock when they're requesting allocations of same or > similar size. This works out well under artificial random loads; I'm > not sure how much it helps on typical real-world loads. > > > In any case, the fine-grained locking also created a problem I didn't > anticipate when I designed it: when allocating memory that involves > splitting a free chunk, or when freeing memory that will get merged > with an adjacent free chunk, the operation as a whole is not atomic; > rather, a large free chunk is consumed, possibly emptying the bin it > lies in, split or merged, then returned either to the same or a > different bin. By saying this operation is non-atomic, I mean other > threads see the intermediate state where the large free chunk has been > consumed but not yet returned, and when this happens, they > unnecessarily split another free chunk or expand the heap rather than > waiting for it to be returned. This yields bad, potentially > catastrophic, fragmentation. > > The malloc side already partially mitigates this with the "pretrim" > function, which is used whenever the leftover part after splitting > would remain in the same bin it was already in. In particular his > covers splitting small allocations off a large "top of heap" zone, but > it doesn't help with splitting small chunks. And the free side is much > worse -- large free zones momentarily disappear every time something > adjacent to them is freed. > > In order to overcome this fully while maintaining the fine-grained > locking, we'd need the ability to hold multiple locks concurrently, > which would require a protocol for lock acquisition order to avoid > deadlock. But there may be cheap mitigations that could be made in > free like on the malloc side. For instance it might be possible to > make a simpler protocol whereby we could always hold the lock for bin > 63, thereby avoiding exposing a state where large free zones are > unavailable. If there's a way to make this work, moving the binmap > masking for newly-emptied bins from unbin() to unlock_bin() would > ensure that malloc doesn't "see" a "temporary empty" state. > > > In the long term, I think the whole malloc design we're using now is > wrong, and should be replaced, but until the time comes to actually do > that, it may make sense to try to improve some of the worst > properties, or even to reduce or eliminate the granularity of the > locking if it proves not to be very valuable on real-world loads. Hi Rich, Have you considered importing jemalloc? It’s been battle tested for 12 years in FreeBSD and in many other apps. Given it’s bundled and used in MariaDB which is undoubtedly a demanding user and presumably Monty approved the choice of allocator, so it shouldn’t be a bad choice. I’d trust the allocator choice of a well respected database architect. From looking at the Lockless, Inc benchmarks jemalloc has good performance from very small to very large allocations. From the Lockless Inc memory allocator comparisons: “The disadvantage of the jemalloc allocator is its memory usage. It uses power-of-two sized bins, which leads to a greatly increased memory footprint compared to other allocators. This can affect real-world performance due to excess cache and TLB misses.” From the wiki: - https://github.com/jemalloc/jemalloc/wiki/Background “jemalloc is a general purpose malloc(3) implementation that emphasizes fragmentation avoidance and scalable concurrency support. It is intended for use as the system-provided memory allocator, as in FreeBSD's libc library, as well as for linking into C/C++ applications. jemalloc provides many introspection, memory management, and tuning features beyond the standard allocator functionality.” It’s a libc allocator. Here are some of the major users (who it’s fair to say must have exercised due consideration in choosing their memory allocator given there are browsers (Firefox’s mozjemalloc fork), several databases and caches): - FreeBSD - NetBSD - Mozilla Firefox - Varnish - Cassandra - Redis - hhvm - MariaDB - Aerospike - Android Curious how big the linked code is... assuming code size is an important consideration... Looking at the code, ... it doesn’t look too big... and it’s interesting to note that the recent changes to glibc to support per thread pools uses the same tcache feature name as jemalloc whose per thread pool implementation is in tcache.c - https://sourceware.org/ml/libc-alpha/2017-01/msg00452.html - https://sourceware.org/ml/libc-alpha/2017-01/msg00524.html musl libc could vendor the jemalloc code with some sed scripts to modify includes so that jemalloc fits into the musl build structure and can be easily synced with upstream. Ideally musl would want to import/vendor vs using a submodule so musl remains stand-alone. I assume many projects have done the same. MariaDB bundles it. Firefox forked it. I think the primary disadvantage of jemalloc is power of 2 size bins. Non power of 2 size strides for small structures have better caching performance for many workloads due to stochastic cache eviction when not using power of 2 strides. I’ve noticed that power of 2 alignments can have pathological performance in some cases. i.e. witnessing an array with sizeof(struct) == 28 or 36 perform much better than sizeof(struct) == 32. That said most apps would be using a single malloc call for these types of performance critical arrays so it’s only an issue for apps that individually malloc many small objects. Worth considering. BTW - I owe you an update on the riscv musl port. I’ll post one soon... I’ve been busy on the QEMU RISC-V port, whose ‘virt’ board has recently been used by Fedora and Debian distro porters for riscv64 arch support (glibc based distros). Stefan O’Rear got MTTCG working on qemu/target/riscv so we can run SMP Linux in QEMU with reasonable performance. Debian for example now has QEMU RISC-V builders for the riscv64 arch. - https://github.com/riscv/riscv-qemu/wiki It will be nice to get the riscv musl support upstream. The major issue at present is ELF thread local storage (pthreads are working but GCC’s __thread is not). The RISC-V TLS model is not documented so it requires some reverse engineering of the riscv support in GCC/binutils/glibc. I haven’t been able to isolate the issue and could benefit from some help by someone more knowledgable in that area (ELF TLS and architecture specific TLS models). Regards, Michael ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What's wrong with musl's malloc 2018-04-22 21:40 ` Michael Clark @ 2018-04-23 2:09 ` Rich Felker 2018-04-23 2:51 ` musl riscv port Rich Felker 1 sibling, 0 replies; 12+ messages in thread From: Rich Felker @ 2018-04-23 2:09 UTC (permalink / raw) To: musl On Mon, Apr 23, 2018 at 09:40:08AM +1200, Michael Clark wrote: > > > On 21/04/2018, at 8:09 AM, Rich Felker <dalias@libc.org> wrote: > > > > I've talked on and off about musl's malloc needing to be overhauled or > > replaced, and gaining the ability to experiment with that is one of > > the motivations for making malloc interposable/replacable. Here's a > > brief write-up of what's wrong that needs to be addressed: > > > > > > The main benefits of musl's malloc vs the standard dlmalloc algorithms > > it's based on is the fine-grained locking. As long as there are binned > > free chunks of various sizes available, threads calling malloc will > > only contend for a lock when they're requesting allocations of same or > > similar size. This works out well under artificial random loads; I'm > > not sure how much it helps on typical real-world loads. > > > > > > In any case, the fine-grained locking also created a problem I didn't > > anticipate when I designed it: when allocating memory that involves > > splitting a free chunk, or when freeing memory that will get merged > > with an adjacent free chunk, the operation as a whole is not atomic; > > rather, a large free chunk is consumed, possibly emptying the bin it > > lies in, split or merged, then returned either to the same or a > > different bin. By saying this operation is non-atomic, I mean other > > threads see the intermediate state where the large free chunk has been > > consumed but not yet returned, and when this happens, they > > unnecessarily split another free chunk or expand the heap rather than > > waiting for it to be returned. This yields bad, potentially > > catastrophic, fragmentation. > > > > The malloc side already partially mitigates this with the "pretrim" > > function, which is used whenever the leftover part after splitting > > would remain in the same bin it was already in. In particular his > > covers splitting small allocations off a large "top of heap" zone, but > > it doesn't help with splitting small chunks. And the free side is much > > worse -- large free zones momentarily disappear every time something > > adjacent to them is freed. > > > > In order to overcome this fully while maintaining the fine-grained > > locking, we'd need the ability to hold multiple locks concurrently, > > which would require a protocol for lock acquisition order to avoid > > deadlock. But there may be cheap mitigations that could be made in > > free like on the malloc side. For instance it might be possible to > > make a simpler protocol whereby we could always hold the lock for bin > > 63, thereby avoiding exposing a state where large free zones are > > unavailable. If there's a way to make this work, moving the binmap > > masking for newly-emptied bins from unbin() to unlock_bin() would > > ensure that malloc doesn't "see" a "temporary empty" state. > > > > > > In the long term, I think the whole malloc design we're using now is > > wrong, and should be replaced, but until the time comes to actually do > > that, it may make sense to try to improve some of the worst > > properties, or even to reduce or eliminate the granularity of the > > locking if it proves not to be very valuable on real-world loads. > > Hi Rich, > > Have you considered importing jemalloc? No, and that's not happening. It's got serious bloat problems, problems with undermining ASLR, and is optimized pretty much only for being as fast as possible without caring how much memory you use. That's nothing like the goals of musl. With the recent changes to allow interposing malloc, projects that want to use it now can, but it definitely shouldn't be imposed on the vast majority of programs that will only hurt (grow much bigger in memory) from using it. Aside from that, importing nontrivial code is pretty much off the table -- every time we've done it it's been almost entirely a mistake (TRE being the worst), since it ends up being lots of code that nobody understands with bugs and UB and often vulns. Just this release cycle, blatent nonsensical special cases in the OpenBSD complex math code bit us. > It’s been battle tested for 12 years in FreeBSD and in many other > apps. Given it’s bundled and used in MariaDB which is undoubtedly a > demanding user and presumably Monty approved the choice of > allocator, so it shouldn’t be a bad choice. I’d trust the allocator > choice of a well respected database architect. > > From looking at the Lockless, Inc benchmarks jemalloc has good These benchmarks were specifically tuned for thresholds that make glibc's ptmalloc look bad. They're not a credible source. While Lockless Inc has some nice documentation on multithreading topics, I'm a lot more skeptical of their product and performance claims about other products... > performance from very small to very large allocations. From the > Lockless Inc memory allocator comparisons: “The disadvantage of the > jemalloc allocator is its memory usage. It uses power-of-two sized > bins, which leads to a greatly increased memory footprint compared > to other allocators. This can affect real-world performance due to > excess cache and TLB misses.” Yes, see above. Rich ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: musl riscv port 2018-04-22 21:40 ` Michael Clark 2018-04-23 2:09 ` Rich Felker @ 2018-04-23 2:51 ` Rich Felker 1 sibling, 0 replies; 12+ messages in thread From: Rich Felker @ 2018-04-23 2:51 UTC (permalink / raw) To: musl On Mon, Apr 23, 2018 at 09:40:08AM +1200, Michael Clark wrote: > BTW - I owe you an update on the riscv musl port. I’ll post one > soon.... I’ve been busy on the QEMU RISC-V port, whose ‘virt’ board > has recently been used by Fedora and Debian distro porters for > riscv64 arch support (glibc based distros). Stefan O’Rear got MTTCG > working on qemu/target/riscv so we can run SMP Linux in QEMU with > reasonable performance. Debian for example now has QEMU RISC-V > builders for the riscv64 arch. > > - https://github.com/riscv/riscv-qemu/wiki > > It will be nice to get the riscv musl support upstream. The major > issue at present is ELF thread local storage (pthreads are working > but GCC’s __thread is not). The RISC-V TLS model is not documented > so it requires some reverse engineering of the riscv support in > GCC/binutils/glibc. I haven’t been able to isolate the issue and > could benefit from some help by someone more knowledgable in that > area (ELF TLS and architecture specific TLS models). If you have specific questions here I can probably help you find answers quickly. It probably suffices just to compile and link a trivial test program using a single TLS variable, and see what offset from the thread pointer the compiler accesses it at. That should tell you if TLS is above or below the thread pointer, and if there's any offset that needs to be applied, what it is. Once it's all working, the big thing that might take some additional effort to get it merged is making sure contributions have been documented. I know a bunch of different people worked on this on and off and I want to make sure they all get credited properly. Rich ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2018-04-23 19:47 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-04-20 20:09 What's wrong with musl's malloc Rich Felker 2018-04-21 5:28 ` Rich Felker 2018-04-21 5:52 ` Rich Felker 2018-04-22 19:34 ` Markus Wichmann 2018-04-23 2:00 ` Rich Felker 2018-04-23 19:02 ` Markus Wichmann 2018-04-23 19:47 ` Rich Felker 2018-04-23 6:50 ` A. Wilcox 2018-04-23 16:43 ` Rich Felker 2018-04-22 21:40 ` Michael Clark 2018-04-23 2:09 ` Rich Felker 2018-04-23 2:51 ` musl riscv port Rich Felker
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).