malloc implementation survey: omalloc

mailing list of musl libc
 help / color / mirror / code / Atom feed

* malloc implementation survey: omalloc
@ 2018-07-29 19:26 Markus Wichmann
  2018-07-30 15:09 ` Christopher Friedt
  2018-07-31  0:47 ` Rich Felker
  0 siblings, 2 replies; 12+ messages in thread
From: Markus Wichmann @ 2018-07-29 19:26 UTC (permalink / raw)
  To: musl

Hi all,

we discussed rewriting malloc() a while back, because, as I recall, Rich
wasn't satisfied with the internal storage the current system is using
(i.e. metadata is stored with the returned pointer) as well as some
corner cases on lock contention, although fine grained locking is a nice
feature in itself.

I therefore had a look at existing malloc() algorithms, the rationale
being that I thought malloc() to be a solved problem, so we only have to
find the right solution.

As it turns out, it appears Doug Lea was simply too successful: Many
allocators follow his pattern in one way or another. But some systems
buck the trend.

So today I found omalloc, the allocator OpenBSD uses, in this nice
repository:

https://github.com/emeryberger/Malloc-Implementations

Among other implementations, of course.

Well, as written, the omalloc implementation there fulfills one of the
criteria layed out by Rich, namely external storage (metadata saved away
from the returned memory). Locking however is monolithic. Maybe
something we can work on ourselves...

So I thought I'd describe the algorithm as an overview and we can have a
discussion on whether to pursue a similar algorithm ourselves.

1. Data Structures
==================

A /region/ is a large, i.e. multiple-of-page-sized chunk of memory. A
region info struct knows the region's location and size.

A /chunk/ is a small power-of-two byte sized chunk of memory, but at
least 16 bytes. I have no idea where that decision came from.

A /chunk info struct/ contains the location of a chunk page (chunks are
at most half a page in size), the size of each chunk in the page, the
number of free chunks, a bitmap of free chunks and a pointer to the next
like-sized chunk info struct.

Global data structures are: A linked list of free chunk info structs, a
linked list of chunk info structs with free elements for each possible
size between 16 bytes and half a page, and a cache of 256 free regions.

The most important data structure is a hash table. The hash table
contains entries of two machine words, of which the first is the key,
sans its lowest lb(PAGE_SIZE) bits. In essence, this means it has a key
of 20 or 52 bits, one value of 12 bits and another value of one machine
word. Adjust by one where appropriate for 8K archs. The hash table
contains 512 entries after initialization and can only grow. The aim is
to keep the load factor below 3/4 (i.e. if the number of free entries
falls below a quarter of the total number of entries, the table is grown
by doubling the number of total entries).

I'll call the three parts page number, chunk size, and region size.

2. Algorithms
=============
a. Allocation
-------------

Allocation is split in two cases: Large and small.

For large allocations (> 1/2 page) we only allocate a region to service
the request. This means, the size is rounded up to the next page size.
Then we search our region cache for a region large enough. If we find
one that fits exactly, we return it. If we find a larger one, but none
that fits exactly, we return the end of the region and adjust the saved
size. If we find none of these, we escalate to the OS.

If we did manage to find a region, we save in the hash table an entry
with page number equal to the page number of the returned page, chunk
size set to zero and region size set to the allocation size, and return
the allocated page(s).

For small allocations (<= 1/2 page) we need to allocate a chunk. For
this, we round the request up to the next power of two, but at least to
16.

Then we check if we have chunk info header in the slot for that size. If
not, we need to allocate one, by getting a header from the list of free
chunk info headers. If that is also empty, we allocate a page for it and
fill it with free chunk info headers. They are constant sized.

With chunk info header in hand, we allocate a page for it, then fill in
the chunk info header, most importantly setting the bitmap to 1 for all
valid chunks. Then we save in the global hash table the page number of
the page containing the actual memory we'll return, as chunk size the
binary log of the size of each chunk, and as region size a pointer to
the chunk info header.

Allocation of a chunk then means finding a one-bit, setting it to zero
and returning the corresponding pointer in the page the header is
pointing to.

b. Freeing
----------

The page number of the pointer is looked up in the hash table. If not
found, the pointer is invalid.

If found, then we need to look at the chunk size portion of the hash
table entry. If that is zero then we need to free a region, else we need
to free a chunk.

The free a region, we remove the entry from the hash table, then add the
region to our cache using a very weird heuristic. In any case, any entry
thrown out of the cache in the process, as well as the current region,
should it not be added to the cache, will be unmapped.

To free a chunk, we set the corresponding bit and increase the counter
of free chunks in the chunk info header, whose address we have from the
repurposed region size part of the hash table entry. If this set the
number of free chunks to one, we add the chunk info header to the list
of chunk info headers with free chunks for the given size. Also, if now
every chunk in the page is free, we remove the chunk info header from
the bucket list, add it to the list of free chunk info headers and unmap
the page.

c. Reallocation
---------------

If both the old and new allocation size are larger than half a page, we
try to remap. Else have not much choice but to allocate a new block and
copy everything.

d. Memory donation
------------------

Entire pages can be added to the cache of free regions. Smaller blocks
can be added to the list of free chunk info headers. We have no use for
even smaller blocks.

3. Critique
===========

As written in the above repo, omalloc pulls a global lock around each
operation. This is probably unaccaptable. Also, while the implementation
offers a high degree of customizability, it uses syscall upon syscall.
Most of which can probably be removed.

The hash table uses linear probing to resolve conflicts, albeit
backwards. According to wikipedia this encourages primary clustering,
i.e. used entries tend to clump together. Also, the hash table growth
algorithm isn't realtime; it allocates the new table and completely
re-keys all entries from the old one. This means in the worst case,
allocation has unbounded run-time, though this amortizes.

Let's talk parallelism: The hash table can be secured with a lock. Both
allocation and deallocation need to write to the table, so a mutex will
have to do. As for the chunk info header lists, they could all be
implemented in a lock-free way. Unfortunately this means that a chunk
info header can disappear from one list for a moment and re-appear in
another a moment later. Which can lead to a thread seeing a list empty
which actually isn't. Protecting all lists with the same lock would
solve that issue but reduce parallelism more. Finally, the free page
cache... could be protected with a different lock. We also need a lock
for each chunk info header. Unless we implement the bitmap in a
lock-free way, but that means that bitmap and "number of free chunks"
marker don't need to be consistent anymore. Decisions, decisions...

Finally, the run time. All external storage solutions require iteration
over structures. At the moment, in the single threaded case, our malloc
requires merely the removal of a single element from a list. omalloc, on
the other hand, for small allocations always requires an iteration over
a bitmap to find the one that is set. And for large allocations always
requires searching the free page cache, or a syscall. Or both, in the
expensive case.

So, is this a direction to pursue, or should we look further?

Ciao,
Markus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: malloc implementation survey: omalloc
  2018-07-29 19:26 malloc implementation survey: omalloc Markus Wichmann
@ 2018-07-30 15:09 ` Christopher Friedt
  2018-07-30 19:14   ` Christopher Friedt
  2018-07-31  7:49   ` Markus Wichmann
  2018-07-31  0:47 ` Rich Felker
  1 sibling, 2 replies; 12+ messages in thread
From: Christopher Friedt @ 2018-07-30 15:09 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1015 bytes --]

I haven't looked at omalloc, but I wrote a deadly simple buddy allocator
for Bionic some time ago with support for malloc(3), calloc(3), realloc(3),
free(3), and posix_memalign(3). It would obviously also support
aligned_alloc(3) for C11.

It ran well on everything from a arm cortex-m0 to an intel core i7. The
code compiled to 504 .text bytes, on cortex-m0, iirc.

I wrote it originally for using on the kernel-side of an rtos, but it could
easily be extended to make a syscall when a process runs out of ram.

Obviously, a shortcoming is that the memory blocks must be PoT and there is
the potential for fragmentation. Otherwise though, the meta-information is
intrinsic within the pointer, which obliviates the need for external
storage.

Every call takes approximately the same average time which is about log2(
depth ) for binary trees.

Objectively speaking though, in terms of choosing a new malloc
implementation, the best one should be based on algorithmic complexity, but
also size and speed metrics.

C

[-- Attachment #2: Type: text/html, Size: 1318 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: malloc implementation survey: omalloc
  2018-07-30 15:09 ` Christopher Friedt
@ 2018-07-30 19:14   ` Christopher Friedt
  2018-07-30 19:22     ` Christopher Friedt
  2018-07-31  7:49   ` Markus Wichmann
  1 sibling, 1 reply; 12+ messages in thread
From: Christopher Friedt @ 2018-07-30 19:14 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1019 bytes --]

On Mon, Jul 30, 2018, 11:09 AM Christopher Friedt, <chrisfriedt@gmail.com>
wrote:

> otential for fragmentation. Otherwise though, the meta-information is
> intrinsic within the pointer, which obliviates the need for external
> storage.
>

I should clarify.

The meta-info in my buddy allocator is only intrinsic if the size of ram,
and offset are constant. Additionally, there is a bitmap of constant size
in .data[1].

Of course, in userspace, there would need to be 1 void*, 1 size_t, and a
bitmap per PoT chunk of memory to manage.

For a 4096-byte page, on a 32-bit system, that would be 4 + 4 + 256 = 304
bytes of overhead.

C

[1]
The size of the bitmap (in bits) is given directly by the geometric series:

F(n) = sum(k, n-1) r^k = (1-r^n)/(1-r)

r is 2.

If 2^U is the memory size, and 2^L is the smallest granularity of
allocation, then n = U - L + 1, and the number of bits in the bitmap is F(
n ).

E.g. 64 bytes of memory, min 8-byte allocation, then F( 4 ) = 15 bits (2
bytes) are needed for the bitmap.

[-- Attachment #2: Type: text/html, Size: 1928 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: malloc implementation survey: omalloc
  2018-07-30 19:14   ` Christopher Friedt
@ 2018-07-30 19:22     ` Christopher Friedt
  2018-07-30 19:34       ` Christopher Friedt
  0 siblings, 1 reply; 12+ messages in thread
From: Christopher Friedt @ 2018-07-30 19:22 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 340 bytes --]

On Mon, Jul 30, 2018, 3:14 PM Christopher Friedt, <chrisfriedt@gmail.com>
wrote:

> For a 4096-byte page, on a 32-bit system, that would be 4 + 4 + 256 = 304
> bytes of overhead.
>

Sorry, today is a bad day for math.

For 4096 byte pages on a 32-bit system, U = 4096 = 2^12, n = 12-3+1 = 10.
So the overhead is 4 + 4 + 128 = 136 bytes.

>

[-- Attachment #2: Type: text/html, Size: 926 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: malloc implementation survey: omalloc
  2018-07-30 19:22     ` Christopher Friedt
@ 2018-07-30 19:34       ` Christopher Friedt
  0 siblings, 0 replies; 12+ messages in thread
From: Christopher Friedt @ 2018-07-30 19:34 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 60 bytes --]

The buddy malloc overhead approaches 3.125% asymptotically.

[-- Attachment #2: Type: text/html, Size: 82 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: malloc implementation survey: omalloc
  2018-07-29 19:26 malloc implementation survey: omalloc Markus Wichmann
  2018-07-30 15:09 ` Christopher Friedt
@ 2018-07-31  0:47 ` Rich Felker
  2018-07-31  1:44   ` Rich Felker
  1 sibling, 1 reply; 12+ messages in thread
From: Rich Felker @ 2018-07-31  0:47 UTC (permalink / raw)
  To: musl

On Sun, Jul 29, 2018 at 09:26:18PM +0200, Markus Wichmann wrote:
> Hi all,
> 
> we discussed rewriting malloc() a while back, because, as I recall, Rich
> wasn't satisfied with the internal storage the current system is using
> (i.e. metadata is stored with the returned pointer) as well as some
> corner cases on lock contention, although fine grained locking is a nice
> feature in itself.
> 
> I therefore had a look at existing malloc() algorithms, the rationale
> being that I thought malloc() to be a solved problem, so we only have to
> find the right solution.
> 
> As it turns out, it appears Doug Lea was simply too successful: Many
> allocators follow his pattern in one way or another. But some systems
> buck the trend.
> 
> So today I found omalloc, the allocator OpenBSD uses, in this nice
> repository:
> 
> https://github.com/emeryberger/Malloc-Implementations

I haven't looked at it in a couple years, but last I did, the OpenBSD
allocator was not practical. It used individual mmaps for even
moderately large allocations, and used guard pages with each, which
with the default Linux VMA limits puts an upper bound of 32768 on the
number of non-tiny (roughly, larger-than-page IIRC) allocations.

It does have much stronger hardening against overflows than musl's
current malloc or any other allocator, but it seemed inferior in all
other ways.

I'll read the rest of your description later and see if there's
anything new that's interesting.

Rich

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: malloc implementation survey: omalloc
  2018-07-31  0:47 ` Rich Felker
@ 2018-07-31  1:44   ` Rich Felker
  2018-08-30 18:16     ` A. Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: Rich Felker @ 2018-07-31  1:44 UTC (permalink / raw)
  To: musl

On Mon, Jul 30, 2018 at 08:47:28PM -0400, Rich Felker wrote:
> On Sun, Jul 29, 2018 at 09:26:18PM +0200, Markus Wichmann wrote:
> > Hi all,
> > 
> > we discussed rewriting malloc() a while back, because, as I recall, Rich
> > wasn't satisfied with the internal storage the current system is using
> > (i.e. metadata is stored with the returned pointer) as well as some
> > corner cases on lock contention, although fine grained locking is a nice
> > feature in itself.
> > 
> > I therefore had a look at existing malloc() algorithms, the rationale
> > being that I thought malloc() to be a solved problem, so we only have to
> > find the right solution.
> > 
> > As it turns out, it appears Doug Lea was simply too successful: Many
> > allocators follow his pattern in one way or another. But some systems
> > buck the trend.
> > 
> > So today I found omalloc, the allocator OpenBSD uses, in this nice
> > repository:
> > 
> > https://github.com/emeryberger/Malloc-Implementations
> 
> I haven't looked at it in a couple years, but last I did, the OpenBSD
> allocator was not practical. It used individual mmaps for even
> moderately large allocations, and used guard pages with each, which
> with the default Linux VMA limits puts an upper bound of 32768 on the
> number of non-tiny (roughly, larger-than-page IIRC) allocations.
> 
> It does have much stronger hardening against overflows than musl's
> current malloc or any other allocator, but it seemed inferior in all
> other ways.
> 
> I'll read the rest of your description later and see if there's
> anything new that's interesting.

I've now read your description and it seems to agree with what I
remember, except maybe replacing "individual mmaps" by some reuse of
regions, and the guard page thing likely being optional (or no longer
used because it was too inefficent?).

Some ideas like treating different size ranges differently might be
worthwhile (this should make a big impact limiting worst-case
fragmentation). For instance I could see having all allocations below
a certain size have to come from "pages" of 512 bytes, broken into
16-byte runs with a 32-bit atomic bitmap of allocated runs, and
likewise for a few more size ranges. This is the kind of thing I've
been thinking about for a while, but I haven't come up with any good
approaches to estimating the cons.

One thing that would probably be really useful is empirical data on
what range of sizes "need to be fast" for bloated OO stuff that does
rapid malloc/free cycles. Conceptually it should only be small sizes,
assuming the whole allocated space is actually used, since beyond a
certain size, actual use of the allocated memory (loads/stores) will
dominate the time spent allocating and freeing it, but the cutoff
isn't clear.

Rich

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: malloc implementation survey: omalloc
  2018-07-30 15:09 ` Christopher Friedt
  2018-07-30 19:14   ` Christopher Friedt
@ 2018-07-31  7:49   ` Markus Wichmann
  2018-07-31 14:25     ` Rich Felker
  2018-07-31 14:49     ` Christopher Friedt
  1 sibling, 2 replies; 12+ messages in thread
From: Markus Wichmann @ 2018-07-31  7:49 UTC (permalink / raw)
  To: musl

On Mon, Jul 30, 2018 at 11:09:05AM -0400, Christopher Friedt wrote:
> I haven't looked at omalloc, but I wrote a deadly simple buddy allocator
> for Bionic some time ago with support for malloc(3), calloc(3), realloc(3),
> free(3), and posix_memalign(3). It would obviously also support
> aligned_alloc(3) for C11.
> 

Most allocators can fit these requirements. omalloc is capable of
delivering naturally aligned chunks of memory for everything up to a
page. After which we'd need to trick Linux's virtual allocator into
giving us the alignment we need. Typically via the old "allocate
size+align bytes" trick. Or pages in this case.

> It ran well on everything from a arm cortex-m0 to an intel core i7. The
> code compiled to 504 .text bytes, on cortex-m0, iirc.
> 
> I wrote it originally for using on the kernel-side of an rtos, but it could
> easily be extended to make a syscall when a process runs out of ram.
> 
> Obviously, a shortcoming is that the memory blocks must be PoT and there is
> the potential for fragmentation. Otherwise though, the meta-information is
> intrinsic within the pointer, which obliviates the need for external
> storage.
> 

That does sound interesting, but I don't really follow. The meta-info
which we need to save is the length of the object returned. How can you
encode that in the pointer without any external storage? Using internal
storage? Yeah, that is where we are right now, and that means there's
only one buffer overflow between us and oblivion at any given time.

Or do you essentially have a large block of memory, with one area for
the 16-byte objects, one for the 32-byte ones, &c. But then what do you
do if your given area runs out and the application still wants more?

Just so we don't talk past each other: External storage in this case
means the meta-data is saved away from the returned memory; ideally,
away from all user-visible memory, so that only a rampaging memory
corruption issue can destroy it. Internal storage means, the information
is saved close to the returned memory.

> Every call takes approximately the same average time which is about log2(
> depth ) for binary trees.
> 
> Objectively speaking though, in terms of choosing a new malloc
> implementation, the best one should be based on algorithmic complexity, but
> also size and speed metrics.
> 
> C

That much is clear as well. As I understood it, Rich didn't merely want
to improve the existing malloc, though, he wanted to find a new
implementation, which had best have some desirable properties. Those
being resillience against buffer overflows, as well as fine grained
locking for good multi-thread performance. omalloc has one of these. And
speed can be optimized.

Reading omalloc is complicated by the tons of customizability in the
algorithm. For some reason, they thought it would be a good idea to have
a file, /etc/malloc.conf, be a symlink that does not point to a file,
but rather has the configuration string as its target. Ugly. But it
means they can read it in a single syscall.

I still have no idea how many calls to the RNG are for security, and how
many are necessary to the algorithm. For instance, if a chunk has to be
handed out, they don't return the first free chunk, they return a random
free chunk. Draw a random number between 0 and the number of free chunks
and look for that many 0-bits. Now, is that done to reduce
predictability of the returned address or to reduce memory fragmentation
or for some other reason? I really can't tell.

Now, the applications that tax the allocator more than any others are OO
applications, that constantly allocate and free small objects. Like
firefox and the like. But those also use multiple threads, so having an
arena allocator would be nice, and in fact, including libtcmalloc did
speed up firefox a lot. libtcmalloc does nothing but run dlmalloc in a
thread-local version. We could modify omalloc like this as well. Not
sure how, though. Have multiple hash tables? But how to locate the one
that contains the object we're supposed to be freeing? Iterate over all
of them? That's going to be a boon to run-time. And how to extend the
list of hash tables? And, more importantly, how to shrink it when a
thread exits?

What I am getting at is this: We can't just benchmark a few algorithms
and stick with the fastest one. Nor is a simple algorithmic analysis
enough, since often enough, simple lists can be replaced with more
complicated containers yielding better performance for some loads. It's
complicated...

Ciao,
Markus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: malloc implementation survey: omalloc
  2018-07-31  7:49   ` Markus Wichmann
@ 2018-07-31 14:25     ` Rich Felker
  2018-07-31 14:49     ` Christopher Friedt
  1 sibling, 0 replies; 12+ messages in thread
From: Rich Felker @ 2018-07-31 14:25 UTC (permalink / raw)
  To: musl

On Tue, Jul 31, 2018 at 09:49:30AM +0200, Markus Wichmann wrote:
> > It ran well on everything from a arm cortex-m0 to an intel core i7. The
> > code compiled to 504 .text bytes, on cortex-m0, iirc.
> > 
> > I wrote it originally for using on the kernel-side of an rtos, but it could
> > easily be extended to make a syscall when a process runs out of ram.
> > 
> > Obviously, a shortcoming is that the memory blocks must be PoT and there is
> > the potential for fragmentation. Otherwise though, the meta-information is
> > intrinsic within the pointer, which obliviates the need for external
> > storage.
> > 
> 
> That does sound interesting, but I don't really follow. The meta-info
> which we need to save is the length of the object returned. How can you
> encode that in the pointer without any external storage? Using internal
> storage? Yeah, that is where we are right now, and that means there's
> only one buffer overflow between us and oblivion at any given time.

This isn't the case with a minimally reasonable design, including
musl's current one. The chunk header and footer must match; if they
don't, it's proof of UB and the allocator terminates the program. The
current implementation is not hardened against replacing real headers
with apparently valid, matching ones, but this hardening can be
achieved by xor'ing them with a canary. One could even make an array
of canaries, indexed by some bits or a hash of the pointer, so that
stealing the canary for one chunk doesn't let you forge another one.

In many ways, internal storage of metadata is *more hardened* because
the allocator is able to catch overflows this way. With no internal
data for the allocator to consistency-check, attackers can just target
application data in adjacent chunks and don't have to worry that the
allocator will catch the attack and terminate.

> What I am getting at is this: We can't just benchmark a few algorithms
> and stick with the fastest one. Nor is a simple algorithmic analysis
> enough, since often enough, simple lists can be replaced with more
> complicated containers yielding better performance for some loads. It's
> complicated...

Because fast and not wasting memory or producing catastrophic
fragmentation are usually mutually exclusive. Eliminating runaway
waste and fragmentation without a major performance regression is the
primary goal. Being faster is a secondary goal.

I know the "main source" of bad behavior in the current design can be
eliminated by getting rid of the fine-grained locks and switching to a
global lock. It can probably also be solved by holding multiple locks
at once and a complex lock order protocol, but it's not clear that
wouldn't be slower than just using a global lock. Major new direction
in design has a better chance of finding something where locality of
access needed for allocator operations is not a significant tradeoff
with fragmentation or over-allocation.

Rich

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: malloc implementation survey: omalloc
  2018-07-31  7:49   ` Markus Wichmann
  2018-07-31 14:25     ` Rich Felker
@ 2018-07-31 14:49     ` Christopher Friedt
  1 sibling, 0 replies; 12+ messages in thread
From: Christopher Friedt @ 2018-07-31 14:49 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2462 bytes --]

On Tue, Jul 31, 2018, 3:49 AM Markus Wichmann, <nullplan@gmx.net> wrote:

> On Mon, Jul 30, 2018 at 11:09:05AM -0400, Christopher Friedt wrote:
> > I wrote it originally for using on the kernel-side of an rtos, but it
> could
> > easily be extended to make a syscall when a process runs out of ram.
> >
> > Obviously, a shortcoming is that the memory blocks must be PoT and there
> is
> > the potential for fragmentation. Otherwise though, the meta-information
> is
> > intrinsic within the pointer, which obliviates the need for external
> > storage.
> >
>
> That does sound interesting, but I don't really follow. The meta-info
> which we need to save is the length of the object returned. How can you
> encode that in the pointer without any external storage? Using internal
> storage? Yeah, that is where we are right now, and that means there's
> only one buffer overflow between us and oblivion at any given time.
>
> Or do you essentially have a large block of memory, with one area for
> the 16-byte objects, one for the 32-byte ones, &c. But then what do you
> do if your given area runs out and the application still wants more?

https://en.wikipedia.org/wiki/Buddy_memory_allocation

The 'bins' do overlap (e.g. a 64 byte chunk can also be used as 2 32-byte
chunks or "buddies"), but their size (when using a bitmap for metadata) is
encoded by the position of the bit in the map. This also encodes if a
parent (or child) of a specific memory region is already in use or free.

My original implementation was for an rtos with a compile-time-fixed memory
size. In user space (since doubling ram usage is obviously bad) it would be
best to use some container (e.g. an interval tree) to contain buddy-malloc
nodes and general slabs of mmap. Current usage, a configurable
fragmentation threshold, and the system pagesize, would determine if an
allocation would use an existing buddy-malloc node, a new buddy-malloc node
(PoT # of pages) or a sequence of pages. Pages would, of course, be
allocated with mmap.

I've personally not implemented malloc debugging for a buddy allocator
though, since bits are precious enough, and there is no fundamental
requirement for it, although technically, it would mean adding some padding
with a computable bit pattern. I suppose if checking is enabled, it can be
run before free(3) returns or before realloc(3) does anything. A global
check on every function call would be more thorough, but obviously more
expensive.

[-- Attachment #2: Type: text/html, Size: 3102 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: malloc implementation survey: omalloc
  2018-07-31  1:44   ` Rich Felker
@ 2018-08-30 18:16     ` A. Wilcox
  2018-08-30 18:46       ` Rich Felker
  0 siblings, 1 reply; 12+ messages in thread
From: A. Wilcox @ 2018-08-30 18:16 UTC (permalink / raw)
  To: musl


[-- Attachment #1.1: Type: text/plain, Size: 867 bytes --]

On 07/30/18 20:44, Rich Felker wrote:
> One thing that would probably be really useful is empirical data on
> what range of sizes "need to be fast" for bloated OO stuff that does
> rapid malloc/free cycles. Conceptually it should only be small sizes,
> assuming the whole allocated space is actually used, since beyond a
> certain size, actual use of the allocated memory (loads/stores) will
> dominate the time spent allocating and freeing it, but the cutoff
> isn't clear.
> 
> Rich
> 


I can probably instrument the malloc() patterns of Firefox and Plasma
Shell.  Can you provide me the desired format?

I would probably simply change musl to print "a:%lld\n" to stderr for
each allocation, and then I could turn it into better formatted data.

Best,
--arw


-- 
A. Wilcox (awilfox)
Project Lead, Adélie Linux
http://adelielinux.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: malloc implementation survey: omalloc
  2018-08-30 18:16     ` A. Wilcox
@ 2018-08-30 18:46       ` Rich Felker
  0 siblings, 0 replies; 12+ messages in thread
From: Rich Felker @ 2018-08-30 18:46 UTC (permalink / raw)
  To: musl

On Thu, Aug 30, 2018 at 01:16:27PM -0500, A. Wilcox wrote:
> On 07/30/18 20:44, Rich Felker wrote:
> > One thing that would probably be really useful is empirical data on
> > what range of sizes "need to be fast" for bloated OO stuff that does
> > rapid malloc/free cycles. Conceptually it should only be small sizes,
> > assuming the whole allocated space is actually used, since beyond a
> > certain size, actual use of the allocated memory (loads/stores) will
> > dominate the time spent allocating and freeing it, but the cutoff
> > isn't clear.
> 
> I can probably instrument the malloc() patterns of Firefox and Plasma
> Shell.  Can you provide me the desired format?
> 
> I would probably simply change musl to print "a:%lld\n" to stderr for
> each allocation, and then I could turn it into better formatted data.

In order to measure lifetime/life cycle, you need an identifier to tie
realloc/free back to the allocation. Printing the resulting pointer
value should work, but then I think you'd want to postprocess the file
to add unique sequence numbers (or just replace the pointers with
sequence numbers) so that subsequent reuse of the same address
wouldn't get associated as the same allocation. Timestamps might be
useful to in order to measure the lifetime not just in terms of number
of intervening allocations but actual wall clock time.

In order to make any of this perform acceptably and somewhat (though
nowhere near entirely) reduce instrumentation noise, I think you'd
want to write in binary form and heavily buffer (buffered stdio/fwrite
would not be bad, or you could roll your own).

Rich

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-08-30 18:46 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-29 19:26 malloc implementation survey: omalloc Markus Wichmann
2018-07-30 15:09 ` Christopher Friedt
2018-07-30 19:14   ` Christopher Friedt
2018-07-30 19:22     ` Christopher Friedt
2018-07-30 19:34       ` Christopher Friedt
2018-07-31  7:49   ` Markus Wichmann
2018-07-31 14:25     ` Rich Felker
2018-07-31 14:49     ` Christopher Friedt
2018-07-31  0:47 ` Rich Felker
2018-07-31  1:44   ` Rich Felker
2018-08-30 18:16     ` A. Wilcox
2018-08-30 18:46       ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).