From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/13086
Path: news.gmane.org!.POSTED!not-for-mail
From: Markus Wichmann <nullplan@gmx.net>
Newsgroups: gmane.linux.lib.musl.general
Subject: malloc implementation survey: omalloc
Date: Sun, 29 Jul 2018 21:26:18 +0200
Message-ID: <20180729192618.GA22386@voyager>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: blaine.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: blaine.gmane.org 1532892269 3807 195.159.176.226 (29 Jul 2018 19:24:29 GMT)
X-Complaints-To: usenet@blaine.gmane.org
NNTP-Posting-Date: Sun, 29 Jul 2018 19:24:29 +0000 (UTC)
User-Agent: Mutt/1.9.4 (2018-02-28)
To: musl@lists.openwall.com
Original-X-From: musl-return-13102-gllmg-musl=m.gmane.org@lists.openwall.com Sun Jul 29 21:24:25 2018
Return-path: <musl-return-13102-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@m.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by blaine.gmane.org with smtp (Exim 4.84_2)
	(envelope-from <musl-return-13102-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1fjrIn-0000t8-Gx
	for gllmg-musl@m.gmane.org; Sun, 29 Jul 2018 21:24:25 +0200
Original-Received: (qmail 26401 invoked by uid 550); 29 Jul 2018 19:26:33 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Original-Received: (qmail 26364 invoked from network); 29 Jul 2018 19:26:32 -0000
Content-Disposition: inline
X-Provags-ID: V03:K1:5lKTGksnM+33dsP7adcC0g7cclC1vGmSfHYyrueyXYQvwbeo2hx
 Ely1mbOgggsTipGl9u9jfxL7cXiaob+3Jdg4o871N4zZT/1eGdAN5mEso9JEq6fme5JwZJy
 tOG/j2NTluI8sNzBRIY8GyUCN4JoWOlw3CDCrr///pXGTYpKepvSMcMVDgyL7HN3hzjowj/
 gu6eOsDZAAjnQJncSkR8Q==
X-UI-Out-Filterresults: notjunk:1;V01:K0:LKXgUm85644=:QG1lIWZB3Dcl4ijIOQuLhv
 pn7Lp2eIRxMkkkQhS9uAUuYuCQDqY2qHCFCzwsBPRATkYOJGh33sNA/FhCG+7aEVjRXWzFfuY
 53gK5ofTI/2tUwlUgooeSs7VI8ZN/81GFyc/KSeV58kzULdPjEWKxbyTp/OwVwoO/18yL1aUZ
 1178Y0DBC1JiZPUTzqmeVF0y/ddg6hGDg8ZJiYfGmh90047NJIXiebBXCkWLqF4ofQ5xfM3uZ
 WXgHDeTtDHUQLqjaMYlaKnkvayKejHM8B6NtDCkQFA58TvxdXQyL4gR+tZWCnIYWhZRn24eWm
 ZLWGLzjZR12COoB3OCIfK7U9jp2lI3GEkvpuc43vzj3RySw40sG/AfsOdAZMrVR7kq3SAeNY+
 JybcSPU9XqlHUzlFyAMdzGc2XZI1Z5qSjzFnGzpb94/aaSiIt0FXbVuzGOxBwoTw1bu318vYn
 9okJfl8I0eVhmUKWjkHN8JtuLtHObs3Su6swcFXLnRKjqg9eO0Wm+O0GVVifti32t1w172AvJ
 tieid0V7/BiYJ0P4aeETQUwhOfdS+h6DN7U0z7AscFQNJ3HJchPspwTP84JZORloiff9df8Aw
 KPs8dbBBJrASpY4bjBIjSsh1RXCJyxijaOJWjy6GlXaOLhJ0PgYgHoZYXpuNn7FD7Jpx9pJjX
 aOwKb1r6Gtzqgb1G1TMNQ0a8XcxYC/kJ0U/vusudoL5gcjR7nse6aM4Q72paJFmkTxWIRjk1/
 9OJLKVj/vCW7O/vwKlNU+6DlKUdFi+3XzB7yqIR4tPYfWyNUeKSo78CMkts=
Xref: news.gmane.org gmane.linux.lib.musl.general:13086
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/13086>

Hi all,

we discussed rewriting malloc() a while back, because, as I recall, Rich
wasn't satisfied with the internal storage the current system is using
(i.e. metadata is stored with the returned pointer) as well as some
corner cases on lock contention, although fine grained locking is a nice
feature in itself.

I therefore had a look at existing malloc() algorithms, the rationale
being that I thought malloc() to be a solved problem, so we only have to
find the right solution.

As it turns out, it appears Doug Lea was simply too successful: Many
allocators follow his pattern in one way or another. But some systems
buck the trend.

So today I found omalloc, the allocator OpenBSD uses, in this nice
repository:

https://github.com/emeryberger/Malloc-Implementations

Among other implementations, of course.

Well, as written, the omalloc implementation there fulfills one of the
criteria layed out by Rich, namely external storage (metadata saved away
from the returned memory). Locking however is monolithic. Maybe
something we can work on ourselves...

So I thought I'd describe the algorithm as an overview and we can have a
discussion on whether to pursue a similar algorithm ourselves.

1. Data Structures
==================

A /region/ is a large, i.e. multiple-of-page-sized chunk of memory. A
region info struct knows the region's location and size.

A /chunk/ is a small power-of-two byte sized chunk of memory, but at
least 16 bytes. I have no idea where that decision came from.

A /chunk info struct/ contains the location of a chunk page (chunks are
at most half a page in size), the size of each chunk in the page, the
number of free chunks, a bitmap of free chunks and a pointer to the next
like-sized chunk info struct.

Global data structures are: A linked list of free chunk info structs, a
linked list of chunk info structs with free elements for each possible
size between 16 bytes and half a page, and a cache of 256 free regions.

The most important data structure is a hash table. The hash table
contains entries of two machine words, of which the first is the key,
sans its lowest lb(PAGE_SIZE) bits. In essence, this means it has a key
of 20 or 52 bits, one value of 12 bits and another value of one machine
word. Adjust by one where appropriate for 8K archs. The hash table
contains 512 entries after initialization and can only grow. The aim is
to keep the load factor below 3/4 (i.e. if the number of free entries
falls below a quarter of the total number of entries, the table is grown
by doubling the number of total entries).

I'll call the three parts page number, chunk size, and region size.

2. Algorithms
=============
a. Allocation
-------------

Allocation is split in two cases: Large and small.

For large allocations (> 1/2 page) we only allocate a region to service
the request. This means, the size is rounded up to the next page size.
Then we search our region cache for a region large enough. If we find
one that fits exactly, we return it. If we find a larger one, but none
that fits exactly, we return the end of the region and adjust the saved
size. If we find none of these, we escalate to the OS.

If we did manage to find a region, we save in the hash table an entry
with page number equal to the page number of the returned page, chunk
size set to zero and region size set to the allocation size, and return
the allocated page(s).

For small allocations (<= 1/2 page) we need to allocate a chunk. For
this, we round the request up to the next power of two, but at least to
16.

Then we check if we have chunk info header in the slot for that size. If
not, we need to allocate one, by getting a header from the list of free
chunk info headers. If that is also empty, we allocate a page for it and
fill it with free chunk info headers. They are constant sized.

With chunk info header in hand, we allocate a page for it, then fill in
the chunk info header, most importantly setting the bitmap to 1 for all
valid chunks. Then we save in the global hash table the page number of
the page containing the actual memory we'll return, as chunk size the
binary log of the size of each chunk, and as region size a pointer to
the chunk info header.

Allocation of a chunk then means finding a one-bit, setting it to zero
and returning the corresponding pointer in the page the header is
pointing to.

b. Freeing
----------

The page number of the pointer is looked up in the hash table. If not
found, the pointer is invalid.

If found, then we need to look at the chunk size portion of the hash
table entry. If that is zero then we need to free a region, else we need
to free a chunk.

The free a region, we remove the entry from the hash table, then add the
region to our cache using a very weird heuristic. In any case, any entry
thrown out of the cache in the process, as well as the current region,
should it not be added to the cache, will be unmapped.

To free a chunk, we set the corresponding bit and increase the counter
of free chunks in the chunk info header, whose address we have from the
repurposed region size part of the hash table entry. If this set the
number of free chunks to one, we add the chunk info header to the list
of chunk info headers with free chunks for the given size. Also, if now
every chunk in the page is free, we remove the chunk info header from
the bucket list, add it to the list of free chunk info headers and unmap
the page.

c. Reallocation
---------------

If both the old and new allocation size are larger than half a page, we
try to remap. Else have not much choice but to allocate a new block and
copy everything.

d. Memory donation
------------------

Entire pages can be added to the cache of free regions. Smaller blocks
can be added to the list of free chunk info headers. We have no use for
even smaller blocks.

3. Critique
===========

As written in the above repo, omalloc pulls a global lock around each
operation. This is probably unaccaptable. Also, while the implementation
offers a high degree of customizability, it uses syscall upon syscall.
Most of which can probably be removed.

The hash table uses linear probing to resolve conflicts, albeit
backwards. According to wikipedia this encourages primary clustering,
i.e. used entries tend to clump together. Also, the hash table growth
algorithm isn't realtime; it allocates the new table and completely
re-keys all entries from the old one. This means in the worst case,
allocation has unbounded run-time, though this amortizes.

Let's talk parallelism: The hash table can be secured with a lock. Both
allocation and deallocation need to write to the table, so a mutex will
have to do. As for the chunk info header lists, they could all be
implemented in a lock-free way. Unfortunately this means that a chunk
info header can disappear from one list for a moment and re-appear in
another a moment later. Which can lead to a thread seeing a list empty
which actually isn't. Protecting all lists with the same lock would
solve that issue but reduce parallelism more. Finally, the free page
cache... could be protected with a different lock. We also need a lock
for each chunk info header. Unless we implement the bitmap in a
lock-free way, but that means that bitmap and "number of free chunks"
marker don't need to be consistent anymore. Decisions, decisions...

Finally, the run time. All external storage solutions require iteration
over structures. At the moment, in the single threaded case, our malloc
requires merely the removal of a single element from a list. omalloc, on
the other hand, for small allocations always requires an iteration over
a bitmap to find the one that is set. And for large allocations always
requires searching the free page cache, or a syscall. Or both, in the
expensive case.

So, is this a direction to pursue, or should we look further?

Ciao,
Markus