[musl] [PATCH 0/1] Add documentation for mallocng

mailing list of musl libc
 help / color / mirror / code / Atom feed

* [musl] [PATCH 0/1] Add documentation for mallocng
@ 2021-11-05 13:53 yoan.picchi
  2021-11-05 13:53 ` [musl] [PATCH 1/1] " yoan.picchi
  0 siblings, 1 reply; 2+ messages in thread
From: yoan.picchi @ 2021-11-05 13:53 UTC (permalink / raw)
  To: musl; +Cc: Yoan Picchi

From: Yoan Picchi <yoan.picchi@arm.com>

Hello.
Having worked on mallocng for a few months, I've found out that there's
little documentation out there on how it works. One have to read the code
for a while until everything comes together. This patch attempt to help
on that front. It would be good if some of the original author of mallocng
could also check the information to make sure I didn't miss something.
I had to replace the pictures I originaly made by some ascii diagrams, but
if you have a richer doc I'd be happy to send the original svg as well

Yoan Picchi (1):
  Add documentation for mallocng

 src/malloc/mallocng/readme.txt | 284 +++++++++++++++++++++++++++++++++
 1 file changed, 284 insertions(+)
 create mode 100644 src/malloc/mallocng/readme.txt

-- 
2.27.0

^ permalink raw reply	[flat|nested] 2+ messages in thread

* [musl] [PATCH 1/1] Add documentation for mallocng
  2021-11-05 13:53 [musl] [PATCH 0/1] Add documentation for mallocng yoan.picchi
@ 2021-11-05 13:53 ` yoan.picchi
  0 siblings, 0 replies; 2+ messages in thread
From: yoan.picchi @ 2021-11-05 13:53 UTC (permalink / raw)
  To: musl; +Cc: Yoan Picchi

From: Yoan Picchi <yoan.picchi@arm.com>

This is an attempt at describing the various data structures of
mallocng and how they interract together.

Signed-off-by: Yoan Picchi <yoan.picchi@arm.com>
Change-Id: I37659bbbc6fb0c23dbd93c9eba27e51bfeb4715d
---
 src/malloc/mallocng/readme.txt | 284 +++++++++++++++++++++++++++++++++
 1 file changed, 284 insertions(+)
 create mode 100644 src/malloc/mallocng/readme.txt

diff --git a/src/malloc/mallocng/readme.txt b/src/malloc/mallocng/readme.txt
new file mode 100644
index 00000000..c0c04847
--- /dev/null
+++ b/src/malloc/mallocng/readme.txt
@@ -0,0 +1,284 @@
+This doc is intended to give the reader a high level view of how musl's malloc
+works through explaining its data structures, and is targeted for whomever
+wants to work on malloc or is just curious about how it works.
+
+You can find an alternative unofficial shorter explanation here:
+https://gist.github.com/MaskRay/ac54b26d72452ac77ac578f2e625369f
+And some other information can be found in the mailing list archive:
+https://www.openwall.com/lists/musl/2021/10/04/2
+
+
+
+    GROUP
+
+The group is the structure that will actually hold the user data. Given that
+this structure will be required for any size of allocation, it must be small
+to reduce the memory waste for tiny allocations, yet be able to hold a large
+amount of data. There are two slightly different usages depending on the size
+of the data to hold.
+
+    For small allocations (<128KiB):
+A group has several slots, up to 32. The header part of the group is 16B long
+on 64 bits platforms. This is defined by the UNIT constant, which also
+represent the minimal alignment inside a group. Then start the slots.
+
+     Small group:
++---------------------+
+| meta* meta          | \
+| int number_active:5 |  } UNIT (16B)
+| char padding[7]     | /
+|                     |
+|  +---------------+  | \
+|  |      slot     |  |  } sizeclass x UNIT
+|  +---------------+  | /
+|  |      slot     |  |
+|  +---------------+  |
+|       ... x32       |
+|  +---------------+  |
+|  |      slot     |  |
+|  +---------------+  |
++---------------------+
+
+Note that a slot uses some memory before the slot's start to store the in-band
+data (IB). This is safe because a slot's last 4B are always unused. For the
+first slot, "padding" should be at least 4 Bytes long to fit the IB data. This
+IB data stores the information required to find the various fields in the slot.
+IB is not defined by any struct, and is accessed solely through pointer
+arithmetic. If it had a structure, it would be like so:
+
+        Basic slot:                                         IB:
+  +---------------------+   <- before the slot    +----------------------+
+  |       IB (4B)       |                         | char need_big_offset |
+  +---------------------+                         | char index:5         |
++-------------------------+ <- slot's start       | char reserved:3      |
+| +---------------------+ |                       | uint16_t offset      |
+| |                     | |                       +----------------------+
+| |      user data      | |
+| |                     | |
+| +---------------------+ | \
+| |     canary (1B)     | |  \
+| +---------------------+ |   \
+| |        slack        | |    \
+| +---------------------+ |     } size = reserved
+| |   footer size (4B)  | |    /
+| |    (exist only if   | |   /
+| |    reserved == 5)   | |  /
+| +---------------------+ | /
+| |     unused (4B)     | |
+| +---------------------+ |
++-------------------------+
+
+"need_big_offset" is always null for multislot groups.
+"index" is the slot's index. It is used to find the slot's end once we have the
+meta (explained later on). It might be redundant as one can calculate it on the
+fly from the slot's address, the group's address and the sizeclass, but by
+keeping the number of slots to 32, those 5 index bits save us the aforenoted
+calculation.
+"reserved" specifies where the user data actually ends compared to the end of
+the slot. This way a canary can be placed to detect overflows right at the end
+of the user's data. If "reserved" is 0, then we have no space for the canary
+and so, we don't use any. Being only 3 bits long, we deal with larger footers
+by writing the size of the footer directly in the slot when we have enough
+space. "offset" is the offset between this slot's start, and the "storage"
+field in group (the start of the first slot). Given that all group slots are
+contiguous, this is used to access the metadata stored at the start of the
+group. Note that this offset is only a 16 bit value. To increase the range, it
+gets multiplied by UNIT, which is currently set to 16 and is the reason why 16
+appears in a lot of places. This offset of 16 bits explains why the largest
+multislot groups will only have up to 8 slots: 8*8191 = 2^16-8.
+This is the meaning of the basic IB data, and is enough to run malloc, but it
+can't support aligned_alloc yet. The group metadata being 16 Bytes long, what
+if one needs an alignment of 64B ?
+
+The slots being a fixed size means that most malloc calls actually won't fill
+it entirely and that we have some wiggle room as to where to place the user
+data in that slot: this is the "slack". But if the user data starts further
+back in the slot, how to find the IB ? The solution is to move the IB data, as
+well. This would mean though that once we return from the malloc call, we no
+longer know where the user data starts. This is no issue though, because the
+original IB slot has been repurposed. In the modified IB, the "offset_size"
+field gives us the offset between the start of the slot and the start of the
+user data. To make sure one doesn't use this IB as a "normal" IB, the
+"reserved" slot is set to 7, which triggers an assert if used to get the
+metadata.
+
+This offset is not only used in aligned_alloc, it is used in nearly all allocs.
+By cycling through every valid offset in a slot upon free and re-use, we
+increase the time to reuse of the same user address. This makes it easier to
+catch a double free.
+
+      Complex slot:                                    Modified IB:
+  +---------------------+                         +----------------------+
+  |   modified IB (4B)  |                         | need_big_offset = 0  |
+  +---------------------+                         | index:5 = undefined  |
++-------------------------+ <- slot's start       | reserved:3 = 7       |
+| +---------------------+ |                       | uint16_t offset_size |
+| |   optional offset   | |                       +----------------------+
+| +---------------------+ |
+| +---------------------+ |
+| |       IB (4B)       | |
+| +---------------------+ |
+| +---------------------+ |
+| |                     | |
+| |      user data      | |
+| |                     | |
+| +---------------------+ |
+| |     canary (1B)     | |
+| +---------------------+ |
+| |        slack        | |
+| +---------------------+ |
+| |   footer size (4B)  | |
+| +---------------------+ |
+| |     unused (4B)     | |
+| +---------------------+ |
++-------------------------+
+
+
+    For large allocations (≥128KiB):
+A group will only hold a single slot. It is not tracked and reused the same way
+as small groups are. It is mmapped upon allocation, and munmapped when freed.
+Given that this group has no maximum size, it is possible that one needs it to
+be aligned to a few MiB using aligned_alloc. This would be above what the 16
+bit offset (and offset_size) fields can hold. In this case, "need_big_offset"
+is set to 1 and an additional 4B is added to the IB to hold a 32 bits offset.
+Note that it limits alignment to 64GiB (offset*UNIT).
+It may look like we have a problem given that there is only 7 Bytes of padding,
+but given that there is only one slot, we never use "number_active", so we can
+afford overwriting it with the extra IB.
+
+Also note that in most case there is some slack in large groups as well because
+mmap allocates at a page granularity. Internally, musl works with 4KiB page and
+simply considers larger pages as a bunch of 4KiB pages. As such, the size of
+the slot is a multiple of those 4KiB pages and not a multiple of the actual
+page size.
+
+     Large group:
++---------------------+
+| meta* meta          | \
+| int number_active:5 |  } UNIT (16B)
+| char padding[7]     | /
+|                     |
+|  +---------------+  |
+|  | extra IB (4B) |  |
+|  +---------------+  | \
+|  |               |  |  \
+|  |      slot     |  |   } maplen x 4KiB
+|  |               |  |  /
+|  +---------------+  | /
++---------------------+  <- end at a 4KiB boundary
+
+Other than those few changes, large groups behave like small ones.
+
+
+
+    META
+
+Groups only have the first part of the metadata. The second part is the bound
+meta object. There's one for each group and they each have a pointer to each
+other.
+
+        Meta:
++--------------------+
+| meta* prev, next   | <- make a circular buffer
+| group* mem         |
+| int avail_mask     | \ define 3 states for each slot
+| int freed_mask     | /
+| size_t last_idx:5  |
+| size_t freeable:1  |
+| size_t sizeclass:6 | <- size if small group
+| size_t maplen:52   | <- size if large group
++--------------------+
+
+This is where the size of the slots is defined, using the "sizeclass" field. It
+is linked to a global array with 48 fixed size ranging from 1 to 8191. These
+sizes are then multiplied by UNIT to get the actual size in memory, hence the
+small/large group threshold at 128KiB. The "sizeclasses" [48-62] are unused and
+a "sizeclass" of 63 means it's a large group.
+Given the limitation of offset to 16 bits, there is a limitation of the number
+of slots for the largest of the small groups. It is set in "last_idx".
+The "avail_mask" and "freed_mask" are both bitmaps to hold the status of every
+slot in the group. The former holds whether the slot is available and never
+used, the latter whether the slot has been used but has been freed, or is
+currently inactive. When one calls malloc, it'll first attempt to allocate into
+an "avail_mask" slot. Failing this, it'll use a freed slot or activate some new
+slots. This allocation procedure is used to keep memory pages clean. A group
+can easily span several pages. As such, it's better to fill in slots in pages
+that have already been dirtied. This is what the "active_idx" field in group is
+used for. When a new slot is alloc-ed in a group, all the slots that are in the
+same page get activated.
+"maplen" is where the size of a large group is stored. It is the number of 4KiB
+pages the group spans.
+Then there is "freeable", which can be a bit more confusing. While it obviously
+specifies whether the group is freeable or not, currently a group is not
+freeable only when a new meta is allocated on zeroed memory.
+And last, the "prev" and "next" fields. This is so we can make a circular
+buffer out of those meta objects. This way, when one requires a slot, malloc
+can easily go through the various metas until it finds a satisfactory one. In
+practice it doesn't search for the optimal one but just uses some heuristics
+like "no slot available, need to activate a new slot ? take the next meta
+instead".
+
+
+
+    META_AREA
+
+We need some way to hold the memory where the meta will be allocated. That's
+the purpose of the meta_areas.
+
+     Meta_area:
++--------------------+
+| int64_t check      |
+| meta_area* next    | <- form a linked list
+| int nslot          |
+| meta slots[]       | <- slots to store meta objects
++--------------------+
+
+These work like a simple linked list where each link is bigger than the
+previous one. Growing the size exponentially ensures there will only be a small
+amount of links and mmap calls. "nslot" gives the number of slots where the
+meta object can stored in this link (≠ slots in groups). Though, in practice,
+"nslot" seems to be currently unused and the slots are tracked in the
+malloc_context object instead.
+
+
+
+    MALLOC_CONTEXT
+
+And finally, the context. This is a global singleton mainly used to manage the
+meta objects. Some of the fields have been omitted for brevity.
+
+        Malloc_context:
++------------------------------+
+| int64_t secret               |
+| meta* free_meta_head         | <- head of the free meta
+| meta* avail_meta             | <- next meta in meta_area
+| size_t avail_meta_count      |
+| size_t avail_meta_area_count |
+| meta_area head, tail         |
+| meta* active[48]             | <- circular buffer for each sizeclass
+| size_t usage_by_class[48]    | <- number of slot for each sizeclass
+| uint8_t unmap_seq[32]        | \ balance fragmentation/slot reuse
+| uint8_t bounce[32]           | /
++------------------------------+
+
+"head" and "tail" allow easy access to the meta_area. Given that meta slots are
+not tracked, we only need access to the tail of the list. There is some kind of
+memory protection trick with the meta area. When a new meta_area is created,
+only the first 4KiB page is made available for the slots. The number of slots
+fitting in that page is written down in "avail_meta_count". When
+"avail_meta_count" runs out of slots, another 4KiB worth of slots are added and
+"avail_meta_area_count" is decremented once. When it reaches 0, a new meta_area
+is mmap-ed. The point of doing so is that the pages originally mapped are
+initialized with no right to read nor write, and those rights are given back
+just as the malloc needs them. "avail_meta" and "free_meta_head" are the two
+sources of meta objects. The first is a pointer to the first free slot in the
+tail meta_area. The second is a pool of all the metas that have been freed.
+This is important because once again, the meta slots are not tracked in the
+meta area. If we weren't retaining the freed meta in this list, they would be
+lost and it would be a memory leak. This way, we prioritize reusing freed meta
+instead of initializing new ones. For easy access, there are a circular buffer
+for each sizeclass. Along with those there are counters of how many slots exist
+for any given sizeclass: "usage_by_class".
+The "bounce" and "unmap_seq" seems to be used for balancing fragmentation and
+address reuse.
+
-- 
2.27.0


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2021-11-05 13:53 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-05 13:53 [musl] [PATCH 0/1] Add documentation for mallocng yoan.picchi
2021-11-05 13:53 ` [musl] [PATCH 1/1] " yoan.picchi

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).