* [musl] [PATCH 1/1] Add documentation for mallocng
2021-11-05 13:53 [musl] [PATCH 0/1] Add documentation for mallocng yoan.picchi
@ 2021-11-05 13:53 ` yoan.picchi
0 siblings, 0 replies; 2+ messages in thread
From: yoan.picchi @ 2021-11-05 13:53 UTC (permalink / raw)
To: musl; +Cc: Yoan Picchi
From: Yoan Picchi <yoan.picchi@arm.com>
This is an attempt at describing the various data structures of
mallocng and how they interract together.
Signed-off-by: Yoan Picchi <yoan.picchi@arm.com>
Change-Id: I37659bbbc6fb0c23dbd93c9eba27e51bfeb4715d
---
src/malloc/mallocng/readme.txt | 284 +++++++++++++++++++++++++++++++++
1 file changed, 284 insertions(+)
create mode 100644 src/malloc/mallocng/readme.txt
diff --git a/src/malloc/mallocng/readme.txt b/src/malloc/mallocng/readme.txt
new file mode 100644
index 00000000..c0c04847
--- /dev/null
+++ b/src/malloc/mallocng/readme.txt
@@ -0,0 +1,284 @@
+This doc is intended to give the reader a high level view of how musl's malloc
+works through explaining its data structures, and is targeted for whomever
+wants to work on malloc or is just curious about how it works.
+
+You can find an alternative unofficial shorter explanation here:
+https://gist.github.com/MaskRay/ac54b26d72452ac77ac578f2e625369f
+And some other information can be found in the mailing list archive:
+https://www.openwall.com/lists/musl/2021/10/04/2
+
+
+
+ GROUP
+
+The group is the structure that will actually hold the user data. Given that
+this structure will be required for any size of allocation, it must be small
+to reduce the memory waste for tiny allocations, yet be able to hold a large
+amount of data. There are two slightly different usages depending on the size
+of the data to hold.
+
+ For small allocations (<128KiB):
+A group has several slots, up to 32. The header part of the group is 16B long
+on 64 bits platforms. This is defined by the UNIT constant, which also
+represent the minimal alignment inside a group. Then start the slots.
+
+ Small group:
++---------------------+
+| meta* meta | \
+| int number_active:5 | } UNIT (16B)
+| char padding[7] | /
+| |
+| +---------------+ | \
+| | slot | | } sizeclass x UNIT
+| +---------------+ | /
+| | slot | |
+| +---------------+ |
+| ... x32 |
+| +---------------+ |
+| | slot | |
+| +---------------+ |
++---------------------+
+
+Note that a slot uses some memory before the slot's start to store the in-band
+data (IB). This is safe because a slot's last 4B are always unused. For the
+first slot, "padding" should be at least 4 Bytes long to fit the IB data. This
+IB data stores the information required to find the various fields in the slot.
+IB is not defined by any struct, and is accessed solely through pointer
+arithmetic. If it had a structure, it would be like so:
+
+ Basic slot: IB:
+ +---------------------+ <- before the slot +----------------------+
+ | IB (4B) | | char need_big_offset |
+ +---------------------+ | char index:5 |
++-------------------------+ <- slot's start | char reserved:3 |
+| +---------------------+ | | uint16_t offset |
+| | | | +----------------------+
+| | user data | |
+| | | |
+| +---------------------+ | \
+| | canary (1B) | | \
+| +---------------------+ | \
+| | slack | | \
+| +---------------------+ | } size = reserved
+| | footer size (4B) | | /
+| | (exist only if | | /
+| | reserved == 5) | | /
+| +---------------------+ | /
+| | unused (4B) | |
+| +---------------------+ |
++-------------------------+
+
+"need_big_offset" is always null for multislot groups.
+"index" is the slot's index. It is used to find the slot's end once we have the
+meta (explained later on). It might be redundant as one can calculate it on the
+fly from the slot's address, the group's address and the sizeclass, but by
+keeping the number of slots to 32, those 5 index bits save us the aforenoted
+calculation.
+"reserved" specifies where the user data actually ends compared to the end of
+the slot. This way a canary can be placed to detect overflows right at the end
+of the user's data. If "reserved" is 0, then we have no space for the canary
+and so, we don't use any. Being only 3 bits long, we deal with larger footers
+by writing the size of the footer directly in the slot when we have enough
+space. "offset" is the offset between this slot's start, and the "storage"
+field in group (the start of the first slot). Given that all group slots are
+contiguous, this is used to access the metadata stored at the start of the
+group. Note that this offset is only a 16 bit value. To increase the range, it
+gets multiplied by UNIT, which is currently set to 16 and is the reason why 16
+appears in a lot of places. This offset of 16 bits explains why the largest
+multislot groups will only have up to 8 slots: 8*8191 = 2^16-8.
+This is the meaning of the basic IB data, and is enough to run malloc, but it
+can't support aligned_alloc yet. The group metadata being 16 Bytes long, what
+if one needs an alignment of 64B ?
+
+The slots being a fixed size means that most malloc calls actually won't fill
+it entirely and that we have some wiggle room as to where to place the user
+data in that slot: this is the "slack". But if the user data starts further
+back in the slot, how to find the IB ? The solution is to move the IB data, as
+well. This would mean though that once we return from the malloc call, we no
+longer know where the user data starts. This is no issue though, because the
+original IB slot has been repurposed. In the modified IB, the "offset_size"
+field gives us the offset between the start of the slot and the start of the
+user data. To make sure one doesn't use this IB as a "normal" IB, the
+"reserved" slot is set to 7, which triggers an assert if used to get the
+metadata.
+
+This offset is not only used in aligned_alloc, it is used in nearly all allocs.
+By cycling through every valid offset in a slot upon free and re-use, we
+increase the time to reuse of the same user address. This makes it easier to
+catch a double free.
+
+ Complex slot: Modified IB:
+ +---------------------+ +----------------------+
+ | modified IB (4B) | | need_big_offset = 0 |
+ +---------------------+ | index:5 = undefined |
++-------------------------+ <- slot's start | reserved:3 = 7 |
+| +---------------------+ | | uint16_t offset_size |
+| | optional offset | | +----------------------+
+| +---------------------+ |
+| +---------------------+ |
+| | IB (4B) | |
+| +---------------------+ |
+| +---------------------+ |
+| | | |
+| | user data | |
+| | | |
+| +---------------------+ |
+| | canary (1B) | |
+| +---------------------+ |
+| | slack | |
+| +---------------------+ |
+| | footer size (4B) | |
+| +---------------------+ |
+| | unused (4B) | |
+| +---------------------+ |
++-------------------------+
+
+
+ For large allocations (≥128KiB):
+A group will only hold a single slot. It is not tracked and reused the same way
+as small groups are. It is mmapped upon allocation, and munmapped when freed.
+Given that this group has no maximum size, it is possible that one needs it to
+be aligned to a few MiB using aligned_alloc. This would be above what the 16
+bit offset (and offset_size) fields can hold. In this case, "need_big_offset"
+is set to 1 and an additional 4B is added to the IB to hold a 32 bits offset.
+Note that it limits alignment to 64GiB (offset*UNIT).
+It may look like we have a problem given that there is only 7 Bytes of padding,
+but given that there is only one slot, we never use "number_active", so we can
+afford overwriting it with the extra IB.
+
+Also note that in most case there is some slack in large groups as well because
+mmap allocates at a page granularity. Internally, musl works with 4KiB page and
+simply considers larger pages as a bunch of 4KiB pages. As such, the size of
+the slot is a multiple of those 4KiB pages and not a multiple of the actual
+page size.
+
+ Large group:
++---------------------+
+| meta* meta | \
+| int number_active:5 | } UNIT (16B)
+| char padding[7] | /
+| |
+| +---------------+ |
+| | extra IB (4B) | |
+| +---------------+ | \
+| | | | \
+| | slot | | } maplen x 4KiB
+| | | | /
+| +---------------+ | /
++---------------------+ <- end at a 4KiB boundary
+
+Other than those few changes, large groups behave like small ones.
+
+
+
+ META
+
+Groups only have the first part of the metadata. The second part is the bound
+meta object. There's one for each group and they each have a pointer to each
+other.
+
+ Meta:
++--------------------+
+| meta* prev, next | <- make a circular buffer
+| group* mem |
+| int avail_mask | \ define 3 states for each slot
+| int freed_mask | /
+| size_t last_idx:5 |
+| size_t freeable:1 |
+| size_t sizeclass:6 | <- size if small group
+| size_t maplen:52 | <- size if large group
++--------------------+
+
+This is where the size of the slots is defined, using the "sizeclass" field. It
+is linked to a global array with 48 fixed size ranging from 1 to 8191. These
+sizes are then multiplied by UNIT to get the actual size in memory, hence the
+small/large group threshold at 128KiB. The "sizeclasses" [48-62] are unused and
+a "sizeclass" of 63 means it's a large group.
+Given the limitation of offset to 16 bits, there is a limitation of the number
+of slots for the largest of the small groups. It is set in "last_idx".
+The "avail_mask" and "freed_mask" are both bitmaps to hold the status of every
+slot in the group. The former holds whether the slot is available and never
+used, the latter whether the slot has been used but has been freed, or is
+currently inactive. When one calls malloc, it'll first attempt to allocate into
+an "avail_mask" slot. Failing this, it'll use a freed slot or activate some new
+slots. This allocation procedure is used to keep memory pages clean. A group
+can easily span several pages. As such, it's better to fill in slots in pages
+that have already been dirtied. This is what the "active_idx" field in group is
+used for. When a new slot is alloc-ed in a group, all the slots that are in the
+same page get activated.
+"maplen" is where the size of a large group is stored. It is the number of 4KiB
+pages the group spans.
+Then there is "freeable", which can be a bit more confusing. While it obviously
+specifies whether the group is freeable or not, currently a group is not
+freeable only when a new meta is allocated on zeroed memory.
+And last, the "prev" and "next" fields. This is so we can make a circular
+buffer out of those meta objects. This way, when one requires a slot, malloc
+can easily go through the various metas until it finds a satisfactory one. In
+practice it doesn't search for the optimal one but just uses some heuristics
+like "no slot available, need to activate a new slot ? take the next meta
+instead".
+
+
+
+ META_AREA
+
+We need some way to hold the memory where the meta will be allocated. That's
+the purpose of the meta_areas.
+
+ Meta_area:
++--------------------+
+| int64_t check |
+| meta_area* next | <- form a linked list
+| int nslot |
+| meta slots[] | <- slots to store meta objects
++--------------------+
+
+These work like a simple linked list where each link is bigger than the
+previous one. Growing the size exponentially ensures there will only be a small
+amount of links and mmap calls. "nslot" gives the number of slots where the
+meta object can stored in this link (≠ slots in groups). Though, in practice,
+"nslot" seems to be currently unused and the slots are tracked in the
+malloc_context object instead.
+
+
+
+ MALLOC_CONTEXT
+
+And finally, the context. This is a global singleton mainly used to manage the
+meta objects. Some of the fields have been omitted for brevity.
+
+ Malloc_context:
++------------------------------+
+| int64_t secret |
+| meta* free_meta_head | <- head of the free meta
+| meta* avail_meta | <- next meta in meta_area
+| size_t avail_meta_count |
+| size_t avail_meta_area_count |
+| meta_area head, tail |
+| meta* active[48] | <- circular buffer for each sizeclass
+| size_t usage_by_class[48] | <- number of slot for each sizeclass
+| uint8_t unmap_seq[32] | \ balance fragmentation/slot reuse
+| uint8_t bounce[32] | /
++------------------------------+
+
+"head" and "tail" allow easy access to the meta_area. Given that meta slots are
+not tracked, we only need access to the tail of the list. There is some kind of
+memory protection trick with the meta area. When a new meta_area is created,
+only the first 4KiB page is made available for the slots. The number of slots
+fitting in that page is written down in "avail_meta_count". When
+"avail_meta_count" runs out of slots, another 4KiB worth of slots are added and
+"avail_meta_area_count" is decremented once. When it reaches 0, a new meta_area
+is mmap-ed. The point of doing so is that the pages originally mapped are
+initialized with no right to read nor write, and those rights are given back
+just as the malloc needs them. "avail_meta" and "free_meta_head" are the two
+sources of meta objects. The first is a pointer to the first free slot in the
+tail meta_area. The second is a pool of all the metas that have been freed.
+This is important because once again, the meta slots are not tracked in the
+meta area. If we weren't retaining the freed meta in this list, they would be
+lost and it would be a memory leak. This way, we prioritize reusing freed meta
+instead of initializing new ones. For easy access, there are a circular buffer
+for each sizeclass. Along with those there are counters of how many slots exist
+for any given sizeclass: "usage_by_class".
+The "bounce" and "unmap_seq" seems to be used for balancing fragmentation and
+address reuse.
+
--
2.27.0
^ permalink raw reply [flat|nested] 2+ messages in thread