From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-3.3 required=5.0 tests=MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 19930 invoked from network); 24 Jan 2022 14:42:52 -0000 Received: from mother.openwall.net (195.42.179.200) by inbox.vuxu.org with ESMTPUTF8; 24 Jan 2022 14:42:51 -0000 Received: (qmail 21851 invoked by uid 550); 24 Jan 2022 14:42:49 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 21816 invoked from network); 24 Jan 2022 14:42:48 -0000 Date: Mon, 24 Jan 2022 09:42:36 -0500 From: Rich Felker To: Yoan Picchi Cc: musl@lists.openwall.com, yoan.picchi@arm.com Message-ID: <20220124144234.GG7074@brightrain.aerifal.cx> References: <52ad54d5-ffc0-6136-6a27-2fe75075c383@foss.arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <52ad54d5-ffc0-6136-6a27-2fe75075c383@foss.arm.com> User-Agent: Mutt/1.5.21 (2010-09-15) Subject: Re: [musl] [PATCH 1/1] Add documentation for mallocng On Mon, Jan 24, 2022 at 01:25:01PM +0000, Yoan Picchi wrote: > Hello. > Having received no review nor acknowledgement for the last few month > I'm sending this reminder. > This patch is about adding some documentation for mallocng as the > current lack of thereof makes > it time consuming to work with. Hi. I'm sorry for not replying sooner. Documentation is intentionally not maintained in the musl tree itself. There is a long-term official doc project that's stalled, but also it's expected only to document public contracts. The immediate-benefit place to put documentation of internals, musl-specific behaviors, etc. is on the wiki. If there doesn't seem to be a good place for this document on the wiki now, myself or someone else can add one. Thanks for your interest/work in documenting this. Rich > From: yoan.picchi@foss.arm.com > > Sent: 05 November 2021 13:36 > To: musl@lists.openwall.com > Cc: Yoan Picchi > Subject: [PATCH 1/1] Add documentation for mallocng > > From: Yoan Picchi > > This is an attempt at describing the various data structures of > mallocng and how they interract together. > > Signed-off-by: Yoan Picchi > Change-Id: I37659bbbc6fb0c23dbd93c9eba27e51bfeb4715d > --- >  src/malloc/mallocng/readme.txt | 284 +++++++++++++++++++++++++++++++++ >  1 file changed, 284 insertions(+) >  create mode 100644 src/malloc/mallocng/readme.txt > > diff --git a/src/malloc/mallocng/readme.txt > b/src/malloc/mallocng/readme.txt new file mode 100644 index > 00000000..c0c04847 > --- /dev/null > +++ b/src/malloc/mallocng/readme.txt > @@ -0,0 +1,284 @@ > +This doc is intended to give the reader a high level view of how musl's > +malloc works through explaining its data structures, and is targeted > +for whomever wants to work on malloc or is just curious about how it works. > + > +You can find an alternative unofficial shorter explanation here: > +https://gist.github.com/MaskRay/ac54b26d72452ac77ac578f2e625369f > +And some other information can be found in the mailing list archive: > +https://www.openwall.com/lists/musl/2021/10/04/2 > + > + > + > +    GROUP > + > +The group is the structure that will actually hold the user data. Given > +that this structure will be required for any size of allocation, it > +must be small to reduce the memory waste for tiny allocations, yet be > +able to hold a large amount of data. There are two slightly different > +usages depending on the size of the data to hold. > + > +    For small allocations (<128KiB): > +A group has several slots, up to 32. The header part of the group is > +16B long on 64 bits platforms. This is defined by the UNIT constant, > +which also represent the minimal alignment inside a group. Then > start the slots. > + > +     Small group: > ++---------------------+ > +| meta* meta          | \ > +| int number_active:5 |  } UNIT (16B) > +| char padding[7]     | / > +|                     | > +|  +---------------+  | \ > +|  |      slot     |  |  } sizeclass x UNIT > +|  +---------------+  | / > +|  |      slot     |  | > +|  +---------------+  | > +|       ... x32       | > +|  +---------------+  | > +|  |      slot     |  | > +|  +---------------+  | > ++---------------------+ > + > +Note that a slot uses some memory before the slot's start to store the > +in-band data (IB). This is safe because a slot's last 4B are always > +unused. For the first slot, "padding" should be at least 4 Bytes long > +to fit the IB data. This IB data stores the information required to > find the various fields in the slot. > +IB is not defined by any struct, and is accessed solely through pointer > +arithmetic. If it had a structure, it would be like so: > + > +        Basic slot:                                         IB: > +  +---------------------+   <- before the slot +----------------------+ > +  |       IB (4B)       |                         | char need_big_offset | > +  +---------------------+                         | char index:5         | > ++-------------------------+ <- slot's start       | char reserved:3      | > +| +---------------------+ |                       | uint16_t offset      | > +| |                     | | +----------------------+ > +| |      user data      | | > +| |                     | | > +| +---------------------+ | \ > +| |     canary (1B)     | |  \ > +| +---------------------+ |   \ > +| |        slack        | |    \ > +| +---------------------+ |     } size = reserved > +| |   footer size (4B)  | |    / > +| |    (exist only if   | |   / > +| |    reserved == 5)   | |  / > +| +---------------------+ | / > +| |     unused (4B)     | | > +| +---------------------+ | > ++-------------------------+ > + > +"need_big_offset" is always null for multislot groups. > +"index" is the slot's index. It is used to find the slot's end once we > +have the meta (explained later on). It might be redundant as one can > +calculate it on the fly from the slot's address, the group's address > +and the sizeclass, but by keeping the number of slots to 32, those 5 > +index bits save us the aforenoted calculation. > +"reserved" specifies where the user data actually ends compared to the > +end of the slot. This way a canary can be placed to detect overflows > +right at the end of the user's data. If "reserved" is 0, then we have > +no space for the canary and so, we don't use any. Being only 3 bits > +long, we deal with larger footers by writing the size of the footer > +directly in the slot when we have enough space. "offset" is the > offset between this slot's start, and the "storage" > +field in group (the start of the first slot). Given that all group > +slots are contiguous, this is used to access the metadata stored at the > +start of the group. Note that this offset is only a 16 bit value. To > +increase the range, it gets multiplied by UNIT, which is currently set > +to 16 and is the reason why 16 appears in a lot of places. This offset > +of 16 bits explains why the largest multislot groups will only have > up to 8 slots: 8*8191 = 2^16-8. > +This is the meaning of the basic IB data, and is enough to run malloc, > +but it can't support aligned_alloc yet. The group metadata being 16 > +Bytes long, what if one needs an alignment of 64B ? > + > +The slots being a fixed size means that most malloc calls actually > +won't fill it entirely and that we have some wiggle room as to where to > +place the user data in that slot: this is the "slack". But if the user > +data starts further back in the slot, how to find the IB ? The solution > +is to move the IB data, as well. This would mean though that once we > +return from the malloc call, we no longer know where the user data > +starts. This is no issue though, because the original IB slot has > been repurposed. In the modified IB, the "offset_size" > +field gives us the offset between the start of the slot and the start > +of the user data. To make sure one doesn't use this IB as a "normal" > +IB, the "reserved" slot is set to 7, which triggers an assert if used > +to get the metadata. > + > +This offset is not only used in aligned_alloc, it is used in nearly > all allocs. > +By cycling through every valid offset in a slot upon free and re-use, > +we increase the time to reuse of the same user address. This makes it > +easier to catch a double free. > + > +      Complex slot:                                    Modified IB: > +  +---------------------+ +----------------------+ > +  |   modified IB (4B)  |                         | need_big_offset = 0  | > +  +---------------------+                         | index:5 = undefined  | > ++-------------------------+ <- slot's start       | reserved:3 = 7       | > +| +---------------------+ |                       | uint16_t > +| +---------------------+ | offset_size | > +| |   optional offset   | | +----------------------+ > +| +---------------------+ | > +| +---------------------+ | > +| |       IB (4B)       | | > +| +---------------------+ | > +| +---------------------+ | > +| |                     | | > +| |      user data      | | > +| |                     | | > +| +---------------------+ | > +| |     canary (1B)     | | > +| +---------------------+ | > +| |        slack        | | > +| +---------------------+ | > +| |   footer size (4B)  | | > +| +---------------------+ | > +| |     unused (4B)     | | > +| +---------------------+ | > ++-------------------------+ > + > + > +    For large allocations (≥128KiB): > +A group will only hold a single slot. It is not tracked and reused the > +same way as small groups are. It is mmapped upon allocation, and > munmapped when freed. > +Given that this group has no maximum size, it is possible that one > +needs it to be aligned to a few MiB using aligned_alloc. This would be > +above what the 16 bit offset (and offset_size) fields can hold. In > this case, "need_big_offset" > +is set to 1 and an additional 4B is added to the IB to hold a 32 > bits offset. > +Note that it limits alignment to 64GiB (offset*UNIT). > +It may look like we have a problem given that there is only 7 Bytes of > +padding, but given that there is only one slot, we never use > +"number_active", so we can afford overwriting it with the extra IB. > + > +Also note that in most case there is some slack in large groups as well > +because mmap allocates at a page granularity. Internally, musl works > +with 4KiB page and simply considers larger pages as a bunch of 4KiB > +pages. As such, the size of the slot is a multiple of those 4KiB pages > +and not a multiple of the actual page size. > + > +     Large group: > ++---------------------+ > +| meta* meta          | \ > +| int number_active:5 |  } UNIT (16B) > +| char padding[7]     | / > +|                     | > +|  +---------------+  | > +|  | extra IB (4B) |  | > +|  +---------------+  | \ > +|  |               |  |  \ > +|  |      slot     |  |   } maplen x 4KiB > +|  |               |  |  / > +|  +---------------+  | / > ++---------------------+  <- end at a 4KiB boundary > + > +Other than those few changes, large groups behave like small ones. > + > + > + > +    META > + > +Groups only have the first part of the metadata. The second part is the > +bound meta object. There's one for each group and they each have a > +pointer to each other. > + > +        Meta: > ++--------------------+ > +| meta* prev, next   | <- make a circular buffer > +| group* mem         | > +| int avail_mask     | \ define 3 states for each slot > +| int freed_mask     | / > +| size_t last_idx:5  | > +| size_t freeable:1  | > +| size_t sizeclass:6 | <- size if small group > +| size_t maplen:52   | <- size if large group > ++--------------------+ > + > +This is where the size of the slots is defined, using the "sizeclass" > +field. It is linked to a global array with 48 fixed size ranging from 1 > +to 8191. These sizes are then multiplied by UNIT to get the actual size > +in memory, hence the small/large group threshold at 128KiB. The > +"sizeclasses" [48-62] are unused and a "sizeclass" of 63 means it's > a large group. > +Given the limitation of offset to 16 bits, there is a limitation of the > +number of slots for the largest of the small groups. It is set in > "last_idx". > +The "avail_mask" and "freed_mask" are both bitmaps to hold the status > +of every slot in the group. The former holds whether the slot is > +available and never used, the latter whether the slot has been used but > +has been freed, or is currently inactive. When one calls malloc, it'll > +first attempt to allocate into an "avail_mask" slot. Failing this, > +it'll use a freed slot or activate some new slots. This allocation > +procedure is used to keep memory pages clean. A group can easily span > +several pages. As such, it's better to fill in slots in pages that have > +already been dirtied. This is what the "active_idx" field in group is > +used for. When a new slot is alloc-ed in a group, all the slots > that are in the same page get activated. > +"maplen" is where the size of a large group is stored. It is the number > +of 4KiB pages the group spans. > +Then there is "freeable", which can be a bit more confusing. While it > +obviously specifies whether the group is freeable or not, currently a > +group is not freeable only when a new meta is allocated on zeroed memory. > +And last, the "prev" and "next" fields. This is so we can make a > +circular buffer out of those meta objects. This way, when one requires > +a slot, malloc can easily go through the various metas until it finds a > +satisfactory one. In practice it doesn't search for the optimal one but > +just uses some heuristics like "no slot available, need to activate a > +new slot ? take the next meta instead". > + > + > + > +    META_AREA > + > +We need some way to hold the memory where the meta will be allocated. > +That's the purpose of the meta_areas. > + > +     Meta_area: > ++--------------------+ > +| int64_t check      | > +| meta_area* next    | <- form a linked list > +| int nslot          | > +| meta slots[]       | <- slots to store meta objects > ++--------------------+ > + > +These work like a simple linked list where each link is bigger than the > +previous one. Growing the size exponentially ensures there will only be > +a small amount of links and mmap calls. "nslot" gives the number of > +slots where the meta object can stored in this link (≠ slots in > +groups). Though, in practice, "nslot" seems to be currently unused and > +the slots are tracked in the malloc_context object instead. > + > + > + > +    MALLOC_CONTEXT > + > +And finally, the context. This is a global singleton mainly used to > +manage the meta objects. Some of the fields have been omitted for brevity. > + > +        Malloc_context: > ++------------------------------+ > +| int64_t secret               | > +| meta* free_meta_head         | <- head of the free meta > +| meta* avail_meta             | <- next meta in meta_area > +| size_t avail_meta_count      | > +| size_t avail_meta_area_count | > +| meta_area head, tail         | > +| meta* active[48]             | <- circular buffer for each sizeclass > +| size_t usage_by_class[48]    | <- number of slot for each sizeclass > +| uint8_t unmap_seq[32]        | \ balance fragmentation/slot reuse > +| uint8_t bounce[32]           | / > ++------------------------------+ > + > +"head" and "tail" allow easy access to the meta_area. Given that meta > +slots are not tracked, we only need access to the tail of the list. > +There is some kind of memory protection trick with the meta area. When > +a new meta_area is created, only the first 4KiB page is made available > +for the slots. The number of slots fitting in that page is written down > +in "avail_meta_count". When "avail_meta_count" runs out of slots, > +another 4KiB worth of slots are added and "avail_meta_area_count" is > +decremented once. When it reaches 0, a new meta_area is mmap-ed. The > +point of doing so is that the pages originally mapped are initialized > +with no right to read nor write, and those rights are given back just > +as the malloc needs them. "avail_meta" and "free_meta_head" are the two > +sources of meta objects. The first is a pointer to the first free > slot in the tail meta_area. The second is a pool of all the metas > that have been freed. > +This is important because once again, the meta slots are not tracked in > +the meta area. If we weren't retaining the freed meta in this list, > +they would be lost and it would be a memory leak. This way, we > +prioritize reusing freed meta instead of initializing new ones. For > +easy access, there are a circular buffer for each sizeclass. Along with > +those there are counters of how many slots exist for any given > sizeclass: "usage_by_class". > +The "bounce" and "unmap_seq" seems to be used for balancing > +fragmentation and address reuse. > + > -- > 2.27.0