mailing list of musl libc
 help / color / mirror / code / Atom feed
c0c0484744a3ed20d8af1ce474dddfad0939c437 blob 13399 bytes (raw)

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
 
This doc is intended to give the reader a high level view of how musl's malloc
works through explaining its data structures, and is targeted for whomever
wants to work on malloc or is just curious about how it works.

You can find an alternative unofficial shorter explanation here:
https://gist.github.com/MaskRay/ac54b26d72452ac77ac578f2e625369f
And some other information can be found in the mailing list archive:
https://www.openwall.com/lists/musl/2021/10/04/2



    GROUP

The group is the structure that will actually hold the user data. Given that
this structure will be required for any size of allocation, it must be small
to reduce the memory waste for tiny allocations, yet be able to hold a large
amount of data. There are two slightly different usages depending on the size
of the data to hold.

    For small allocations (<128KiB):
A group has several slots, up to 32. The header part of the group is 16B long
on 64 bits platforms. This is defined by the UNIT constant, which also
represent the minimal alignment inside a group. Then start the slots.

     Small group:
+---------------------+
| meta* meta          | \
| int number_active:5 |  } UNIT (16B)
| char padding[7]     | /
|                     |
|  +---------------+  | \
|  |      slot     |  |  } sizeclass x UNIT
|  +---------------+  | /
|  |      slot     |  |
|  +---------------+  |
|       ... x32       |
|  +---------------+  |
|  |      slot     |  |
|  +---------------+  |
+---------------------+

Note that a slot uses some memory before the slot's start to store the in-band
data (IB). This is safe because a slot's last 4B are always unused. For the
first slot, "padding" should be at least 4 Bytes long to fit the IB data. This
IB data stores the information required to find the various fields in the slot.
IB is not defined by any struct, and is accessed solely through pointer
arithmetic. If it had a structure, it would be like so:

        Basic slot:                                         IB:
  +---------------------+   <- before the slot    +----------------------+
  |       IB (4B)       |                         | char need_big_offset |
  +---------------------+                         | char index:5         |
+-------------------------+ <- slot's start       | char reserved:3      |
| +---------------------+ |                       | uint16_t offset      |
| |                     | |                       +----------------------+
| |      user data      | |
| |                     | |
| +---------------------+ | \
| |     canary (1B)     | |  \
| +---------------------+ |   \
| |        slack        | |    \
| +---------------------+ |     } size = reserved
| |   footer size (4B)  | |    /
| |    (exist only if   | |   /
| |    reserved == 5)   | |  /
| +---------------------+ | /
| |     unused (4B)     | |
| +---------------------+ |
+-------------------------+

"need_big_offset" is always null for multislot groups.
"index" is the slot's index. It is used to find the slot's end once we have the
meta (explained later on). It might be redundant as one can calculate it on the
fly from the slot's address, the group's address and the sizeclass, but by
keeping the number of slots to 32, those 5 index bits save us the aforenoted
calculation.
"reserved" specifies where the user data actually ends compared to the end of
the slot. This way a canary can be placed to detect overflows right at the end
of the user's data. If "reserved" is 0, then we have no space for the canary
and so, we don't use any. Being only 3 bits long, we deal with larger footers
by writing the size of the footer directly in the slot when we have enough
space. "offset" is the offset between this slot's start, and the "storage"
field in group (the start of the first slot). Given that all group slots are
contiguous, this is used to access the metadata stored at the start of the
group. Note that this offset is only a 16 bit value. To increase the range, it
gets multiplied by UNIT, which is currently set to 16 and is the reason why 16
appears in a lot of places. This offset of 16 bits explains why the largest
multislot groups will only have up to 8 slots: 8*8191 = 2^16-8.
This is the meaning of the basic IB data, and is enough to run malloc, but it
can't support aligned_alloc yet. The group metadata being 16 Bytes long, what
if one needs an alignment of 64B ?

The slots being a fixed size means that most malloc calls actually won't fill
it entirely and that we have some wiggle room as to where to place the user
data in that slot: this is the "slack". But if the user data starts further
back in the slot, how to find the IB ? The solution is to move the IB data, as
well. This would mean though that once we return from the malloc call, we no
longer know where the user data starts. This is no issue though, because the
original IB slot has been repurposed. In the modified IB, the "offset_size"
field gives us the offset between the start of the slot and the start of the
user data. To make sure one doesn't use this IB as a "normal" IB, the
"reserved" slot is set to 7, which triggers an assert if used to get the
metadata.

This offset is not only used in aligned_alloc, it is used in nearly all allocs.
By cycling through every valid offset in a slot upon free and re-use, we
increase the time to reuse of the same user address. This makes it easier to
catch a double free.

      Complex slot:                                    Modified IB:
  +---------------------+                         +----------------------+
  |   modified IB (4B)  |                         | need_big_offset = 0  |
  +---------------------+                         | index:5 = undefined  |
+-------------------------+ <- slot's start       | reserved:3 = 7       |
| +---------------------+ |                       | uint16_t offset_size |
| |   optional offset   | |                       +----------------------+
| +---------------------+ |
| +---------------------+ |
| |       IB (4B)       | |
| +---------------------+ |
| +---------------------+ |
| |                     | |
| |      user data      | |
| |                     | |
| +---------------------+ |
| |     canary (1B)     | |
| +---------------------+ |
| |        slack        | |
| +---------------------+ |
| |   footer size (4B)  | |
| +---------------------+ |
| |     unused (4B)     | |
| +---------------------+ |
+-------------------------+


    For large allocations (≥128KiB):
A group will only hold a single slot. It is not tracked and reused the same way
as small groups are. It is mmapped upon allocation, and munmapped when freed.
Given that this group has no maximum size, it is possible that one needs it to
be aligned to a few MiB using aligned_alloc. This would be above what the 16
bit offset (and offset_size) fields can hold. In this case, "need_big_offset"
is set to 1 and an additional 4B is added to the IB to hold a 32 bits offset.
Note that it limits alignment to 64GiB (offset*UNIT).
It may look like we have a problem given that there is only 7 Bytes of padding,
but given that there is only one slot, we never use "number_active", so we can
afford overwriting it with the extra IB.

Also note that in most case there is some slack in large groups as well because
mmap allocates at a page granularity. Internally, musl works with 4KiB page and
simply considers larger pages as a bunch of 4KiB pages. As such, the size of
the slot is a multiple of those 4KiB pages and not a multiple of the actual
page size.

     Large group:
+---------------------+
| meta* meta          | \
| int number_active:5 |  } UNIT (16B)
| char padding[7]     | /
|                     |
|  +---------------+  |
|  | extra IB (4B) |  |
|  +---------------+  | \
|  |               |  |  \
|  |      slot     |  |   } maplen x 4KiB
|  |               |  |  /
|  +---------------+  | /
+---------------------+  <- end at a 4KiB boundary

Other than those few changes, large groups behave like small ones.



    META

Groups only have the first part of the metadata. The second part is the bound
meta object. There's one for each group and they each have a pointer to each
other.

        Meta:
+--------------------+
| meta* prev, next   | <- make a circular buffer
| group* mem         |
| int avail_mask     | \ define 3 states for each slot
| int freed_mask     | /
| size_t last_idx:5  |
| size_t freeable:1  |
| size_t sizeclass:6 | <- size if small group
| size_t maplen:52   | <- size if large group
+--------------------+

This is where the size of the slots is defined, using the "sizeclass" field. It
is linked to a global array with 48 fixed size ranging from 1 to 8191. These
sizes are then multiplied by UNIT to get the actual size in memory, hence the
small/large group threshold at 128KiB. The "sizeclasses" [48-62] are unused and
a "sizeclass" of 63 means it's a large group.
Given the limitation of offset to 16 bits, there is a limitation of the number
of slots for the largest of the small groups. It is set in "last_idx".
The "avail_mask" and "freed_mask" are both bitmaps to hold the status of every
slot in the group. The former holds whether the slot is available and never
used, the latter whether the slot has been used but has been freed, or is
currently inactive. When one calls malloc, it'll first attempt to allocate into
an "avail_mask" slot. Failing this, it'll use a freed slot or activate some new
slots. This allocation procedure is used to keep memory pages clean. A group
can easily span several pages. As such, it's better to fill in slots in pages
that have already been dirtied. This is what the "active_idx" field in group is
used for. When a new slot is alloc-ed in a group, all the slots that are in the
same page get activated.
"maplen" is where the size of a large group is stored. It is the number of 4KiB
pages the group spans.
Then there is "freeable", which can be a bit more confusing. While it obviously
specifies whether the group is freeable or not, currently a group is not
freeable only when a new meta is allocated on zeroed memory.
And last, the "prev" and "next" fields. This is so we can make a circular
buffer out of those meta objects. This way, when one requires a slot, malloc
can easily go through the various metas until it finds a satisfactory one. In
practice it doesn't search for the optimal one but just uses some heuristics
like "no slot available, need to activate a new slot ? take the next meta
instead".



    META_AREA

We need some way to hold the memory where the meta will be allocated. That's
the purpose of the meta_areas.

     Meta_area:
+--------------------+
| int64_t check      |
| meta_area* next    | <- form a linked list
| int nslot          |
| meta slots[]       | <- slots to store meta objects
+--------------------+

These work like a simple linked list where each link is bigger than the
previous one. Growing the size exponentially ensures there will only be a small
amount of links and mmap calls. "nslot" gives the number of slots where the
meta object can stored in this link (≠ slots in groups). Though, in practice,
"nslot" seems to be currently unused and the slots are tracked in the
malloc_context object instead.



    MALLOC_CONTEXT

And finally, the context. This is a global singleton mainly used to manage the
meta objects. Some of the fields have been omitted for brevity.

        Malloc_context:
+------------------------------+
| int64_t secret               |
| meta* free_meta_head         | <- head of the free meta
| meta* avail_meta             | <- next meta in meta_area
| size_t avail_meta_count      |
| size_t avail_meta_area_count |
| meta_area head, tail         |
| meta* active[48]             | <- circular buffer for each sizeclass
| size_t usage_by_class[48]    | <- number of slot for each sizeclass
| uint8_t unmap_seq[32]        | \ balance fragmentation/slot reuse
| uint8_t bounce[32]           | /
+------------------------------+

"head" and "tail" allow easy access to the meta_area. Given that meta slots are
not tracked, we only need access to the tail of the list. There is some kind of
memory protection trick with the meta area. When a new meta_area is created,
only the first 4KiB page is made available for the slots. The number of slots
fitting in that page is written down in "avail_meta_count". When
"avail_meta_count" runs out of slots, another 4KiB worth of slots are added and
"avail_meta_area_count" is decremented once. When it reaches 0, a new meta_area
is mmap-ed. The point of doing so is that the pages originally mapped are
initialized with no right to read nor write, and those rights are given back
just as the malloc needs them. "avail_meta" and "free_meta_head" are the two
sources of meta objects. The first is a pointer to the first free slot in the
tail meta_area. The second is a pool of all the metas that have been freed.
This is important because once again, the meta slots are not tracked in the
meta area. If we weren't retaining the freed meta in this list, they would be
lost and it would be a memory leak. This way, we prioritize reusing freed meta
instead of initializing new ones. For easy access, there are a circular buffer
for each sizeclass. Along with those there are counters of how many slots exist
for any given sizeclass: "usage_by_class".
The "bounce" and "unmap_seq" seems to be used for balancing fragmentation and
address reuse.

debug log:

solving c0c04847 ...
found c0c04847 in https://inbox.vuxu.org/musl/20211105135323.36524-2-yoan.picchi@foss.arm.com/

applying [1/1] https://inbox.vuxu.org/musl/20211105135323.36524-2-yoan.picchi@foss.arm.com/
diff --git a/src/malloc/mallocng/readme.txt b/src/malloc/mallocng/readme.txt
new file mode 100644
index 00000000..c0c04847

Checking patch src/malloc/mallocng/readme.txt...
1:290: new blank line at EOF.
+
Applied patch src/malloc/mallocng/readme.txt cleanly.
warning: 1 line adds whitespace errors.

index at:
100644 c0c0484744a3ed20d8af1ce474dddfad0939c437	src/malloc/mallocng/readme.txt

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).