[musl] mallocng progress and growth chart

mailing list of musl libc
 help / color / mirror / code / Atom feed

* [musl] mallocng progress and growth chart
@ 2020-05-10 18:09 Rich Felker
  2020-05-11 17:51 ` Rich Felker
  2020-05-16  0:29 ` Rich Felker
  0 siblings, 2 replies; 12+ messages in thread
From: Rich Felker @ 2020-05-10 18:09 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2188 bytes --]

I just pushed a big queued set of changes to mallocng, including some
directions that didn't go well and were mostly reversed but with
improvements to the old. These include:

- Fixes to definitions of some size classes where counts didn't fit in
  a power-of-two sized mmap or containing-group.

- Fixing bad interaction between realloc and coarse size classing
  (always causing malloc-memcpy-free due to size class not matching;
  this was what was making apk and some other things painfully slow).

- Coarse size classing down to class 4 (except 6->7). This tends to
  save memory in low-usage programs.

- Improvements to logic for single-slot and low-count groups to avoid
  situations that waste lots of memory as well as ones that
  unnecessarily do eagar allocation.

- Flipping definition/dependency of aligned_alloc/memalign so that
  legacy nonstd function is not used in definition of std functions.

- New bounce counter with sequence numbers and decay, so that bounces
  with lots of intervening expensive (mmap/munmap) ops don't get
  accounted, and classes in bouncing state (retaining a free group
  unnecessarily to avoid cost of un- and re-mapping) can return to
  non-bouncing state.

I've also hacked up a test program that probes and prints the growth
sequence at each size just below a step in size class or page
boundary. This is intended to verify that the emergent behavior from
the allocation strategy rules in mallocng isn't doing anything
horribly stupid.

For each size, 200 mallocs are performed and if the result is the
start of a new group, the size/type of the allocation is printed. The
form 'NxZ' indicates a normal group of N slots each of slot size Z
(Z-4 usable). The form 'Nk' indicates a single-slot group where the
mapping length (N kB) is less than the nominal slot size; it shows
places where the strategy chose to use individual mmap below the hard
mmap threshold to save space and preserve individual freeability. (And
in principle these allocations should be amenable to mremap for
realloc, but it's not used yet.)

Source (with nasty poking at implementation internals, so build with
-no-builtin or -ffreestanding) is also attached.

Rich

[-- Attachment #2: growth.c --]
[-- Type: text/plain, Size: 1549 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <unistd.h>
#include <sys/wait.h>

struct meta {
	struct meta *prev, *next;
	struct group *mem;
	volatile int avail_mask, freed_mask;
	uintptr_t last_idx:5;
	uintptr_t freeable:1;
	uintptr_t sizeclass:6;
	uintptr_t maplen:8*sizeof(uintptr_t)-12;
};

const uint16_t size_classes[] = {
	1, 2, 3, 4, 5, 6, 7, 8,
	9, 10, 12, 15,
	18, 20, 25, 31,
	36, 42, 50, 63,
	72, 84, 102, 127,
	146, 170, 204, 255,
	292, 340, 409, 511,
	584, 682, 818, 1023,
	1169, 1364, 1637, 2047,
	2340, 2730, 3276, 4095,
	4680, 5460, 6552, 8191,
};

void runtest(size_t size)
{
	pid_t pid;
	if ((pid=fork())) {
		waitpid(pid, 0, 0);
		return;
	}
	unsigned char *p, *ptrs[200];
	printf("%5zu: ", size);
	for (int i=0; i<sizeof ptrs / sizeof *ptrs; i++) {
		ptrs[i] = p = malloc(size);
		int idx = p[-3] & 31;
		struct meta *m = *(struct meta **)(p-16-*(uint16_t *)(p-2)*16);
		if (idx==0) {
			if (!m->maplen || m->maplen*4096UL>=size_classes[m->sizeclass]*16+16)
				printf(" %2dx%-5d", m->last_idx+1, size_classes[m->sizeclass]*16);
			else
				printf(" %7luk", m->maplen*4UL);
		} else {
			//printf("[        ]");
		}
	}
	for (int i=0; i<sizeof ptrs / sizeof *ptrs; i++) free(ptrs[i]);
	printf("\n");
	exit(0);
}

int main() {
	if (1) for (int i=0; i<44; i++) {
		size_t size = size_classes[i]*16;
		runtest(size-4);
		size_t pgaligned = (size+4095) & -4096;
		if (pgaligned != size+16 && pgaligned < size_classes[i+1]*16)
			runtest(pgaligned-20);
	}
//	runtest(4076);
//	runtest(4076);
//	runtest(4076);
}

[-- Attachment #3: growth.txt --]
[-- Type: text/plain, Size: 13062 bytes --]

   12:  30x16    30x16    30x16    30x16    30x16    30x16    30x16   
   28:  15x32    15x32    15x32    15x32    15x32    15x32    15x32    15x32    15x32    31x32    31x32    31x32   
   44:  10x48    10x48    10x48    10x48    10x48    10x48    10x48    10x48    20x48    20x48    20x48    20x48    20x48    20x48   
   60:   7x64     7x64     7x64     7x64     7x64     7x64     7x64     7x64     7x64    15x64    15x64    15x64    15x64    15x64    31x64    31x64   
   76:   5x96     5x96     6x80     6x80     6x80     6x80     6x80     6x80     6x80     6x80    12x80    12x80    12x80    12x80    12x80    25x80    25x80    25x80    25x80   
   92:   5x96     5x96     5x96     5x96     5x96     5x96     5x96     5x96    10x96    10x96    10x96    10x96    10x96    21x96    21x96    21x96    21x96    21x96    21x96   
  108:   4x112    4x112    4x112    4x112    4x112    4x112    4x112    4x112    8x112    8x112    8x112    8x112    8x112   18x112   18x112   18x112   18x112   18x112   18x112   18x112   18x112  
  124:   7x128    7x128    7x128    7x128    7x128    7x128    7x128    7x128    7x128   15x128   15x128   15x128   15x128   15x128   31x128   31x128  
  140:   3x160    3x160    3x160    3x160    6x144    6x144    6x144    6x144    6x144    6x144    6x144    6x144    6x144    6x144   14x144   14x144   14x144   14x144   28x144   28x144   28x144  
  156:   3x160    3x160    3x160    3x160    3x160    3x160    3x160    3x160    6x160    6x160    6x160    6x160   12x160   12x160   12x160   12x160   24x160   24x160   24x160   24x160   24x160  
  188:   2x240    2x240    2x240    2x240    2x240    5x192    5x192    5x192    5x192    5x192    5x192    5x192    5x192   10x192   10x192   10x192   10x192   20x192   20x192   20x192   20x192   20x192   20x192  
  236:   2x240    2x240    2x240    2x240    2x240    2x240    2x240    2x240    4x240    4x240    4x240    4x240    8x240    8x240    8x240    8x240   16x240   16x240   16x240   16x240   32x240   32x240   32x240  
  284:   3x320    3x320    3x320    3x320    7x288    7x288    7x288    7x288    7x288    7x288    7x288    7x288   14x288   14x288   14x288   14x288   28x288   28x288   28x288  
  316:   3x320    3x320    3x320    3x320    3x320    3x320    3x320    3x320    6x320    6x320    6x320    6x320   12x320   12x320   12x320   12x320   24x320   24x320   24x320   24x320   24x320  
  396:   2x496    2x496    2x496    2x496    2x496    5x400    5x400    5x400    5x400    5x400    5x400    5x400    5x400   10x400   10x400   10x400   10x400   20x400   20x400   20x400   20x400   20x400   20x400  
  492:   2x496    2x496    2x496    2x496    2x496    2x496    2x496    2x496    4x496    4x496    4x496    4x496    8x496    8x496    8x496    8x496   16x496   16x496   16x496   16x496   32x496   32x496   32x496  
  572:   3x672    3x672    3x672    3x672    7x576    7x576    7x576    7x576    7x576    7x576    7x576    7x576   14x576   14x576   14x576   14x576   28x576   28x576   28x576  
  668:   3x672    3x672    3x672    3x672    3x672    3x672    3x672    3x672    6x672    6x672    6x672    6x672   12x672   12x672   12x672   12x672   24x672   24x672   24x672   24x672   24x672  
  796:   2x1008   2x1008   2x1008   2x1008   2x1008   5x800    5x800    5x800    5x800    5x800    5x800    5x800    5x800   10x800   10x800   10x800   10x800   20x800   20x800   20x800   20x800   20x800   20x800  
 1004:   2x1008   2x1008   2x1008   2x1008   2x1008   2x1008   2x1008   2x1008   4x1008   4x1008   4x1008   4x1008   8x1008   8x1008   8x1008   8x1008  16x1008  16x1008  16x1008  16x1008  32x1008  32x1008  32x1008 
 1148:   3x1344   3x1344   3x1344   3x1344   7x1152   7x1152   7x1152   7x1152   7x1152   7x1152   7x1152   7x1152  14x1152  14x1152  14x1152  14x1152  28x1152  28x1152  28x1152 
 1340:   3x1344   3x1344   3x1344   3x1344   3x1344   3x1344   3x1344   3x1344   6x1344   6x1344   6x1344   6x1344  12x1344  12x1344  12x1344  12x1344  24x1344  24x1344  24x1344  24x1344  24x1344 
 1628:   2x2032   2x2032   2x2032   2x2032   2x2032   5x1632   5x1632   5x1632   5x1632   5x1632   5x1632   5x1632   5x1632  10x1632  10x1632  10x1632  10x1632  20x1632  20x1632  20x1632  20x1632  20x1632  20x1632 
 2028:   2x2032   2x2032   2x2032   2x2032   2x2032   2x2032   2x2032   2x2032   4x2032   4x2032   4x2032   4x2032   8x2032   8x2032   8x2032   8x2032  16x2032  16x2032  16x2032  16x2032  32x2032  32x2032  32x2032 
 2332:   3x2720   3x2720   3x2720   3x2720   5x2336   5x2336   5x2336   5x2336   5x2336   5x2336   7x2336   7x2336   7x2336   7x2336  14x2336  14x2336  14x2336  14x2336  28x2336  28x2336  28x2336 
 2716:   3x2720   3x2720   3x2720   3x2720   3x2720   3x2720   3x2720   3x2720   6x2720   6x2720   6x2720   6x2720  12x2720  12x2720  12x2720  12x2720  24x2720  24x2720  24x2720  24x2720  24x2720 
 3260:   1x4080   1x4080   1x4080   1x4080   1x4080   1x4080   1x4080   1x4080   2x4080   5x3264   5x3264   5x3264   5x3264   5x3264   5x3264   5x3264   5x3264  10x3264  10x3264  10x3264  10x3264  20x3264  20x3264  20x3264  20x3264  20x3264  20x3264 
 4076:   1x4080   1x4080   1x4080   1x4080   1x4080   1x4080   1x4080   1x4080   2x4080   2x4080   2x4080   2x4080   4x4080   4x4080   4x4080   4x4080   8x4080   8x4080   8x4080   8x4080  16x4080  16x4080  16x4080  16x4080  32x4080  32x4080  32x4080 
 4668:   2x5440   2x5440   2x5440   2x5440   2x5440   5x4672   5x4672   5x4672   5x4672   5x4672   5x4672   7x4672   7x4672   7x4672   7x4672  14x4672  14x4672  14x4672  14x4672  28x4672  28x4672  28x4672 
 5436:   2x5440   2x5440   2x5440   2x5440   2x5440   2x5440   3x5440   3x5440   3x5440   3x5440   6x5440   6x5440   6x5440   6x5440  12x5440  12x5440  12x5440  12x5440  24x5440  24x5440  24x5440  24x5440  24x5440 
 6540:   1x8176   1x8176   1x8176   1x8176   1x8176   1x8176   1x8176   1x8176   2x8176   3x6544   3x6544   3x6544   3x6544   3x6544   3x6544   3x6544   5x6544   5x6544   5x6544   5x6544  10x6544  10x6544  10x6544  10x6544  20x6544  20x6544  20x6544  20x6544  20x6544  20x6544 
 8172:   1x8176   1x8176   1x8176   1x8176   1x8176   1x8176   1x8176   1x8176   2x8176   2x8176   2x8176   2x8176   4x8176   4x8176   4x8176   4x8176   8x8176   8x8176   8x8176   8x8176  16x8176  16x8176  16x8176  16x8176  32x8176  32x8176  32x8176 
 9340:   2x10912  2x10912  2x10912  2x10912  2x10912  3x9344   3x9344   3x9344   3x9344   3x9344   3x9344   3x9344   3x9344   3x9344   3x9344   7x9344   7x9344   7x9344   7x9344  14x9344  14x9344  14x9344  14x9344  28x9344  28x9344  28x9344 
10908:   2x10912  2x10912  2x10912  2x10912  2x10912  2x10912  3x10912  3x10912  3x10912  3x10912  6x10912  6x10912  6x10912  6x10912 12x10912 12x10912 12x10912 12x10912 24x10912 24x10912 24x10912 24x10912 24x10912
12268:       12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k      12k 10x13088 10x13088 10x13088 10x13088 20x13088 20x13088 20x13088 20x13088 20x13088 20x13088
13084:   1x16368  1x16368  1x16368  1x16368  1x16368  1x16368  1x16368  1x16368  2x16368  1x13088  1x13088  1x13088  1x13088  1x13088  1x13088  3x13088  3x13088  3x13088  3x13088  3x13088  5x13088  5x13088  5x13088  5x13088 10x13088 10x13088 10x13088 10x13088 20x13088 20x13088 20x13088 20x13088 20x13088 20x13088
16364:   1x16368  1x16368  1x16368  1x16368  1x16368  1x16368  1x16368  1x16368  2x16368  2x16368  2x16368  2x16368  4x16368  4x16368  4x16368  4x16368  8x16368  8x16368  8x16368  8x16368 16x16368 16x16368 16x16368 16x16368 32x16368 32x16368 32x16368
18700:       20k      20k      20k      20k      20k      20k      20k      20k      20k      20k  1x18704  1x18704  1x18704  1x18704  1x18704  1x18704  3x18704  3x18704  3x18704  3x18704  3x18704  3x18704  3x18704  3x18704  7x18704  7x18704  7x18704  7x18704 14x18704 14x18704 14x18704 14x18704 28x18704 28x18704 28x18704
20460:       20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k      20k 12x21824 12x21824 12x21824 12x21824 24x21824 24x21824 24x21824 24x21824 24x21824
21820:   1x21824  1x21824  1x21824  1x21824  2x21824  2x21824  2x21824  2x21824  3x21824  3x21824  3x21824  3x21824  6x21824  6x21824  6x21824  6x21824 12x21824 12x21824 12x21824 12x21824 24x21824 24x21824 24x21824 24x21824 24x21824
24556:       24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k      24k 10x26192 10x26192 10x26192 10x26192 20x26192 20x26192 20x26192 20x26192 20x26192 20x26192
26188:       28k      28k      28k      28k      28k      28k      28k      28k      28k      28k  1x26192  1x26192  1x26192  1x26192  1x26192  1x26192  3x26192  3x26192  3x26192  3x26192  3x26192  5x26192  5x26192  5x26192  5x26192 10x26192 10x26192 10x26192 10x26192 20x26192 20x26192 20x26192 20x26192 20x26192 20x26192
28652:       28k      28k      28k      28k      28k      28k      28k      28k      28k      28k      28k      28k      28k      28k      28k      28k      28k      28k      28k      28k      28k      28k      28k      28k      28k      28k      28k      28k      28k      28k      28k      28k  8x32752  8x32752  8x32752  8x32752 16x32752 16x32752 16x32752 16x32752 32x32752 32x32752 32x32752
32748:   1x32752  1x32752  1x32752  1x32752  1x32752  1x32752  1x32752  1x32752  2x32752  2x32752  2x32752  2x32752  4x32752  4x32752  4x32752  4x32752  8x32752  8x32752  8x32752  8x32752 16x32752 16x32752 16x32752 16x32752 32x32752 32x32752 32x32752
37436:       40k      40k      40k      40k      40k      40k      40k      40k      40k      40k  1x37440  1x37440  1x37440  1x37440  1x37440  1x37440  3x37440  3x37440  3x37440  3x37440  3x37440  3x37440  3x37440  3x37440  7x37440  7x37440  7x37440  7x37440 14x37440 14x37440 14x37440 14x37440 28x37440 28x37440 28x37440
40940:       40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k      40k 12x43680 12x43680 12x43680 12x43680 24x43680 24x43680 24x43680 24x43680 24x43680
43676:   1x43680  1x43680  1x43680  1x43680  2x43680  2x43680  2x43680  2x43680  3x43680  3x43680  3x43680  3x43680  6x43680  6x43680  6x43680  6x43680 12x43680 12x43680 12x43680 12x43680 24x43680 24x43680 24x43680 24x43680 24x43680
45036:       44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k      44k 10x52416 10x52416 10x52416 10x52416 20x52416 20x52416 20x52416 20x52416 20x52416 20x52416
52412:       52k      52k      52k      52k      52k      52k      52k      52k      52k      52k  1x52416  1x52416  1x52416  1x52416  1x52416  1x52416  3x52416  3x52416  3x52416  3x52416  3x52416  5x52416  5x52416  5x52416  5x52416 10x52416 10x52416 10x52416 10x52416 20x52416 20x52416 20x52416 20x52416 20x52416 20x52416
53228:       52k      52k      52k      52k      52k      52k      52k      52k      52k      52k      52k      52k      52k      52k      52k      52k      52k      52k      52k      52k      52k      52k      52k      52k      52k      52k      52k      52k      52k      52k      52k      52k  8x65520  8x65520  8x65520  8x65520 16x65520 16x65520 16x65520 16x65520 16x65520 16x65520 16x65520 16x65520 16x65520
65516:   1x65520  1x65520  1x65520  1x65520  1x65520  1x65520  1x65520  1x65520  2x65520  2x65520  2x65520  2x65520  4x65520  4x65520  4x65520  4x65520  8x65520  8x65520  8x65520  8x65520 16x65520 16x65520 16x65520 16x65520 16x65520 16x65520 16x65520 16x65520 16x65520

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [musl] mallocng progress and growth chart
  2020-05-10 18:09 [musl] mallocng progress and growth chart Rich Felker
@ 2020-05-11 17:51 ` Rich Felker
  2020-05-16  0:29 ` Rich Felker
  1 sibling, 0 replies; 12+ messages in thread
From: Rich Felker @ 2020-05-11 17:51 UTC (permalink / raw)
  To: musl

On Sun, May 10, 2020 at 02:09:34PM -0400, Rich Felker wrote:
> I just pushed a big queued set of changes to mallocng, including some
> directions that didn't go well and were mostly reversed but with
> improvements to the old. These include:
> 
> - Fixes to definitions of some size classes where counts didn't fit in
>   a power-of-two sized mmap or containing-group.
> 
> - Fixing bad interaction between realloc and coarse size classing
>   (always causing malloc-memcpy-free due to size class not matching;
>   this was what was making apk and some other things painfully slow).
> 
> - Coarse size classing down to class 4 (except 6->7). This tends to
>   save memory in low-usage programs.
> 
> - Improvements to logic for single-slot and low-count groups to avoid
>   situations that waste lots of memory as well as ones that
>   unnecessarily do eagar allocation.
> 
> - Flipping definition/dependency of aligned_alloc/memalign so that
>   legacy nonstd function is not used in definition of std functions.
> 
> - New bounce counter with sequence numbers and decay, so that bounces
>   with lots of intervening expensive (mmap/munmap) ops don't get
>   accounted, and classes in bouncing state (retaining a free group
>   unnecessarily to avoid cost of un- and re-mapping) can return to
>   non-bouncing state.

I left off one fairly important one:

- Hardening enhancement: clearing the pointer back to the group's meta
  struct before freeing the group, so that the next time the memory is
  handed out in an allocation the uninitialized memory doesn't contain
  pointers to malloc internal structures. For example:

	free(malloc(42));
	p=malloc(492);

  would include a pointer to the freed meta structure for the nested
  group, at offset 0 or 16 if I'm not mistaken. This could allow
  uninitialized memory leak plus a second difficult-to-exploit bug to
  be transformed into an easier-to-exploit bug.

This also suggested to me that we could clear the pointer back to meta
whenever a group is entirely free slots but not itself freed, and
restore it only when it's subsequently used. This might have minor
hardening benefits too, but it highlights that we could use MADV_FREE
on such groups, which might be valuable at large sizes. OTOH MADV_FREE
has some anti-hardening properties (potentially loses offset cycling
state under attacker-controlled load, allowing the attacker to obtain
more predictable addresses) that make it less attractive to use here.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [musl] mallocng progress and growth chart
  2020-05-10 18:09 [musl] mallocng progress and growth chart Rich Felker
  2020-05-11 17:51 ` Rich Felker
@ 2020-05-16  0:29 ` Rich Felker
  2020-05-16  3:29   ` Rich Felker
  2020-05-18 18:35   ` Rich Felker
  1 sibling, 2 replies; 12+ messages in thread
From: Rich Felker @ 2020-05-16  0:29 UTC (permalink / raw)
  To: musl

On Sun, May 10, 2020 at 02:09:34PM -0400, Rich Felker wrote:
>  4668:   2x5440   2x5440   2x5440   2x5440   2x5440   5x4672   5x4672   5x4672   5x4672   5x4672   5x4672   7x4672   ...

This turns out to be just about the worst edge case we have, and in a
sense one that's fundamental. Sadly there are a number of
applications, including bash, that do a lot of malloc(4096). The ones
that just allocate and don't have any complex malloc/free patterns
will see somewhat higher usage with mallocng, and I don't think
there's any way around that. (Note: oldmalloc also has problems here
under certain patterns of alloc/free, due to bin_index vs bin_index_up
discrepancy!)

I have some changes I'm about to push that help this somewhat. The
2x5440 count-reduction (this is 3x with proper-fit count) is overly
costly at this size, and imposes a 12.5% waste on top of the slack
from coarse size classing and the base slack from mapping 4096 into a
4672 size class. Getting rid of it, and accounting for existing coarse
size class usage when doing the 7->5 reduction, produce:

 4668:   3x5440   3x5440   3x5440   3x5440   5x4672   7x4672   ...

which seems like about the best we can do. The initial allocation of
3x rather than 2x only uses one additional page to get an additional
slot that can be used before needing to mmap again, which is a big win
(essentially that third slot doesn't have any overhead) except in the
case where it's never used, and only a small loss (1 page) even then.

The same thing happens at the next doubling for malloc(8192), and the
same mitigation applies. However with that:

 9340:   3x10912  3x10912  3x10912  3x10912  3x9344   7x9344   ...

the coarse size classing is dubious because the size is sufficiently
large that a 7->3 count reduction can be used, with the same count the
coarsse class would have, but with a 28k rather than 32k mapping.

Unfortunately the decision here depends on knowing page size, which
isn't constant at the point where it needs to be made. For integration
with musl, page size is initially known even if it's variable, so we
could possibly make a decision not to use coarse sizing based on that
here, but standalone mallocng lacks that knowledge (page size isn't
known until after first alloc_meta). This might could be reworked.
There's a fairly small range of sizes that would benefit (larger ones
are better off with individual mmap because page size quickly becomes
"finer than" size classes), but the benefit seems fairly significant
(not wasting an extra 1.3k each for the first 12 malloc(8192)'s) at
the sizes where it helps.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [musl] mallocng progress and growth chart
  2020-05-16  0:29 ` Rich Felker
@ 2020-05-16  3:29   ` Rich Felker
  2020-05-17  3:30     ` Rich Felker
  2020-05-18 18:35   ` Rich Felker
  1 sibling, 1 reply; 12+ messages in thread
From: Rich Felker @ 2020-05-16  3:29 UTC (permalink / raw)
  To: musl

On Fri, May 15, 2020 at 08:29:13PM -0400, Rich Felker wrote:
> The same thing happens at the next doubling for malloc(8192), and the
> same mitigation applies. However with that:
> 
>  9340:   3x10912  3x10912  3x10912  3x10912  3x9344   7x9344   ...
> 
> the coarse size classing is dubious because the size is sufficiently
> large that a 7->3 count reduction can be used, with the same count the
> coarsse class would have, but with a 28k rather than 32k mapping.
> 
> Unfortunately the decision here depends on knowing page size, which
> isn't constant at the point where it needs to be made. For integration
> with musl, page size is initially known even if it's variable, so we
> could possibly make a decision not to use coarse sizing based on that
> here, but standalone mallocng lacks that knowledge (page size isn't
> known until after first alloc_meta). This might could be reworked.
> There's a fairly small range of sizes that would benefit (larger ones
> are better off with individual mmap because page size quickly becomes
> "finer than" size classes), but the benefit seems fairly significant
> (not wasting an extra 1.3k each for the first 12 malloc(8192)'s) at
> the sizes where it helps.

It might just make sense to always disable coarse size classing
starting at this size. The absolute amount of over-allocation is
sufficiently high that it's probably not justified. On archs with
larger page size, more pages may be mapped (e.g. 7x9344 can't be
reduced to 3x if page size is over 4k, and can't be reduced to 5x if
page size is over 16k) but having too much memory mapped is generally
an expected consequence of ridiculous page sizes.

Another possibly horrible idea for dealing with exact page sizes at
low usage: pair groups of short length. Rather than needing to choose
between coarse classing 3x5440 (16k) or a 5x4672 (5 whole slots, 24k)
for a malloc(4095), at low useage we could create a pseudo-2x4672
where the second slot is truncated and contains a nested 1x3264 that's
only freeable together with the outer group. This relation always
works: if size class k is just under a power of two (k == 3 mod 4),
classes k+1 and k-1 add up to just under 2 times class k. (This
follows from n/5 + 2n/7 == (7n+10n)/35 == 17n/35 <= n/2 == 2*n/4,
where n/5, 2n/7, and n/4 are the sizes of class k-1, k+1, and k,
respectively.)

This gives a strategy that always works for allocating very-low-count
of arbitrary size classes, as long as we're willing to allocate a slot
for the complementary size at the same time. And in some ways it's
nicer than coarse classing -- rather than overallocating the requested
slot in hopes that the grouped slots will be useful for larger
allocations too, it allocates a smaller complementary pair in hopes
that the complementary size will be useful.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [musl] mallocng progress and growth chart
  2020-05-16  3:29   ` Rich Felker
@ 2020-05-17  3:30     ` Rich Felker
  2020-05-18 18:53       ` Rich Felker
  0 siblings, 1 reply; 12+ messages in thread
From: Rich Felker @ 2020-05-17  3:30 UTC (permalink / raw)
  To: musl

On Fri, May 15, 2020 at 11:29:01PM -0400, Rich Felker wrote:
> On Fri, May 15, 2020 at 08:29:13PM -0400, Rich Felker wrote:
> > The same thing happens at the next doubling for malloc(8192), and the
> > same mitigation applies. However with that:
> > 
> >  9340:   3x10912  3x10912  3x10912  3x10912  3x9344   7x9344   ...
> > 
> > the coarse size classing is dubious because the size is sufficiently
> > large that a 7->3 count reduction can be used, with the same count the
> > coarsse class would have, but with a 28k rather than 32k mapping.
> > 
> > Unfortunately the decision here depends on knowing page size, which
> > isn't constant at the point where it needs to be made. For integration
> > with musl, page size is initially known even if it's variable, so we
> > could possibly make a decision not to use coarse sizing based on that
> > here, but standalone mallocng lacks that knowledge (page size isn't
> > known until after first alloc_meta). This might could be reworked.
> > There's a fairly small range of sizes that would benefit (larger ones
> > are better off with individual mmap because page size quickly becomes
> > "finer than" size classes), but the benefit seems fairly significant
> > (not wasting an extra 1.3k each for the first 12 malloc(8192)'s) at
> > the sizes where it helps.
> 
> It might just make sense to always disable coarse size classing
> starting at this size. The absolute amount of over-allocation is
> sufficiently high that it's probably not justified. On archs with
> larger page size, more pages may be mapped (e.g. 7x9344 can't be
> reduced to 3x if page size is over 4k, and can't be reduced to 5x if
> page size is over 16k) but having too much memory mapped is generally
> an expected consequence of ridiculous page sizes.
> 
> Another possibly horrible idea for dealing with exact page sizes at
> low usage: pair groups of short length. Rather than needing to choose
> between coarse classing 3x5440 (16k) or a 5x4672 (5 whole slots, 24k)
> for a malloc(4095), at low useage we could create a pseudo-2x4672
> where the second slot is truncated and contains a nested 1x3264 that's
> only freeable together with the outer group. This relation always
> works: if size class k is just under a power of two (k == 3 mod 4),
> classes k+1 and k-1 add up to just under 2 times class k. (This
> follows from n/5 + 2n/7 == (7n+10n)/35 == 17n/35 <= n/2 == 2*n/4,
> where n/5, 2n/7, and n/4 are the sizes of class k-1, k+1, and k,
> respectively.)
> 
> This gives a strategy that always works for allocating very-low-count
> of arbitrary size classes, as long as we're willing to allocate a slot
> for the complementary size at the same time. And in some ways it's
> nicer than coarse classing -- rather than overallocating the requested
> slot in hopes that the grouped slots will be useful for larger
> allocations too, it allocates a smaller complementary pair in hopes
> that the complementary size will be useful.

Another alternative for avoiding eagar commit at low usage, which
works for all but nommu: when adding groups with nontrivial slot count
at low usage, don't activate all the slots right away. Reserve vm
space for 7 slots for a 7x4672, but only unprotect the first 2 pages,
and treat it as a group of just 1 slot until there are no slots free
and one is needed. Then, unprotect another page (or more if needed to
fit another slot, as would be needed at larger sizes) and adjust the
slot count to match. (Conceptually; implementation-wise, the slot
count would be fixed, and there would just be a limit on the number of
slots made avilable when transformed from "freed" to "available" for
activation.)

Note that this is what happens anyway with physical memory as clean
anonymous pages are first touched, but (1) doing it without explicit
unprotect over-counts the not-yet-used slots for commit charge
purposes and breaks tightly-memory-constrained environments (global
commit limit or cgroup) and (2) when all slots are initially available
as they are now, repeated free/malloc cycles for the same size will
round-robin all the slots, touching them all.

Here, property (2) is of course desirable for hardening at moderate to
high usage, but at low usage UAF tends to be less of a concern
(because you don't have complex data structures with complex lifetimes
if you hardly have any malloc).

Note also that (2) could be solved without addressing (1) just by
skipping the protection aspect of this idea and only using the
available-slot-limiting part.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [musl] mallocng progress and growth chart
  2020-05-16  0:29 ` Rich Felker
  2020-05-16  3:29   ` Rich Felker
@ 2020-05-18 18:35   ` Rich Felker
  1 sibling, 0 replies; 12+ messages in thread
From: Rich Felker @ 2020-05-18 18:35 UTC (permalink / raw)
  To: musl

On Fri, May 15, 2020 at 08:29:13PM -0400, Rich Felker wrote:
> On Sun, May 10, 2020 at 02:09:34PM -0400, Rich Felker wrote:
> >  4668:   2x5440   2x5440   2x5440   2x5440   2x5440   5x4672   5x4672   5x4672   5x4672   5x4672   5x4672   7x4672   ...
> 
> This turns out to be just about the worst edge case we have, and in a
> sense one that's fundamental. Sadly there are a number of
> applications, including bash, that do a lot of malloc(4096). The ones
> that just allocate and don't have any complex malloc/free patterns
> will see somewhat higher usage with mallocng, and I don't think
> there's any way around that. (Note: oldmalloc also has problems here
> under certain patterns of alloc/free, due to bin_index vs bin_index_up
> discrepancy!)

Actually oldmalloc "cheats" for the exact case mentioned here, 4096,
because it's an exact size class boundary where
bin_index(4096)==bin_index_up(4096). For any other size (e.g. 4080 or
4097) the catastrophic non-reusable-chunk thing will happen.

This actually suggests that, if trying to improve/salvage a dlmalloc
type design, it would be useful to round up to size classes before
allocating, and to avoid trimming down to exact size. Of course this
would use considerably more memory initially (just like mallocng uses
more than oldmalloc) for straight bump allocation with no free, but it
should give less fragmentation and more-stable usage over time. Of
course there might be new failure modes too when doing that,
especially since split/merge won't preserve the property of being
exact size classes.

An interesting experiment would be to hack up oldmalloc to round up to
size classes, then compare memory usage with mallocng. I suspect we'd
then find mallocng almost always uses less memory.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [musl] mallocng progress and growth chart
  2020-05-17  3:30     ` Rich Felker
@ 2020-05-18 18:53       ` Rich Felker
  2020-05-25 15:45         ` Pirmin Walthert
  0 siblings, 1 reply; 12+ messages in thread
From: Rich Felker @ 2020-05-18 18:53 UTC (permalink / raw)
  To: musl

On Sat, May 16, 2020 at 11:30:25PM -0400, Rich Felker wrote:
> Another alternative for avoiding eagar commit at low usage, which
> works for all but nommu: when adding groups with nontrivial slot count
> at low usage, don't activate all the slots right away. Reserve vm
> space for 7 slots for a 7x4672, but only unprotect the first 2 pages,
> and treat it as a group of just 1 slot until there are no slots free
> and one is needed. Then, unprotect another page (or more if needed to
> fit another slot, as would be needed at larger sizes) and adjust the
> slot count to match. (Conceptually; implementation-wise, the slot
> count would be fixed, and there would just be a limit on the number of
> slots made avilable when transformed from "freed" to "available" for
> activation.)
> 
> Note that this is what happens anyway with physical memory as clean
> anonymous pages are first touched, but (1) doing it without explicit
> unprotect over-counts the not-yet-used slots for commit charge
> purposes and breaks tightly-memory-constrained environments (global
> commit limit or cgroup) and (2) when all slots are initially available
> as they are now, repeated free/malloc cycles for the same size will
> round-robin all the slots, touching them all.
> 
> Here, property (2) is of course desirable for hardening at moderate to
> high usage, but at low usage UAF tends to be less of a concern
> (because you don't have complex data structures with complex lifetimes
> if you hardly have any malloc).
> 
> Note also that (2) could be solved without addressing (1) just by
> skipping the protection aspect of this idea and only using the
> available-slot-limiting part.

One abstract way of thinking about the above is that it's just a
per-size-class bump allocator, pre-reserving enough virtual address
space to end sufficiently close to a page boundary that there's no
significant memory waste. This is actually fairly elegant, and might
obsolete some of the other measures taken to avoid overly eagar
allocation. So this might be a worthwhile direction to pursue.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [musl] mallocng progress and growth chart
  2020-05-18 18:53       ` Rich Felker
@ 2020-05-25 15:45         ` Pirmin Walthert
  2020-05-25 17:54           ` Rich Felker
  0 siblings, 1 reply; 12+ messages in thread
From: Pirmin Walthert @ 2020-05-25 15:45 UTC (permalink / raw)
  To: musl

Am 18.05.20 um 20:53 schrieb Rich Felker:

> On Sat, May 16, 2020 at 11:30:25PM -0400, Rich Felker wrote:
>> Another alternative for avoiding eagar commit at low usage, which
>> works for all but nommu: when adding groups with nontrivial slot count
>> at low usage, don't activate all the slots right away. Reserve vm
>> space for 7 slots for a 7x4672, but only unprotect the first 2 pages,
>> and treat it as a group of just 1 slot until there are no slots free
>> and one is needed. Then, unprotect another page (or more if needed to
>> fit another slot, as would be needed at larger sizes) and adjust the
>> slot count to match. (Conceptually; implementation-wise, the slot
>> count would be fixed, and there would just be a limit on the number of
>> slots made avilable when transformed from "freed" to "available" for
>> activation.)
>>
>> Note that this is what happens anyway with physical memory as clean
>> anonymous pages are first touched, but (1) doing it without explicit
>> unprotect over-counts the not-yet-used slots for commit charge
>> purposes and breaks tightly-memory-constrained environments (global
>> commit limit or cgroup) and (2) when all slots are initially available
>> as they are now, repeated free/malloc cycles for the same size will
>> round-robin all the slots, touching them all.
>>
>> Here, property (2) is of course desirable for hardening at moderate to
>> high usage, but at low usage UAF tends to be less of a concern
>> (because you don't have complex data structures with complex lifetimes
>> if you hardly have any malloc).
>> c
>> Note also that (2) could be solved without addressing (1) just by
>> skipping the protection aspect of this idea and only using the
>> available-slot-limiting part.
> One abstract way of thinking about the above is that it's just a
> per-size-class bump allocator, pre-reserving enough virtual address
> space to end sufficiently close to a page boundary that there's no
> significant memory waste. This is actually fairly elegant, and might
> obsolete some of the other measures taken to avoid overly eagar
> allocation. So this might be a worthwhile direction to pursue.

Dear Rich,

Currently we use mallocng in production for most applications in our 
"embedded like" virtualised system setups, it even helped to find some 
bugs (for example in asterisk) as mallocng was less forgiving than the 
old malloc implementation. So if you're interested in real world 
feedback: everything seems to be running quite smoothly so far, thanks 
for this great work.

Currently we use the git version of April 24th, so the version before 
you merged the huge optimization changes. As you mentioned in your 
"brainstorming mails", if I got them right, that you might rethink a few 
of these changes, I'd like to ask: do you think it would be better to 
use the current git-master version rather than the version of April 24th 
(we are not THAT memory constrained, so stability is the most important 
thing) or do you think it would be better to stick on the old version 
and wait for the next changes to be merged?

Best regards,

Pirmin


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [musl] mallocng progress and growth chart
  2020-05-25 15:45         ` Pirmin Walthert
@ 2020-05-25 17:54           ` Rich Felker
  2020-05-25 18:13             ` Pirmin Walthert
  0 siblings, 1 reply; 12+ messages in thread
From: Rich Felker @ 2020-05-25 17:54 UTC (permalink / raw)
  To: Pirmin Walthert; +Cc: musl

On Mon, May 25, 2020 at 05:45:33PM +0200, Pirmin Walthert wrote:
> Am 18.05.20 um 20:53 schrieb Rich Felker:
> 
> >On Sat, May 16, 2020 at 11:30:25PM -0400, Rich Felker wrote:
> >>Another alternative for avoiding eagar commit at low usage, which
> >>works for all but nommu: when adding groups with nontrivial slot count
> >>at low usage, don't activate all the slots right away. Reserve vm
> >>space for 7 slots for a 7x4672, but only unprotect the first 2 pages,
> >>and treat it as a group of just 1 slot until there are no slots free
> >>and one is needed. Then, unprotect another page (or more if needed to
> >>fit another slot, as would be needed at larger sizes) and adjust the
> >>slot count to match. (Conceptually; implementation-wise, the slot
> >>count would be fixed, and there would just be a limit on the number of
> >>slots made avilable when transformed from "freed" to "available" for
> >>activation.)
> >>
> >>Note that this is what happens anyway with physical memory as clean
> >>anonymous pages are first touched, but (1) doing it without explicit
> >>unprotect over-counts the not-yet-used slots for commit charge
> >>purposes and breaks tightly-memory-constrained environments (global
> >>commit limit or cgroup) and (2) when all slots are initially available
> >>as they are now, repeated free/malloc cycles for the same size will
> >>round-robin all the slots, touching them all.
> >>
> >>Here, property (2) is of course desirable for hardening at moderate to
> >>high usage, but at low usage UAF tends to be less of a concern
> >>(because you don't have complex data structures with complex lifetimes
> >>if you hardly have any malloc).
> >>c
> >>Note also that (2) could be solved without addressing (1) just by
> >>skipping the protection aspect of this idea and only using the
> >>available-slot-limiting part.
> >One abstract way of thinking about the above is that it's just a
> >per-size-class bump allocator, pre-reserving enough virtual address
> >space to end sufficiently close to a page boundary that there's no
> >significant memory waste. This is actually fairly elegant, and might
> >obsolete some of the other measures taken to avoid overly eagar
> >allocation. So this might be a worthwhile direction to pursue.
> 
> Dear Rich,
> 
> Currently we use mallocng in production for most applications in our
> "embedded like" virtualised system setups, it even helped to find
> some bugs (for example in asterisk) as mallocng was less forgiving
> than the old malloc implementation. So if you're interested in real
> world feedback: everything seems to be running quite smoothly so
> far, thanks for this great work.
> 
> Currently we use the git version of April 24th, so the version
> before you merged the huge optimization changes. As you mentioned in
> your "brainstorming mails", if I got them right, that you might
> rethink a few of these changes, I'd like to ask: do you think it
> would be better to use the current git-master version rather than
> the version of April 24th (we are not THAT memory constrained, so
> stability is the most important thing) or do you think it would be
> better to stick on the old version and wait for the next changes to
> be merged?

Thanks for the feedback!

Which are the "huge optimization changes" you're wondering about?
Indeed there's a large series of commits after the version you're
using but I think you're possibly misattributing them.

A number of the commits are bug fixes -- mostly not for hard bugs, but
for unwanted and unintended behaviors:

a709dde fix unexpected allocation of 7x144 group in non-power-of-two slot
dda5a88 fix exact size tracking in memalign
915a914 adjust several size classes to fix nested groups of non-power-of-2 size
7acd61e allow in-place realloc when ideal size class is off-by-one
caca917 add support for aligned_alloc alignments 1M and over

There were also quite a few around an idea that didn't go well and was
mostly reverted, but with major improvements to the original behavior:

5bff93c overhaul bounce counter to work with map sizes instead of size classes
71262cd tune bounce counter to avoid triggering early
9601aaa prevent overflow of unmap counter part of bounce counter
aca1f32 don't let the mmap cache limit grow unboundedly or overflow
6fbee31 second partial overhaul of bounce counter system
150de6e revert from map cache to old okay_to_free scheme, but improved
1e972da initial conversion of bounce counting to use sequence numbers, decay
e3eecb1 factor bounce/sequence counter logic into meta.h
6693738 account seq for individually-mmapped allocations above hard threshold
4443f64 fix complete regression (malloc always fails) on variable-pagesize archs

If you don't care about low usage, that whole change series is fairly
unimportant, but should be harmless. It just changes decisions about
choices where either choice produces as valid state for the allocator
but there are tradeoffs between memory usage and performance. The new
behavior should be better, though.

A few commits were reordering the dependency between memalign and the
standard memalign-variant functions, which is a minor namespace
detail:

da4c88e rename aligned_alloc.c
04407f7 reverse dependency order of memalign and aligned_alloc
74e6657 rename aligned_alloc source file back to its proper name
c990cb1 rename memalign source file back to its proper name

A couple were hardening:

5bf4e92 clear group header pointer to meta when freeing groups
bd04c75 in get_meta, check offset against maplen (minor hardening)
77cea57 add support for allocating meta areas via legacy brk

And pretty much all the rest of the changes are tuning behavior for
"optimization" of some sort or another, which may be what you were
referring to:

26143c4 limit slot count growth to 25% instead of 50% of current usage in class
a9187f0 remove unnecessary optimization tuning flags from Makefile CFLAGS
045cc04 move coarse size classing logic to malloc fast path
8348a82 eliminate med_twos_tab
e619034 allow slot count 1 for size classes 3 mod 4 as natural/first-class
c9d54f4 activate coarse size classing for small classes down to 4 (but not 6)
44092d8 improve individual-mmap decision
d355eaf remove slot count reduction to 1 for size classes 1 mod 3
c555ebe fix off-by-one in logic to use single-slot groups
9d5ec34 switch from MADV_DONTNEED to MADV_FREE for large free slots
584c7aa avoid over-use of reduced-count groups due to coarse size classing
f9bfb0a increase threshold for 3->2 slot reduction to 16 pages
20da09e disable coarse size classing for large classes (over 8k)

I don't think any of these changes are potentially obsoleted by
further ideas in the above thread. I am working on delaying activation
of slots until they're actually needed, so that we don't dirty pages
we could avoid touching, but I proposed this as an alternative to
other more complex tricks that I didn't really like, which have not
been implemented and probably won't be now.

So, in summary, I don't see any good reason not to go with latest.

Rich

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [musl] mallocng progress and growth chart
  2020-05-25 17:54           ` Rich Felker
@ 2020-05-25 18:13             ` Pirmin Walthert
  2020-06-04  7:04               ` Pirmin Walthert
  0 siblings, 1 reply; 12+ messages in thread
From: Pirmin Walthert @ 2020-05-25 18:13 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

Am 25.05.20 um 19:54 schrieb Rich Felker:

> On Mon, May 25, 2020 at 05:45:33PM +0200, Pirmin Walthert wrote:
>> Am 18.05.20 um 20:53 schrieb Rich Felker:
>>
>>> On Sat, May 16, 2020 at 11:30:25PM -0400, Rich Felker wrote:
>>>> Another alternative for avoiding eagar commit at low usage, which
>>>> works for all but nommu: when adding groups with nontrivial slot count
>>>> at low usage, don't activate all the slots right away. Reserve vm
>>>> space for 7 slots for a 7x4672, but only unprotect the first 2 pages,
>>>> and treat it as a group of just 1 slot until there are no slots free
>>>> and one is needed. Then, unprotect another page (or more if needed to
>>>> fit another slot, as would be needed at larger sizes) and adjust the
>>>> slot count to match. (Conceptually; implementation-wise, the slot
>>>> count would be fixed, and there would just be a limit on the number of
>>>> slots made avilable when transformed from "freed" to "available" for
>>>> activation.)
>>>>
>>>> Note that this is what happens anyway with physical memory as clean
>>>> anonymous pages are first touched, but (1) doing it without explicit
>>>> unprotect over-counts the not-yet-used slots for commit charge
>>>> purposes and breaks tightly-memory-constrained environments (global
>>>> commit limit or cgroup) and (2) when all slots are initially available
>>>> as they are now, repeated free/malloc cycles for the same size will
>>>> round-robin all the slots, touching them all.
>>>>
>>>> Here, property (2) is of course desirable for hardening at moderate to
>>>> high usage, but at low usage UAF tends to be less of a concern
>>>> (because you don't have complex data structures with complex lifetimes
>>>> if you hardly have any malloc).
>>>> c
>>>> Note also that (2) could be solved without addressing (1) just by
>>>> skipping the protection aspect of this idea and only using the
>>>> available-slot-limiting part.
>>> One abstract way of thinking about the above is that it's just a
>>> per-size-class bump allocator, pre-reserving enough virtual address
>>> space to end sufficiently close to a page boundary that there's no
>>> significant memory waste. This is actually fairly elegant, and might
>>> obsolete some of the other measures taken to avoid overly eagar
>>> allocation. So this might be a worthwhile direction to pursue.
>> Dear Rich,
>>
>> Currently we use mallocng in production for most applications in our
>> "embedded like" virtualised system setups, it even helped to find
>> some bugs (for example in asterisk) as mallocng was less forgiving
>> than the old malloc implementation. So if you're interested in real
>> world feedback: everything seems to be running quite smoothly so
>> far, thanks for this great work.
>>
>> Currently we use the git version of April 24th, so the version
>> before you merged the huge optimization changes. As you mentioned in
>> your "brainstorming mails", if I got them right, that you might
>> rethink a few of these changes, I'd like to ask: do you think it
>> would be better to use the current git-master version rather than
>> the version of April 24th (we are not THAT memory constrained, so
>> stability is the most important thing) or do you think it would be
>> better to stick on the old version and wait for the next changes to
>> be merged?
> Thanks for the feedback!
>
> Which are the "huge optimization changes" you're wondering about?
> Indeed there's a large series of commits after the version you're
> using but I think you're possibly misattributing them.
>
> A number of the commits are bug fixes -- mostly not for hard bugs, but
> for unwanted and unintended behaviors:
>
> a709dde fix unexpected allocation of 7x144 group in non-power-of-two slot
> dda5a88 fix exact size tracking in memalign
> 915a914 adjust several size classes to fix nested groups of non-power-of-2 size
> 7acd61e allow in-place realloc when ideal size class is off-by-one
> caca917 add support for aligned_alloc alignments 1M and over
>
> There were also quite a few around an idea that didn't go well and was
> mostly reverted, but with major improvements to the original behavior:
>
> 5bff93c overhaul bounce counter to work with map sizes instead of size classes
> 71262cd tune bounce counter to avoid triggering early
> 9601aaa prevent overflow of unmap counter part of bounce counter
> aca1f32 don't let the mmap cache limit grow unboundedly or overflow
> 6fbee31 second partial overhaul of bounce counter system
> 150de6e revert from map cache to old okay_to_free scheme, but improved
> 1e972da initial conversion of bounce counting to use sequence numbers, decay
> e3eecb1 factor bounce/sequence counter logic into meta.h
> 6693738 account seq for individually-mmapped allocations above hard threshold
> 4443f64 fix complete regression (malloc always fails) on variable-pagesize archs
>
> If you don't care about low usage, that whole change series is fairly
> unimportant, but should be harmless. It just changes decisions about
> choices where either choice produces as valid state for the allocator
> but there are tradeoffs between memory usage and performance. The new
> behavior should be better, though.
>
> A few commits were reordering the dependency between memalign and the
> standard memalign-variant functions, which is a minor namespace
> detail:
>
> da4c88e rename aligned_alloc.c
> 04407f7 reverse dependency order of memalign and aligned_alloc
> 74e6657 rename aligned_alloc source file back to its proper name
> c990cb1 rename memalign source file back to its proper name
>
> A couple were hardening:
>
> 5bf4e92 clear group header pointer to meta when freeing groups
> bd04c75 in get_meta, check offset against maplen (minor hardening)
> 77cea57 add support for allocating meta areas via legacy brk
>
> And pretty much all the rest of the changes are tuning behavior for
> "optimization" of some sort or another, which may be what you were
> referring to:
>
> 26143c4 limit slot count growth to 25% instead of 50% of current usage in class
> a9187f0 remove unnecessary optimization tuning flags from Makefile CFLAGS
> 045cc04 move coarse size classing logic to malloc fast path
> 8348a82 eliminate med_twos_tab
> e619034 allow slot count 1 for size classes 3 mod 4 as natural/first-class
> c9d54f4 activate coarse size classing for small classes down to 4 (but not 6)
> 44092d8 improve individual-mmap decision
> d355eaf remove slot count reduction to 1 for size classes 1 mod 3
> c555ebe fix off-by-one in logic to use single-slot groups
> 9d5ec34 switch from MADV_DONTNEED to MADV_FREE for large free slots
> 584c7aa avoid over-use of reduced-count groups due to coarse size classing
> f9bfb0a increase threshold for 3->2 slot reduction to 16 pages
> 20da09e disable coarse size classing for large classes (over 8k)
>
> I don't think any of these changes are potentially obsoleted by
> further ideas in the above thread. I am working on delaying activation
> of slots until they're actually needed, so that we don't dirty pages
> we could avoid touching, but I proposed this as an alternative to
> other more complex tricks that I didn't really like, which have not
> been implemented and probably won't be now.
>
> So, in summary, I don't see any good reason not to go with latest.
>
> Rich

Many thanks for your detailed answer. I'll give it a try then!

Pirmin


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [musl] mallocng progress and growth chart
  2020-05-25 18:13             ` Pirmin Walthert
@ 2020-06-04  7:04               ` Pirmin Walthert
  2020-06-04 15:11                 ` Rich Felker
  0 siblings, 1 reply; 12+ messages in thread
From: Pirmin Walthert @ 2020-06-04  7:04 UTC (permalink / raw)
  To: musl


Am 25.05.20 um 20:13 schrieb Pirmin Walthert:
> Am 25.05.20 um 19:54 schrieb Rich Felker:
>
>> On Mon, May 25, 2020 at 05:45:33PM +0200, Pirmin Walthert wrote:
>>> Am 18.05.20 um 20:53 schrieb Rich Felker:
>>>
>>>> On Sat, May 16, 2020 at 11:30:25PM -0400, Rich Felker wrote:
>>>>> Another alternative for avoiding eagar commit at low usage, which
>>>>> works for all but nommu: when adding groups with nontrivial slot 
>>>>> count
>>>>> at low usage, don't activate all the slots right away. Reserve vm
>>>>> space for 7 slots for a 7x4672, but only unprotect the first 2 pages,
>>>>> and treat it as a group of just 1 slot until there are no slots free
>>>>> and one is needed. Then, unprotect another page (or more if needed to
>>>>> fit another slot, as would be needed at larger sizes) and adjust the
>>>>> slot count to match. (Conceptually; implementation-wise, the slot
>>>>> count would be fixed, and there would just be a limit on the 
>>>>> number of
>>>>> slots made avilable when transformed from "freed" to "available" for
>>>>> activation.)
>>>>>
>>>>> Note that this is what happens anyway with physical memory as clean
>>>>> anonymous pages are first touched, but (1) doing it without explicit
>>>>> unprotect over-counts the not-yet-used slots for commit charge
>>>>> purposes and breaks tightly-memory-constrained environments (global
>>>>> commit limit or cgroup) and (2) when all slots are initially 
>>>>> available
>>>>> as they are now, repeated free/malloc cycles for the same size will
>>>>> round-robin all the slots, touching them all.
>>>>>
>>>>> Here, property (2) is of course desirable for hardening at 
>>>>> moderate to
>>>>> high usage, but at low usage UAF tends to be less of a concern
>>>>> (because you don't have complex data structures with complex 
>>>>> lifetimes
>>>>> if you hardly have any malloc).
>>>>> c
>>>>> Note also that (2) could be solved without addressing (1) just by
>>>>> skipping the protection aspect of this idea and only using the
>>>>> available-slot-limiting part.
>>>> One abstract way of thinking about the above is that it's just a
>>>> per-size-class bump allocator, pre-reserving enough virtual address
>>>> space to end sufficiently close to a page boundary that there's no
>>>> significant memory waste. This is actually fairly elegant, and might
>>>> obsolete some of the other measures taken to avoid overly eagar
>>>> allocation. So this might be a worthwhile direction to pursue.
>>> Dear Rich,
>>>
>>> Currently we use mallocng in production for most applications in our
>>> "embedded like" virtualised system setups, it even helped to find
>>> some bugs (for example in asterisk) as mallocng was less forgiving
>>> than the old malloc implementation. So if you're interested in real
>>> world feedback: everything seems to be running quite smoothly so
>>> far, thanks for this great work.
>>>
>>> Currently we use the git version of April 24th, so the version
>>> before you merged the huge optimization changes. As you mentioned in
>>> your "brainstorming mails", if I got them right, that you might
>>> rethink a few of these changes, I'd like to ask: do you think it
>>> would be better to use the current git-master version rather than
>>> the version of April 24th (we are not THAT memory constrained, so
>>> stability is the most important thing) or do you think it would be
>>> better to stick on the old version and wait for the next changes to
>>> be merged?
>> Thanks for the feedback!
>>
>> Which are the "huge optimization changes" you're wondering about?
>> Indeed there's a large series of commits after the version you're
>> using but I think you're possibly misattributing them.
>>
>> A number of the commits are bug fixes -- mostly not for hard bugs, but
>> for unwanted and unintended behaviors:
>>
>> a709dde fix unexpected allocation of 7x144 group in non-power-of-two 
>> slot
>> dda5a88 fix exact size tracking in memalign
>> 915a914 adjust several size classes to fix nested groups of 
>> non-power-of-2 size
>> 7acd61e allow in-place realloc when ideal size class is off-by-one
>> caca917 add support for aligned_alloc alignments 1M and over
>>
>> There were also quite a few around an idea that didn't go well and was
>> mostly reverted, but with major improvements to the original behavior:
>>
>> 5bff93c overhaul bounce counter to work with map sizes instead of 
>> size classes
>> 71262cd tune bounce counter to avoid triggering early
>> 9601aaa prevent overflow of unmap counter part of bounce counter
>> aca1f32 don't let the mmap cache limit grow unboundedly or overflow
>> 6fbee31 second partial overhaul of bounce counter system
>> 150de6e revert from map cache to old okay_to_free scheme, but improved
>> 1e972da initial conversion of bounce counting to use sequence 
>> numbers, decay
>> e3eecb1 factor bounce/sequence counter logic into meta.h
>> 6693738 account seq for individually-mmapped allocations above hard 
>> threshold
>> 4443f64 fix complete regression (malloc always fails) on 
>> variable-pagesize archs
>>
>> If you don't care about low usage, that whole change series is fairly
>> unimportant, but should be harmless. It just changes decisions about
>> choices where either choice produces as valid state for the allocator
>> but there are tradeoffs between memory usage and performance. The new
>> behavior should be better, though.
>>
>> A few commits were reordering the dependency between memalign and the
>> standard memalign-variant functions, which is a minor namespace
>> detail:
>>
>> da4c88e rename aligned_alloc.c
>> 04407f7 reverse dependency order of memalign and aligned_alloc
>> 74e6657 rename aligned_alloc source file back to its proper name
>> c990cb1 rename memalign source file back to its proper name
>>
>> A couple were hardening:
>>
>> 5bf4e92 clear group header pointer to meta when freeing groups
>> bd04c75 in get_meta, check offset against maplen (minor hardening)
>> 77cea57 add support for allocating meta areas via legacy brk
>>
>> And pretty much all the rest of the changes are tuning behavior for
>> "optimization" of some sort or another, which may be what you were
>> referring to:
>>
>> 26143c4 limit slot count growth to 25% instead of 50% of current 
>> usage in class
>> a9187f0 remove unnecessary optimization tuning flags from Makefile 
>> CFLAGS
>> 045cc04 move coarse size classing logic to malloc fast path
>> 8348a82 eliminate med_twos_tab
>> e619034 allow slot count 1 for size classes 3 mod 4 as 
>> natural/first-class
>> c9d54f4 activate coarse size classing for small classes down to 4 
>> (but not 6)
>> 44092d8 improve individual-mmap decision
>> d355eaf remove slot count reduction to 1 for size classes 1 mod 3
>> c555ebe fix off-by-one in logic to use single-slot groups
>> 9d5ec34 switch from MADV_DONTNEED to MADV_FREE for large free slots
>> 584c7aa avoid over-use of reduced-count groups due to coarse size 
>> classing
>> f9bfb0a increase threshold for 3->2 slot reduction to 16 pages
>> 20da09e disable coarse size classing for large classes (over 8k)
>>
>> I don't think any of these changes are potentially obsoleted by
>> further ideas in the above thread. I am working on delaying activation
>> of slots until they're actually needed, so that we don't dirty pages
>> we could avoid touching, but I proposed this as an alternative to
>> other more complex tricks that I didn't really like, which have not
>> been implemented and probably won't be now.
>>
>> So, in summary, I don't see any good reason not to go with latest.
>>
>> Rich
>
> Many thanks for your detailed answer. I'll give it a try then!
>
> Pirmin
>
FYI: everything working without any issues so far on about 150 systems 
(using the most current version from 27th of May). Even got around an 
OOM issue during DOS that could be reproduced with the old malloc 
implementation (maybe because of the fragmentation issue?). Server 
software running on these systems: lighttpd, php-fpm 7.3, asterisk 16, 
slapd, isc-dhcpd, dropbear

Pirmin


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [musl] mallocng progress and growth chart
  2020-06-04  7:04               ` Pirmin Walthert
@ 2020-06-04 15:11                 ` Rich Felker
  0 siblings, 0 replies; 12+ messages in thread
From: Rich Felker @ 2020-06-04 15:11 UTC (permalink / raw)
  To: musl

On Thu, Jun 04, 2020 at 09:04:47AM +0200, Pirmin Walthert wrote:
> FYI: everything working without any issues so far on about 150
> systems (using the most current version from 27th of May). Even got
> around an OOM issue during DOS that could be reproduced with the old
> malloc implementation (maybe because of the fragmentation issue?).
> Server software running on these systems: lighttpd, php-fpm 7.3,
> asterisk 16, slapd, isc-dhcpd, dropbear

Thanks for the update! That's really good to hear.

Rich

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-06-04 15:11 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-10 18:09 [musl] mallocng progress and growth chart Rich Felker
2020-05-11 17:51 ` Rich Felker
2020-05-16  0:29 ` Rich Felker
2020-05-16  3:29   ` Rich Felker
2020-05-17  3:30     ` Rich Felker
2020-05-18 18:53       ` Rich Felker
2020-05-25 15:45         ` Pirmin Walthert
2020-05-25 17:54           ` Rich Felker
2020-05-25 18:13             ` Pirmin Walthert
2020-06-04  7:04               ` Pirmin Walthert
2020-06-04 15:11                 ` Rich Felker
2020-05-18 18:35   ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).