Hi Rich,

I am quite interested into the topic, and made a comparation between glibc and musl with following code:

#define MAXF 4096
void* tobefree[MAXF];
int main() {
    long long i;
    int v, k;
    size_t s, c=0;
    char *p;
    for (i=0; i<100000000L; i++) {
        v = rand();
        s = ((v%256)+1)*1024;
        p = (char*) malloc(s);
        p[1023]=0;
        if (c>=MAXF) {
            k = v%c;
            free(tobefree[k]);
            tobefree[k]=tobefree[--c];
        }
        tobefree[c++]=p;
    }
    return 0;
}

```

The results show signaficant difference.

With glibc, (running within a debian docker image)

# gcc -o m.debian -O0 app_malloc.c
# time ./m.debian
real    0m37.529s
user    0m36.677s
sys    0m0.771s

With musl, (runnign within a alpine3.15 docker image)

# gcc -o m.alpine -O0 app_malloc.c
# time ./m.alpine
real    6m 30.51s
user    1m 36.67s
sys    4m 53.31s

musl seems spend way too much time within kernel, while glibc hold most work within userspace.

I used perf_event_open to profile those programs:

musl profiling(total 302899 samples) shows that those "malloc/free" sequence spend lots of time dealing with pagefault/munmap/madvise/mmap

munmap(30.858% 93469/302899)
_init?(22.583% 68404/302899)
aligned_alloc?(89.290% 61078/68404)
asm_exc_page_fault(45.961% 28072/61078)

main(9.001% 6157/68404)
asm_exc_page_fault(29.170% 1796/6157)

rand(1.266% 866/68404)

aligned_alloc?(20.437% 61904/302899)
asm_exc_page_fault(56.038% 34690/61904)

madvise(13.275% 40209/302899)
mmap64(11.125% 33698/302899)

But glibc profiling (total 29072 samples) is way much lighter, pagefault is the most cost while glibc spend significat time on "free"

pthread_attr_setschedparam?(82.021% 23845/29072)
asm_exc_page_fault(1.657% 395/23845)
_dl_catch_error?(16.714% 4859/29072) __libc_start_main(100.000% 4859/4859)
cfree(58.839% 2859/4859)
main(31.138% 1513/4859)
asm_exc_page_fault(2.115% 32/1513)

pthread_attr_setschedparam?(3.725% 181/4859)
random(2.099% 102/4859)
random_r(1.832% 89/4859)
__libc_malloc(1.420% 69/4859)

It seems to be me, glibc make lots of uasage of cache of kernel memory and avoid lots of pagefault and syscalls.

Is this performance difference should concern realworld applications? On average, musl actual spend about 3~4ns per malloc/free, which is quite acceptable in realworld applications, I think.

(Seems to me, that the performance difference has nothing to do with malloc_usable_size, which may be indeed just a speculative guess without any base)

David Wang